Planning for NewSQL: A “How To” for Architects, Developers, and DBAs Using SQLFire

header-graphic-planning-for-newsqlThere is now near universal agreement that monolithic, disk-based relational databases (RDBMS) are not the way forward.

However, in moving away from them, we need to be careful we don’t throw the baby out with the bathwater.

Since RDBMS and SQL are often used synonymously, SQL has inappropriately become the poster child for everything wrong with relational databases. While moving away from RDBMSes, we should evaluate SQL separately.

The fact is SQL remains a powerful tool. There is a large volume of SQL code in the world that performs important jobs every day, and no one wants to trash these investments. In addition, there is a significant amount of expertise and a large eco-system supporting SQL. The power and sheer flexibility of SQL has made it a standard over decades.

Beyond being embedded into millions of lines of legacy code, we should also remember SQL is a remarkably expressive query language and, with its history in online transactional processing systems (OLTP), it is still the best choice for applications with heavy read-write requirements.

So, while we are looking to get the scalability from distributed data stores, for transactional systems, we still very much need a SQL approach. This is why the industry has evolved a new class of RDBMSes called NewSQL that work very well with the newer style in-memory data-stores and data-grids. NewSQL distributed data stores can provide extremely high throughput, low latency, elastic scale, and high availability. Pivotal SQLFire is such a NewSQL data fabric and has a proven ability to handle terabytes of data in a single cluster.

For those learning to architect, develop, and administer NewSQL solutions, designing SQLFire-based applications is not rocket science. While architecting for NewSQL is based on familiar territory, it does require a mindset shift much like using Hadoop instead of traditional data warehouses. This article discusses key considerations for building a robust, high performance application with cloud scale using Pivotal SQLFire.

Below, we dive into considerations for the data model, queries, JVM sizes, and capacity planning to help you approach NewSQL data grids using Pivotal SQLFire.

Get the Data Model Right

Your SQLFire application (and any database application) is doomed to failure unless you get the data model right. So what makes the ‘right data model’? We all know what this means in a traditional RDBMS, but a distributed, in-memory data fabric has additional considerations:

  • Figure out the relationship among the tables in the schema and see what elements of the schema would benefit from co-location (also covered in the SQLFire documentation). This is is especially important when transactions are involved. Co-location can help scale transactions and reduce unnecessary network round trips.
  • Determine what tables need to be partitioned and what tables are best left replicated. For example, read only tables with small amounts of data that are needed by almost all joins are a good candidate for replicated tables.
  • Decide what tables need to be persistent and which tables are better off as being non-persistent. Take into account the recovery aspect of the system when making this decision. Persistence will cost you some in terms of performance, but will speed up recovery. Not everything needs to be persistent. Ask yourself how a table is used, what is the cost of repopulating the data from a 3rd party data source?
  • Get the data types right. Using the right data types will help speed up queries and cut down entry overhead.

Get the Queries Right

Needless to say, most OLTP databases are measured by their ability to respond to queries in real time, and SQLFire is no exception. Primary key lookups or get convertible queries are blazingly fast in SQLFire. Joins and complex queries can drag down performance unless these are optimized with indexes. Having an understanding of the result set sizes for a query will help your query performance and, more importantly, will cause less garbage generation in the product, helping keep performance numbers up. You need to capture each query that runs through the system and verify that it is optimized for the schema. Sometimes, it may mean changing the schema, rewriting the query, or adding indexes to speed up the query.

JVM Sizes Matter

Pivotal SQLFire is a Java based product. It’s performance is dependent on the amount of memory allocated to the JVM, allowing it to stay out of garbage collection (GC) pauses.  SQLFire uses CMS (Concurrent Mark Sweep)  by default. This does a good job of managing resources, and there is no need to tune the defaults unless there is enough evidence that suggests otherwise (new size setting, stack size setting, etc.). There may be special cases depending on the size of the elements of a row and the number of compiled statements. All aspects of memory tuning can be viewed as stats via Visual Statistics Display (VSD). Tuning the heap size allocated to the SQLFire server is the biggest thing you can do to extract the highest performance out of SQLFire.

Performance Tuning of SQLFire

After getting the data model right, ensuring that you are invoking the right queries, and tuning JVM sizes, there is still a lot of room to improve the performance of SQLFire in a running system. To get the highest performance, we recommend reviewing the following questions using stats and EXPLAIN PLAN:

  • Which are the expensive queries? (Statement Stats)
  • Are the queries using the right indexes? (EXPLAIN PLAN)
  • Are the servers seeing GC issues? (GC Stats)
  • Are there too many compiled queries, which cause new classes to be compiled and loaded thereby burning up PermGen? (GC Stats)
  • Are the JVMs sized correctly? (GC Pauses, GC Stats)
  • Will co-location help improve performance? (Study the data model and look for opportunities to co-locate tables.)
  • Are we using peer clients with transactions and is pr-single-hop being used? (This gives better performance and lower CPU on client applications, but, if you need transactional support from the application, you still have to use peer clients.)
  • How big are the rows and would we expect to see the de-serialization times we are seeing? (Look at the table schema and data types.)
  • Are they throttled on CPU or disk? (Machine Stats)
  • Are there not enough application threads to drive throughput? (CPU usage is low and throughput is low.)
  • Is query performance getting worse as the data volume grows? (Usually points to an indexing problem.)
  • Can you create proper startup and shutdown scripts? (Start members one at a time, and shut them down one at a time).
  • Are you using redundancy zones? (Doing this can greatly help high availability and also make upgrades a lot easier.)
  • How well do you understand the networking environment? What else is happening on it? (We had a case where a deployment always got better throughput on weekends,  but the customer had never investigated it. It turns out, they used a shared message queue that was hit heavily during weekdays. We had another customer whose network would fail every day. Another customer had a 100MB switch across some machines and wondered why performance was slower on some calls.)
  • What are the ulimit settings in the deployed environment?
  • Have you changed the conserve-sockets setting to false? (While this needs to be one of the load adaptive settings in SQLFire, currently it is not. You have to set this property uniformly on all servers.)
  • Is application code throwing exceptions that are filling up logs?
  • Is there an external database whose performance is very poor?
  • Are the buckets balanced? (Sometimes they are not, for whatever reason and that can perpetuate bad performance. This is easily addressed by doing a rebalance if the buckets are not balanced at startup.)

Capacity Planning

sqlfire-sample-sizing-spreadsheetCustomers still have to make the leap from disk based systems to memory oriented systems. As a result, planning cluster size is important—having enough RAM, CPU, and disk. While the SQLFire Best Practices Guide and documentation cover more, here are the capacity planning steps that make sense for most applications:

  • How much raw data needs to be stored in SQLFire?
  • How much of this needs to be in memory?
  • How big is each row for each of the major tables?
  • If you are storing multiple days worth of data, how many days of data is in SQLFire?
  • How many of the tables need indexes?
  • Based on the data types, what is the estimated size per row in bytes?
  • What is the heap size per SQLFire cluster?
  • If you are using WAN replication plan for multiple clusters, are you using the same sizing information gathered above?
  • How much physical memory is there per box?
  • If WAN queues are involved, are you provisioning memory for those queues?

In addition to this, you need to plan for disk space based on the amount of data being written out to disk. A tricky thing to estimate is the amount of CPU and consequently the number of machines that you will need (independent of memory). This depends on the number of clients and how active they are. Another factor affecting CPU is the number of stored procedures being executed in the grid. Both of these will need to be derived empirically because they vary from one use case to another.

Try the online spreadsheet “Pivotal SQLFire – Capacity Planning – Sample Worksheet” to see how sizing is accomplished.

Debugging Issues in SQLFire

Despite your best attempts, you may encounter issues due to:

  • Environment issues that you could not have foreseen.
  • Capacity planning gone awry due to queries that did not behave as expected.
  • Persistence issues caused by startup/shutdown sequencing issues.
  • Bugs in application or product code.

When these issues happen, the stats, a thread dump on all the servers, and a heap dump will go a long way in helping understand the root cause of the problem. Stat files, viewable through VSD, can tell you about garbage collection pauses, memory pressure on the JVM, and network flakiness. It will also pinpoint whether application code (running in process within the servers) is responsible for server process thrashing. Failure to provision adequate amounts of disk space for the system is often a problem that will cause the tables to become inaccessible.

Further reading:

You can learn more about SQLFire through the following resources:

Permalink

Tags: , , ,

One comment on “Planning for NewSQL: A “How To” for Architects, Developers, and DBAs Using SQLFire

  1. Hey Sudhir, Thanks a lot for sharing this post. I have found it very useful.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>