There is no way around it.
You don’t even need a crystal ball or fortune teller to see this future.
In this article, we will focus on the fundamental computer science behind horizontally scaling data stores and the limits of common approaches. Then, we will explain the architecture for a hybrid data platform—combining Hadoop, massively parallel processing databases (MPP), NewSQL, and NoSQL types of systems. Lastly, we will cover some real-world use cases and show how this reference architecture can scale with us into the years ahead.
Horizontal Scale, a Definition
A scalable system is one that allows throughput or performance to increase proportionally to the hardware added. There are two basic ways that data stores do this—scaling up or vertical scale and scaling out or horizontal scale. With scale up, we add more capacity or processing power to a single box or move to a bigger box. Generally, the costs for bigger and bigger boxes increase exponentially. At some point, we can hit two walls: 1) the approach becomes economically challenged and 2) extreme cases don’t have big enough machines. In addition, web companies using open source technologies have proven that horizontal with commodity hardware works. This is why modern, high-scale architectures tend to look at horizontal approaches.
Still, there are challenges with horizontal scale, and we’ve seen a number of solutions show up in the market to help us achieve it.
Why Scaling Out is Hard
If we look at the physical architecture of a typical computer, we can begin to see where scale limits are encountered. A cluster of processors connects to main memory (RAM), external storage, internal disk, and the network. Regardless of where the data comes from, it is loaded into RAM before being accessed by the main physical processors. There are limiting factors where the components interact—the input and output (I/O) between network, memory, disk, and CPU—as well as within each of the components themselves. Within a horizontally scalable architecture, the components don’t just interact inside a single server. They have to also interact across machines so that the group of nodes acts like a single system. This adds an additional layer of I/O to the system, providing new complexities and scale problems to solve.
Even though disk speeds have increased dramatically over the past several years, disk I/O is still orders of magnitude slower on latency than other operations like network within the same data center. In the chart below, we see that main memory is referenced in 100 ns, round trips within a datacenter are in 500,000 ns, and disk seeks are in 10,000,000 ns. In other words, main memory is 5000x faster than a datacenter trip and 100,000x faster than a disk seek. In human terms, let’s say accessing main memory is the same amount of time as warming up a bowl of nacho cheese in the microwave—1 minute. Then the datacenter took almost 3.5 days to warm the cheese and the disk seek took almost 70 days.
The First Challenge to Horizontal Scale—Disk I/O
- Speeding disks up
- Solid state drives where data is stored persistently in memory
- Parallel disk I/O where we can write to multiple files and disks at once
- Avoiding updates so that we avoid disk seek
- Minimizing inserts by doing them in batch for better efficiency
- Writing asynchronously to remove I/O from the critical path of the transaction
For example, we have looked to column-oriented databases to minimize disk I/O. Particularly with analytical queries that look across sets of data in a column, there are advantages to columnar models. In short, a system can grab the data from one identified place without having to scan an entire table to find matches. While we can also parallelize disk access for columnar models, columnar databases are not horizontally scalable because all the processors and memory need to be on a single server. Recently, several columnar databases have added parallel processing and horizontal scale—this generally classifies them as MPPs (see below).
With Hadoop, have a horizontally scalable system for batch types of jobs through HDFS, a distributed file system. Storage nodes can be added get to the petabyte level. The distributed data workloads use local CPU and memory and process in parallel on both structured and unstructured data. The scale issue here comes into play for low-latency access. The data is on disk, not in-memory. There is a lot of data moving over the network to deal with distribution and processing. Hadoop and MapReduce weren’t originally built for transactions, they were designed for batch jobs. While it is superb at staging, ETL/ELT, and some OLAP, it is far from ideal for low-latency OLTP. Even with HBase, Hadoop’s transactional database based on columnar storage, the latency is still based on disk I/O, queries are based on API-level programming, and high availability isn’t mature.
Massively parallel processing databases allow for both the data to be distributed across nodes where it is stored on disk and the query processing to be run in parallel across the nodes. These systems can scale to the petabyte range and are great for structured data. With some, there is columnar storage built in. Still, transactions can be very slow when there are too many indexes, partitions, and distributions. Often, batch inserts are recommended and used for this reason. While the memory within nodes can act as a cache, ad-hoc requests still have disk-based latency—queries can still queue up and create bottlenecks.
These systems rely on memory and remove the reliance on disk for I/O. These systems have latency times based on the speed of memory—as much as 100,000x faster. Some of these systems can perform transactions in memory and propagate the transactions to disk or back-up nodes for high availability. In most cases, all the nodes also hold all the data—of course, this model doesn’t scale. Since data is not distributed, processing cannot be distributed either
If in-memory databases could distribute processing power and distribute across disk, as needed, and without performance impact, we begin to see behind the curtains and towards the future—a new way to scale with distributed memory, distributed CPUs, and distributed disk. However, the weak link then becomes the network. In this model, the network is the only remaining shared resource—it deals with increasingly greater distribution to scale and is likely to be used more as each node is added. Network I/O becomes the second key challenge in designing a horizontally scalable system. As software engineers and architects, we’ve also looked at many solutions to this problem:
- Speeding networks up with gigabit Ethernet, fiber channel, and optical
- Moving the processing to the data instead of the other way around
- Improving algorithms to avoid unnecessary hops across a network
In-Memory Data Grids—NoSQL and NewSQL
In-memory data grids combine solutions to overcome both disk I/O and network I/O:
- The data is distributed AND in memory for high-speed access
- Wherever possible, the processing is distributed instead of sending data over the network—data isn’t moved when processing can be moved more efficiently
- The network usage is orchestrated more intelligently to reduce hops
When a transaction happens on an in-memory data grid (IMDG), it is distributed across nodes with micro-second based latency. Of course, not all businesses require this type of performance, but thousands of simultaneous transactions per second basically mandate it. With processing via functions, procedures, or queries, each member gets a request, partial results are sent back, and they are combined. This scatter gather or MapReduce type of approach is the same model Hadoop uses, but it is in real-time with memory-level latency. Different from in-memory databases that have the entire data set replicated across each member in a cluster, IMDGs distribute parts of the data across members. The system is responsible for tracking itself and knowing where each piece of data is, making the location transparent to clients. Part of the approach to architecting and managing IMDGs is optimizing the data’s distribution and replication. For example, strongly correlated data is colocated on a peer to remove network hops within a single query. The system also distributes functions, procedures, and queries transparently to nodes hosting specific shards of data. There is still a limit here, there must be a financially reasonable way to store the entire data set in memory—petabytes can draw a limit.
There is still a limit here, there must be a financially reasonable way to store the entire data set in memory—petabytes can draw a limit.
Designing for Cost-Effective, Infinite Scale
By comparing all of the horizontally scalable options, we can begin to see how they might fit together. There is not one solution that can cost effectively scale, deal with both structured and unstructured data, access via SQL or API, handle petabyte analysis, and respond in the sub-millisecond range on transactions.
Hadoop and MPP databases both have lower cost models because they are disk-based and can run on commodity hardware. They can also scale to the petabyte level and distribute queries or procedures for analyzing massive data sets at higher speeds. MPP databases are generally used for structured data and accessed by SQL. Hadoop is often used to process unstructured data and uses programmatic functions to do so. Of course, there are cases where Hadoop might run on structured data and MPP databases might store chunks of unstructured data. Together, they can support both structured and unstructured information using either SQL and/or API-based access.
Both NewSQL and NoSQL in-memory data grids have a higher cost model for storing data due to the use of memory instead of disk. IMDG’s can scale to the hundreds of gigabytes level, and they can distribute queries, procedures, and transactions at low-latency response times for transactional scenarios. As the name states, NewSQL uses SQL on structured data, and NoSQL uses programmatic functions and APIs to access documents, key-values, objects, and more. Together, they provide real-time access to both structured and unstructured information using SQL and API-based access.
A combination of these four types of data stores can be applied in many scenarios and architectures. Commonly, the real-time data grid is used for transactions, real-time analytical queries, and “hot data” search. The real-time grid supports fast data—it takes high throughput, low-latency information from apps or acts as an advanced cache for entire systems and writes behind to the MPP database, HDFS, and even an enterprise data warehouse. The write-behind function allows memory to be freed from the grid—immediately or over time.
Using Pivotal’s data fabric, the grid runs as Pivotal SQLFire, Pivotal GemFire, or both. These data stores can handle thousands of concurrent requests with sub-millisecond latency, provide high availability inside the cluster, and share data across a WAN. GemFire’s event driven architecture and continuous querying can also send selected events to other systems like complex event processing platforms. The big data portions are handled by the Pivotal Greenplum Database, Pivotal HD (Hadoop), both, or even existing data infrastructure. This supports long-term storage, big data analytics queries, and big data batch jobs.
2 Mini Cases Studies—Use Cases Requiring Infinite Scale
For one company, the volume of orders during busy hours amounted to thousands of transactions per second. Each order needed to be analyzed in real time for credit. As well, each order could potentially trigger events in their supply chain systems and logistics based on longer term data and complex, predictive analytics. At the same time, there were 30,000 simultaneous queries for analysis and reporting and covering 200 metrics, helping management run a data-driven organization.
Their solution replaced a columnar database and placed SQLFire on top of Greenplum to gain a lower cost per transaction and horizontal scale. SQLFire handled the transactions, events, and reports based on data from the last 30 days—enough to keep in memory. Greenplum handled the complex, predictive, analytical functions.
Another solution looked similar to this case study. A bank’s mainframe was failing to scale at the busiest times. As resources were used up, the MQSeries would begin to queue up requests and use more resources, bringing the mainframe to it’s knees. GemFire showed that it could be used to ingest all the data from the mainframe’s MQSeries at rates of 100K transactions per second and scaling at 10x the expected number of transactions while keeping a low demand on processing and scaling linearly. GemFire piped the data into Greenplum’s database and provided a high-scale analytics platform on core banking transaction data that was previously only available to compliance and audit groups.
There is one additional and noteworthy element that can be placed in this architecture. With a product like Pivotal HD and HAWQ, we can also run SQL queries on Hadoop’s file system with very high performance.
For more information on the Pivotal Data Fabric and products covered in this post:
- Pivotal GemFire
- Pivotal SQLFire
- Pivotal Greenplum DB