More and more, today’s information-related businesses are facing extreme transaction processing environments, and real-time decision-making on this type of big data is becoming increasingly important. While industries have different contexts for real-time analysis and every enterprise architecture is unique, there are five key capabilities to help put companies in a position to do real-time analytics on massive amounts of incoming data.
What’s Driving the Real-Time Need?
Historically, we might think of using real-time data or predictive analytics on large data sets in the financial services industry for automated trading, preventing fraud with stolen credit cards, checking credit in real time, or even related areas like pre-paid mobile phone contracts. However, real-time analytics is being applied more widely, particularly in digital arenas where massive amounts of data are created.
Part of the reason is that we are all just dealing with bigger piles of data that are being created faster. Just a few years ago, we primarily used gigabyte to describe data stores. Now, we commonly use terabyte and petabyte because data stores are growing 50% year over year.
Industries are looking to use this data to analyze and optimize more business processes.
Financial trading platforms, historic real-time analysis and extreme transaction processing environments, are incorporating even more data by adding media feeds, news, social media sentiment, and even video information to make decisions in real time. Ad networks and media properties are analyzing and optimizing advertising pages, graphics, video, and links to keep you on their site or click more ads. E-commerce retailers continue to capture data about you to support and expand their growth—making offers in real-time. Telecommunications companies track mobile sessions to improve metering, billing, fraud, quality of service, security, location-based services, and more. Internet use is expanding to mobile phones and tablets, leading app developers to track in-app purchases in real-time. Of course sensor networks and internet-of-things infrastructures are being built in many cases to optimize the physical things that are being tracked and monitored, ultimately to make improvements in real-time.
With successes like these, the value for real-time, big data processing and analysis has been established in nearly every sector across the market. From the military to manufacturing, organizations are piling on the big data bandwagon.
Shifting from OLTP/ETL/OLAP to “Extreme Analysis Processing”
As the use of real-time analytics expands, traditional business intelligence and analytics shift as well. Historically, analysis has followed the architecture of capturing data in a traditional OLTP database, using ETL/ELT processes to migrate data, and storing the data into one or more analytical OLAP environments. That was needed since OLTP and OLAP database systems are particularly very different. While OLTPs are good on handling transactions but not optimised for analytical workloads, OLAPs are slow for transactional loads but excellent for analytics. Two different pieces needed for two different demands.
Previously, data analysts used 2 different systems for managing data (OLTP and OLAP) where the analytics data moved via ad-hoc requests, nightly jobs, or perhaps in batches every four to six hours. For significantly larger volumes of data or multiple, simultaneous transactions, the sheer amount of disk I/O can produce a bottleneck, and actually increase the gap between when transaction happened and when we can actually have analysis on top of it. For real-time systems that have attempted to use OLTP for analysis, the scale is limited because it is not optimized for analytical queries. For example, an e-commerce company might find that their database doesn’t scale well during holiday times, and adding a real-time layer to this system sounds like a recipe for down-time that would impact revenue.
However, the business value for analytics is really on finding opportunities before they’re gone, so its important to get that data analyzed real-time. There’s huge advantage on making the analytics run on near real time, either for traditional BI or, more important, Predictive Analytics. To get there, previously we’ve shared 4 architectures for building big data platforms. Today, let’s look at 5 specific capabilities to bring that data into your analytics platform real-time.
5 Key Capabilities for Extreme Transaction Processing & Real-time Analytics
In a real-time analytics environment dealing with extremely large volumes of data, there are five key capabilities to consider to help you achieve the scale and speed you need to succeed in this new era of big data.
1. Delivering In-Memory Transaction Speeds
The first step is to look at how your organization ingests data. Incoming transactions must be handled at significant volumes. Any time data is immediately stored to disk, we run into the slowest part of the transaction—I/O to disk latency. One way to address this issue is to remove the hop to disk. Using systems that can store transactions in memory regardless of writes via methods and APIs or SQL successfully eliminates this problem. In addition, a shared nothing architecture with independent processors and memory helps to distribute ingestion across as many servers as necessary, helping to deal more effectively with spikes in data volumes. This allows for cluster size increases and data redistributions without down-time and an ability to achieve horizontal scale in a linear or near-linear fashion.
2. Quickly Moving Unneeded Data to Disk for Long-Term Storage
Of course, the entire data set may not be able to stay in memory forever or may be cost-prohibitive. Yet, there may be a need to keep some data in memory longer. So, we also need the ability to persist from memory to disk through asynchronous writes based on certain data sets or keeping active values in memory so that we can do things like searches on “hot data” for most recently used (MRU) sets. By persisting to a massively parallel processing database, also built in a shared nothing architecture, we can process queries in parallel across disks at high performance for larger, older data sets that are no longer in-memory. The data movement can also happen without ETL/ELT types of delays and even write to Hadoop.
3. Distributing Data AND Processing for Speed
The idea of a distributed architecture where many smaller machines do smaller jobs to complete a bigger job faster makes sense. It solves for horizontal scalability, and gets hardware costs back in line with the value of the jobs. But its important to not just try to do the same big job on lots of little data sets. You actually need to break up the pieces of the job sequence and dedicate the right number of resources to each step of the way. This way, as data comes in, you can run both transactional and analytical workloads side by side. Carefully organizing how your data is distributed across nodes, and also separating processing into smaller jobs can help to do a lot more with data as it comes in. Use coordinator nodes to help execute queries in a scatter-gather fashion, allowing them to run in parallel and real-time. By partitioning data in memory, we can also query based on key and reduce scans across the entire data set. Lastly, we can ship functions to the nodes and move the query to the data, much like a map-reduce type of operation but at in-memory speeds. Replication and colocation also allow us to place sets of data together on a node to avoid joins across nodes and memory sets.
4. Supporting Continuous Queries for Real Time Events
Acting on data as it comes in helps to avoid having to look for the data later. By setting up continuous queries, we can determine when incoming data, like a stock trade, has a certain criteria and take action on it. The thread responsible for a certain in-memory data set reviews each insert or update and checks the criteria. As updates that meet the criteria are found, event notifications are sent to listeners waiting to receive and process the information. Having an event monitoring function available prevents regular full-table scans for matching criteria across large record sets and allows us to act on these transactions as they happen. Similarly, we can push these events to external clients and feed complex event processing (CEP) platforms for scenarios. This is not possible on a pure SQL architecture characterized by the typical OLTP request-response model.
5. Embedding Data into Apps or Apps into Databases
For some real-time scenarios, applications might need information from the process heap, like session state. They may also need to replicate information as if they were a low-latency cache. In-memory databases should allow embedding into a JVM alongside applications. This allows either one-hop or no-hop access to data, as the data lives in the application requesting it. The embedded members should allow for replicated and partitioned data, persist data to disk, communicate directly with other servers, and participate in distributed queries.
To learn more:
The following Pivotal Data Fabric products were covered in this post:
- Pivotal’s GemFire—NoSQL, in-memory data grid
- Pivotal SQLFire—NewSQL, in-memory data grid
- Pivotal Greenplum DB—massively parallel processing database
- Pivotal HD—Pivotal’s Hadoop distribution