Apache Hadoop and all its flavors of distributions are the hottest technologies on the market. Its fundamentally changing how we store, use and share data. It is pushing us all forward in many ways–how we socialize with friends, how science is zeroing in on new discoveries, and how industry is becoming more efficient.
But it is a major mind shift. I’ve had several conversations in the past two weeks with programmers and DBAs alike explaining these concepts. For those that have not yet experimented with it, they find the basic concepts of breaking apart databases and not using SQL to be equal parts confusing and interesting science. To that end, we’re going to take this conversation a little more broad and start to layout some of the primary concepts that new professionals to Hadoop can use as a primer.
To do this, examples work best. So we are going to use a basic word count program to illustrate how programming works within the MapReduce framework in Hadoop. We will explore four coding approaches using the native Hadoop library, or alternate libraries such as Pig, Hive and Cascading so programmers can evaluate which approach works best for their needs and skills.
Basic Programming in MapReduce
In concept, the function of MapReduce is not some new method of computing. We are still dealing with data input and output. If you know basic batch processing, MapReduce is familiar ground—we collect data, perform some function on it, and put it somewhere. The difference with MapReduce is that the steps are a little different, and we perform the steps on terabytes of data across 1000s of computers in parallel.
The typical introductory program or ‘Hello World’ for Hadoop is a word count program. Word count programs or functions do a few things: 1) look at a file with words in it, 2) determine what words are contained in the file, and 3) count how many times each word shows up and potentially rank or sort the results. For example, you could run a word count function on a 200 page book about software programming to see how many times the word “code” showed up and what other words were more or less common. A word count program like this is considered to be a simple program.
The word counting problem becomes more complex when we want it to run a word count function on 100,000 books, 100 million web pages, or many terabytes of data instead of a single file. For this volume of data, we need a framework like MapReduce to help us by applying the principle of divide and conquer—MapReduce basically takes each chapter of each book, gives it to a different machine to count, and then aggregates the results on another set of machines. The MapReduce workflow for such a word count function would follow the steps as shown in the diagram below:
- The system takes input from a file system and splits it up across separate Map nodes
- The Map function or code is run and generates an output for each Map node—in the word count function, every word is listed and grouped by word per node
- This output represents a set of intermediate key-value pairs that are moved to Reduce nodes as input
- The Reduce function or code is run and generates an output for each Reduce node—in the word count example, the reduce function sums the number of times a group of words or keys occurs
- The system takes the outputs from each node to aggregate a final view
So, where do we start programming?
There are really a few places we might start coding, but it depends on the scope of your system. It may be that you need to have a program that places data on the file system as input or removes it; however, data can also move manually. The main are we will start programming for is the colored Map and Reduce functions in the diagram above.
Of course, we must understand more about how storage and network are used as well as how data is split up, moved, and aggregated to ensure the entire unit of work functions and performs as we expect. These topics will be saved for future posts or you can dig into them on your own for now.
Code Examples—Hadoop, Pig, Hive, and Cascading
At a high level, people use the native Hadoop libraries to achieve the greatest performance and have the most fine-grained control. Pig is somewhere between the very SQL-like, database language provided by Hive and the very Java-like programming language provided by Cascading. Below, we walk through these four approaches.
Let’s look at the 4 options.
Native Hadoop Libraries
The native libraries provide developers with the most granularity of coding. Given that all other approaches are essentially abstractions, this language offers the least overhead and best performance. Most Hadoop queries are not singular, rather they are several queries strung together. For our simplistic example with a single query, it is likely the most efficient. However, once you have more complex series of jobs with dependencies, some of the abstractions offer more developer assistance.
In the example below, we see snippets from the standard word count example in Hadoop’s documentation. There are two basic things happening. One, the Mapper looks at a data set and reads it line by line. Then, the Mapper’s StringTokenizer function splits each line into words as key value pairs—this is what generates the intermediate output. To clarify, there is a key value pair for each instance of each word in the input file. At the bottom, we can see that the reducer code has received the key value pairs, counts each instance, and writes the information to disk.
Hive is a project that Facebook started in 2008 to make Hadoop behave more like a traditional data warehouse. Hive provides an even more SQL-like interface for MapReduce programming. In the example below, we see how Hive gets data from Hadoop Distributed File System (HDFS), creates a table for lines then does a select count on the table in a very SQL-like fashion. The lateral view applies splits, eliminates spaces, groups, and counts. Each of these commands maps to MapReduce functions covered above. Often considered the slower of the languages to do Hadoop with, this project is being actively worked on to speed it up 100x.
Cascading is neither a scripting nor a SQL-oriented language—it is a set of .jars that define data processing APIs, integration APIs, as well as a process planner and scheduler. As an abstraction of MapReduce, it may run slower than native Hadoop because of some overhead, but most developers don’t mind because its functions help complete projects faster, with less wasted time. For example, Cascading has a fail-fast planner which prevents it from running a Cascading Flow on the cluster if all the data/field dependencies are not satisfied in the Flow. It defines components and actions, sources, and output. As data goes from source to output, you apply a transformation, and we see an example of this below where lines, words, and counts are created and written to disk.
Just the Tip of the Iceberg
For Hadoop, there are many ways to skin this cat. These four examples are considered the more classic or standard platforms for writing MapReduce programs, probably because all except for Cascading is an Apache project. However, many more exist. Even Pivotal has one with our Pivotal HD Hadoop distribution called HAWQ. HAWQ is a true SQL engine that appeals to data scientists because of the level of familiarity and the amount of flexibility it offers. Also, it is fast. HAWQ can leverage local disk, rather than HDFS, for temporarily storing intermediate result, so it is able to perform joins, sorts and OLAP operations on data well beyond the total size of memory in the cluster.
- Great video that illustrates the basic concepts of MapReduce in more detail
- For more ideas on what to use Hadoop to do, see 20 Examples of Getting Big Results with Big Data
- Want it simpler? Learn about the unified configuration and POJO programming model provided by Spring for Apache Hadoop
- Read more on Pivotal HD, Pivotal’s distribution of Hadoop that supports the Apache projects as well as HAWQ for advanced SQL processing