In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:
The full code and short instructions for how to compile and run it are available at https://github.com/sryza/traffic-reduce.
In addition, this time we’ll write our MapReduce program using the “new” MapReduce API, a cleaned-up take on what MapReduce programs should look like that was introduced in Hadoop 0.20. Note that the difference between the old and new MapReduce API is entirely separate from the difference between MR1 and MR2: The API changes affect developers writing MapReduce code, while MR2 is an architectural change that differs from MR1 by, under the hood, extracting out the scheduling and resource management aspects into YARN, which allows Hadoop to support other parallel execution frameworks and scale to larger clusters. Both MR1 and MR2 support the old and new MapReduce API.
It’s 11pm on a Thursday, and while Los Angeles is known for its atrocious traffic, you can usually count on being safe five from heavy traffic hours after rush hour. But when you merge onto the I-10 going west, it’s bumper to bumper for miles! What’s going on?
It has to be the Clippers game. With tens of thousands of cars leaving from the Staples Center after a home-team basketball game, of course it’s going to be bad. But what about for a Lakers game? How bad does it for those? And what about holidays and during political events? It would be great if you could enter a time and determine how far traffic deviated from average for every road in the city.
CalTrans’ Performance and Measurement System (PeMS) provides detailed traffic data from sensors placed on freeways across the state, with updates coming in every 30 seconds. The Los Angeles area alone contains over 4,000 sensor stations. While this is frankly a boatload of data, MapReduce allows you to leverage a cluster to process it in a reasonable amount of time.
In this post, we’ll write a MapReduce program that computes the averages, and next time, we’ll write a program that uses this information to build an index of this data, so that a program may query it easily to display data from the relevant time.
For our first MapReduce job, we would like to find the average traffic for each sensor station at each time of the week. While the data is available every 30 seconds, we don’t need such fine granularity, so we will use the five-minute summaries that PeMS also publishes. Thus, with 4,370 stations, we will be calculating 4,370 * (60 / 5) * 24 * 7 = 8,809,920 averages.
Each of our input data files contains the measurements for all the stations over a month. Each line contains a station ID, a time, some information about the station, and the measurements taken from that station during that time interval.
Here are some example lines. The fields that are useful to us are the first, which tells the time; the second, which tells the station ID; and the 10th, which gives a normalized vehicle at that station at that time.
01/01/2012 00:00:00,312346,3,80,W,ML,.491,354,33,1128,.0209,71.5,0,0,0,0,0,0,260,.012,73.9,432,.0273,69.2,436,.0234,72.3,,,,,,,,,,,,,,, 01/01/2012 00:00:00,312347,3,80,W,OR,,236,50,91,,,,,,,,,91,,,,,,,,,,,,,,,,,,,,,,, 01/01/2012 00:00:00,312382,3,99,N,ML,.357,0,0,629,.0155,67,0,0,0,0,0,0,330,.0159,69.9,299,.015,63.9,,,,,,,,,,,,,,,,,, 01/01/2012 00:00:00,312383,3,99,N,OR,,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 01/01/2012 00:00:00,312386,3,99,N,ML,.42,0,0,1336,.0352,67.1,0,0,0,0,0,0,439,.0309,70.4,494,.039,67.4,403,.0357,63.4,,,,,,,,,,,,,,,
The mappers will parse the input lines and emit a key/value pair for each line, where the key is an ID that combines the station ID with the time of the week, and the value is the number of cars that passed over that sensor during that time. Each call to the reduce function receives a station/time of week and the vehicle count values over all the weeks, and computes their average.
An interesting inefficiency to note is that if a single mapper processes measurements over multiple weeks, it will end up with multiple outputs going to the same reducer. As these outputs are going to be averaged by the reducer anyway, we would be able to save I/O by computing partial averages before we have the complete data. To do this, we would need to maintain a count of how many data points are in each partial average, so that we can weight our final average by that count. For example, we could collapse a set of map outputs like 5, 6, 9, 10 into (avg=7.5, count=4). As each map output is written to disk on the mapper, sent over the network, and then possibly written to disk on the reducer, reducing the number of outputs in this way can save a fair amount of I/O.
MapReduce provides us with a way to do exactly this in the form of combiner functions. The framework calls the combiner function in between the map and reduce phase, with the combiner’s outputs sent to the reducer instead of the map outputs that it’s called on. The framework may choose to call a combiner function zero or more times – generally it is called before map outputs are persisted to disk, both on the map and reduce side.
Thus, from a high level, our program looks like this:
map(line of input file) { parse line of input file, emit (station id, time of week) -> (vehicle count, 1) } combine((station id, time of week), list of corresponding (vehicle count, 1)s) { take average of the input values, emit (station id, time of week) -> (average vehicle count, size(list)) } reduce((station id, time of week), list of corresponding (vehicle count, size)s) { take weighted average of the input values, emit (station id, time of week) -> average vehicle count }
MapReduce key and value classes implement Hadoop’s Writable
interface so that they can be serialized to and from binary. While Hadoop provides a set of classes that implement Writable
to serialize primitive types, the tuples we use in your pseudo-code don’t map efficiently onto any of them. For our keys, we can concatenate the station ID with the time of week to represent them as strings and use the Text
type. However, as our value tuple is composed of primitive types, a float and an integer, it would be nice not to have to convert them to and from strings each time you want to use them. We can accomplish this by implementing a Writable
for them.
public class AverageWritable implements Writable { private int numElements; private double average; public AverageWritable() {} public void set(int numElements, double average) { this.numElements = numElements; this.average = average; } public int getNumElements() { return numElements; } public double getAverage() { return average; } @Override public void readFields(DataInput input) throws IOException { numElements = input.readInt(); average = input.readDouble(); } @Override public void write(DataOutput output) throws IOException { output.writeInt(numElements); output.writeDouble(average); } [toString(), equals(), and hashCode() shown in the the github repo] }
We deploy our Writable
by including it in our job jar. To instantiate our Writable
, the framework will call its no-argument constructor, and then fill it in by calling its readFields
method. Note that if we wanted to use a custom class as a key, it would need to implement WritableComparable
so that it would be able to be sorted.
With our custom data type in hand, we are at last ready to write our MapReduce program. Here is what our mapper looks like:
public class AveragerMapper extends Mapper<LongWritable, Text, Text, AverageWritable> { private AverageWritable outAverage = new AverageWritable(); private Text id = new Text(); @Override public void map(LongWritable key, Text line, Context context) throws InterruptedException, IOException { String[] tokens = line.toString().split(","); if (tokens.length < 10) { return; } String dateTime = tokens[0]; String stationId = tokens[1]; String trafficCount = tokens[9]; if (trafficCount.length() > 0) { id.set(stationId + "_" + TimeUtil.toTimeOfWeek(dateTime)); outAverage.set(1, Integer.parseInt(trafficCount)); context.write(id, outAverage); } } }
You may notice that this mapper looks a little bit different than the mapper used in the last post. This is because in this post we use the “new” MapReduce API, a rewrite of the MapReduce API that was introduced in Hadoop 0.20. The newer one is a little bit cleaner, but Hadoop will support both APIs far into the future.
An astute observer will notice that our combiner and reducer are doing exactly the same thing – i.e. outputting a weighted average of the inputs. Thus, we can write the following reducer function, and pass it as a combiner as well:
public class AveragerReducer extends Reducer<Text, AverageWritable, Text, AverageWritable> { private AverageWritable outAverage = new AverageWritable(); @Override public void reduce(Text key, Iterable<AverageWritable> averages, Context context) throws InterruptedException, IOException { double sum = 0.0; int numElements = 0; for (AverageWritable partialAverage : averages) { // weight partial average by number of elements included in it sum += partialAverage.getAverage() * partialAverage.getNumElements(); numElements += partialAverage.getNumElements(); } double average = sum / numElements; outAverage.set(numElements, average); context.write(key, outAverage); } }
Using the new API, our driver class looks like this:
public class AveragerRunner { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); Job job = new Job(conf); job.setJarByClass(AveragerRunner.class); job.setMapperClass(AveragerMapper.class); job.setReducerClass(AveragerReducer.class); job.setCombinerClass(AveragerReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(AverageWritable.class); job.setInputFormatClass(TextInputFormat.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.waitForCompletion(true); } }
Note that unlike last time, when we used KeyValueTextInputFormat
, we use TextInputFormat
for our input data. While KeyValueTextInputFormat
splits up the line into a key and a value, TextInputFormat
passes the entire line as the value, and uses its position in the file (as an offset from the first byte) as the key. The position is not used, which is fairly typical when using TextInputFormat
.
In the real world, data is messy. Traffic sensor data, for example, contains records with missing fields all the time, as sensors in the wild are bound to malfunction at times. Running our MapReduce job, it is often useful to count up and collect metrics on the side about what our job is doing. For a program on a single computer, we might just do this by adding in a count variable, incrementing it whenever our event of interest occurs, and printing it out at the end, but when our code is running in a distributed fashion, aggregating these counts gets hairy very quickly.
Luckily, Hadoop provides a mechanism to handle this for us, using Counters. MapReduce contains a number of built-in counters that you have probably seen in the output on completion of a MapReduce job.
Map-Reduce Framework Map input records=10 Map output records=7 Map output bytes=175 [and many more]
This information is also available in the web UI, both per-job and per-task. To use our own counter, we can simply add a line like
context.getCounter("Averager Counters", "Missing vehicle flows").increment(1);
to the point in the code where the mapper comes across a record with a missing count. Then, when our job completes, we will see our count along with the built-in counters:
Averager Counters
Missing vehicle flows=2329
It’s often convenient to wrap your entire map or reduce function in a try/catch, and increment a counter in the catch block, using the exception class’s name as the counter’s name for a profile of what kind of errors come up.
Running a MapReduce program on a cluster, if we even have access to one, can take a while. However, if we want to make sure that our basic logic works, we have no need for all the machinery. Enter Apache MRUnit, an Apache project that makes writing JUnit tests for MapReduce programs probably as easy as it could possibly be. Through MRUnit, we can test our mappers and reducers both separately and as a full flow.
To include it in our project, we add the following to the dependencies section Maven’s pom.xml:
<dependency> <groupId>org.apache.mrunit</groupId> <artifactId>mrunit</artifactId> <version>0.9.0-incubating</version> <classifier>hadoop2</classifier> </dependency>
The following contains a test for both the mapper and reducer, verifying that with sample inputs, they produce the expected outputs:
public class TestTrafficAverager { private MapDriver<LongWritable, Text, Text, AverageWritable> mapDriver; private ReduceDriver<Text, AverageWritable, Text, AverageWritable> reduceDriver; @Before public void setup() { AveragerMapper mapper = new AveragerMapper(); AveragerReducer reducer = new AveragerReducer(); mapDriver = MapDriver.newMapDriver(mapper); reduceDriver = ReduceDriver.newReduceDriver(reducer); } @Test public void testMapper() throws IOException { String line = "01/01/2012 00:00:00,311831,3,5,S,OR,,118,0,200,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"; mapDriver.withInput(new LongWritable(0), new Text(line)); Text outKey = new Text("311831_" + TimeUtil.toTimeOfWeek("01/01/2012 00:00:00")); AverageWritable outVal = new AverageWritable(); outVal.set(1, 200.0); mapDriver.withOutput(outKey, outVal); mapDriver.runTest(); } @Test public void testReducer() { AverageWritable avg1 = new AverageWritable(); avg1.set(1, 2.0); AverageWritable avg2 = new AverageWritable(); avg2.set(3, 1.0); AverageWritable outAvg = new AverageWritable(); outAvg.set(4, 1.25); Text key = new Text("331831_86400"); reduceDriver.withInput(key, Arrays.asList(avg1, avg2)); reduceDriver.withOutput(key, outAvg); reduceDriver.runTest(); }
We can run our tests with “mvn test” in the project directory. If there are failures, information on why is available in the project directory in target/surefire-reports.
A more in depth MRUnit tutorial is available here:https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial.
Like last time, we can build the jar with
mvn install
The full data is available at http://pems.dot.ca.gov/?dnode=Clearinghouse, but like last time, the github repo contains some sample data to run our program on. To place it on the cluster, we can run:
hadoop fs -mkdir trafficcounts hadoop fs -put samples/input.txt trafficcounts
To run our program, we can use
hadoop jar target/trafficinduce-1.0-SNAPSHOT.jar AveragerRunner trafficcounts/input.txt trafficcounts/output
We can inspect the output with:
hadoop fs -cat /trafficcounts/output/part-00000
Thanks for reading! Next time, we’ll delve into some more advanced MapReduce features, like the distributed cache, custom partitioners, and custom input and output formats.
Sandy Ryza is a Software Engineer on the Platform team.