How-To: Run a MapReduce Job in CDH4 using Advanced Features

In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:

  • Combiner functions, a feature that allows you to aggregate map outputs before they are passed to the reducer, possibly greatly reducing the amount of data written to disk and sent over the network for certain types of jobs
  • Counters, a way to track how often user-defined events occur across an entire job – for example, count the number of bad records your MapReduce job encounters in all your data and feed it back to you, without any complex instrumentation on your part
  • Custom Writables, go beyond the basic data types that Hadoop provides as keys and values for your mappers and reducers
  • MRUnit, a framework that facilitates unit testing of MapReduce programs

The full code and short instructions for how to compile and run it are available at https://github.com/sryza/traffic-reduce.

In addition, this time we’ll write our MapReduce program using the “new” MapReduce API, a cleaned-up take on what MapReduce programs should look like that was introduced in Hadoop 0.20. Note that the difference between the old and new MapReduce API is entirely separate from the difference between MR1 and MR2: The API changes affect developers writing MapReduce code, while MR2 is an architectural change that differs from MR1 by, under the hood, extracting out the scheduling and resource management aspects into YARN, which allows Hadoop to support other parallel execution frameworks and scale to larger clusters. Both MR1 and MR2 support the old and new MapReduce API.

The Use Case

It’s 11pm on a Thursday, and while Los Angeles is known for its atrocious traffic, you can usually count on being safe five from heavy traffic hours after rush hour. But when you merge onto the I-10 going west, it’s bumper to bumper for miles!  What’s going on?

It has to be the Clippers game. With tens of thousands of cars leaving from the Staples Center after a home-team basketball game, of course it’s going to be bad. But what about for a Lakers game?  How bad does it for those?  And what about holidays and during political events? It would be great if you could enter a time and determine how far traffic deviated from average for every road in the city.

CalTrans’ Performance and Measurement System (PeMS) provides detailed traffic data from sensors placed on freeways across the state, with updates coming in every 30 seconds. The Los Angeles area alone contains over 4,000 sensor stations.  While this is frankly a boatload of data, MapReduce allows you to leverage a cluster to process it in a reasonable amount of time.

In this post, we’ll write a MapReduce program that computes the averages, and next time, we’ll write a program that uses this information to build an index of this data, so that a program may query it easily to display data from the relevant time.

The TrafficInduce Program

For our first MapReduce job, we would like to find the average traffic for each sensor station at each time of the week. While the data is available every 30 seconds, we don’t need such fine granularity, so we will use the five-minute summaries that PeMS also publishes. Thus, with 4,370 stations, we will be calculating 4,370 * (60 / 5) * 24 * 7 = 8,809,920 averages.

Each of our input data files contains the measurements for all the stations over a month. Each line contains a station ID, a time, some information about the station, and the measurements taken from that station during that time interval.

Here are some example lines. The fields that are useful to us are the first, which tells the time; the second, which tells the station ID; and the 10th, which gives a normalized vehicle at that station at that time.

01/01/2012 00:00:00,312346,3,80,W,ML,.491,354,33,1128,.0209,71.5,0,0,0,0,0,0,260,.012,73.9,432,.0273,69.2,436,.0234,72.3,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312347,3,80,W,OR,,236,50,91,,,,,,,,,91,,,,,,,,,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312382,3,99,N,ML,.357,0,0,629,.0155,67,0,0,0,0,0,0,330,.0159,69.9,299,.015,63.9,,,,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312383,3,99,N,OR,,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312386,3,99,N,ML,.42,0,0,1336,.0352,67.1,0,0,0,0,0,0,439,.0309,70.4,494,.039,67.4,403,.0357,63.4,,,,,,,,,,,,,,,

 

The mappers will parse the input lines and emit a key/value pair for each line, where the key is an ID that combines the station ID with the time of the week, and the value is the number of cars that passed over that sensor during that time. Each call to the reduce function receives a station/time of week and the vehicle count values over all the weeks, and computes their average.

Combiners

An interesting inefficiency to note is that if a single mapper processes measurements over multiple weeks, it will end up with multiple outputs going to the same reducer. As these outputs are going to be averaged by the reducer anyway, we would be able to save I/O by computing partial averages before we have the complete data. To do this, we would need to maintain a count of how many data points are in each partial average, so that we can weight our final average by that count.  For example, we could collapse a set of map outputs like 5, 6, 9, 10 into (avg=7.5, count=4). As each map output is written to disk on the mapper, sent over the network, and then possibly written to disk on the reducer, reducing the number of outputs in this way can save a fair amount of I/O.

MapReduce provides us with a way to do exactly this in the form of combiner functions. The framework calls the combiner function in between the map and reduce phase, with the combiner’s outputs sent to the reducer instead of the map outputs that it’s called on. The framework may choose to call a combiner function zero or more times – generally it is called before map outputs are persisted to disk, both on the map and reduce side.

Thus, from a high level, our program looks like this:

map(line of input file) {
  parse line of input file, emit (station id, time of week) -> (vehicle count, 1)
}

combine((station id, time of week), list of corresponding (vehicle count, 1)s) {
  take average of the input values, emit (station id, time of week) -> (average vehicle count, size(list))
}

reduce((station id, time of week), list of corresponding (vehicle count, size)s) {
  take weighted average of the input values, emit (station id, time of week) -> average vehicle count
}

 

Custom Writables

MapReduce key and value classes implement Hadoop’s Writable interface so that they can be serialized to and from binary. While Hadoop provides a set of classes that implement Writable to serialize primitive types, the tuples we use in your pseudo-code don’t map efficiently onto any of them. For our keys, we can concatenate the station ID with the time of week to represent them as strings and use the Text type.  However, as our value tuple is composed of primitive types, a float and an integer, it would be nice not to have to convert them to and from strings each time you want to use them. We can accomplish this by implementing a Writable for them.

public class AverageWritable implements Writable {

  private int numElements;
  private double average;
 
  public AverageWritable() {}
 
  public void set(int numElements, double average) {
    this.numElements = numElements;
    this.average = average;
  }
 
  public int getNumElements() {
    return numElements;
  }
 
  public double getAverage() {
    return average;
  }
 
  @Override
  public void readFields(DataInput input) throws IOException {
    numElements = input.readInt();
    average = input.readDouble();
  }

  @Override
  public void write(DataOutput output) throws IOException {
    output.writeInt(numElements);
    output.writeDouble(average);
  }

  [toString(), equals(), and hashCode() shown in the the github repo]
}

 

We deploy our Writable by including it in our job jar. To instantiate our Writable, the framework will call its no-argument constructor, and then fill it in by calling its readFields method. Note that if we wanted to use a custom class as a key, it would need to implement WritableComparable so that it would be able to be sorted.

At Last, the Program

With our custom data type in hand, we are at last ready to write our MapReduce program. Here is what our mapper looks like:

public class AveragerMapper extends Mapper<LongWritable, Text, Text, AverageWritable> {
 
  private AverageWritable outAverage = new AverageWritable();
  private Text id = new Text();
 
  @Override
  public void map(LongWritable key, Text line, Context context)
      throws InterruptedException, IOException {
    String[] tokens = line.toString().split(",");
    if (tokens.length < 10) {
      return;
    }
    String dateTime = tokens[0];
    String stationId = tokens[1];
    String trafficCount = tokens[9];

    if (trafficCount.length() > 0) {
      id.set(stationId + "_" + TimeUtil.toTimeOfWeek(dateTime));
      outAverage.set(1, Integer.parseInt(trafficCount));
     
      context.write(id, outAverage);
    }
  }
}

 

You may notice that this mapper looks a little bit different than the mapper used in the last post. This is because in this post we use the “new” MapReduce API, a rewrite of the MapReduce API that was introduced in Hadoop 0.20.  The newer one is a little bit cleaner, but Hadoop will support both APIs far into the future.

An astute observer will notice that our combiner and reducer are doing exactly the same thing – i.e. outputting a weighted average of the inputs.  Thus, we can write the following reducer function, and pass it as a combiner as well:

public class AveragerReducer extends Reducer<Text, AverageWritable, Text, AverageWritable> {
 
  private AverageWritable outAverage = new AverageWritable();
 
  @Override
  public void reduce(Text key, Iterable<AverageWritable> averages, Context context)
      throws InterruptedException, IOException {
    double sum = 0.0;
    int numElements = 0;
    for (AverageWritable partialAverage : averages) {
      // weight partial average by number of elements included in it
      sum += partialAverage.getAverage() * partialAverage.getNumElements();
      numElements += partialAverage.getNumElements();
    }
    double average = sum / numElements;
   
    outAverage.set(numElements, average);
    context.write(key, outAverage);
  }
}

 

Using the new API, our driver class looks like this:

public class AveragerRunner {
  public static void main(String[] args) throws IOException, ClassNotFoundException,
      InterruptedException {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    Job job = new Job(conf);
    job.setJarByClass(AveragerRunner.class);
    job.setMapperClass(AveragerMapper.class);
    job.setReducerClass(AveragerReducer.class);
    job.setCombinerClass(AveragerReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(AverageWritable.class);
    job.setInputFormatClass(TextInputFormat.class);
   
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    job.waitForCompletion(true);
  }
}

 

Note that unlike last time, when we used KeyValueTextInputFormat, we use TextInputFormat for our input data. While KeyValueTextInputFormat splits up the line into a key and a value, TextInputFormat passes the entire line as the value, and uses its position in the file (as an offset from the first byte) as the key.  The position is not used, which is fairly typical when using TextInputFormat.

Counters

In the real world, data is messy. Traffic sensor data, for example, contains records with missing fields all the time, as sensors in the wild are bound to malfunction at times. Running our MapReduce job, it is often useful to count up and collect metrics on the side about what our job is doing. For a program on a single computer, we might just do this by adding in a count variable, incrementing it whenever our event of interest occurs, and printing it out at the end, but when our code is running in a distributed fashion, aggregating these counts gets hairy very quickly. 

Luckily, Hadoop provides a mechanism to handle this for us, using Counters. MapReduce contains a number of built-in counters that you have probably seen in the output on completion of a MapReduce job.

Map-Reduce Framework
       Map input records=10
       Map output records=7
       Map output bytes=175
       [and many more]

 

This information is also available in the web UI, both per-job and per-task. To use our own counter, we can simply add a line like

context.getCounter("Averager Counters", "Missing vehicle flows").increment(1);

 

to the point in the code where the mapper comes across a record with a missing count. Then, when our job completes, we will see our count along with the built-in counters:

Averager Counters
   Missing vehicle flows=2329

It’s often convenient to wrap your entire map or reduce function in a try/catch, and increment a counter in the catch block, using the exception class’s name as the counter’s name for a profile of what kind of errors come up.

Testing

Running a MapReduce program on a cluster, if we even have access to one, can take a while. However, if we want to make sure that our basic logic works, we have no need for all the machinery. Enter Apache MRUnit, an Apache project that makes writing JUnit tests for MapReduce programs probably as easy as it could possibly be. Through MRUnit, we can test our mappers and reducers both separately and as a full flow. 

To include it in our project, we add the following to the dependencies section Maven’s pom.xml:

<dependency>
   <groupId>org.apache.mrunit</groupId>
   <artifactId>mrunit</artifactId>
   <version>0.9.0-incubating</version>
   <classifier>hadoop2</classifier>
</dependency>

 

The following contains a test for both the mapper and reducer, verifying that with sample inputs, they produce the expected outputs:

public class TestTrafficAverager {
  private MapDriver<LongWritable, Text, Text, AverageWritable> mapDriver;
  private ReduceDriver<Text, AverageWritable, Text, AverageWritable> reduceDriver;
 
  @Before
  public void setup() {
    AveragerMapper mapper = new AveragerMapper();
    AveragerReducer reducer = new AveragerReducer();
    mapDriver = MapDriver.newMapDriver(mapper);
    reduceDriver = ReduceDriver.newReduceDriver(reducer);
  }
 
  @Test
  public void testMapper() throws IOException {
    String line = "01/01/2012 00:00:00,311831,3,5,S,OR,,118,0,200,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,";
    mapDriver.withInput(new LongWritable(0), new Text(line));
    Text outKey = new Text("311831_" + TimeUtil.toTimeOfWeek("01/01/2012 00:00:00"));
    AverageWritable outVal = new AverageWritable();
    outVal.set(1, 200.0);
    mapDriver.withOutput(outKey, outVal);
    mapDriver.runTest();
  }
 
  @Test
  public void testReducer() {
    AverageWritable avg1 = new AverageWritable();
    avg1.set(1, 2.0);
    AverageWritable avg2 = new AverageWritable();
    avg2.set(3, 1.0);
    AverageWritable outAvg = new AverageWritable();
    outAvg.set(4, 1.25);
    Text key = new Text("331831_86400");
   
    reduceDriver.withInput(key, Arrays.asList(avg1, avg2));
    reduceDriver.withOutput(key, outAvg);
    reduceDriver.runTest();
  }

 

We can run our tests with “mvn test” in the project directory.  If there are failures, information on why is available in the project directory in target/surefire-reports.

A more in depth MRUnit tutorial is available here:https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial.

Running Our Program on Hadoop

Like last time, we can build the jar with

mvn install

 

The full data is available at http://pems.dot.ca.gov/?dnode=Clearinghouse, but like last time, the github repo contains some sample data to run our program on.  To place it on the cluster, we can run:

hadoop fs -mkdir trafficcounts
hadoop fs -put samples/input.txt trafficcounts

 

To run our program, we can use

hadoop jar target/trafficinduce-1.0-SNAPSHOT.jar AveragerRunner trafficcounts/input.txt trafficcounts/output

 

We can inspect the output with:

hadoop fs -cat /trafficcounts/output/part-00000

 

Thanks for reading! Next time, we’ll delve into some more advanced MapReduce features, like the distributed cache, custom partitioners, and custom input and output formats.

Sandy Ryza is a Software Engineer on the Platform team.

Ref: http://blog.cloudera.com/blog/2013/02/how-to-run-a-mapreduce-job-in-cdh4-using-advanced-features/

你可能感兴趣的:(How-To: Run a MapReduce Job in CDH4 using Advanced Features)