In this blog post, I will demonstrate severval techniques for local aggregation using the sample word count example. The original WordCount example comes from Hadoop examples. But I'll try to improve it with in-map combining algorithm later.
/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
- Local aggregation use combiner. Combiner provide a general mechanism within the MapReduce framework to reduce the amount of intermediate date generated by the mappers. This results in a reduction in the number of intermediate key-value pairs that need to be shuffled across the network.
job.setCombinerClass(IntSumReducer.class);
- An improvement on the basic algorithm is to use an associative array inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document, this version emits a key-value pair for each unique term in the document. Given that some words appear frequently within a document, this can yield substantial savings in the number of intermediate key-value pairs emitted, especially for long documents. The mapper is changed, but the reducer was kept
package wc; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.HashMap; import java.util.Map; import java.util.StringTokenizer; /** * Keep reading until the whole document stream closed, then emit each term as well as its total count. * <p/> * User: George Sun * Date: 7/10/13 * Time: 7:56 PM */ public class DocumentTokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final Text wordText = new Text(); private final IntWritable totalCountInDocument = new IntWritable(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Map<String, Integer> wordCounter = new HashMap<String, Integer>(); StringTokenizer st = new StringTokenizer(value.toString()); // Count every word in a document while (st.hasMoreTokens()) { String word = st.nextToken(); if (wordCounter.containsKey(word)) { wordCounter.put(word, wordCounter.get(word) + 1); } else { wordCounter.put(word, 1); } } // Emit each word as well as its count for (Map.Entry<String, Integer> entry : wordCounter.entrySet()) { wordText.set(entry.getKey()); totalCountInDocument.set(entry.getValue()); context.write(wordText, totalCountInDocument); } } }
- We can take one step further. The workings of the algorithm critically depends on the details of how map and reduce tasks in Hadoop are executed. A mapper object is created for each map task, which is responsible for processing a block of input key-value pairs. Prior to processing any input key-value pairs, the mapper's setup(...) method is called, which is an API hook for user-specified code. In this case, we initialize a associative array(HashMap) for holding term counts. Since it's possible to preserve state across multiple calls of the Map method(for each input key-value pair), we can continue to accumulate partial term counts in the associative array across multiple documents, and emit key-value pairs only when the mapper has processed all documents. That is, emission of intermediate data is deferred until the cleanup(...) method called. This is a API hook provides an opportunity to execute user-specified code after the map method has been applied to all input key-value pairs of the input data split to which the map task was assigned.
package wc; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.HashMap; import java.util.Map; import java.util.StringTokenizer; /** * Hold partial input words as well as its count in memory, until all input key-value pairs of the input * data split has been processed. * <p/> * User: George Sun * Date: 7/10/13 * Time: 8:17 PM */ public class InputDocumentsTokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private Map<String, Integer> wordCounter; private final Text wordText = new Text(); private final IntWritable totalCountInDocument = new IntWritable(); @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); wordCounter = new HashMap<String, Integer>(); } @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); // Count every word in a document while (st.hasMoreTokens()) { String word = st.nextToken(); if (wordCounter.containsKey(word)) { wordCounter.put(word, wordCounter.get(word) + 1); } else { wordCounter.put(word, 1); } } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { // Emit each word as well as its count for (Map.Entry<String, Integer> entry : wordCounter.entrySet()) { wordText.set(entry.getKey()); totalCountInDocument.set(entry.getValue()); context.write(wordText, totalCountInDocument); } super.cleanup(context); } }
With this technique, we are in essence incorporating combiner functionality directly inside the mapper, there is no need to run a separate combiner.
Pros of in-mapper combining
There're two main advantages to using this design pattern.
- First, it provides control over when local aggregation occurs and how it exactly takes place. In contrast, the semantics of the combiner is underspecified in MapReduce. For example, Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all. Such indeterminism is un acceptable under some circumstances.
- Second, in-mapper combining will typically be more efficient that using actual combiners. One reason for this is the additional overhead associated with actually materializing the key-value pairs. Combiners reduce the amount of intermediate data that is shuffled across the network, but don't actually reduce the number of key-value pairs that are emitted by the mappers in the first place.
Cons of in-mapper combining
- First it breaks the functional programming underpinings of MapReduce, since state is being preserved across multiple input key-value pairs. Ultimately, this isn't a big deal, since pragmatic concerns for efficiency often trump theoretical "purity", but there are practical consequences as well. Preserving state across multiple input instances means that algorithmic behavior may depend on the order in which input key-value pairs are encountered. This creates the potential for ordering-dependent bugs, which are difficult to debug on large datasets .
- Second, there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. It critically depends on having sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
One common solution to limiting memory usage when using the in-mapper combining technique is to "block" input key-value pairs and "flush" in-memory data structures periodically. The idea is simple: instead of emitting intermediate data only after every key-value pair has been processed, emit partial results after processing every n key-value pairs. This is straightforwardly implemented with a counter variable that keeps track of the number of input key-value pairs that have been processed. As an alternative, the mapper could keep track of its own memory footprint and flush intermediate key-value pairs once memory usage has crossed a certain threshold. In both approaches, either the block size or the memory usage threshold needs to be determined empirically: with too large a value, the mapper may run out of memory, but with too small a value, opportunities for local aggregation may be lost. Furthermore, in Hadoop physical memory is split between multiple tasks that may be running on a node concurrently; these tasks are all competing for finite resources, but since the tasks are not aware of each other, it is difficult to coordinate resource consumption effectively. In practice, however, one often encounters diminishing returns in performance gains with increasing buffer sizes, such that it is not worth the effort to search for an optimal buffer size.
In MapReduce algorithms, the extent to which efficiency can be increased through local aggregation depends on the size of the intermediate key space, the distribution of keys themselves, and the number of key-value pairs that are emitted by each individual map task. Opportunities for aggregation, after all, come from having multiple values associated with the same key (whether one uses combiners or employs the in-mapper combining pattern).
Local aggregation is also an effective technique for dealing with reduce stragglers that result from a highly skewed (e.g., Zipfian) distribution of values associated with intermediate keys. In our word count example, we do not filter frequently occurring words: therefore, without local aggregation, the reducer that's responsible for computing the count of 'the' will have a lot more work to do than the typical reducer, and therefore will likely be a straggler. With local aggregation (either combiners or in-mapper combining), we substantially reduce the number of values associated with frequently occurring terms, which alleviates the reduce straggler problem.