Introduction
I am reading a book by (Lin and Dyer 2010). This book is very informative about designing efficient algorithms under the Map/Reduce (M/R) programming paradigm. Of particular interest is the “in-mapper combining” design pattern that I came across while reading this book. As if engineers and data miners did not have to change their way of thinking enough while adapting to the M/R programming paradigm, our change in thinking and development must also be sensitive to the particular M/R framework as well. The in-mapper combining design pattern is meant to address some issues with M/R programming, and in particular, M/R programming under the Hadoop platform. In this blog I will discuss this in-mapper combining design patterns and show some examples. This design pattern seems to me an excellent technical screening problem—if you are so (un)fortunate. :) Hereafter, I will refer to the in-mapper combining design pattern with the acronym IMCDP.
Need for the design pattern
IMCDP may result in a more efficient algorithm implementation. I would not agree that it may result in a more efficient algorithm, per say, since it is not necessarily changing the running time complexity. In Lin’s book, the driving factor for IMCDP is to “substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers.”
In the canonical example of word counting, a key-value pair is emitted for every word found. For example, if we had 1,000 words, then 1,000 key-value pairs will be emitted from the mappers to the reducer(s). In between this handing off of data from the mappers to the reducer(s), a shuffle and sort step on the key-value pairs occurs. If the number of “intermediary” key-value pairs (these are they key-value pairs being sent from the mappers to the reducer(s)) are extremely high, then this amount of data could be a pain point in the speed of completing the overall M/R job. With IMCDP, the idea is to reduce the number of intermediary key-value pairs being sent from the mappers to the reducer(s).
Of course, you may be wondering, why not just use a combiner (the mini-reducer)? The reason why we should not just use a combiner is because even if we explicitly set one for a M/R job, Hadoop may or may not run the combiner. With IMCDP, the engineer can explicitly and deterministically control how to reduce the number of intermediary key-value pairs.
Canonical word count mapper
The canonical word count mapper program is shown below. As stated before, for every word found, a corresponding intermediary key-value pair is emitted from the mapper to the reducer. Also, again, if we had 1,000,000 words, then 1,000,000 key-value pairs will be emitted from the mapper to the reducer. We do should try to optimize the implementation of this word count mapper by reducing the amount of data that needs to be passed from the mapper to the reducer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
public
static
class
TokenizerMapper
extends
Mapper<Object, Text, Text, IntWritable> {
private
final
static
IntWritable one =
new
IntWritable(
1
);
private
Text word =
new
Text();
public
void
map(Object key, Text value, Context context)
throws
IOException, InterruptedException {
StringTokenizer itr =
new
StringTokenizer(value.toString());
while
(itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
|
A local IMCDP
In Lin’s book, he suggests that we use an associative array (i.e. in Java, this is a Map) to store the words and their associated frequency. After we have counted all the words in the incoming text, then we emit each word and its associated frequency. I have modified the canonical word count mapper as below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
public
static
class
TokenizerMapper
extends
Mapper<Object, Text, Text, IntWritable> {
public
void
map(Object key, Text value, Context context)
throws
IOException, InterruptedException {
Map<String, Integer> map =
new
HashMap<String, Integer>();
StringTokenizer itr =
new
StringTokenizer(value.toString());
while
(itr.hasMoreTokens()) {
String token = itr.nextToken();
if
(map.containsKey(token)) {
int
total = map.get(token).get() +
1
;
map.put(token, total);
}
else
{
map.put(token,
1
);
}
}
Iterator<Map.Entry<String, Integer>> it = map.entrySet().iterator();
while
(it.hasNext()) {
Map.Entry<String, Integer> entry = it.next();
String sKey = entry.getKey();
int
total = entry.getValue().intValue();
context.write(
new
Text(sKey),
new
IntWritable(total));
}
}
}
|
I refer to this IMCDP approach as a local one because the associative array is local with respect to the method. As you can see, instead of emitting one key-value pair per token, now, we emit one key-value pair per word.
A global IMCDP
Lin suggest we may even do better than the local IMCDP with a global IMCDP approach. Instead of using an associative array per key-value input, we use an associative array per mapper. The associative array is outside of the method, and so I refer to this approach as global. However, no matter the strategy or name, both approaches, local and global IMCDP, are still considered local aggregating techniques with respect to the mapper. The code below shows the global IMCDP approach.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
public
static
class
TokenizerMapper
extends
Mapper<Object, Text, Text, IntWritable> {
private
Map<String, Integer> map;
public
void
map(Object key, Text value, Context context)
throws
IOException, InterruptedException {
Map<String, Integer> map = getMap();
StringTokenizer itr =
new
StringTokenizer(value.toString());
while
(itr.hasMoreTokens()) {
String token = itr.nextToken();
if
(map.containsKey(token)) {
int
total = map.get(token).get() +
1
;
map.put(token, total);
}
else
{
map.put(token,
1
);
}
}
}
protected
void
cleanup(Context context)
throws
IOException, InterruptedException {
Map<String, Integer> map = getMap();
Iterator<Map.Entry<String, Integer>> it = map.entrySet().iterator();
while
(it.hasNext()) {
Map.Entry<String, Integer> entry = it.next();
String sKey = entry.getKey();
int
total = entry.getValue().intValue();
context.write(
new
Text(sKey),
new
IntWritable(total));
}
}
public
Map<String, Integer> getMap() {
if
(
null
== map)
//lazy loading
map =
new
HashMap<String, Integer>();
return
map;
}
}
|
As you can see, we override Mapper’s cleanup method, which is called only once before the Mapper is destroyed, to emit each word and its associated frequency. With the global IMCDP, we are emitting potentially even less key-value pairs to the reducer.
The global IMCDP approach may run into a memory limitation issue. If the associative array becomes very large and to the point where memory runs out, your mapper task will certainly crash. Lin suggests “flushing” the associative array every so often. Below, I show a way to flush the associative array.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
|
public
static
class
TokenizerMapper
extends
Mapper<Object, Text, Text, IntWritable> {
private
static
final
int
FLUSH_SIZE =
1000
;
private
Map<String, Integer> map;
public
void
map(Object key, Text value, Context context)
throws
IOException, InterruptedException {
Map<String, Integer> map = getMap();
StringTokenizer itr =
new
StringTokenizer(value.toString());
while
(itr.hasMoreTokens()) {
String token = itr.nextToken();
if
(map.containsKey(token)) {
int
total = map.get(token).get() +
1
;
map.put(token, total);
}
else
{
map.put(token,
1
);
}
}
flush(context,
false
);
}
private
void
flush(Context context,
boolean
force)
throws
IOException, InterruptedException {
Map<String, Integer> map = getMap();
if
(!force) {
int
size = map.size();
if
(size < FLUSH_SIZE)
return
;
}
Iterator<Map.Entry<String, Integer>> it = map.entrySet().iterator();
while
(it.hasNext()) {
Map.Entry<String, Integer> entry = it.next();
String sKey = entry.getKey();
int
total = entry.getValue().intValue();
context.write(
new
Text(sKey),
new
IntWritable(total));
}
map.clear();
//make sure to empty map
}
protected
void
cleanup(Context context)
throws
IOException, InterruptedException {
flush(context,
true
);
//force flush no matter what at the end
}
}
public
Map<String, Integer> getMap() {
if
(
null
== map)
//lazy loading
map =
new
HashMap<String, Integer>();
return
map;
}
}
|
Summary and conclusion
IMCDP is a way to possibly improve the speed of a M/R job by reducing the number of intermediary key-value pairs emitted from mappers to reducers. Unlike a combiner, which may or may not run at all, IMCDP can be controlled and will always run when implemented. There are several ways to implement IMCDP, what I refer to as local and global IMCDP, but they are still local aggregation design patterns with respect to the mapper.
As always, I hoped you enjoyed reading and this blog post helped you. Cheers! Sib ntsib dua nawb mog! שלום!
Reference
- J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, 2010. http://www.morganclaypool.com/doi/abs/10.2200/S00274ED1V01Y201006HLT007?journalCode=hlt
https://vangjee.wordpress.com/2012/03/07/the-in-mapper-combining-design-pattern-for-mapreduce-programming/