什么是Combiner Functions
“Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.” -- 《Hadoop: The Definitive Guide》
简单的说,combiner是一个在mapper之后运行的function,非常类似reducer的功能,所以在《Hadoop In Action》又叫作“local reduce”。它的好处是减少网络的数据传输,从而提高性能。但因为是一个优化功能,所以Hadoop并不保证会运行它。
其实这个的有一个更深入的设计问题,这里有一个假设就是大家倾向于fat mapper和slim reducer。就是一般情况下,大家会尽可能的在mapper里实现复杂的逻辑和运算,在reducer只是做简单的汇聚。这就是为什么有mapper端的combiner而没有reducer端的combiner。
在这里,同一个用户会出现在一天的任何小时,所以必须将同一个用户汇聚到一起来计算访问时间。显然无法在mapper端实现这样的功能。相同的用户以“用户ID”作为partition key排序后汇聚到reducer端。采用如下的标准模板(Perl语言为例):
while ( my $line = <STDIN> ) { chomp($line); ( $user_id, $country, $timestamp ) = split( /\t/, $line ); # set base key $key = $base_key; if ($cur_key) { if ( $key ne $cur_key ) { &onEndKey(); &onBeginKey(); } &onSameKey(); } else { &onBeginKey(); &onSameKey(); } } if ($cur_key) { &onEndKey(); }