如果把前面的例子加上Combiner.class
public static class Combiner extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
long count = 0;
for (Text val : values) {
count+=Long.parseLong(val.toString());
}
context.write(key, new Text(""+count));
}
}
然后指定 job.setCombinerClass(Combiner.class);
可以观察下两个的效率区别:
4/11/07 14:49:25 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=52642504
FILE: Number of bytes written=95200714
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=608036374
HDFS: Number of bytes written=423
HDFS: Number of read operations=22
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=2923923
Map output records=2923923
Map output bytes=20467464
Map output materialized bytes=26315322
Input split bytes=212
Combine input records=0
Combine output records=0
Reduce input groups=38
Reduce shuffle bytes=26315322
Reduce input records=2923923
Reduce output records=38
Spilled Records=5847846
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=252
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1150484480
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236907275
File Output Format Counters
Bytes Written=423
使用后的:
14/11/07 16:04:49 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=16224
FILE: Number of bytes written=704061
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=608036374
HDFS: Number of bytes written=423
HDFS: Number of read operations=22
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=2923923
Map output records=2923923
Map output bytes=20467464
Map output materialized bytes=523
Input split bytes=212
Combine input records=2923923
Combine output records=39
Reduce input groups=38
Reduce shuffle bytes=523
Reduce input records=39
Reduce output records=38
Spilled Records=78
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=281
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1154875392
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236907275
File Output Format Counters
Bytes Written=423
第一次耗费 28秒
第二次耗费21秒。