《Hadoop The Definitive Guide》ch08 MapReduce Features

1. 计数器

1) 内置计数器

2) 用户自定义Java计数器

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar ch08.jar MaxTemperatureWithCounters input/ncdc/all max-temp
12/07/03 19:53:21 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 19:53:21 INFO mapred.JobClient: Running job: job_201207030133_0002
12/07/03 19:53:22 INFO mapred.JobClient:  map 0% reduce 0%
12/07/03 19:53:37 INFO mapred.JobClient:  map 100% reduce 0%
12/07/03 19:53:49 INFO mapred.JobClient:  map 100% reduce 100%
12/07/03 19:53:54 INFO mapred.JobClient: Job complete: job_201207030133_0002
12/07/03 19:53:54 INFO mapred.JobClient: Counters: 29
12/07/03 19:53:54 INFO mapred.JobClient:   Job Counters 
12/07/03 19:53:54 INFO mapred.JobClient:     Launched reduce tasks=1
12/07/03 19:53:54 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16305
12/07/03 19:53:54 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/03 19:53:54 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/03 19:53:54 INFO mapred.JobClient:     Launched map tasks=2
12/07/03 19:53:54 INFO mapred.JobClient:     Data-local map tasks=2
12/07/03 19:53:54 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10068
12/07/03 19:53:54 INFO mapred.JobClient:   File Input Format Counters 
12/07/03 19:53:54 INFO mapred.JobClient:     Bytes Read=147972
12/07/03 19:53:54 INFO mapred.JobClient:   File Output Format Counters 
12/07/03 19:53:54 INFO mapred.JobClient:     Bytes Written=18
12/07/03 19:53:54 INFO mapred.JobClient:   FileSystemCounters
12/07/03 19:53:54 INFO mapred.JobClient:     FILE_BYTES_READ=28
12/07/03 19:53:54 INFO mapred.JobClient:     HDFS_BYTES_READ=148184
12/07/03 19:53:54 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=62992
12/07/03 19:53:54 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=18
12/07/03 19:53:54 INFO mapred.JobClient:   TemperatureQuality
12/07/03 19:53:54 INFO mapred.JobClient:     1=13129
12/07/03 19:53:54 INFO mapred.JobClient:     9=1
12/07/03 19:53:54 INFO mapred.JobClient:   Air Temperature Records
12/07/03 19:53:54 INFO mapred.JobClient:     Missing=1
12/07/03 19:53:54 INFO mapred.JobClient:   Map-Reduce Framework
12/07/03 19:53:54 INFO mapred.JobClient:     Map output materialized bytes=34
12/07/03 19:53:54 INFO mapred.JobClient:     Map input records=13130
12/07/03 19:53:54 INFO mapred.JobClient:     Reduce shuffle bytes=34
12/07/03 19:53:54 INFO mapred.JobClient:     Spilled Records=4
12/07/03 19:53:54 INFO mapred.JobClient:     Map output bytes=118161
12/07/03 19:53:54 INFO mapred.JobClient:     Map input bytes=1777168
12/07/03 19:53:54 INFO mapred.JobClient:     Combine input records=13129
12/07/03 19:53:54 INFO mapred.JobClient:     SPLIT_RAW_BYTES=212
12/07/03 19:53:54 INFO mapred.JobClient:     Reduce input records=2
12/07/03 19:53:54 INFO mapred.JobClient:     Reduce input groups=2
12/07/03 19:53:54 INFO mapred.JobClient:     Combine output records=2
12/07/03 19:53:54 INFO mapred.JobClient:     Reduce output records=2
12/07/03 19:53:54 INFO mapred.JobClient:     Map output records=13129

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar ch08.jar MissingTemperatureFields job_201207030133_0002
Records with missing temperature fields: 0.01%

2. 排序

对数据进行排序是MapReduce的核心。

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar ch08.jar SortDataPreprocessor input/ncdc/all input/ncdc/all-seq
12/07/03 20:55:15 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 20:55:16 INFO mapred.JobClient: Running job: job_201207030133_0003
12/07/03 20:55:17 INFO mapred.JobClient:  map 0% reduce 0%
12/07/03 20:55:30 INFO mapred.JobClient:  map 100% reduce 0%
12/07/03 20:55:35 INFO mapred.JobClient: Job complete: job_201207030133_0003
12/07/03 20:55:35 INFO mapred.JobClient: Counters: 16
12/07/03 20:55:35 INFO mapred.JobClient:   Job Counters 
12/07/03 20:55:35 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16560
12/07/03 20:55:35 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/03 20:55:35 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/03 20:55:35 INFO mapred.JobClient:     Launched map tasks=2
12/07/03 20:55:35 INFO mapred.JobClient:     Data-local map tasks=2
12/07/03 20:55:35 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
12/07/03 20:55:35 INFO mapred.JobClient:   File Input Format Counters 
12/07/03 20:55:35 INFO mapred.JobClient:     Bytes Read=147972
12/07/03 20:55:35 INFO mapred.JobClient:   File Output Format Counters 
12/07/03 20:55:35 INFO mapred.JobClient:     Bytes Written=163409
12/07/03 20:55:35 INFO mapred.JobClient:   FileSystemCounters
12/07/03 20:55:35 INFO mapred.JobClient:     HDFS_BYTES_READ=148184
12/07/03 20:55:35 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=41754
12/07/03 20:55:35 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=163409
12/07/03 20:55:35 INFO mapred.JobClient:   Map-Reduce Framework
12/07/03 20:55:35 INFO mapred.JobClient:     Map input records=13130
12/07/03 20:55:35 INFO mapred.JobClient:     Spilled Records=0
12/07/03 20:55:35 INFO mapred.JobClient:     Map input bytes=1777168
12/07/03 20:55:35 INFO mapred.JobClient:     Map output records=13129
12/07/03 20:55:35 INFO mapred.JobClient:     SPLIT_RAW_BYTES=212

部分排序

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar ch08.jar SortByTemperatureUsingHashPartitioner -D mapred.reduce.tasks=30 input/ncdc/all-seq output-hashsort                                        <
12/07/03 22:28:32 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 22:28:33 INFO mapred.JobClient: Running job: job_201207030133_0004
12/07/03 22:28:34 INFO mapred.JobClient:  map 0% reduce 0%
12/07/03 22:28:47 INFO mapred.JobClient:  map 100% reduce 0%
12/07/03 22:28:59 INFO mapred.JobClient:  map 100% reduce 3%
12/07/03 22:29:02 INFO mapred.JobClient:  map 100% reduce 6%
12/07/03 22:29:08 INFO mapred.JobClient:  map 100% reduce 10%
12/07/03 22:29:11 INFO mapred.JobClient:  map 100% reduce 13%
12/07/03 22:29:23 INFO mapred.JobClient:  map 100% reduce 20%
12/07/03 22:29:32 INFO mapred.JobClient:  map 100% reduce 23%
12/07/03 22:29:38 INFO mapred.JobClient:  map 100% reduce 26%
12/07/03 22:29:41 INFO mapred.JobClient:  map 100% reduce 30%
12/07/03 22:29:47 INFO mapred.JobClient:  map 100% reduce 33%
12/07/03 22:29:56 INFO mapred.JobClient:  map 100% reduce 36%
12/07/03 22:30:02 INFO mapred.JobClient:  map 100% reduce 40%
12/07/03 22:30:05 INFO mapred.JobClient:  map 100% reduce 43%
12/07/03 22:30:11 INFO mapred.JobClient:  map 100% reduce 46%
12/07/03 22:30:14 INFO mapred.JobClient:  map 100% reduce 50%
12/07/03 22:30:23 INFO mapred.JobClient:  map 100% reduce 53%
12/07/03 22:30:29 INFO mapred.JobClient:  map 100% reduce 56%
12/07/03 22:30:35 INFO mapred.JobClient:  map 100% reduce 60%
12/07/03 22:30:38 INFO mapred.JobClient:  map 100% reduce 63%
12/07/03 22:30:44 INFO mapred.JobClient:  map 100% reduce 66%
12/07/03 22:30:47 INFO mapred.JobClient:  map 100% reduce 70%
12/07/03 22:30:59 INFO mapred.JobClient:  map 100% reduce 73%
12/07/03 22:31:02 INFO mapred.JobClient:  map 100% reduce 76%
12/07/03 22:31:08 INFO mapred.JobClient:  map 100% reduce 80%
12/07/03 22:31:11 INFO mapred.JobClient:  map 100% reduce 83%
12/07/03 22:31:17 INFO mapred.JobClient:  map 100% reduce 87%
12/07/03 22:31:23 INFO mapred.JobClient:  map 100% reduce 90%
12/07/03 22:31:32 INFO mapred.JobClient:  map 100% reduce 93%
12/07/03 22:31:35 INFO mapred.JobClient:  map 100% reduce 96%
12/07/03 22:31:41 INFO mapred.JobClient:  map 100% reduce 100%
12/07/03 22:31:46 INFO mapred.JobClient: Job complete: job_201207030133_0004
12/07/03 22:31:46 INFO mapred.JobClient: Counters: 26
12/07/03 22:31:46 INFO mapred.JobClient:   Job Counters 
12/07/03 22:31:46 INFO mapred.JobClient:     Launched reduce tasks=30
12/07/03 22:31:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16282
12/07/03 22:31:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/03 22:31:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/03 22:31:46 INFO mapred.JobClient:     Launched map tasks=2
12/07/03 22:31:46 INFO mapred.JobClient:     Data-local map tasks=2
12/07/03 22:31:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=335658
12/07/03 22:31:46 INFO mapred.JobClient:   File Input Format Counters 
12/07/03 22:31:46 INFO mapred.JobClient:     Bytes Read=163409
12/07/03 22:31:46 INFO mapred.JobClient:   File Output Format Counters 
12/07/03 22:31:46 INFO mapred.JobClient:     Bytes Written=180399
12/07/03 22:31:46 INFO mapred.JobClient:   FileSystemCounters
12/07/03 22:31:46 INFO mapred.JobClient:     FILE_BYTES_READ=1882171
12/07/03 22:31:46 INFO mapred.JobClient:     HDFS_BYTES_READ=163635
12/07/03 22:31:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4431596
12/07/03 22:31:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=180399
12/07/03 22:31:46 INFO mapred.JobClient:   Map-Reduce Framework
12/07/03 22:31:46 INFO mapred.JobClient:     Map output materialized bytes=1882351
12/07/03 22:31:46 INFO mapred.JobClient:     Map input records=13129
12/07/03 22:31:46 INFO mapred.JobClient:     Reduce shuffle bytes=1278651
12/07/03 22:31:46 INFO mapred.JobClient:     Spilled Records=26258
12/07/03 22:31:46 INFO mapred.JobClient:     Map output bytes=1842641
12/07/03 22:31:46 INFO mapred.JobClient:     Map input bytes=163159
12/07/03 22:31:46 INFO mapred.JobClient:     Combine input records=0
12/07/03 22:31:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=226
12/07/03 22:31:46 INFO mapred.JobClient:     Reduce input records=13129
12/07/03 22:31:46 INFO mapred.JobClient:     Reduce input groups=116
12/07/03 22:31:46 INFO mapred.JobClient:     Combine output records=0
12/07/03 22:31:46 INFO mapred.JobClient:     Reduce output records=13129
12/07/03 22:31:46 INFO mapred.JobClient:     Map output records=13129

已划分的MapFile文件查找

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar ch08.jar SortByTemperatureToMapFile -D mapred.reduce.tasks=30 input/ncdc/all-seq output-hashmapso>
12/07/03 22:35:53 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 22:35:53 INFO mapred.JobClient: Running job: job_201207030133_0005
12/07/03 22:35:54 INFO mapred.JobClient:  map 0% reduce 0%
12/07/03 22:36:08 INFO mapred.JobClient:  map 100% reduce 0%
12/07/03 22:36:20 INFO mapred.JobClient:  map 100% reduce 3%
12/07/03 22:36:23 INFO mapred.JobClient:  map 100% reduce 6%
12/07/03 22:36:29 INFO mapred.JobClient:  map 100% reduce 10%
12/07/03 22:36:32 INFO mapred.JobClient:  map 100% reduce 13%
12/07/03 22:36:44 INFO mapred.JobClient:  map 100% reduce 20%
12/07/03 22:36:53 INFO mapred.JobClient:  map 100% reduce 23%
12/07/03 22:36:56 INFO mapred.JobClient:  map 100% reduce 26%
12/07/03 22:37:02 INFO mapred.JobClient:  map 100% reduce 30%
12/07/03 22:37:05 INFO mapred.JobClient:  map 100% reduce 33%
12/07/03 22:37:17 INFO mapred.JobClient:  map 100% reduce 40%
12/07/03 22:37:26 INFO mapred.JobClient:  map 100% reduce 43%
12/07/03 22:37:29 INFO mapred.JobClient:  map 100% reduce 46%
12/07/03 22:37:35 INFO mapred.JobClient:  map 100% reduce 50%
12/07/03 22:37:38 INFO mapred.JobClient:  map 100% reduce 53%
12/07/03 22:37:51 INFO mapred.JobClient:  map 100% reduce 60%
12/07/03 22:38:00 INFO mapred.JobClient:  map 100% reduce 63%
12/07/03 22:38:03 INFO mapred.JobClient:  map 100% reduce 66%
12/07/03 22:38:09 INFO mapred.JobClient:  map 100% reduce 70%
12/07/03 22:38:12 INFO mapred.JobClient:  map 100% reduce 73%
12/07/03 22:38:18 INFO mapred.JobClient:  map 100% reduce 74%
12/07/03 22:38:21 INFO mapred.JobClient:  map 100% reduce 77%
12/07/03 22:38:24 INFO mapred.JobClient:  map 100% reduce 80%
12/07/03 22:38:33 INFO mapred.JobClient:  map 100% reduce 83%
12/07/03 22:38:36 INFO mapred.JobClient:  map 100% reduce 86%
12/07/03 22:38:42 INFO mapred.JobClient:  map 100% reduce 90%
12/07/03 22:38:45 INFO mapred.JobClient:  map 100% reduce 93%
12/07/03 22:38:57 INFO mapred.JobClient:  map 100% reduce 100%
12/07/03 22:39:02 INFO mapred.JobClient: Job complete: job_201207030133_0005
12/07/03 22:39:02 INFO mapred.JobClient: Counters: 26
12/07/03 22:39:02 INFO mapred.JobClient:   Job Counters 
12/07/03 22:39:02 INFO mapred.JobClient:     Launched reduce tasks=30
12/07/03 22:39:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16299
12/07/03 22:39:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/03 22:39:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/03 22:39:02 INFO mapred.JobClient:     Launched map tasks=2
12/07/03 22:39:02 INFO mapred.JobClient:     Data-local map tasks=2
12/07/03 22:39:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=330354
12/07/03 22:39:02 INFO mapred.JobClient:   File Input Format Counters 
12/07/03 22:39:02 INFO mapred.JobClient:     Bytes Read=163409
12/07/03 22:39:02 INFO mapred.JobClient:   File Output Format Counters 
12/07/03 22:39:02 INFO mapred.JobClient:     Bytes Written=186935
12/07/03 22:39:02 INFO mapred.JobClient:   FileSystemCounters
12/07/03 22:39:02 INFO mapred.JobClient:     FILE_BYTES_READ=1882171
12/07/03 22:39:02 INFO mapred.JobClient:     HDFS_BYTES_READ=163635
12/07/03 22:39:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4431532
12/07/03 22:39:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=186935
12/07/03 22:39:02 INFO mapred.JobClient:   Map-Reduce Framework
12/07/03 22:39:02 INFO mapred.JobClient:     Map output materialized bytes=1882351
12/07/03 22:39:02 INFO mapred.JobClient:     Map input records=13129
12/07/03 22:39:02 INFO mapred.JobClient:     Reduce shuffle bytes=1054827
12/07/03 22:39:02 INFO mapred.JobClient:     Spilled Records=26258
12/07/03 22:39:02 INFO mapred.JobClient:     Map output bytes=1842641
12/07/03 22:39:02 INFO mapred.JobClient:     Map input bytes=163159
12/07/03 22:39:02 INFO mapred.JobClient:     Combine input records=0
12/07/03 22:39:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=226
12/07/03 22:39:02 INFO mapred.JobClient:     Reduce input records=13129
12/07/03 22:39:02 INFO mapred.JobClient:     Reduce input groups=116
12/07/03 22:39:02 INFO mapred.JobClient:     Combine output records=0
12/07/03 22:39:02 INFO mapred.JobClient:     Reduce output records=13129
12/07/03 22:39:02 INFO mapred.JobClient:     Map output records=13129

全局排序

>> hadoop jar ch08.jar SortByTemperatureUsingTotalOrderPartitioner -D mapred.reduce.tasks=30 input/ncdc/all-seq output-totalsort                                     <
12/07/03 23:35:45 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 23:35:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/03 23:35:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/07/03 23:35:45 INFO compress.CodecPool: Got brand-new decompressor
12/07/03 23:35:45 INFO compress.CodecPool: Got brand-new decompressor
12/07/03 23:35:45 INFO compress.CodecPool: Got brand-new decompressor
12/07/03 23:35:45 INFO compress.CodecPool: Got brand-new decompressor
12/07/03 23:35:45 INFO lib.InputSampler: Using 1339 samples
12/07/03 23:35:45 INFO compress.CodecPool: Got brand-new compressor
12/07/03 23:35:45 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 23:35:45 INFO mapred.JobClient: Running job: job_201207030133_0006
12/07/03 23:35:46 INFO mapred.JobClient:  map 0% reduce 0%
12/07/03 23:36:01 INFO mapred.JobClient:  map 100% reduce 0%
12/07/03 23:36:13 INFO mapred.JobClient:  map 100% reduce 3%
12/07/03 23:36:16 INFO mapred.JobClient:  map 100% reduce 6%
12/07/03 23:36:25 INFO mapred.JobClient:  map 100% reduce 10%
12/07/03 23:36:28 INFO mapred.JobClient:  map 100% reduce 13%
12/07/03 23:36:37 INFO mapred.JobClient:  map 100% reduce 20%
12/07/03 23:36:49 INFO mapred.JobClient:  map 100% reduce 26%
12/07/03 23:36:58 INFO mapred.JobClient:  map 100% reduce 30%
12/07/03 23:37:01 INFO mapred.JobClient:  map 100% reduce 33%
12/07/03 23:37:10 INFO mapred.JobClient:  map 100% reduce 36%
12/07/03 23:37:16 INFO mapred.JobClient:  map 100% reduce 40%
12/07/03 23:37:19 INFO mapred.JobClient:  map 100% reduce 43%
12/07/03 23:37:25 INFO mapred.JobClient:  map 100% reduce 46%
12/07/03 23:37:31 INFO mapred.JobClient:  map 100% reduce 50%
12/07/03 23:37:40 INFO mapred.JobClient:  map 100% reduce 56%
12/07/03 23:37:49 INFO mapred.JobClient:  map 100% reduce 60%
12/07/03 23:37:52 INFO mapred.JobClient:  map 100% reduce 63%
12/07/03 23:38:01 INFO mapred.JobClient:  map 100% reduce 66%
12/07/03 23:38:04 INFO mapred.JobClient:  map 100% reduce 70%
12/07/03 23:38:13 INFO mapred.JobClient:  map 100% reduce 76%
12/07/03 23:38:22 INFO mapred.JobClient:  map 100% reduce 80%
12/07/03 23:38:25 INFO mapred.JobClient:  map 100% reduce 83%
12/07/03 23:38:34 INFO mapred.JobClient:  map 100% reduce 87%
12/07/03 23:38:37 INFO mapred.JobClient:  map 100% reduce 90%
12/07/03 23:38:40 INFO mapred.JobClient:  map 100% reduce 91%
12/07/03 23:38:46 INFO mapred.JobClient:  map 100% reduce 93%
12/07/03 23:38:49 INFO mapred.JobClient:  map 100% reduce 96%
12/07/03 23:38:58 INFO mapred.JobClient:  map 100% reduce 100%
12/07/03 23:39:03 INFO mapred.JobClient: Job complete: job_201207030133_0006
12/07/03 23:39:03 INFO mapred.JobClient: Counters: 26
12/07/03 23:39:03 INFO mapred.JobClient:   Job Counters 
12/07/03 23:39:03 INFO mapred.JobClient:     Launched reduce tasks=30
12/07/03 23:39:03 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18040
12/07/03 23:39:03 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/03 23:39:03 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/03 23:39:03 INFO mapred.JobClient:     Launched map tasks=2
12/07/03 23:39:03 INFO mapred.JobClient:     Data-local map tasks=2
12/07/03 23:39:03 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=336193
12/07/03 23:39:03 INFO mapred.JobClient:   File Input Format Counters 
12/07/03 23:39:03 INFO mapred.JobClient:     Bytes Read=163409
12/07/03 23:39:03 INFO mapred.JobClient:   File Output Format Counters 
12/07/03 23:39:03 INFO mapred.JobClient:     Bytes Written=177339
12/07/03 23:39:03 INFO mapred.JobClient:   FileSystemCounters
12/07/03 23:39:03 INFO mapred.JobClient:     FILE_BYTES_READ=1882171
12/07/03 23:39:03 INFO mapred.JobClient:     HDFS_BYTES_READ=165067
12/07/03 23:39:03 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4462828
12/07/03 23:39:03 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=177339
12/07/03 23:39:03 INFO mapred.JobClient:   Map-Reduce Framework
12/07/03 23:39:03 INFO mapred.JobClient:     Map output materialized bytes=1882351
12/07/03 23:39:03 INFO mapred.JobClient:     Map input records=13129
12/07/03 23:39:03 INFO mapred.JobClient:     Reduce shuffle bytes=1138806
12/07/03 23:39:03 INFO mapred.JobClient:     Spilled Records=26258
12/07/03 23:39:03 INFO mapred.JobClient:     Map output bytes=1842641
12/07/03 23:39:03 INFO mapred.JobClient:     Map input bytes=163159
12/07/03 23:39:03 INFO mapred.JobClient:     Combine input records=0
12/07/03 23:39:03 INFO mapred.JobClient:     SPLIT_RAW_BYTES=226
12/07/03 23:39:03 INFO mapred.JobClient:     Reduce input records=13129
12/07/03 23:39:03 INFO mapred.JobClient:     Reduce input groups=116
12/07/03 23:39:03 INFO mapred.JobClient:     Combine output records=0
12/07/03 23:39:03 INFO mapred.JobClient:     Reduce output records=13129
12/07/03 23:39:03 INFO mapred.JobClient:     Map output records=13129

二次排序

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar ch08.jar MaxTemperatureUsingSecondarySort input/ncdc/all output-secondarysort
12/07/03 23:59:15 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/03 23:59:15 INFO mapred.JobClient: Running job: job_201207030133_0007
12/07/03 23:59:16 INFO mapred.JobClient:  map 0% reduce 0%
12/07/03 23:59:31 INFO mapred.JobClient:  map 100% reduce 0%
12/07/03 23:59:43 INFO mapred.JobClient:  map 100% reduce 100%
12/07/03 23:59:48 INFO mapred.JobClient: Job complete: job_201207030133_0007
12/07/03 23:59:48 INFO mapred.JobClient: Counters: 26
12/07/03 23:59:48 INFO mapred.JobClient:   Job Counters 
12/07/03 23:59:48 INFO mapred.JobClient:     Launched reduce tasks=1
12/07/03 23:59:48 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16330
12/07/03 23:59:48 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/03 23:59:48 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/03 23:59:48 INFO mapred.JobClient:     Launched map tasks=2
12/07/03 23:59:48 INFO mapred.JobClient:     Data-local map tasks=2
12/07/03 23:59:48 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=9967
12/07/03 23:59:48 INFO mapred.JobClient:   File Input Format Counters 
12/07/03 23:59:48 INFO mapred.JobClient:     Bytes Read=147972
12/07/03 23:59:48 INFO mapred.JobClient:   File Output Format Counters 
12/07/03 23:59:48 INFO mapred.JobClient:     Bytes Written=18
12/07/03 23:59:48 INFO mapred.JobClient:   FileSystemCounters
12/07/03 23:59:48 INFO mapred.JobClient:     FILE_BYTES_READ=131296
12/07/03 23:59:48 INFO mapred.JobClient:     HDFS_BYTES_READ=148184
12/07/03 23:59:48 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=326482
12/07/03 23:59:48 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=18
12/07/03 23:59:48 INFO mapred.JobClient:   Map-Reduce Framework
12/07/03 23:59:48 INFO mapred.JobClient:     Map output materialized bytes=131302
12/07/03 23:59:48 INFO mapred.JobClient:     Map input records=13130
12/07/03 23:59:48 INFO mapred.JobClient:     Reduce shuffle bytes=131302
12/07/03 23:59:48 INFO mapred.JobClient:     Spilled Records=26258
12/07/03 23:59:48 INFO mapred.JobClient:     Map output bytes=105032
12/07/03 23:59:48 INFO mapred.JobClient:     Map input bytes=1777168
12/07/03 23:59:48 INFO mapred.JobClient:     Combine input records=0
12/07/03 23:59:48 INFO mapred.JobClient:     SPLIT_RAW_BYTES=212
12/07/03 23:59:48 INFO mapred.JobClient:     Reduce input records=0
12/07/03 23:59:48 INFO mapred.JobClient:     Reduce input groups=2
12/07/03 23:59:48 INFO mapred.JobClient:     Combine output records=0
12/07/03 23:59:48 INFO mapred.JobClient:     Reduce output records=2
12/07/03 23:59:48 INFO mapred.JobClient:     Map output records=13129

3. 连接

4. 次要数据的分布 Side Data Distribution

分布式缓存

相对于在作业配置中对次要数据进行序列化,更好的方法是使用Hadoop的分布式缓存机制来分布数据集。它提供了为该任务及时复制文件和存档文件到任务节点的服务以便在运行时使用它们。为了节省网络带宽,每个作业文件通常复制到任何特定的节点一次。

>> hadoop jar ch08.jar MaxTemperatureByStationNameUsingDistributedCacheFile -files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output                               <
12/07/04 00:18:14 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/04 00:18:14 INFO mapred.JobClient: Running job: job_201207030133_0008
12/07/04 00:18:15 INFO mapred.JobClient:  map 0% reduce 0%
12/07/04 00:18:29 INFO mapred.JobClient:  map 100% reduce 0%
12/07/04 00:18:41 INFO mapred.JobClient:  map 100% reduce 100%
12/07/04 00:18:46 INFO mapred.JobClient: Job complete: job_201207030133_0008
12/07/04 00:18:46 INFO mapred.JobClient: Counters: 26
12/07/04 00:18:46 INFO mapred.JobClient:   Job Counters 
12/07/04 00:18:46 INFO mapred.JobClient:     Launched reduce tasks=1
12/07/04 00:18:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=17800
12/07/04 00:18:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/04 00:18:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/04 00:18:46 INFO mapred.JobClient:     Launched map tasks=2
12/07/04 00:18:46 INFO mapred.JobClient:     Data-local map tasks=2
12/07/04 00:18:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10372
12/07/04 00:18:46 INFO mapred.JobClient:   File Input Format Counters 
12/07/04 00:18:46 INFO mapred.JobClient:     Bytes Read=147972
12/07/04 00:18:46 INFO mapred.JobClient:   File Output Format Counters 
12/07/04 00:18:46 INFO mapred.JobClient:     Bytes Written=170
12/07/04 00:18:46 INFO mapred.JobClient:   FileSystemCounters
12/07/04 00:18:46 INFO mapred.JobClient:     FILE_BYTES_READ=234
12/07/04 00:18:46 INFO mapred.JobClient:     HDFS_BYTES_READ=148184
12/07/04 00:18:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=66722
12/07/04 00:18:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=170
12/07/04 00:18:46 INFO mapred.JobClient:   Map-Reduce Framework
12/07/04 00:18:46 INFO mapred.JobClient:     Map output materialized bytes=240
12/07/04 00:18:46 INFO mapred.JobClient:     Map input records=13130
12/07/04 00:18:46 INFO mapred.JobClient:     Reduce shuffle bytes=120
12/07/04 00:18:46 INFO mapred.JobClient:     Spilled Records=24
12/07/04 00:18:46 INFO mapred.JobClient:     Map output bytes=223193
12/07/04 00:18:46 INFO mapred.JobClient:     Map input bytes=1777168
12/07/04 00:18:46 INFO mapred.JobClient:     Combine input records=13129
12/07/04 00:18:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=212
12/07/04 00:18:46 INFO mapred.JobClient:     Reduce input records=12
12/07/04 00:18:46 INFO mapred.JobClient:     Reduce input groups=6
12/07/04 00:18:46 INFO mapred.JobClient:     Combine output records=12
12/07/04 00:18:46 INFO mapred.JobClient:     Reduce output records=6
12/07/04 00:18:46 INFO mapred.JobClient:     Map output records=13129


你可能感兴趣的:(《Hadoop The Definitive Guide》ch08 MapReduce Features)