《Hadoop The Definitive Guide》ch04 Hadoop I/O

1. Hadoop comes with a set of primitives for data I/O. Some of these are techniques that are more general than Hadoop, such as data integrity and compression, but deserve
special consideration when dealing with multiterabyte datasets. Others are Hadoop tools or APIs that form the building blocks for developing distributed systems, such as

serialization frameworks and on-disk data structures.

2. 压缩

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> echo "text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip -
12/07/02 00:21:12 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/02 00:21:12 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
text

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> echo "text" | hadoop PooledStreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip -
12/07/02 00:24:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/02 00:24:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/07/02 00:24:45 INFO compress.CodecPool: Got brand-new compressor
text

在MapReduce中使用压缩

3. 序列化

序列化指的是将结构化对象转为字节流以便于通过网络进行传输或写入持久存储的过程。反序列化指的是将字节流转为一系列结构化对象的过程。

序列化用于分布式数据处理中两个截然不同的领域:进程间通信和持久存储。

Hadoop中,节点之间的进程间通信是通过RPC来实现的。

几个序列化框架 Apache Thrift和Google的 Protocol Buffers,Avro。

4. 基于文件的数据结构

4.1 SequenceFileDemo

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop SequenceFileWriteDemo numbers.seq
12/07/02 01:11:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/02 01:11:00 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/07/02 01:11:00 INFO compress.CodecPool: Got brand-new compressor
[128]   100     One, two, buckle my shoe
[173]   99      Three, four, shut the door
[220]   98      Five, six, pick up sticks
[264]   97      Seven, eight, lay them straight
[314]   96      Nine, ten, a big fat hen
[359]   95      One, two, buckle my shoe
[404]   94      Three, four, shut the door
[451]   93      Five, six, pick up sticks
[495]   92      Seven, eight, lay them straight
[545]   91      Nine, ten, a big fat hen
[590]   90      One, two, buckle my shoe
[635]   89      Three, four, shut the door
[682]   88      Five, six, pick up sticks
[726]   87      Seven, eight, lay them straight
[776]   86      Nine, ten, a big fat hen
[821]   85      One, two, buckle my shoe
[866]   84      Three, four, shut the door
[913]   83      Five, six, pick up sticks
[957]   82      Seven, eight, lay them straight
[1007]  81      Nine, ten, a big fat hen
[1052]  80      One, two, buckle my shoe
[1097]  79      Three, four, shut the door
[1144]  78      Five, six, pick up sticks
[1188]  77      Seven, eight, lay them straight
[1238]  76      Nine, ten, a big fat hen
[1283]  75      One, two, buckle my shoe
[1328]  74      Three, four, shut the door
[1375]  73      Five, six, pick up sticks
[1419]  72      Seven, eight, lay them straight
[1469]  71      Nine, ten, a big fat hen
[1514]  70      One, two, buckle my shoe
[1559]  69      Three, four, shut the door
[1606]  68      Five, six, pick up sticks
[1650]  67      Seven, eight, lay them straight
[1700]  66      Nine, ten, a big fat hen
[1745]  65      One, two, buckle my shoe
[1790]  64      Three, four, shut the door
[1837]  63      Five, six, pick up sticks
[1881]  62      Seven, eight, lay them straight
[1931]  61      Nine, ten, a big fat hen
[1976]  60      One, two, buckle my shoe
[2021]  59      Three, four, shut the door
[2088]  58      Five, six, pick up sticks
[2132]  57      Seven, eight, lay them straight
[2182]  56      Nine, ten, a big fat hen
[2227]  55      One, two, buckle my shoe
[2272]  54      Three, four, shut the door
[2319]  53      Five, six, pick up sticks
[2363]  52      Seven, eight, lay them straight
[2413]  51      Nine, ten, a big fat hen
[2458]  50      One, two, buckle my shoe
[2503]  49      Three, four, shut the door
[2550]  48      Five, six, pick up sticks
[2594]  47      Seven, eight, lay them straight
[2644]  46      Nine, ten, a big fat hen
[2689]  45      One, two, buckle my shoe
[2734]  44      Three, four, shut the door
[2781]  43      Five, six, pick up sticks
[2825]  42      Seven, eight, lay them straight
[2875]  41      Nine, ten, a big fat hen
[2920]  40      One, two, buckle my shoe
[2965]  39      Three, four, shut the door
[3012]  38      Five, six, pick up sticks
[3056]  37      Seven, eight, lay them straight
[3106]  36      Nine, ten, a big fat hen
[3151]  35      One, two, buckle my shoe
[3196]  34      Three, four, shut the door
[3243]  33      Five, six, pick up sticks
[3287]  32      Seven, eight, lay them straight
[3337]  31      Nine, ten, a big fat hen
[3382]  30      One, two, buckle my shoe
[3427]  29      Three, four, shut the door
[3474]  28      Five, six, pick up sticks
[3518]  27      Seven, eight, lay them straight
[3568]  26      Nine, ten, a big fat hen
[3613]  25      One, two, buckle my shoe
[3658]  24      Three, four, shut the door
[3705]  23      Five, six, pick up sticks
[3749]  22      Seven, eight, lay them straight
[3799]  21      Nine, ten, a big fat hen
[3844]  20      One, two, buckle my shoe
[3889]  19      Three, four, shut the door
[3936]  18      Five, six, pick up sticks
[3980]  17      Seven, eight, lay them straight
[4030]  16      Nine, ten, a big fat hen
[4075]  15      One, two, buckle my shoe
[4140]  14      Three, four, shut the door
[4187]  13      Five, six, pick up sticks
[4231]  12      Seven, eight, lay them straight
[4281]  11      Nine, ten, a big fat hen
[4326]  10      One, two, buckle my shoe
[4371]  9       Three, four, shut the door
[4418]  8       Five, six, pick up sticks
[4462]  7       Seven, eight, lay them straight
[4512]  6       Nine, ten, a big fat hen
[4557]  5       One, two, buckle my shoe
[4602]  4       Three, four, shut the door
[4649]  3       Five, six, pick up sticks
[4693]  2       Seven, eight, lay them straight
[4743]  1       Nine, ten, a big fat hen

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop SequenceFileReadDemo numbers.seq
12/07/02 01:15:49 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/02 01:15:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/07/02 01:15:49 INFO compress.CodecPool: Got brand-new decompressor
[128]   100     One, two, buckle my shoe
[173]   99      Three, four, shut the door
[220]   98      Five, six, pick up sticks
[264]   97      Seven, eight, lay them straight
[314]   96      Nine, ten, a big fat hen
[359]   95      One, two, buckle my shoe
[404]   94      Three, four, shut the door
[451]   93      Five, six, pick up sticks
[495]   92      Seven, eight, lay them straight
[545]   91      Nine, ten, a big fat hen
[590]   90      One, two, buckle my shoe
[635]   89      Three, four, shut the door
[682]   88      Five, six, pick up sticks
[726]   87      Seven, eight, lay them straight
[776]   86      Nine, ten, a big fat hen
[821]   85      One, two, buckle my shoe
[866]   84      Three, four, shut the door
[913]   83      Five, six, pick up sticks
[957]   82      Seven, eight, lay them straight
[1007]  81      Nine, ten, a big fat hen
[1052]  80      One, two, buckle my shoe
[1097]  79      Three, four, shut the door
[1144]  78      Five, six, pick up sticks
[1188]  77      Seven, eight, lay them straight
[1238]  76      Nine, ten, a big fat hen
[1283]  75      One, two, buckle my shoe
[1328]  74      Three, four, shut the door
[1375]  73      Five, six, pick up sticks
[1419]  72      Seven, eight, lay them straight
[1469]  71      Nine, ten, a big fat hen
[1514]  70      One, two, buckle my shoe
[1559]  69      Three, four, shut the door
[1606]  68      Five, six, pick up sticks
[1650]  67      Seven, eight, lay them straight
[1700]  66      Nine, ten, a big fat hen
[1745]  65      One, two, buckle my shoe
[1790]  64      Three, four, shut the door
[1837]  63      Five, six, pick up sticks
[1881]  62      Seven, eight, lay them straight
[1931]  61      Nine, ten, a big fat hen
[1976]  60      One, two, buckle my shoe
[2021*] 59      Three, four, shut the door
[2088]  58      Five, six, pick up sticks
[2132]  57      Seven, eight, lay them straight
[2182]  56      Nine, ten, a big fat hen
[2227]  55      One, two, buckle my shoe
[2272]  54      Three, four, shut the door
[2319]  53      Five, six, pick up sticks
[2363]  52      Seven, eight, lay them straight
[2413]  51      Nine, ten, a big fat hen
[2458]  50      One, two, buckle my shoe
[2503]  49      Three, four, shut the door
[2550]  48      Five, six, pick up sticks
[2594]  47      Seven, eight, lay them straight
[2644]  46      Nine, ten, a big fat hen
[2689]  45      One, two, buckle my shoe
[2734]  44      Three, four, shut the door
[2781]  43      Five, six, pick up sticks
[2825]  42      Seven, eight, lay them straight
[2875]  41      Nine, ten, a big fat hen
[2920]  40      One, two, buckle my shoe
[2965]  39      Three, four, shut the door
[3012]  38      Five, six, pick up sticks
[3056]  37      Seven, eight, lay them straight
[3106]  36      Nine, ten, a big fat hen
[3151]  35      One, two, buckle my shoe
[3196]  34      Three, four, shut the door
[3243]  33      Five, six, pick up sticks
[3287]  32      Seven, eight, lay them straight
[3337]  31      Nine, ten, a big fat hen
[3382]  30      One, two, buckle my shoe
[3427]  29      Three, four, shut the door
[3474]  28      Five, six, pick up sticks
[3518]  27      Seven, eight, lay them straight
[3568]  26      Nine, ten, a big fat hen
[3613]  25      One, two, buckle my shoe
[3658]  24      Three, four, shut the door
[3705]  23      Five, six, pick up sticks
[3749]  22      Seven, eight, lay them straight
[3799]  21      Nine, ten, a big fat hen
[3844]  20      One, two, buckle my shoe
[3889]  19      Three, four, shut the door
[3936]  18      Five, six, pick up sticks
[3980]  17      Seven, eight, lay them straight
[4030]  16      Nine, ten, a big fat hen
[4075*] 15      One, two, buckle my shoe
[4140]  14      Three, four, shut the door
[4187]  13      Five, six, pick up sticks
[4231]  12      Seven, eight, lay them straight
[4281]  11      Nine, ten, a big fat hen
[4326]  10      One, two, buckle my shoe
[4371]  9       Three, four, shut the door
[4418]  8       Five, six, pick up sticks
[4462]  7       Seven, eight, lay them straight
[4512]  6       Nine, ten, a big fat hen
[4557]  5       One, two, buckle my shoe
[4602]  4       Three, four, shut the door
[4649]  3       Five, six, pick up sticks
[4693]  2       Seven, eight, lay them straight
[4743]  1       Nine, ten, a big fat hen

查看写入的内容,

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop fs -text numbers.seq |less

排序和合并序列文件

>> hadoop jar /local/nomad2/hadoop/hadoop-0.20.203.0/hadoop-examples-0.20.203.0.jar sort -r 1 \
more?> -inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \
more?> -outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \
more?> -outKey org.apache.hadoop.io.IntWritable \
more?> -outValue org.apache.hadoop.io.Text \
more?> numbers.seq sorted
Running on 1 nodes to sort from hdfs://localhost/user/nomad2/numbers.seq into hdfs://localhost/user/nomad2/sorted with 1 reduces.
Job started: Mon Jul 02 01:22:26 CST 2012
12/07/02 01:22:26 INFO mapred.FileInputFormat: Total input paths to process : 1
12/07/02 01:22:26 INFO mapred.JobClient: Running job: job_201207012246_0008
12/07/02 01:22:27 INFO mapred.JobClient:  map 0% reduce 0%
12/07/02 01:22:40 INFO mapred.JobClient:  map 100% reduce 0%
12/07/02 01:22:52 INFO mapred.JobClient:  map 100% reduce 100%
12/07/02 01:22:57 INFO mapred.JobClient: Job complete: job_201207012246_0008
12/07/02 01:22:57 INFO mapred.JobClient: Counters: 26
12/07/02 01:22:57 INFO mapred.JobClient:   Job Counters 
12/07/02 01:22:57 INFO mapred.JobClient:     Launched reduce tasks=1
12/07/02 01:22:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16289
12/07/02 01:22:57 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/02 01:22:57 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/02 01:22:57 INFO mapred.JobClient:     Launched map tasks=2
12/07/02 01:22:57 INFO mapred.JobClient:     Data-local map tasks=2
12/07/02 01:22:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10069
12/07/02 01:22:57 INFO mapred.JobClient:   File Input Format Counters 
12/07/02 01:22:57 INFO mapred.JobClient:     Bytes Read=6613
12/07/02 01:22:57 INFO mapred.JobClient:   File Output Format Counters 
12/07/02 01:22:57 INFO mapred.JobClient:     Bytes Written=4005
12/07/02 01:22:57 INFO mapred.JobClient:   FileSystemCounters
12/07/02 01:22:57 INFO mapred.JobClient:     FILE_BYTES_READ=3306
12/07/02 01:22:57 INFO mapred.JobClient:     HDFS_BYTES_READ=6868
12/07/02 01:22:57 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=70016
12/07/02 01:22:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=4005
12/07/02 01:22:57 INFO mapred.JobClient:   Map-Reduce Framework
12/07/02 01:22:57 INFO mapred.JobClient:     Map output materialized bytes=3312
12/07/02 01:22:57 INFO mapred.JobClient:     Map input records=100
12/07/02 01:22:57 INFO mapred.JobClient:     Reduce shuffle bytes=2811
12/07/02 01:22:57 INFO mapred.JobClient:     Spilled Records=200
12/07/02 01:22:57 INFO mapred.JobClient:     Map output bytes=3100
12/07/02 01:22:57 INFO mapred.JobClient:     Map input bytes=4660
12/07/02 01:22:57 INFO mapred.JobClient:     Combine input records=0
12/07/02 01:22:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=190
12/07/02 01:22:57 INFO mapred.JobClient:     Reduce input records=100
12/07/02 01:22:57 INFO mapred.JobClient:     Reduce input groups=100
12/07/02 01:22:57 INFO mapred.JobClient:     Combine output records=0
12/07/02 01:22:57 INFO mapred.JobClient:     Reduce output records=100
12/07/02 01:22:57 INFO mapred.JobClient:     Map output records=100
Job ended: Mon Jul 02 01:22:57 CST 2012
The job took 31 seconds.

4.2 MapFile

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop MapFileWriteDemo numbers.map
12/07/02 01:27:49 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/02 01:27:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/07/02 01:27:49 INFO compress.CodecPool: Got brand-new compressor
12/07/02 01:27:49 INFO compress.CodecPool: Got brand-new compressor


将SequenceFile转化为MapFile
[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop jar /local/nomad2/hadoop/hadoop-0.20.203.0/hadoop-examples-0.20.203.0.jar sort -r 1 \^J-inFormat>
Running on 1 nodes to sort from hdfs://localhost/user/nomad2/numbers.seq into hdfs://localhost/user/nomad2/numbers.map with 1 reduces.
Job started: Mon Jul 02 01:31:58 CST 2012
12/07/02 01:31:58 INFO mapred.FileInputFormat: Total input paths to process : 1
12/07/02 01:31:58 INFO mapred.JobClient: Running job: job_201207012246_0010
12/07/02 01:31:59 INFO mapred.JobClient:  map 0% reduce 0%
12/07/02 01:32:13 INFO mapred.JobClient:  map 100% reduce 0%
12/07/02 01:32:25 INFO mapred.JobClient:  map 100% reduce 100%
12/07/02 01:32:30 INFO mapred.JobClient: Job complete: job_201207012246_0010
12/07/02 01:32:30 INFO mapred.JobClient: Counters: 26
12/07/02 01:32:30 INFO mapred.JobClient:   Job Counters 
12/07/02 01:32:30 INFO mapred.JobClient:     Launched reduce tasks=1
12/07/02 01:32:30 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16355
12/07/02 01:32:30 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/02 01:32:30 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/02 01:32:30 INFO mapred.JobClient:     Launched map tasks=2
12/07/02 01:32:30 INFO mapred.JobClient:     Data-local map tasks=2
12/07/02 01:32:30 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10036
12/07/02 01:32:30 INFO mapred.JobClient:   File Input Format Counters 
12/07/02 01:32:30 INFO mapred.JobClient:     Bytes Read=6613
12/07/02 01:32:30 INFO mapred.JobClient:   File Output Format Counters 
12/07/02 01:32:30 INFO mapred.JobClient:     Bytes Written=4005
12/07/02 01:32:30 INFO mapred.JobClient:   FileSystemCounters
12/07/02 01:32:30 INFO mapred.JobClient:     FILE_BYTES_READ=3306
12/07/02 01:32:30 INFO mapred.JobClient:     HDFS_BYTES_READ=6868
12/07/02 01:32:30 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=70031
12/07/02 01:32:30 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=4005
12/07/02 01:32:30 INFO mapred.JobClient:   Map-Reduce Framework
12/07/02 01:32:30 INFO mapred.JobClient:     Map output materialized bytes=3312
12/07/02 01:32:30 INFO mapred.JobClient:     Map input records=100
12/07/02 01:32:30 INFO mapred.JobClient:     Reduce shuffle bytes=3312
12/07/02 01:32:30 INFO mapred.JobClient:     Spilled Records=200
12/07/02 01:32:30 INFO mapred.JobClient:     Map output bytes=3100
12/07/02 01:32:30 INFO mapred.JobClient:     Map input bytes=4660
12/07/02 01:32:30 INFO mapred.JobClient:     Combine input records=0
12/07/02 01:32:30 INFO mapred.JobClient:     SPLIT_RAW_BYTES=190
12/07/02 01:32:30 INFO mapred.JobClient:     Reduce input records=100
12/07/02 01:32:30 INFO mapred.JobClient:     Reduce input groups=100
12/07/02 01:32:30 INFO mapred.JobClient:     Combine output records=0
12/07/02 01:32:30 INFO mapred.JobClient:     Reduce output records=100
12/07/02 01:32:30 INFO mapred.JobClient:     Map output records=100
Job ended: Mon Jul 02 01:32:30 CST 2012
The job took 32 seconds.

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop fs -mv numbers.map/part-00000 numbers.map/data

[ate: /local/nomad2/hadoop/tomwhite-hadoop-book-32dae01 ]
>> hadoop MapFileFixer numbers.map
12/07/02 01:33:31 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/02 01:33:31 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/07/02 01:33:31 INFO compress.CodecPool: Got brand-new compressor
Created MapFile numbers.map with 100 entries

你可能感兴趣的:(《Hadoop The Definitive Guide》ch04 Hadoop I/O)