第5章 Hadoop IO操作学习笔记

5.1 数据完整性

系统中需要处理的数据量大到Hadoop处理极限时，容易出现数据丢失或者损坏.

措施：数据引入系统时计算校验和（checksum），当数据通过不可信通道传输时，再次计算校验和进行比较，校验和只能检测错误，而无法修复问题，所以建议使用ECC内存（Error-Correcting Code memory纠错内存）。

常用错误检测码CRC-32，任何大小输入数据均计算得一个32位的整数校验和

Hadoop ChecksumFileSystem 使用CRC-32

HDFS使用更有效的变体CRC-32C

5.1.1 HDFS的数据完整性

5.2 压缩

两大好处：

减少存储空间

加速数据在网络和磁盘传输

所有压缩算法都需要权衡时间/空间，-1优化压缩速度，-9优化压缩空间

5.2.1 codec

Codec是压缩/解压缩算法的一种实现，Hadoop中CompressionCodec接口的实现类

表5-2. Hadoop压缩codec

压缩格式	CompressionCodec实现
deflate	`org.apache.hadoop.io.compress.DefaultCodec`
gzip	`org.apache.hadoop.io.compress.GzipCodec`
lz4	`org.apache.hadoop.io.compress.Lz4Codec`
snappy	`org.apache.hadoop.io.compress.SnappyCodec`
bzip2	`org.apache.hadoop.io.compress.BZip2Codec`

1、通过`CompressionCodec`对数据流进行压缩和解压缩

压缩输出到控制台示例：

package com.zyf.study5;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;

public class StreamCompressor {

    public static void main(String[] args) {
        try {
            Class codecClass = GzipCodec.class;
            CompressionCodec compressionCodec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, new Configuration());

            CompressionOutputStream outputStream = compressionCodec.createOutputStream(System.out);
            IOUtils.copyBytes(System.in, outputStream, 4096, false);
            outputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

修改mvn pom.xml文件，设置jar包运行入口

    
        org.apache.maven.plugins
        maven-jar-plugin
        
          
            
              
                com.zyf.study5.StreamCompressor

运行程序

> echo 'Text' | hadoop jar hadoop-first-1.0-SNAPSHOT.jar | gunzip
Text

2、通过`CompressionCodecFactory`推断`CompressionCodec`

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.net.URI;

public class FileDecompressor {

    private static final Logger LOGGER = LoggerFactory.getLogger(FileDecompressor.class);

    public static void main(String[] args) {
        if (args.length == 0) {
            System.err.println("Usage: FileDecompressor ");
            System.exit(-1);
        }

        LOGGER.info("input path is " + args[0]);

        Path path = new Path(args[0]);

        Configuration conf = new Configuration();
        CompressionCodecFactory compressionCodecFactory = new CompressionCodecFactory(conf);
        CompressionCodec compressionCodec = compressionCodecFactory.getCodec(path);
        if (compressionCodec == null) {
            System.err.println("No Codec found for " + args[0]);
            System.exit(-1);
        }

        try {
            URI uri = URI.create(args[0]);
            FileSystem fileSystem = FileSystem.get(uri, conf, "ossuser");
            FSDataInputStream fsDataInputStream = fileSystem.open(path);
            CompressionInputStream compressionInputStream = compressionCodec.createInputStream(fsDataInputStream);

            String fileName = CompressionCodecFactory.removeSuffix(path.getName(), compressionCodec.getDefaultExtension());
            LOGGER.info("output path is " + fileName);
            FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(fileName));

            IOUtils.copyBytes(compressionInputStream, fsDataOutputStream, 4096, false);

            fsDataOutputStream.close();
            fsDataInputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

}

运行程序

> hadoop jar hadoop-first-1.0-SNAPSHOT.jar hdfs://127.0.0.1:9000/user/ossuser/test.txt.gz
19/05/07 18:57:46 INFO study5.FileDecompressor: input path is hdfs://127.0.0.1:9000/user/ossuser/test.txt.gz
19/05/07 18:57:48 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
19/05/07 18:57:48 INFO compress.CodecPool: Got brand-new decompressor [.gz]
19/05/07 18:57:48 INFO study5.FileDecompressor: output path is test.txt

> hadoop jar hadoop-first-1.0-SNAPSHOT.jar hdfs://127.0.0.1:9000/user/ossuser/test.txt.bz2
19/05/07 18:57:46 INFO study5.FileDecompressor: input path is hdfs://127.0.0.1:9000/user/ossuser/test.txt.bz2
19/05/07 19:01:17 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
19/05/07 19:01:17 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
19/05/07 19:01:17 INFO study5.FileDecompressor: output path is test.txt

程序运行完毕，在/user/ossuser/目录下会生成test.txt文件，内容为解压缩后文件。

CompressionCodecFactory会加载表5-2中所有的codec实现类，同时也会加载io.compression.codecs属性定义的codec实现，每个codec都知道自己默认的扩展名。

3、原生类库

使用native类库带来更好性能，与java实现比，原生gzip解压时间减少一半，压缩时间减少10%，可以通过系统属性java.library.path指定，或etc/hadoop/下脚本设置，或应用中手动设置。可以通过io.native.lib.available=false禁用native库。

4、CodecPool

package com.zyf.study5;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;

public class PooledStreamCompressor {

    public static void main(String[] args) {
        if (args.length == 0) {
            System.err.println("Usage: PooledStreamCompressor ");
            System.exit(-1);
        }

        Compressor compressor = null;
        try {
            Class classOfCodec = Class.forName(args[0]);

            Configuration conf = new Configuration();
            CompressionCodec compressionCodec = (CompressionCodec) ReflectionUtils.newInstance(classOfCodec, conf);

            compressor = CodecPool.getCompressor(compressionCodec);

            CompressionOutputStream outputStream = compressionCodec.createOutputStream(System.out, compressor);
            IOUtils.copyBytes(System.in, outputStream, 4096, false);

            outputStream.finish();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            CodecPool.returnCompressor(compressor);
        }
    }
}

运行程序

> echo 'gzip output' | hadoop jar hadoop-first-1.0-SNAPSHOT.jar org.apache.hadoop.io.compress.GzipCodec | gunzip
2019-05-07 19:37:57,413 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-07 19:37:57,414 INFO compress.CodecPool: Got brand-new compressor [.gz]
gzip output

5.2.2 压缩和输入分片

gzip不支持文件切分
bzip2支持文件切分

使用何种压缩格式？
与待处理文件的大小、格式、所使用的工具相关，下面为效率从高到低排列的建议

使用容器文件格式，如SequenceFile、avro、ORCFiles、Parquet，这些文件格式支持压缩和切分，最好与一个快速压缩工具一起使用，如LZO、LZ4、Snappy

使用支持切分的压缩格式，如bz2，或使用通过索引实现切分的压缩格式，如LZO。

应用中切分文件，为每个快建议压缩文件，合理切分，以使压缩后大小接近HDFS块大小.

不压缩

对于大文件，不要使用不支持切分的压缩格式，因为会失去数据本地化优势，造成MapReduce效率低下。

5.2.3 在MapReduce中使用压缩

输入文件是压缩的，MapReduce会通过CompressionCodecFactory根据文件后缀推导出codec，读取文件时自动解压。

输出，可以根据mapreduce.output.fileoutputformat.compress=true设置，mapreduce.output.fileoutputformat.compress.codec可以设置压缩codec，另一种方案是通过FileOutputFormat设置。

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

运行程序：

> hadoop jar hadoop-first-1.0-SNAPSHOT.jar input/input.txt.gz output
...
19/05/08 09:40:05 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
19/05/08 09:40:05 INFO compress.CodecPool: Got brand-new decompressor [.gz]

结果文件：

._SUCCESS.crc
.part-r-00000.gz.crc
_SUCCESS
part-r-00000.gz

如果为输出生成SequenceFile，即job.setOutputFormatClass(SequenceFileOutputFormat.class)（注：输出生成SequenceFile，压缩格式必须兼容，默认为deflate，不能设置为FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class)），可以设置mapreduce.output.fileoutputformat.compress.type控制压缩格式，默认为RECORD,即针对每条记录压缩，建议使用BLOCK。

属性	类型	默认	描述
`mapreduce.output.fileoutputformat.compress`	boolean	false	是否压缩输出
`mapreduce.output.fileoutputformat.compress.codec`	类名称	`org.apache.hadoop.io.compress.DefaultCodec`	压缩codec
`mapreduce.output.fileoutputformat.compress.type`	String	RECORD	NONE、RECORD、BLOCK

map任务输出压缩

对map阶段中间结果进行压缩，如使用LZO、Snappy、LZ4等快速压缩方式，可以减少网络数据传输，提升性能。

属性	类型	默认	描述
`mapreduce.map.output.compress`	boolean	false	是否压缩map输出
`mapreduce.map.output.compress.codec`	类名称	`org.apache.hadoop.io.compress.DefaultCodec`	压缩codec

通过如下代码设置map输出压缩：

Configuration conf = new Configuration();
conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);
conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, DefaultCodec.class, CompressionCodec.class);

Job job = Job.getInstance(conf);

5.3 序列化

序列化 将结构化对象转为字节流，用于分布式数据处理两大领域：进程间通信和永久存储，Hadoop中通过RPC实现进程间通信，RPC将消息序列转为二进制流发送到远程节点，远程节点将二进制反序列化为原始消息。通常，RPC序列化格式如下：
- 紧凑减小网络传输（有效利用存储空间）
- 快速减少序列化和反序列化性能开销（高效读/写数据）
- 可扩展 协议向前向后兼容（透明读取老格式数据）
- 支持互操作 支持以不同语言实现的客户端与服务端交互（不同语言读写永久存储的数据）
反序列化 将字节流转为结构化对象

序列化格式Writable，紧凑、速度快，但不易被java以外的语言使用和扩展。Avro克服了Writable的部分不足。

5.3.1 `Writable`接口

package org.apache.hadoop.io;
...
public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}

1、实现`Writable`接口

package com.zyf.study5;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;

import java.io.*;

public class MyWritable implements Writable {

    private String name;
    private int score;

    public MyWritable() {
        super();
    }

    public MyWritable(String name, int score) {
        this.name = name;
        this.score = score;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(name);
        out.writeInt(score);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.name = in.readUTF();
        this.score = in.readInt();
    }

    public static byte[] serialize(MyWritable writable) throws IOException {
        if (writable == null) {
            return null;
        }

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream outputStream = new DataOutputStream(baos);
        writable.write(outputStream);

        outputStream.close();
        return baos.toByteArray();
    }

    public static MyWritable deserialize(byte[] value) throws IOException {
        if (value == null || value.length == 0) {
            return null;
        }

        ByteArrayInputStream bais = new ByteArrayInputStream(value);
        DataInputStream inputStream = new DataInputStream(bais);

        MyWritable fromBytes = new MyWritable();
        fromBytes.readFields(inputStream);

        return fromBytes;
    }

    @Override
    public String toString() {
        return "MyWritable{" +
                "name='" + name + '\'' +
                ", score=" + score +
                '}';
    }

    public static void main(String[] args) throws IOException {
        MyWritable myWritable = new MyWritable("abcdef", 33);

        byte[] value = MyWritable.serialize(myWritable);
        System.out.println(StringUtils.byteToHexString(value));

        MyWritable fromBytes = MyWritable.deserialize(value);
        System.out.println(fromBytes);
    }
}

输出如下：

000661626364656600000021
MyWritable{name='abcdef', score=33}

可以看到，序列化字节数组中，0006标示string长度，616263646566代表了字符串abcdef，最后4个字节（00000021）代表整型33.

2、`WritableComparable`接口和`Comparator`

IntWritable实现了WritableComparable接口，WritableComparable继承了Writable, Comparable。

package org.apache.hadoop.io;

public class IntWritable implements WritableComparable {
  
  /** A Comparator optimized for IntWritable. */ 
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(IntWritable.class);
    }
    
    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int thisValue = readInt(b1, s1);
      int thatValue = readInt(b2, s2);
      return (thisValue

 
 对于MapReduce来说，类型比较很重要，因为中间有个基于键排序的阶段。Hadoop还提供了一个优化版的比较器，WritableComparator实现了RawComparator, Configurable 
 package org.apache.hadoop.io;

import org.apache.hadoop.io.serializer.DeserializerComparator;

/**
 * 
 * A {@link Comparator} that operates directly on byte representations of
 * objects.
 * 
 * @param 
 * @see DeserializerComparator
 */
public interface RawComparator extends Comparator {

  /**
   * Compare two objects in binary.
   * b1[s1:l1] is the first object, and b2[s2:l2] is the second object.
   * 
   * @param b1 The first byte array.
   * @param s1 The position index in b1. The object under comparison's starting index.
   * @param l1 The length of the object in b1.
   * @param b2 The second byte array.
   * @param s2 The position index in b2. The object under comparison's starting index.
   * @param l2 The length of the object under comparison in b2.
   * @return An integer result of the comparison.
   */
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
 
 该接口允许其实现直接比较流中的记录，而无须先把流反序列化为对象，避免新建对象的额外开销。如IntWritable.Comparator直接从流中获取整数进行比较。WritableComparator是一个RawComparator接口的通用实现，提供两个功能： 
  
  1、提供了对原始compare方法的默认实现，该方法能够将流中数据反序列化为对象，再调用对象的compare方法；
 2、充当RawComparator实现的工厂类，通过调用
 public static WritableComparator get(Class c)方法即可获取对应比较器。 
  
 byte[] b1 = new byte[] {0, 0, 0, 15};
byte[] b2 = new byte[] {0, 0, 0, 13};

WritableComparator comparator = WritableComparator.get(IntWritable.class);
int compare = comparator.compare(b1, 0, 4, b2, 0, 4);
 
 或 
 IntWritable intWritable1 = new IntWritable(100);
IntWritable intWritable2 = new IntWritable(101);
compare = comparator.compare(intWritable1, intWritable2);
 
 5.3.2 Writable实现类 
 1）Java基本类型对应的Writable类 
  
   
    
    基本类型 
     Writable实现类 
    序列化大小（字节） 
    
   
   
    
    boolean 
    BooleanWritable 
    1 
    
    
    byte 
    ByteWritable 
    1 
    
    
    short 
    ShortWritable 
    2 
    
    
    char 
    无，可用IntWritable  
    - 
    
    
    int 
    IntWritable 
    4 
    
    
     
    VIntWritable 
    1-5 
    
    
    long 
    LongWritable 
    8 
    
    
     
    VLongWritable 
    1-9 
    
    
    float 
    FloatWritable 
    4 
    
    
    double 
    DoubleWritable 
    8 
    
   
  
 整形包括定长格式和变长格式，需要编码的数值相当小，变长格式更节约空间，如127只需要一个字节。
 定长格式适合整个数值域非常均匀的情况，如精心设计的hash函数。
 绝大情况数值变量分布都不均匀，一般而言变长格式更节约空间。 
 2）Text类型 
 Text是针对UTF-8序列Writable类，一般认为与String的Writable等价。
 Text使用变长整型记录字符串字节数，因此最大可存储2GB，另外，使用了标准UTF-8编码。 
 package org.apache.hadoop.io;
public class Text extends BinaryComparable
    implements WritableComparable {

  private static final ThreadLocal ENCODER_FACTORY =
    new ThreadLocal() {
      @Override
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

  /** Set to contain the contents of a string. 
   */
  public void set(String string) {
    try {
      ByteBuffer bb = encode(string, true);
      bytes = bb.array();
      length = bb.limit();
    }catch(CharacterCodingException e) {
      throw new RuntimeException("Should not have happened ", e); 
    }
  }

  public static ByteBuffer encode(String string, boolean replace)
    throws CharacterCodingException {
    CharsetEncoder encoder = ENCODER_FACTORY.get();
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPLACE);
      encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    ByteBuffer bytes = 
      encoder.encode(CharBuffer.wrap(string.toCharArray()));
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPORT);
      encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return bytes;
  }

  @Override
  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  /** Write a UTF8 encoded string to out
   */
  public static int writeString(DataOutput out, String s) throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }
}
 
 Text索引是根据字符串UTF-8编码后的字节位置实现的，String是基于char 
         String str = "我要学Hadoop,哈";
        Text text = new Text(str);

        print("str.length()=" + str.length());
        print("text.getLength()=" + text.getLength());

        byte[] bytes = text.getBytes();
        print("text.getBytes()=" + StringUtils.byteToHexString(bytes));
        print("text.getBytes().length=" + bytes.length);

        print("text.find(\"我\")=" + text.find("我"));
        print("text.find(\"要\")=" + text.find("要"));
        print("text.find(\"学\")=" + text.find("学"));
        print("text.find(\"o\")=" + text.find("o"));

        print(str.charAt(2));
        print((char) text.charAt(6));
        print(text.charAt(100));
       //print(str.charAt(100));
 
 输出： 
 str.length()=11
text.getLength()=19
text.getBytes()=e68891e8a681e5ada64861646f6f702ce59388000000000000
text.getBytes().length=25
text.find("我")=0
text.find("要")=3
text.find("学")=6
text.find("o")=12
学
学
-1
 
  
   String的length方法返回的是字符串字符数，text.getLength()返回的是字符串转UTF-8后字节数，text.getBytes()末尾包含了一部分空字节，如上例，所以text.getBytes().length值比实际字节长度大。 
   text.find()方法与string.indexOf()方法类似，前者返回字节位置，后者返回字符位置； 
   String与Text都有charAt方法，前者返回字符串中指定索引位置的字符，若索引位置大于字符串长度，会报StringIndexOutOfBoundsException，后者返回指定字节位置的字符，若索引位置大于字节数组长度，返回-1； 
  
  
   
    
    Unicode编码 
    U+0041 
    U+00DF 
    U+6771 
    U+10400 
    
   
   
    
    名称 
    A 
    ß 
    東 
    “字符代理串” 
    
    
    Java表示 
    \u0041 
    \u00DF 
    \u6771 
    \uD801\uDC00 
    
   
  
         String s = "\u0041\u00DF\u6771\uD801\uDC00";
        Text t = new Text(s);
        print(s.length() + ", " + t.getLength());
        print(s.getBytes(Charset.forName("UTF-8")).length + ", " + t.getLength());

        print(s.indexOf("\u0041") + ", " + t.find("\u0041"));
        print(s.indexOf("\u00DF") + ", " + t.find("\u00DF"));
        print(s.indexOf("\u6771") + ", " + t.find("\u6771"));
        print(s.indexOf("\uD801\uDC00") + ", " + t.find("\uD801\uDC00"));

        print(s.charAt(0) + ", " + (char)t.charAt(0));
        print(s.charAt(1) + ", " + (char)t.charAt(1));
        print(s.charAt(2) + ", " + (char)t.charAt(3));
        print(s.charAt(3) + ", " + (char)t.charAt(6));

        print(s.codePointAt(0) == 0x0041);
        print(s.codePointAt(1) == 0x00DF);
        print(s.codePointAt(2) == 0x6771);
        print(s.codePointAt(3) == 0x10400);
 
 输出： 
 5, 10
10, 10
0, 0
1, 1
2, 3
3, 6
A, A
ß, ß
東, 東
?, Ѐ
true
true
true
true
 
 上述，Text对象长度是其UTF-8编码的字节数（1+2+3+4）。 
 迭代Text 
         String s1 = "\u0041\u00DF\u6771\uD801\uDC00";
        Text t1 = new Text(s1);

        ByteBuffer byteBuffer = ByteBuffer.wrap(t1.getBytes(), 0, t1.getLength());
        int value;
        while(byteBuffer.hasRemaining() && (value = Text.bytesToCodePoint(byteBuffer)) != -1) {
            print(Integer.toHexString(value));
        }
 
 输出： 
 41
df
6771
10400
 
  
   
    Text可变性，可通过set修改值 
    TextAPI不如String丰富，可转为String操作 
   
  
 3）BytesWritable 
 对二进制数据数组封装，可变，getBytes().getLength()无法体现对象数据字节真正长度，序列化格式为：4字节指定字节长度+数据内容 
 BytesWritable bytesWritable = new BytesWritable(new byte[]{3, 5, 7});
print(StringUtils.byteToHexString(serialize(bytesWritable)));
 
 输出： 
 00000003030507
 
 4）NullWritable 
 是Writable的特殊类型，序列化长度为0，不从数据流读取数据，也不写入数据，充当占位符，如在MapReduce中，不需要使用键或值的序列化地址，就可以将键或值声明为NullWritable，这样可以高效存储常量空值。NullWritable也可以用作SequenceFile键。 
 public class NullWritable implements WritableComparable {

  private static final NullWritable THIS = new NullWritable();

  private NullWritable() {}                       // no public ctor

  public static NullWritable get() { return THIS; }
  
  @Override
  public String toString() {
    return "(null)";
  }
  
  @Override
  public int compareTo(NullWritable other) {
    return 0;
  }
  
  @Override
  public void readFields(DataInput in) throws IOException {}
  @Override
  public void write(DataOutput out) throws IOException {}

  /** A Comparator "optimized" for NullWritable. */
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(NullWritable.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      return 0;
    }
  }
}
 
 5）ObjectWritable和GenericWritable 
 ObjectWritable是String、java基本类型、Enum、Writable或这些数据类型组成的数组的封装类。当一个字段包含多个类型时，ObjectWritable非常有用，如SequenceFile的值包含多个类型，可以将值声明为ObjectWritable，并将每个类型封装在一个ObjectWritable中。缺点：每次序列化都会写封装类型的全名称，非常浪费空间，可以使用GenericWritable解决。 
 public class ObjectWritable implements Writable, Configurable {

    public static void writeObject(DataOutput out, Object instance,
        Class declaredClass, Configuration conf, boolean allowCompactArrays) 
    throws IOException {

    if (instance == null) {                       // null
      instance = new NullInstance(declaredClass, conf);
      declaredClass = Writable.class;
    }
    
    // Special case: must come before writing out the declaredClass.
    // If this is an eligible array of primitives,
    // wrap it in an ArrayPrimitiveWritable$Internal wrapper class.
    if (allowCompactArrays && declaredClass.isArray()
        && instance.getClass().getName().equals(declaredClass.getName())
        && instance.getClass().getComponentType().isPrimitive()) {
      instance = new ArrayPrimitiveWritable.Internal(instance);
      declaredClass = ArrayPrimitiveWritable.Internal.class;
    }

    UTF8.writeString(out, declaredClass.getName()); // always write declared

    if (declaredClass.isArray()) {     // non-primitive or non-compact array
      int length = Array.getLength(instance);
      out.writeInt(length);
      for (int i = 0; i < length; i++) {
            writeObject(out, Array.get(instance, i),
            declaredClass.getComponentType(), conf, allowCompactArrays);
      }
      
    } else if (declaredClass == ArrayPrimitiveWritable.Internal.class) {
      ((ArrayPrimitiveWritable.Internal) instance).write(out);
      
    } else if (declaredClass == String.class) {   // String
      UTF8.writeString(out, (String)instance);
      
    } else if (declaredClass.isPrimitive()) {     // primitive type

      if (declaredClass == Boolean.TYPE) {        // boolean
        out.writeBoolean(((Boolean)instance).booleanValue());
      } else if (declaredClass == Character.TYPE) { // char
        out.writeChar(((Character)instance).charValue());
      } else if (declaredClass == Byte.TYPE) {    // byte
        out.writeByte(((Byte)instance).byteValue());
      } else if (declaredClass == Short.TYPE) {   // short
        out.writeShort(((Short)instance).shortValue());
      } else if (declaredClass == Integer.TYPE) { // int
        out.writeInt(((Integer)instance).intValue());
      } else if (declaredClass == Long.TYPE) {    // long
        out.writeLong(((Long)instance).longValue());
      } else if (declaredClass == Float.TYPE) {   // float
        out.writeFloat(((Float)instance).floatValue());
      } else if (declaredClass == Double.TYPE) {  // double
        out.writeDouble(((Double)instance).doubleValue());
      } else if (declaredClass == Void.TYPE) {    // void
      } else {
        throw new IllegalArgumentException("Not a primitive: "+declaredClass);
      }
    } else if (declaredClass.isEnum()) {         // enum
      UTF8.writeString(out, ((Enum)instance).name());
    } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
      UTF8.writeString(out, instance.getClass().getName());
      ((Writable)instance).write(out);

    } else if (Message.class.isAssignableFrom(declaredClass)) {
      ((Message)instance).writeDelimitedTo(
          DataOutputOutputStream.constructOutputStream(out));
    } else {
      throw new IOException("Can't write: "+instance+" as "+declaredClass);
    }
  }
}
 
 public class UTF8 implements WritableComparable {
  public static int writeString(DataOutput out, String s) throws IOException {
    int len = utf8Length(s);
    out.writeShort(len);
    writeChars(out, s, 0, s.length());
    return len;
  }
}
 
 使用ObjectWritable序列化对象 
     public static byte[] serialize(Object object) throws IOException {
        ObjectWritable objectWritable = new ObjectWritable(object);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream outputStream = new DataOutputStream(baos);
        objectWritable.write(outputStream);
        return baos.toByteArray();
    }

    public static void main(String[] args) throws IOException {
        Person p = new Person("z1", 1);
        byte[] values = serialize(p);
        print(StringUtils.byteToHexString(values));

        int[] array = new int[]{3, 5, 10};
        values = serialize(array);
        print(StringUtils.byteToHexString(values));

        Person[] ps = new Person[] {new Person("z1", 1), new Person("z2", 2)};
        values = serialize(ps);
        print(StringUtils.byteToHexString(values));
    }
 
 输出说明 
 #输出1（为了方便理解加了空格换行，实际没有）
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a31 00000001
#说明1
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z1 1

#输出2（为了方便理解加了空格换行，实际没有）
0002 5b49 00000003 0003696e74 00000003 0003696e74 00000005 0003696e74 0000000a
#说明2
{字符串长度:2} [I {数组长度:3} {字符串长度:3} int 3 {字符串长度:3} int 5 {字符串长度:3} int 10

#输出3（为了方便理解加了空格换行，实际没有）
002b 5b4c 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e3b 00000002 
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a31  00000001
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a32  00000002

#说明3：
{字符串长度:43} [Lcom.zyf.study5.ObjectWritableDemo$Person; {数组长度:2}
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z1 1
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z2 2
 
 通过上述输出说明可以看到，ObjectWritable序列化的后值比较长，对于数组，首先先将数组全类名进行序列化，接着数组长度，对于数组元素，循环递归调用writeObject方法，方法入口先序列化数组元素类全名称，因为元素是Writable子类，在该分之内又先序列化了元素类全名称，再调用了
 Writable的write方法序列化元素属性。若使用GenericType，可以大大减少类名长度，如下： 
     public static void main(String[] args) throws IOException {
        Person p = new Person("z1", 1);

        CustomGenericWritable customGenericWritable = new CustomGenericWritable();
        customGenericWritable.set(p);
        byte[] values = serialize(customGenericWritable);
        print(StringUtils.byteToHexString(values));


        MyWritable myWritable = new MyWritable("abcdef", 33);
        customGenericWritable.set(myWritable);
        values = serialize(customGenericWritable);
        print(StringUtils.byteToHexString(values));
    }

    static class CustomGenericWritable extends GenericWritable {

        final static Class[] TYPES = new Class[] {MyWritable.class, Person.class};

        @Override
        protected Class[] getTypes() {
            return TYPES;
        }
    }
 
 输出： 
 #输出1：
01 0002 7a31 00000001
#说明1：
{TYPE类型，TYPES数组索引} {字符串长度:2} z1 1

#输出2：
00 0006 616263646566 00000021
#说明2：
{TYPE类型，TYPES数组索引} {字符串长度:6} abcdef 33
 
 6）Writable集合 
 org.apache.hadoop.io包下有6个Writable集合类，为ArrayWritable、ArrayPrimitiveWritable、TwoDArrayWritable、MapWritable、SortedMapWritable、EnumSetWritable，ArrayWritable、ArrayPrimitiveWritable分别为一维二维数组，元素必须为相同类型的Writable，都有toArray方法，浅拷贝，若需要存储不同类型Writable，可以使用GenericWritable封装。 
   public Object toArray() {
    Object result = Array.newInstance(valueClass, values.length);
    for (int i = 0; i < values.length; i++) {
      Array.set(result, i, values[i]);
    }
    return result;
  }
 
 使用示例： 
     public static void main(String[] args) throws IOException {
        MyWritable[] myWritables = new MyWritable[] {new MyWritable("z1", 1), new MyWritable("z2", 2)};
        ArrayWritable arrayWritable = new ArrayWritable(MyWritable.class, myWritables);

        byte[] values = serialize(arrayWritable);
        System.out.println(StringUtils.byteToHexString(values));
    }
 
 输出： 
 00000002 0002 7a31 00000001 0002 7a32 00000002
{数组长度:2} {字符串长度:2} z1 1 {字符串长度:2} z2 1
 
 ArrayPrimitiveWritable是基础数据类型数组的封装，使用示例： 
     public static void main(String[] args) throws IOException {
        int[] array = new int[]{1, 3, 5, 7};

        ArrayPrimitiveWritable writable = new ArrayPrimitiveWritable(array);
        byte[] bytes = serialize(writable);

        System.out.println(StringUtils.byteToHexString(bytes));
    }
 
 输出： 
 0003 696e74 00000004 00000001 00000003 00000005 00000007
 
 MapWritable继承AbstractMapWritable实现了Map接口，通过HashMap存放键值对，AbstractMapWritable维护了键值对class与id的映射关系，调用put方法插入键值对时，若键和/或值类型不存在于这个关系对象中，则会插入到关系中，id从1开始自增。 
 public class MapWritable extends AbstractMapWritable
  implements Map {
}
 
 使用示例代码如下： 
     public  static void main(String[] args) throws IOException {
        MapWritable mapWritable = new MapWritable();

        IntWritable intWritable1 = new IntWritable(1);
        IntWritable intWritable2 = new IntWritable(2);
        IntWritable intWritable3 = new IntWritable(3);

        mapWritable.put(intWritable1, new MyWritable("z1", 1));
        mapWritable.put(intWritable2, new MyWritable("z2", 2));
        mapWritable.put(intWritable3, new ObjectWritableDemo.Person("z3", 2));

        byte[] bytes = serialize(mapWritable);
        System.out.println(StringUtils.byteToHexString(bytes));
    }
 
 输出： 
 02 
01 0019 636f6d2e7a79662e7374756479352e4d795772697461626c65
02 0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
   00000003 
        85 00000001 01 0002 7a31 00000001 
        85 00000002 01 0002 7a32 00000002 
        85 00000003 02 0002 7a33 00000002
#说明：
{当前类型Id:2}
{id:1}{字符串长度:25}com.zyf.study5.MyWritable
{id:2}{字符串长度:40}com.zyf.study5.ObjectWritableDemo$Person
{map.size:3}
     {key类型id,即IntWritable对应id,为-123}{key值:1}{value类型id:1}{字符串长度:2}z1 1
     {key类型id,即IntWritable对应id,为-123}{key值:2}{value类型id:1}{字符串长度:2}z2 1
     {key类型id,即IntWritable对应id,为-123}{key值:3}{value类型id:2}{字符串长度:2}z3 2
 
 SortedMapWritable实现了SortedMap接口，通过TreeMap存放键值对。 
 5.3.3 实现定制Writable集合 
 5.3.4 序列化框架 
 尽管大多数MapReduce程序使用的都是Writable类型的键和值，但这不是MapReduce API强制要求的，可以使用任何类型，只要有一种机制能对这个类型和二进制进行来回转换，Hadoop有一个针对可替换序列化框架的API以支持这机制，序列化框架用一个org.apache.hadoop.io.serializer.Serialization实现来表示，如WritableSerialization.对象定义了类型与二进制来回转换的实现。 
 5.4 基于文件的数据结构 
 5.4.1 SequenceFile 
 SequenceFile，为二进制键-值对提供了持久化数据结构。也可作为小文件容器，获得更高效率的存储和处理。 
 1、SequenceFile写文件 
 SequenceFile.createWriter可选参数还包括压缩codec、Progressable、在SequenceFile文件头添加Metadata。
 SequenceFile键和值不一定是Writable，也可以是序列化对象。 
 package com.zyf.study5;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

import java.io.IOException;
import java.net.URI;

public class SequenceFileWriterDemo {

    final static String[] DATAS = {
            "One, two, buckle my shoe",
            "Three, four, shut the door",
            "Five, sex, pick up sticks",
            "Seven, eight, lay them straight",
            "Nine, then, a big fat hen"
    };

    public static void main(String[] args) {
        URI uri = URI.create("hdfs://127.0.0.1:9000/user/ossuser/sequenceFile.seq");
        FSDataOutputStream outputStream = null;
        SequenceFile.Writer writer = null;
        try {
            Configuration conf = new Configuration();
            FileSystem fileSystem = FileSystem.get(uri, conf, "ossuser");

            Path path = new Path(uri);

            outputStream = fileSystem.create(path);
            SequenceFile.Writer.Option streamOption = SequenceFile.Writer.stream(outputStream);

            SequenceFile.Writer.Option keyOption = SequenceFile.Writer.keyClass(IntWritable.class);
            SequenceFile.Writer.Option valueOption = SequenceFile.Writer.valueClass(Text.class);

            writer = SequenceFile.createWriter(conf, streamOption, keyOption, valueOption);

            IntWritable key = new IntWritable();
            Text value = new Text();
            for(int i=0; i<100; i++) {
                key.set(100 - i);
                value.set(DATAS[i % DATAS.length]);

                if ((100-i) % 10 == 0) {
                    writer.sync();
                }

                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

                writer.append(key, value);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            IOUtils.closeStreams(outputStream, writer);
        }
    }
}
 
 创建SequenceFile.Writer，每10条记录插入同步点。writer.getLength()获取文件当前位置。 
 2、SequenceFile读文件 
  
   
    Writable键值对，通过调用public boolean next(Writable, Writable)迭代获取记录； 
   序列化值，调用如下方法:
 public synchronized Object next(Object key) throws IOException
 public synchronized Object getCurrentValue(Object val) throws IOException
 next返回null代表读取到文件尾，否则将值传入getCurrentValue 可得value.
 确保io.serializations设置了序列化框架 
   
  
 package com.zyf.study5;

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;
import java.net.URI;

public class SequenceFileReaderDemo {

    public static void main(String[] args) {
        URI uri = URI.create("hdfs://127.0.0.1:9000/user/ossuser/sequenceFile.txt");

        SequenceFile.Reader reader = null;
        try {
            Configuration conf = new Configuration();

            SequenceFile.Reader.Option fileOption = SequenceFile.Reader.file(new Path(uri));
            reader = new SequenceFile.Reader(conf, fileOption);

            Class keyClass = reader.getKeyClass();
            Class valueClass = reader.getValueClass();
            Writable key = (Writable) ReflectionUtils.newInstance(keyClass, conf);
            Writable value = (Writable) ReflectionUtils.newInstance(valueClass, conf);

            long position = reader.getPosition();
            while(reader.next(key, value)) {
                String syncSeen = reader.syncSeen() ? "*":"";
                System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
                position = reader.getPosition();
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            IOUtils.closeQuietly(reader);
        }
    }
}
 
 输出： 
 [128*]100 One, two, buckle my shoe
[193]99 Three, four, shut the door
[240]98 Five, sex, pick up sticks
[284]97 Seven, eight, lay them straight
[334]96 Nine, then, a big fat hen
[378]95 One, two, buckle my shoe
[423]94 Three, four, shut the door
[470]93 Five, sex, pick up sticks
[514]92 Seven, eight, lay them straight
[564]91 Nine, then, a big fat hen
[608*]90 One, two, buckle my shoe
[673]89 Three, four, shut the door
...
 
 可通过如下代码，定位到文件具体位置 
 reader.seek(240);
reader.next(key, value);
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
 
 输出： 
 [240]98 Five, sex, pick up sticks
 
 但如果给定数值不是记录边界，就会报错 
 reader.seek(239);
 
 输出： 
 java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
 
 可以使用同步点查找记录边界，如下代码，会定位到239后最近的一个同步点，如果position后没有同步点了，会定位到文件最后一行。可以将插入了同步点的SequenceFile作为MapReduce的输入，这类文件允许切分，不同部分由独立的map任务单独处理。详见SequenceFileInputFormat 
 reader.sync(239);
long position = reader.getPosition();
String syncSeen = reader.syncSeen() ? "*":"";
reader.next(key, value);
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
 
 输出： 
 [608]90 One, two, buckle my shoe
 
 3、命令行读取SequenceFile 
 hdfs dfs -text可以以文本形式显示SequenceFile，该选项可以查看文件代码、检测文件类型以转成文本，支持gzip、bzip2、avro、SequenceFile；如果有自定义健值类，需要确保在Hadoop类路径下。 
 > hdfs dfs -text /user/ossuser/sequenceFile.txt
2019-05-08 14:57:58,744 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-08 14:57:59,914 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
100     One, two, buckle my shoe
99      Three, four, shut the door
98      Five, sex, pick up sticks
...
> hdfs dfs -text /user/ossuser/test.txt.gz
abc
> hdfs dfs -text /user/ossuser/test.txt.bz2
bzip2 file
 
 4、SequenceFile排序与合并 
 >hadoop jar hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar sort -r 1 \
-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
/user/ossuser/sequenceFile.txt /user/ossuser/sorted

> hdfs dfs -text sorted/part-r-00000
1       Nine, then, a big fat hen
2       Seven, eight, lay them straight
3       Five, sex, pick up sticks
4       Three, four, shut the door
...
 
 -r reduce任务数
 -inFormat 任务输入数据格式
 -outFormat 任务输出数据格式
 -outKey 输出key数据类型
 -outValue 输出value数据类型 
 5、SequenceFile文件格式 
 基于不同的压缩方式，SequenceFile有3种格式，这三种格式head都一样 
 Header 
  
  version - 前3字节为魔数SEQ, 紧接的后一字节为实际版本号(e.g. SEQ4 or SEQ6) 
  keyClassName - key class 
  valueClassName - value class 
  compression - 布尔值，标示文件中键值对是否压缩. 
  blockCompression - 布尔值，标示是否对文件中键值对进行块压缩. 
  compression codec - CompressionCodec类用来对键和/或值进行压缩(若开启了压缩). 
  metadata - SequenceFile.Metadata 文件元数据. 
  sync - 同步位，标示header结束. 
  
 Uncompressed 
  
  Header 
  Record 
    
    Record length 
    Key length 
    Key 
    Value 
   
  
  每大约100字节一个同步位. 
  
 记录压缩 
  
  Header 
  Record 
    
    Record length 
    Key length 
    Key 
     Compressed Value 
   
  
  每大约100字节一个同步位. 
  
 块压缩 
  
  Header 
  Record Block 
    
    Uncompressed 块中记录数 
    Compressed key-lengths块 大小 
    Compressed key-lengths块 
    Compressed keys块 大小 
    Compressed keys块 
    Compressed value-lengths块 大小 
    Compressed value-lengths块 
    Compressed values块 大小 
    Compressed values块 
   
  
  每个块一个同步位. 
  
 key-lengths块和value-lengths块，包含了每一键值对（编码为ZeroCompressedInteger 格式）真正的长度。 
 块压缩一次压缩多条记录，可以利用记录间相似性进行压缩，相对单条记录压缩，有更高的压缩率，可以不断向块中压缩记录，直到块大小不小于io.seqfile.compress.blocksize，该值默认1MB。 
 5.4.2 MapFile 
 MapFile是排过序的SequenceFile，有索引，可以按键查找，索引本身是SequenceFile，包含了map中一小部分键（默认128个），索引能加载进内存，因此可以提供对主数据快速查找，主数据文件是另一个SequenceFile，包含所有键值对，且按顺序存放。
 MapFile.Writer进行写时，map entry必须按顺序添加，否则IOException异常。 
 MapFile变种 
  
  ArrayFile 键是整形，标示元素索引，值为Writable 
  SetFile 特殊的MapFile，用于存放Writable键 
  BloomMapFile 提供了get()方法高性能实现，对稀疏文件很有用，使用一个动态布隆过滤器检测map是否存在某个给定键，测试在内存中完成，非常快， 结果出现假阳性概率大于0，仅当测试通过时，才能调用get(). 
  
 5.4.3 其他文件格式和面向列格式 
 Avro类似SequenceFile，二进制格式，面向大规模数据处理，紧凑而可切分，可移植，跨编程语言，使用模式描述； 
  
  面向行 SequenceFile、Avro、map文件，每一行值在文件中连续存储。只读取少部分列，也需要将整行都加载进内存。适用于读取多列场景。 
  面向列 行被分割成分片，每个分片以面向列形式存储。如果读取列少，只需要将对应列加载进内存。适用于读取小部分列场景。不适合流写操作，需要操作多个文件，不易控制。Hive早先面向列格式为RCFile(Record Columnar File)，已被ORCFile(Optimized Record Columnar File)及Parquet取代。 
  
  
   
    
     
    
   
  
    二维表.png 
   
  
 
 
  
   
    
     
    
   
  
    面向行存储.png 
   
  
 
 
  
   
    
     
    
   
  
    面向列存储(先分片，列放一行存储).png

第5章 Hadoop IO操作学习笔记

5.1 数据完整性

5.1.1 HDFS的数据完整性

5.2 压缩

5.2.1 codec

1、通过`CompressionCodec`对数据流进行压缩和解压缩

2、通过`CompressionCodecFactory`推断`CompressionCodec`

3、原生类库

4、CodecPool

5.2.2 压缩和输入分片

5.2.3 在MapReduce中使用压缩

map任务输出压缩

5.3 序列化

5.3.1 `Writable`接口

1、实现`Writable`接口

2、`WritableComparable`接口和`Comparator`

5.3.2 `Writable`实现类

1）Java基本类型对应的`Writable`类

2）Text类型

3）BytesWritable

4）NullWritable

5）ObjectWritable和GenericWritable

6）Writable集合

5.3.3 实现定制Writable集合

5.3.4 序列化框架

5.4 基于文件的数据结构

5.4.1 SequenceFile

1、SequenceFile写文件

2、SequenceFile读文件

3、命令行读取SequenceFile

4、SequenceFile排序与合并

5、SequenceFile文件格式

Header

Uncompressed

记录压缩

块压缩

5.4.2 MapFile

MapFile变种

5.4.3 其他文件格式和面向列格式

你可能感兴趣的:(第5章 Hadoop IO操作学习笔记)

基本类型	`Writable`实现类	序列化大小（字节）
boolean	`BooleanWritable`	1
byte	`ByteWritable`	1
short	`ShortWritable`	2
char	无，可用`IntWritable`	-
int	`IntWritable`	4
	`VIntWritable`	1-5
long	`LongWritable`	8
	`VLongWritable`	1-9
float	`FloatWritable`	4
double	`DoubleWritable`	8

Unicode编码	U+0041	U+00DF	U+6771	U+10400
名称	A	ß	東	“字符代理串”
Java表示	\u0041	\u00DF	\u6771	\uD801\uDC00

第5章 Hadoop IO操作 学习笔记

5.1 数据完整性

5.1.1 HDFS的数据完整性

5.2 压缩

5.2.1 codec

1、通过CompressionCodec对数据流进行压缩和解压缩

2、通过CompressionCodecFactory推断CompressionCodec

3、原生类库

4、CodecPool

5.2.2 压缩和输入分片

5.2.3 在MapReduce中使用压缩

map任务输出压缩

5.3 序列化

5.3.1 Writable接口

1、实现Writable接口

2、WritableComparable接口和Comparator

5.3.2 Writable实现类

1）Java基本类型对应的Writable类

2）Text类型

3）BytesWritable

4）NullWritable

5）ObjectWritable和GenericWritable

6）Writable集合

5.3.3 实现定制Writable集合

5.3.4 序列化框架

5.4 基于文件的数据结构

5.4.1 SequenceFile

1、SequenceFile写文件

2、SequenceFile读文件

3、命令行读取SequenceFile

4、SequenceFile排序与合并

5、SequenceFile文件格式

Header

Uncompressed

记录压缩

块压缩

5.4.2 MapFile

MapFile变种

5.4.3 其他文件格式和面向列格式

你可能感兴趣的:(第5章 Hadoop IO操作 学习笔记)

第5章 Hadoop IO操作学习笔记

1、通过`CompressionCodec`对数据流进行压缩和解压缩

2、通过`CompressionCodecFactory`推断`CompressionCodec`

5.3.1 `Writable`接口

1、实现`Writable`接口

2、`WritableComparable`接口和`Comparator`

5.3.2 `Writable`实现类

1）Java基本类型对应的`Writable`类

你可能感兴趣的:(第5章 Hadoop IO操作学习笔记)