第5章 Hadoop IO操作 学习笔记

5.1 数据完整性

系统中需要处理的数据量大到Hadoop处理极限时,容易出现数据丢失或者损坏.

  • 措施:数据引入系统时计算校验和(checksum),当数据通过不可信通道传输时,再次计算校验和进行比较,校验和只能检测错误,而无法修复问题,所以建议使用ECC内存(Error-Correcting Code memory纠错内存)。

常用错误检测码CRC-32,任何大小输入数据均计算得一个32位的整数校验和

  • Hadoop ChecksumFileSystem 使用CRC-32
  • HDFS使用更有效的变体CRC-32C
5.1.1 HDFS的数据完整性

5.2 压缩

两大好处:

  • 减少存储空间
  • 加速数据在网络和磁盘传输

所有压缩算法都需要权衡时间/空间,-1优化压缩速度,-9优化压缩空间

5.2.1 codec

Codec是压缩/解压缩算法的一种实现,Hadoop中CompressionCodec接口的实现类

表5-2. Hadoop压缩codec

压缩格式 CompressionCodec实现
deflate org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
lz4 org.apache.hadoop.io.compress.Lz4Codec
snappy org.apache.hadoop.io.compress.SnappyCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
1、通过CompressionCodec对数据流进行压缩和解压缩

压缩输出到控制台示例:

package com.zyf.study5;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;

public class StreamCompressor {

    public static void main(String[] args) {
        try {
            Class codecClass = GzipCodec.class;
            CompressionCodec compressionCodec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, new Configuration());

            CompressionOutputStream outputStream = compressionCodec.createOutputStream(System.out);
            IOUtils.copyBytes(System.in, outputStream, 4096, false);
            outputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

修改mvn pom.xml文件,设置jar包运行入口

    
        org.apache.maven.plugins
        maven-jar-plugin
        
          
            
              
                com.zyf.study5.StreamCompressor
              
            
          
        
      

运行程序

> echo 'Text' | hadoop jar hadoop-first-1.0-SNAPSHOT.jar | gunzip
Text
2、通过CompressionCodecFactory推断CompressionCodec
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.net.URI;

public class FileDecompressor {

    private static final Logger LOGGER = LoggerFactory.getLogger(FileDecompressor.class);

    public static void main(String[] args) {
        if (args.length == 0) {
            System.err.println("Usage: FileDecompressor ");
            System.exit(-1);
        }

        LOGGER.info("input path is " + args[0]);

        Path path = new Path(args[0]);

        Configuration conf = new Configuration();
        CompressionCodecFactory compressionCodecFactory = new CompressionCodecFactory(conf);
        CompressionCodec compressionCodec = compressionCodecFactory.getCodec(path);
        if (compressionCodec == null) {
            System.err.println("No Codec found for " + args[0]);
            System.exit(-1);
        }

        try {
            URI uri = URI.create(args[0]);
            FileSystem fileSystem = FileSystem.get(uri, conf, "ossuser");
            FSDataInputStream fsDataInputStream = fileSystem.open(path);
            CompressionInputStream compressionInputStream = compressionCodec.createInputStream(fsDataInputStream);

            String fileName = CompressionCodecFactory.removeSuffix(path.getName(), compressionCodec.getDefaultExtension());
            LOGGER.info("output path is " + fileName);
            FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(fileName));

            IOUtils.copyBytes(compressionInputStream, fsDataOutputStream, 4096, false);

            fsDataOutputStream.close();
            fsDataInputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

}

运行程序

> hadoop jar hadoop-first-1.0-SNAPSHOT.jar hdfs://127.0.0.1:9000/user/ossuser/test.txt.gz
19/05/07 18:57:46 INFO study5.FileDecompressor: input path is hdfs://127.0.0.1:9000/user/ossuser/test.txt.gz
19/05/07 18:57:48 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
19/05/07 18:57:48 INFO compress.CodecPool: Got brand-new decompressor [.gz]
19/05/07 18:57:48 INFO study5.FileDecompressor: output path is test.txt

> hadoop jar hadoop-first-1.0-SNAPSHOT.jar hdfs://127.0.0.1:9000/user/ossuser/test.txt.bz2
19/05/07 18:57:46 INFO study5.FileDecompressor: input path is hdfs://127.0.0.1:9000/user/ossuser/test.txt.bz2
19/05/07 19:01:17 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
19/05/07 19:01:17 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
19/05/07 19:01:17 INFO study5.FileDecompressor: output path is test.txt

程序运行完毕,在/user/ossuser/目录下会生成test.txt文件,内容为解压缩后文件。

CompressionCodecFactory会加载表5-2中所有的codec实现类,同时也会加载io.compression.codecs属性定义的codec实现,每个codec都知道自己默认的扩展名。

3、原生类库

使用native类库带来更好性能,与java实现比,原生gzip解压时间减少一半,压缩时间减少10%,可以通过系统属性java.library.path指定,或etc/hadoop/下脚本设置,或应用中手动设置。可以通过io.native.lib.available=false禁用native库。

4、CodecPool
package com.zyf.study5;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;

public class PooledStreamCompressor {

    public static void main(String[] args) {
        if (args.length == 0) {
            System.err.println("Usage: PooledStreamCompressor ");
            System.exit(-1);
        }

        Compressor compressor = null;
        try {
            Class classOfCodec = Class.forName(args[0]);

            Configuration conf = new Configuration();
            CompressionCodec compressionCodec = (CompressionCodec) ReflectionUtils.newInstance(classOfCodec, conf);

            compressor = CodecPool.getCompressor(compressionCodec);

            CompressionOutputStream outputStream = compressionCodec.createOutputStream(System.out, compressor);
            IOUtils.copyBytes(System.in, outputStream, 4096, false);

            outputStream.finish();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            CodecPool.returnCompressor(compressor);
        }
    }
}

运行程序

> echo 'gzip output' | hadoop jar hadoop-first-1.0-SNAPSHOT.jar org.apache.hadoop.io.compress.GzipCodec | gunzip
2019-05-07 19:37:57,413 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-07 19:37:57,414 INFO compress.CodecPool: Got brand-new compressor [.gz]
gzip output
5.2.2 压缩和输入分片

gzip不支持文件切分
bzip2支持文件切分

使用何种压缩格式?
与待处理文件的大小、格式、所使用的工具相关,下面为效率从高到低排列的建议

  • 使用容器文件格式,如SequenceFile、avro、ORCFiles、Parquet,这些文件格式支持压缩和切分,最好与一个快速压缩工具一起使用,如LZO、LZ4、Snappy
  • 使用支持切分的压缩格式,如bz2,或使用通过索引实现切分的压缩格式,如LZO。
  • 应用中切分文件,为每个快建议压缩文件,合理切分,以使压缩后大小接近HDFS块大小.
  • 不压缩

对于大文件,不要使用不支持切分的压缩格式,因为会失去数据本地化优势,造成MapReduce效率低下。

5.2.3 在MapReduce中使用压缩
  • 输入文件是压缩的,MapReduce会通过CompressionCodecFactory根据文件后缀推导出codec,读取文件时自动解压。
  • 输出,可以根据mapreduce.output.fileoutputformat.compress=true设置,mapreduce.output.fileoutputformat.compress.codec可以设置压缩codec,另一种方案是通过FileOutputFormat设置。
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

运行程序:

> hadoop jar hadoop-first-1.0-SNAPSHOT.jar input/input.txt.gz output
...
19/05/08 09:40:05 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
19/05/08 09:40:05 INFO compress.CodecPool: Got brand-new decompressor [.gz]

结果文件:

._SUCCESS.crc
.part-r-00000.gz.crc
_SUCCESS
part-r-00000.gz

如果为输出生成SequenceFile,即job.setOutputFormatClass(SequenceFileOutputFormat.class)(注:输出生成SequenceFile,压缩格式必须兼容,默认为deflate,不能设置为FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class)),可以设置mapreduce.output.fileoutputformat.compress.type控制压缩格式,默认为RECORD,即针对每条记录压缩,建议使用BLOCK。

属性 类型 默认 描述
mapreduce.output.fileoutputformat.compress boolean false 是否压缩输出
mapreduce.output.fileoutputformat.compress.codec 类名称 org.apache.hadoop.io.compress.DefaultCodec 压缩codec
mapreduce.output.fileoutputformat.compress.type String RECORD NONE、RECORD、BLOCK
map任务输出压缩

对map阶段中间结果进行压缩,如使用LZO、Snappy、LZ4等快速压缩方式,可以减少网络数据传输,提升性能。

属性 类型 默认 描述
mapreduce.map.output.compress boolean false 是否压缩map输出
mapreduce.map.output.compress.codec 类名称 org.apache.hadoop.io.compress.DefaultCodec 压缩codec

通过如下代码设置map输出压缩:

Configuration conf = new Configuration();
conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);
conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, DefaultCodec.class, CompressionCodec.class);

Job job = Job.getInstance(conf);

5.3 序列化

  • 序列化 将结构化对象转为字节流,用于分布式数据处理两大领域:进程间通信和永久存储,Hadoop中通过RPC实现进程间通信,RPC将消息序列转为二进制流发送到远程节点,远程节点将二进制反序列化为原始消息。通常,RPC序列化格式如下:

    • 紧凑 减小网络传输(有效利用存储空间)
    • 快速 减少序列化和反序列化性能开销(高效读/写数据)
    • 可扩展 协议向前向后兼容(透明读取老格式数据)
    • 支持互操作 支持以不同语言实现的客户端与服务端交互(不同语言读写永久存储的数据)
  • 反序列化 将字节流转为结构化对象

序列化格式Writable,紧凑、速度快,但不易被java以外的语言使用和扩展。Avro克服了Writable的部分不足。

5.3.1 Writable接口
package org.apache.hadoop.io;
...
public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}
1、实现Writable接口
package com.zyf.study5;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;

import java.io.*;

public class MyWritable implements Writable {

    private String name;
    private int score;

    public MyWritable() {
        super();
    }

    public MyWritable(String name, int score) {
        this.name = name;
        this.score = score;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(name);
        out.writeInt(score);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.name = in.readUTF();
        this.score = in.readInt();
    }

    public static byte[] serialize(MyWritable writable) throws IOException {
        if (writable == null) {
            return null;
        }

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream outputStream = new DataOutputStream(baos);
        writable.write(outputStream);

        outputStream.close();
        return baos.toByteArray();
    }

    public static MyWritable deserialize(byte[] value) throws IOException {
        if (value == null || value.length == 0) {
            return null;
        }

        ByteArrayInputStream bais = new ByteArrayInputStream(value);
        DataInputStream inputStream = new DataInputStream(bais);

        MyWritable fromBytes = new MyWritable();
        fromBytes.readFields(inputStream);

        return fromBytes;
    }

    @Override
    public String toString() {
        return "MyWritable{" +
                "name='" + name + '\'' +
                ", score=" + score +
                '}';
    }

    public static void main(String[] args) throws IOException {
        MyWritable myWritable = new MyWritable("abcdef", 33);

        byte[] value = MyWritable.serialize(myWritable);
        System.out.println(StringUtils.byteToHexString(value));

        MyWritable fromBytes = MyWritable.deserialize(value);
        System.out.println(fromBytes);
    }
}

输出如下:

000661626364656600000021
MyWritable{name='abcdef', score=33}

可以看到,序列化字节数组中,0006标示string长度,616263646566代表了字符串abcdef,最后4个字节(00000021)代表整型33.

2、WritableComparable接口和Comparator

IntWritable实现了WritableComparable接口,WritableComparable继承了Writable, Comparable

package org.apache.hadoop.io;

public class IntWritable implements WritableComparable {
  
  /** A Comparator optimized for IntWritable. */ 
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(IntWritable.class);
    }
    
    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int thisValue = readInt(b1, s1);
      int thatValue = readInt(b2, s2);
      return (thisValue

对于MapReduce来说,类型比较很重要,因为中间有个基于键排序的阶段。Hadoop还提供了一个优化版的比较器,WritableComparator实现了RawComparator, Configurable

package org.apache.hadoop.io;

import org.apache.hadoop.io.serializer.DeserializerComparator;

/**
 * 

* A {@link Comparator} that operates directly on byte representations of * objects. *

* @param * @see DeserializerComparator */ public interface RawComparator extends Comparator { /** * Compare two objects in binary. * b1[s1:l1] is the first object, and b2[s2:l2] is the second object. * * @param b1 The first byte array. * @param s1 The position index in b1. The object under comparison's starting index. * @param l1 The length of the object in b1. * @param b2 The second byte array. * @param s2 The position index in b2. The object under comparison's starting index. * @param l2 The length of the object under comparison in b2. * @return An integer result of the comparison. */ public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }

该接口允许其实现直接比较流中的记录,而无须先把流反序列化为对象,避免新建对象的额外开销。如IntWritable.Comparator直接从流中获取整数进行比较。WritableComparator是一个RawComparator接口的通用实现,提供两个功能:

1、提供了对原始compare方法的默认实现,该方法能够将流中数据反序列化为对象,再调用对象的compare方法;
2、充当RawComparator实现的工厂类,通过调用
public static WritableComparator get(Class c)方法即可获取对应比较器。

byte[] b1 = new byte[] {0, 0, 0, 15};
byte[] b2 = new byte[] {0, 0, 0, 13};

WritableComparator comparator = WritableComparator.get(IntWritable.class);
int compare = comparator.compare(b1, 0, 4, b2, 0, 4);

IntWritable intWritable1 = new IntWritable(100);
IntWritable intWritable2 = new IntWritable(101);
compare = comparator.compare(intWritable1, intWritable2);
5.3.2 Writable实现类
1)Java基本类型对应的Writable
基本类型 Writable实现类 序列化大小(字节)
boolean BooleanWritable 1
byte ByteWritable 1
short ShortWritable 2
char 无,可用IntWritable -
int IntWritable 4
VIntWritable 1-5
long LongWritable 8
VLongWritable 1-9
float FloatWritable 4
double DoubleWritable 8

整形包括定长格式和变长格式,需要编码的数值相当小,变长格式更节约空间,如127只需要一个字节。
定长格式适合整个数值域非常均匀的情况,如精心设计的hash函数。
绝大情况数值变量分布都不均匀,一般而言变长格式更节约空间。

2)Text类型

Text是针对UTF-8序列Writable类,一般认为与StringWritable等价。
Text使用变长整型记录字符串字节数,因此最大可存储2GB,另外,使用了标准UTF-8编码。

package org.apache.hadoop.io;
public class Text extends BinaryComparable
    implements WritableComparable {

  private static final ThreadLocal ENCODER_FACTORY =
    new ThreadLocal() {
      @Override
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

  /** Set to contain the contents of a string. 
   */
  public void set(String string) {
    try {
      ByteBuffer bb = encode(string, true);
      bytes = bb.array();
      length = bb.limit();
    }catch(CharacterCodingException e) {
      throw new RuntimeException("Should not have happened ", e); 
    }
  }

  public static ByteBuffer encode(String string, boolean replace)
    throws CharacterCodingException {
    CharsetEncoder encoder = ENCODER_FACTORY.get();
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPLACE);
      encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    ByteBuffer bytes = 
      encoder.encode(CharBuffer.wrap(string.toCharArray()));
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPORT);
      encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return bytes;
  }

  @Override
  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  /** Write a UTF8 encoded string to out
   */
  public static int writeString(DataOutput out, String s) throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }
}

Text索引是根据字符串UTF-8编码后的字节位置实现的,String是基于char

        String str = "我要学Hadoop,哈";
        Text text = new Text(str);

        print("str.length()=" + str.length());
        print("text.getLength()=" + text.getLength());

        byte[] bytes = text.getBytes();
        print("text.getBytes()=" + StringUtils.byteToHexString(bytes));
        print("text.getBytes().length=" + bytes.length);

        print("text.find(\"我\")=" + text.find("我"));
        print("text.find(\"要\")=" + text.find("要"));
        print("text.find(\"学\")=" + text.find("学"));
        print("text.find(\"o\")=" + text.find("o"));

        print(str.charAt(2));
        print((char) text.charAt(6));
        print(text.charAt(100));
       //print(str.charAt(100));

输出:

str.length()=11
text.getLength()=19
text.getBytes()=e68891e8a681e5ada64861646f6f702ce59388000000000000
text.getBytes().length=25
text.find("我")=0
text.find("要")=3
text.find("学")=6
text.find("o")=12
学
学
-1
  • Stringlength方法返回的是字符串字符数,text.getLength()返回的是字符串转UTF-8后字节数,text.getBytes()末尾包含了一部分空字节,如上例,所以text.getBytes().length值比实际字节长度大。
  • text.find()方法与string.indexOf()方法类似,前者返回字节位置,后者返回字符位置;
  • StringText都有charAt方法,前者返回字符串中指定索引位置的字符,若索引位置大于字符串长度,会报StringIndexOutOfBoundsException,后者返回指定字节位置的字符,若索引位置大于字节数组长度,返回-1;
Unicode编码 U+0041 U+00DF U+6771 U+10400
名称 A ß “字符代理串”
Java表示 \u0041 \u00DF \u6771 \uD801\uDC00
        String s = "\u0041\u00DF\u6771\uD801\uDC00";
        Text t = new Text(s);
        print(s.length() + ", " + t.getLength());
        print(s.getBytes(Charset.forName("UTF-8")).length + ", " + t.getLength());

        print(s.indexOf("\u0041") + ", " + t.find("\u0041"));
        print(s.indexOf("\u00DF") + ", " + t.find("\u00DF"));
        print(s.indexOf("\u6771") + ", " + t.find("\u6771"));
        print(s.indexOf("\uD801\uDC00") + ", " + t.find("\uD801\uDC00"));

        print(s.charAt(0) + ", " + (char)t.charAt(0));
        print(s.charAt(1) + ", " + (char)t.charAt(1));
        print(s.charAt(2) + ", " + (char)t.charAt(3));
        print(s.charAt(3) + ", " + (char)t.charAt(6));

        print(s.codePointAt(0) == 0x0041);
        print(s.codePointAt(1) == 0x00DF);
        print(s.codePointAt(2) == 0x6771);
        print(s.codePointAt(3) == 0x10400);

输出:

5, 10
10, 10
0, 0
1, 1
2, 3
3, 6
A, A
ß, ß
東, 東
?, Ѐ
true
true
true
true

上述,Text对象长度是其UTF-8编码的字节数(1+2+3+4)。

迭代Text

        String s1 = "\u0041\u00DF\u6771\uD801\uDC00";
        Text t1 = new Text(s1);

        ByteBuffer byteBuffer = ByteBuffer.wrap(t1.getBytes(), 0, t1.getLength());
        int value;
        while(byteBuffer.hasRemaining() && (value = Text.bytesToCodePoint(byteBuffer)) != -1) {
            print(Integer.toHexString(value));
        }

输出:

41
df
6771
10400
  • Text可变性,可通过set修改值
  • TextAPI不如String丰富,可转为String操作
3)BytesWritable

对二进制数据数组封装,可变,getBytes().getLength()无法体现对象数据字节真正长度,序列化格式为:4字节指定字节长度+数据内容

BytesWritable bytesWritable = new BytesWritable(new byte[]{3, 5, 7});
print(StringUtils.byteToHexString(serialize(bytesWritable)));

输出:

00000003030507
4)NullWritable

Writable的特殊类型,序列化长度为0,不从数据流读取数据,也不写入数据,充当占位符,如在MapReduce中,不需要使用键或值的序列化地址,就可以将键或值声明为NullWritable,这样可以高效存储常量空值。NullWritable也可以用作SequenceFile键。

public class NullWritable implements WritableComparable {

  private static final NullWritable THIS = new NullWritable();

  private NullWritable() {}                       // no public ctor

  public static NullWritable get() { return THIS; }
  
  @Override
  public String toString() {
    return "(null)";
  }
  
  @Override
  public int compareTo(NullWritable other) {
    return 0;
  }
  
  @Override
  public void readFields(DataInput in) throws IOException {}
  @Override
  public void write(DataOutput out) throws IOException {}

  /** A Comparator "optimized" for NullWritable. */
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(NullWritable.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      return 0;
    }
  }
}
5)ObjectWritable和GenericWritable

ObjectWritable是String、java基本类型、Enum、Writable或这些数据类型组成的数组的封装类。当一个字段包含多个类型时,ObjectWritable非常有用,如SequenceFile的值包含多个类型,可以将值声明为ObjectWritable,并将每个类型封装在一个ObjectWritable中。缺点:每次序列化都会写封装类型的全名称,非常浪费空间,可以使用GenericWritable解决。

public class ObjectWritable implements Writable, Configurable {

    public static void writeObject(DataOutput out, Object instance,
        Class declaredClass, Configuration conf, boolean allowCompactArrays) 
    throws IOException {

    if (instance == null) {                       // null
      instance = new NullInstance(declaredClass, conf);
      declaredClass = Writable.class;
    }
    
    // Special case: must come before writing out the declaredClass.
    // If this is an eligible array of primitives,
    // wrap it in an ArrayPrimitiveWritable$Internal wrapper class.
    if (allowCompactArrays && declaredClass.isArray()
        && instance.getClass().getName().equals(declaredClass.getName())
        && instance.getClass().getComponentType().isPrimitive()) {
      instance = new ArrayPrimitiveWritable.Internal(instance);
      declaredClass = ArrayPrimitiveWritable.Internal.class;
    }

    UTF8.writeString(out, declaredClass.getName()); // always write declared

    if (declaredClass.isArray()) {     // non-primitive or non-compact array
      int length = Array.getLength(instance);
      out.writeInt(length);
      for (int i = 0; i < length; i++) {
            writeObject(out, Array.get(instance, i),
            declaredClass.getComponentType(), conf, allowCompactArrays);
      }
      
    } else if (declaredClass == ArrayPrimitiveWritable.Internal.class) {
      ((ArrayPrimitiveWritable.Internal) instance).write(out);
      
    } else if (declaredClass == String.class) {   // String
      UTF8.writeString(out, (String)instance);
      
    } else if (declaredClass.isPrimitive()) {     // primitive type

      if (declaredClass == Boolean.TYPE) {        // boolean
        out.writeBoolean(((Boolean)instance).booleanValue());
      } else if (declaredClass == Character.TYPE) { // char
        out.writeChar(((Character)instance).charValue());
      } else if (declaredClass == Byte.TYPE) {    // byte
        out.writeByte(((Byte)instance).byteValue());
      } else if (declaredClass == Short.TYPE) {   // short
        out.writeShort(((Short)instance).shortValue());
      } else if (declaredClass == Integer.TYPE) { // int
        out.writeInt(((Integer)instance).intValue());
      } else if (declaredClass == Long.TYPE) {    // long
        out.writeLong(((Long)instance).longValue());
      } else if (declaredClass == Float.TYPE) {   // float
        out.writeFloat(((Float)instance).floatValue());
      } else if (declaredClass == Double.TYPE) {  // double
        out.writeDouble(((Double)instance).doubleValue());
      } else if (declaredClass == Void.TYPE) {    // void
      } else {
        throw new IllegalArgumentException("Not a primitive: "+declaredClass);
      }
    } else if (declaredClass.isEnum()) {         // enum
      UTF8.writeString(out, ((Enum)instance).name());
    } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
      UTF8.writeString(out, instance.getClass().getName());
      ((Writable)instance).write(out);

    } else if (Message.class.isAssignableFrom(declaredClass)) {
      ((Message)instance).writeDelimitedTo(
          DataOutputOutputStream.constructOutputStream(out));
    } else {
      throw new IOException("Can't write: "+instance+" as "+declaredClass);
    }
  }
}
public class UTF8 implements WritableComparable {
  public static int writeString(DataOutput out, String s) throws IOException {
    int len = utf8Length(s);
    out.writeShort(len);
    writeChars(out, s, 0, s.length());
    return len;
  }
}

使用ObjectWritable序列化对象

    public static byte[] serialize(Object object) throws IOException {
        ObjectWritable objectWritable = new ObjectWritable(object);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream outputStream = new DataOutputStream(baos);
        objectWritable.write(outputStream);
        return baos.toByteArray();
    }

    public static void main(String[] args) throws IOException {
        Person p = new Person("z1", 1);
        byte[] values = serialize(p);
        print(StringUtils.byteToHexString(values));

        int[] array = new int[]{3, 5, 10};
        values = serialize(array);
        print(StringUtils.byteToHexString(values));

        Person[] ps = new Person[] {new Person("z1", 1), new Person("z2", 2)};
        values = serialize(ps);
        print(StringUtils.byteToHexString(values));
    }

输出说明

#输出1(为了方便理解加了空格换行,实际没有)
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a31 00000001
#说明1
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z1 1

#输出2(为了方便理解加了空格换行,实际没有)
0002 5b49 00000003 0003696e74 00000003 0003696e74 00000005 0003696e74 0000000a
#说明2
{字符串长度:2} [I {数组长度:3} {字符串长度:3} int 3 {字符串长度:3} int 5 {字符串长度:3} int 10

#输出3(为了方便理解加了空格换行,实际没有)
002b 5b4c 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e3b 00000002 
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a31  00000001
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028      636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a32  00000002

#说明3:
{字符串长度:43} [Lcom.zyf.study5.ObjectWritableDemo$Person; {数组长度:2}
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z1 1
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z2 2

通过上述输出说明可以看到,ObjectWritable序列化的后值比较长,对于数组,首先先将数组全类名进行序列化,接着数组长度,对于数组元素,循环递归调用writeObject方法,方法入口先序列化数组元素类全名称,因为元素是Writable子类,在该分之内又先序列化了元素类全名称,再调用了
Writable的write方法序列化元素属性。若使用GenericType,可以大大减少类名长度,如下:

    public static void main(String[] args) throws IOException {
        Person p = new Person("z1", 1);

        CustomGenericWritable customGenericWritable = new CustomGenericWritable();
        customGenericWritable.set(p);
        byte[] values = serialize(customGenericWritable);
        print(StringUtils.byteToHexString(values));


        MyWritable myWritable = new MyWritable("abcdef", 33);
        customGenericWritable.set(myWritable);
        values = serialize(customGenericWritable);
        print(StringUtils.byteToHexString(values));
    }

    static class CustomGenericWritable extends GenericWritable {

        final static Class[] TYPES = new Class[] {MyWritable.class, Person.class};

        @Override
        protected Class[] getTypes() {
            return TYPES;
        }
    }

输出:

#输出1:
01 0002 7a31 00000001
#说明1:
{TYPE类型,TYPES数组索引} {字符串长度:2} z1 1

#输出2:
00 0006 616263646566 00000021
#说明2:
{TYPE类型,TYPES数组索引} {字符串长度:6} abcdef 33
6)Writable集合

org.apache.hadoop.io包下有6个Writable集合类,为ArrayWritableArrayPrimitiveWritableTwoDArrayWritableMapWritableSortedMapWritableEnumSetWritableArrayWritableArrayPrimitiveWritable分别为一维二维数组,元素必须为相同类型的Writable,都有toArray方法,浅拷贝,若需要存储不同类型Writable,可以使用GenericWritable封装。

  public Object toArray() {
    Object result = Array.newInstance(valueClass, values.length);
    for (int i = 0; i < values.length; i++) {
      Array.set(result, i, values[i]);
    }
    return result;
  }

使用示例:

    public static void main(String[] args) throws IOException {
        MyWritable[] myWritables = new MyWritable[] {new MyWritable("z1", 1), new MyWritable("z2", 2)};
        ArrayWritable arrayWritable = new ArrayWritable(MyWritable.class, myWritables);

        byte[] values = serialize(arrayWritable);
        System.out.println(StringUtils.byteToHexString(values));
    }

输出:

00000002 0002 7a31 00000001 0002 7a32 00000002
{数组长度:2} {字符串长度:2} z1 1 {字符串长度:2} z2 1

ArrayPrimitiveWritable是基础数据类型数组的封装,使用示例:

    public static void main(String[] args) throws IOException {
        int[] array = new int[]{1, 3, 5, 7};

        ArrayPrimitiveWritable writable = new ArrayPrimitiveWritable(array);
        byte[] bytes = serialize(writable);

        System.out.println(StringUtils.byteToHexString(bytes));
    }

输出:

0003 696e74 00000004 00000001 00000003 00000005 00000007

MapWritable继承AbstractMapWritable实现了Map接口,通过HashMap存放键值对,AbstractMapWritable维护了键值对class与id的映射关系,调用put方法插入键值对时,若键和/或值类型不存在于这个关系对象中,则会插入到关系中,id从1开始自增。

public class MapWritable extends AbstractMapWritable
  implements Map {
}

使用示例代码如下:

    public  static void main(String[] args) throws IOException {
        MapWritable mapWritable = new MapWritable();

        IntWritable intWritable1 = new IntWritable(1);
        IntWritable intWritable2 = new IntWritable(2);
        IntWritable intWritable3 = new IntWritable(3);

        mapWritable.put(intWritable1, new MyWritable("z1", 1));
        mapWritable.put(intWritable2, new MyWritable("z2", 2));
        mapWritable.put(intWritable3, new ObjectWritableDemo.Person("z3", 2));

        byte[] bytes = serialize(mapWritable);
        System.out.println(StringUtils.byteToHexString(bytes));
    }

输出:

02 
01 0019 636f6d2e7a79662e7374756479352e4d795772697461626c65
02 0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
   00000003 
        85 00000001 01 0002 7a31 00000001 
        85 00000002 01 0002 7a32 00000002 
        85 00000003 02 0002 7a33 00000002
#说明:
{当前类型Id:2}
{id:1}{字符串长度:25}com.zyf.study5.MyWritable
{id:2}{字符串长度:40}com.zyf.study5.ObjectWritableDemo$Person
{map.size:3}
     {key类型id,即IntWritable对应id,为-123}{key值:1}{value类型id:1}{字符串长度:2}z1 1
     {key类型id,即IntWritable对应id,为-123}{key值:2}{value类型id:1}{字符串长度:2}z2 1
     {key类型id,即IntWritable对应id,为-123}{key值:3}{value类型id:2}{字符串长度:2}z3 2

SortedMapWritable实现了SortedMap接口,通过TreeMap存放键值对。

5.3.3 实现定制Writable集合
5.3.4 序列化框架

尽管大多数MapReduce程序使用的都是Writable类型的键和值,但这不是MapReduce API强制要求的,可以使用任何类型,只要有一种机制能对这个类型和二进制进行来回转换,Hadoop有一个针对可替换序列化框架的API以支持这机制,序列化框架用一个org.apache.hadoop.io.serializer.Serialization实现来表示,如WritableSerialization.对象定义了类型与二进制来回转换的实现。

5.4 基于文件的数据结构

5.4.1 SequenceFile

SequenceFile,为二进制键-值对提供了持久化数据结构。也可作为小文件容器,获得更高效率的存储和处理。

1、SequenceFile写文件

SequenceFile.createWriter可选参数还包括压缩codec、Progressable、在SequenceFile文件头添加Metadata。
SequenceFile键和值不一定是Writable,也可以是序列化对象。

package com.zyf.study5;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

import java.io.IOException;
import java.net.URI;

public class SequenceFileWriterDemo {

    final static String[] DATAS = {
            "One, two, buckle my shoe",
            "Three, four, shut the door",
            "Five, sex, pick up sticks",
            "Seven, eight, lay them straight",
            "Nine, then, a big fat hen"
    };

    public static void main(String[] args) {
        URI uri = URI.create("hdfs://127.0.0.1:9000/user/ossuser/sequenceFile.seq");
        FSDataOutputStream outputStream = null;
        SequenceFile.Writer writer = null;
        try {
            Configuration conf = new Configuration();
            FileSystem fileSystem = FileSystem.get(uri, conf, "ossuser");

            Path path = new Path(uri);

            outputStream = fileSystem.create(path);
            SequenceFile.Writer.Option streamOption = SequenceFile.Writer.stream(outputStream);

            SequenceFile.Writer.Option keyOption = SequenceFile.Writer.keyClass(IntWritable.class);
            SequenceFile.Writer.Option valueOption = SequenceFile.Writer.valueClass(Text.class);

            writer = SequenceFile.createWriter(conf, streamOption, keyOption, valueOption);

            IntWritable key = new IntWritable();
            Text value = new Text();
            for(int i=0; i<100; i++) {
                key.set(100 - i);
                value.set(DATAS[i % DATAS.length]);

                if ((100-i) % 10 == 0) {
                    writer.sync();
                }

                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

                writer.append(key, value);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            IOUtils.closeStreams(outputStream, writer);
        }
    }
}

创建SequenceFile.Writer,每10条记录插入同步点。writer.getLength()获取文件当前位置。

2、SequenceFile读文件
  • Writable键值对,通过调用public boolean next(Writable, Writable)迭代获取记录;
  • 序列化值,调用如下方法:
    public synchronized Object next(Object key) throws IOException
    public synchronized Object getCurrentValue(Object val) throws IOException
    next返回null代表读取到文件尾,否则将值传入getCurrentValue 可得value.
    确保io.serializations设置了序列化框架
package com.zyf.study5;

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;
import java.net.URI;

public class SequenceFileReaderDemo {

    public static void main(String[] args) {
        URI uri = URI.create("hdfs://127.0.0.1:9000/user/ossuser/sequenceFile.txt");

        SequenceFile.Reader reader = null;
        try {
            Configuration conf = new Configuration();

            SequenceFile.Reader.Option fileOption = SequenceFile.Reader.file(new Path(uri));
            reader = new SequenceFile.Reader(conf, fileOption);

            Class keyClass = reader.getKeyClass();
            Class valueClass = reader.getValueClass();
            Writable key = (Writable) ReflectionUtils.newInstance(keyClass, conf);
            Writable value = (Writable) ReflectionUtils.newInstance(valueClass, conf);

            long position = reader.getPosition();
            while(reader.next(key, value)) {
                String syncSeen = reader.syncSeen() ? "*":"";
                System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
                position = reader.getPosition();
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            IOUtils.closeQuietly(reader);
        }
    }
}

输出:

[128*]100 One, two, buckle my shoe
[193]99 Three, four, shut the door
[240]98 Five, sex, pick up sticks
[284]97 Seven, eight, lay them straight
[334]96 Nine, then, a big fat hen
[378]95 One, two, buckle my shoe
[423]94 Three, four, shut the door
[470]93 Five, sex, pick up sticks
[514]92 Seven, eight, lay them straight
[564]91 Nine, then, a big fat hen
[608*]90 One, two, buckle my shoe
[673]89 Three, four, shut the door
...

可通过如下代码,定位到文件具体位置

reader.seek(240);
reader.next(key, value);
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);

输出:

[240]98 Five, sex, pick up sticks

但如果给定数值不是记录边界,就会报错

reader.seek(239);

输出:

java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)

可以使用同步点查找记录边界,如下代码,会定位到239后最近的一个同步点,如果position后没有同步点了,会定位到文件最后一行。可以将插入了同步点的SequenceFile作为MapReduce的输入,这类文件允许切分,不同部分由独立的map任务单独处理。详见SequenceFileInputFormat

reader.sync(239);
long position = reader.getPosition();
String syncSeen = reader.syncSeen() ? "*":"";
reader.next(key, value);
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);

输出:

[608]90 One, two, buckle my shoe
3、命令行读取SequenceFile

hdfs dfs -text可以以文本形式显示SequenceFile,该选项可以查看文件代码、检测文件类型以转成文本,支持gzip、bzip2、avro、SequenceFile;如果有自定义健值类,需要确保在Hadoop类路径下。

> hdfs dfs -text /user/ossuser/sequenceFile.txt
2019-05-08 14:57:58,744 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-08 14:57:59,914 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
100     One, two, buckle my shoe
99      Three, four, shut the door
98      Five, sex, pick up sticks
...
> hdfs dfs -text /user/ossuser/test.txt.gz
abc
> hdfs dfs -text /user/ossuser/test.txt.bz2
bzip2 file
4、SequenceFile排序与合并
>hadoop jar hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar sort -r 1 \
-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
/user/ossuser/sequenceFile.txt /user/ossuser/sorted

> hdfs dfs -text sorted/part-r-00000
1       Nine, then, a big fat hen
2       Seven, eight, lay them straight
3       Five, sex, pick up sticks
4       Three, four, shut the door
...

-r reduce任务数
-inFormat 任务输入数据格式
-outFormat 任务输出数据格式
-outKey 输出key数据类型
-outValue 输出value数据类型

5、SequenceFile文件格式

基于不同的压缩方式,SequenceFile有3种格式,这三种格式head都一样

Header
  • version - 前3字节为魔数SEQ, 紧接的后一字节为实际版本号(e.g. SEQ4 or SEQ6)
  • keyClassName - key class
  • valueClassName - value class
  • compression - 布尔值,标示文件中键值对是否压缩.
  • blockCompression - 布尔值,标示是否对文件中键值对进行块压缩.
  • compression codec - CompressionCodec类用来对键和/或值进行压缩(若开启了压缩).
  • metadata - SequenceFile.Metadata 文件元数据.
  • sync - 同步位,标示header结束.
Uncompressed
  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Value
  • 每大约100字节一个同步位.
记录压缩
  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value
  • 每大约100字节一个同步位.
块压缩
  • Header
  • Record Block
    • Uncompressed 块中记录数
    • Compressed key-lengths块 大小
    • Compressed key-lengths块
    • Compressed keys块 大小
    • Compressed keys块
    • Compressed value-lengths块 大小
    • Compressed value-lengths块
    • Compressed values块 大小
    • Compressed values块
  • 每个块一个同步位.

key-lengths块和value-lengths块,包含了每一键值对(编码为ZeroCompressedInteger 格式)真正的长度。

块压缩一次压缩多条记录,可以利用记录间相似性进行压缩,相对单条记录压缩,有更高的压缩率,可以不断向块中压缩记录,直到块大小不小于io.seqfile.compress.blocksize,该值默认1MB。

5.4.2 MapFile

MapFile是排过序的SequenceFile,有索引,可以按键查找,索引本身是SequenceFile,包含了map中一小部分键(默认128个),索引能加载进内存,因此可以提供对主数据快速查找,主数据文件是另一个SequenceFile,包含所有键值对,且按顺序存放。
MapFile.Writer进行写时,map entry必须按顺序添加,否则IOException异常。

MapFile变种
  • ArrayFile 键是整形,标示元素索引,值为Writable
  • SetFile 特殊的MapFile,用于存放Writable键
  • BloomMapFile 提供了get()方法高性能实现,对稀疏文件很有用,使用一个动态布隆过滤器检测map是否存在某个给定键,测试在内存中完成,非常快, 结果出现假阳性概率大于0,仅当测试通过时,才能调用get().
5.4.3 其他文件格式和面向列格式

Avro类似SequenceFile,二进制格式,面向大规模数据处理,紧凑而可切分,可移植,跨编程语言,使用模式描述;

  • 面向行 SequenceFile、Avro、map文件,每一行值在文件中连续存储。只读取少部分列,也需要将整行都加载进内存。适用于读取多列场景。
  • 面向列 行被分割成分片,每个分片以面向列形式存储。如果读取列少,只需要将对应列加载进内存。适用于读取小部分列场景。不适合流写操作,需要操作多个文件,不易控制。Hive早先面向列格式为RCFile(Record Columnar File),已被ORCFile(Optimized Record Columnar File)及Parquet取代。
二维表.png

面向行存储.png

面向列存储(先分片,列放一行存储).png

你可能感兴趣的:(第5章 Hadoop IO操作 学习笔记)