5.1 数据完整性
系统中需要处理的数据量大到Hadoop处理极限时,容易出现数据丢失或者损坏.
- 措施:数据引入系统时计算校验和(checksum),当数据通过不可信通道传输时,再次计算校验和进行比较,校验和只能检测错误,而无法修复问题,所以建议使用ECC内存(Error-Correcting Code memory纠错内存)。
常用错误检测码CRC-32,任何大小输入数据均计算得一个32位的整数校验和
- Hadoop ChecksumFileSystem 使用CRC-32
- HDFS使用更有效的变体CRC-32C
5.1.1 HDFS的数据完整性
5.2 压缩
两大好处:
- 减少存储空间
- 加速数据在网络和磁盘传输
所有压缩算法都需要权衡时间/空间,-1优化压缩速度,-9优化压缩空间
5.2.1 codec
Codec是压缩/解压缩算法的一种实现,Hadoop中CompressionCodec
接口的实现类
表5-2. Hadoop压缩codec
压缩格式 | CompressionCodec实现 |
---|---|
deflate | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
lz4 | org.apache.hadoop.io.compress.Lz4Codec |
snappy | org.apache.hadoop.io.compress.SnappyCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
1、通过CompressionCodec
对数据流进行压缩和解压缩
压缩输出到控制台示例:
package com.zyf.study5;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.IOException;
public class StreamCompressor {
public static void main(String[] args) {
try {
Class codecClass = GzipCodec.class;
CompressionCodec compressionCodec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, new Configuration());
CompressionOutputStream outputStream = compressionCodec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, outputStream, 4096, false);
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
修改mvn pom.xml文件,设置jar包运行入口
org.apache.maven.plugins
maven-jar-plugin
com.zyf.study5.StreamCompressor
运行程序
> echo 'Text' | hadoop jar hadoop-first-1.0-SNAPSHOT.jar | gunzip
Text
2、通过CompressionCodecFactory
推断CompressionCodec
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.net.URI;
public class FileDecompressor {
private static final Logger LOGGER = LoggerFactory.getLogger(FileDecompressor.class);
public static void main(String[] args) {
if (args.length == 0) {
System.err.println("Usage: FileDecompressor ");
System.exit(-1);
}
LOGGER.info("input path is " + args[0]);
Path path = new Path(args[0]);
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecFactory = new CompressionCodecFactory(conf);
CompressionCodec compressionCodec = compressionCodecFactory.getCodec(path);
if (compressionCodec == null) {
System.err.println("No Codec found for " + args[0]);
System.exit(-1);
}
try {
URI uri = URI.create(args[0]);
FileSystem fileSystem = FileSystem.get(uri, conf, "ossuser");
FSDataInputStream fsDataInputStream = fileSystem.open(path);
CompressionInputStream compressionInputStream = compressionCodec.createInputStream(fsDataInputStream);
String fileName = CompressionCodecFactory.removeSuffix(path.getName(), compressionCodec.getDefaultExtension());
LOGGER.info("output path is " + fileName);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(fileName));
IOUtils.copyBytes(compressionInputStream, fsDataOutputStream, 4096, false);
fsDataOutputStream.close();
fsDataInputStream.close();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
运行程序
> hadoop jar hadoop-first-1.0-SNAPSHOT.jar hdfs://127.0.0.1:9000/user/ossuser/test.txt.gz
19/05/07 18:57:46 INFO study5.FileDecompressor: input path is hdfs://127.0.0.1:9000/user/ossuser/test.txt.gz
19/05/07 18:57:48 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
19/05/07 18:57:48 INFO compress.CodecPool: Got brand-new decompressor [.gz]
19/05/07 18:57:48 INFO study5.FileDecompressor: output path is test.txt
> hadoop jar hadoop-first-1.0-SNAPSHOT.jar hdfs://127.0.0.1:9000/user/ossuser/test.txt.bz2
19/05/07 18:57:46 INFO study5.FileDecompressor: input path is hdfs://127.0.0.1:9000/user/ossuser/test.txt.bz2
19/05/07 19:01:17 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
19/05/07 19:01:17 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
19/05/07 19:01:17 INFO study5.FileDecompressor: output path is test.txt
程序运行完毕,在/user/ossuser/
目录下会生成test.txt文件,内容为解压缩后文件。
CompressionCodecFactory
会加载表5-2中所有的codec实现类,同时也会加载io.compression.codecs
属性定义的codec实现,每个codec都知道自己默认的扩展名。
3、原生类库
使用native类库带来更好性能,与java实现比,原生gzip解压时间减少一半,压缩时间减少10%,可以通过系统属性java.library.path
指定,或etc/hadoop/
下脚本设置,或应用中手动设置。可以通过io.native.lib.available=false
禁用native库。
4、CodecPool
package com.zyf.study5;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.IOException;
public class PooledStreamCompressor {
public static void main(String[] args) {
if (args.length == 0) {
System.err.println("Usage: PooledStreamCompressor ");
System.exit(-1);
}
Compressor compressor = null;
try {
Class classOfCodec = Class.forName(args[0]);
Configuration conf = new Configuration();
CompressionCodec compressionCodec = (CompressionCodec) ReflectionUtils.newInstance(classOfCodec, conf);
compressor = CodecPool.getCompressor(compressionCodec);
CompressionOutputStream outputStream = compressionCodec.createOutputStream(System.out, compressor);
IOUtils.copyBytes(System.in, outputStream, 4096, false);
outputStream.finish();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
CodecPool.returnCompressor(compressor);
}
}
}
运行程序
> echo 'gzip output' | hadoop jar hadoop-first-1.0-SNAPSHOT.jar org.apache.hadoop.io.compress.GzipCodec | gunzip
2019-05-07 19:37:57,413 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-07 19:37:57,414 INFO compress.CodecPool: Got brand-new compressor [.gz]
gzip output
5.2.2 压缩和输入分片
gzip不支持文件切分
bzip2支持文件切分
使用何种压缩格式?
与待处理文件的大小、格式、所使用的工具相关,下面为效率从高到低排列的建议
- 使用容器文件格式,如SequenceFile、avro、ORCFiles、Parquet,这些文件格式支持压缩和切分,最好与一个快速压缩工具一起使用,如LZO、LZ4、Snappy
- 使用支持切分的压缩格式,如bz2,或使用通过索引实现切分的压缩格式,如LZO。
- 应用中切分文件,为每个快建议压缩文件,合理切分,以使压缩后大小接近HDFS块大小.
- 不压缩
对于大文件,不要使用不支持切分的压缩格式,因为会失去数据本地化优势,造成MapReduce效率低下。
5.2.3 在MapReduce中使用压缩
- 输入文件是压缩的,MapReduce会通过
CompressionCodecFactory
根据文件后缀推导出codec,读取文件时自动解压。- 输出,可以根据
mapreduce.output.fileoutputformat.compress=true
设置,mapreduce.output.fileoutputformat.compress.codec
可以设置压缩codec,另一种方案是通过FileOutputFormat
设置。
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
运行程序:
> hadoop jar hadoop-first-1.0-SNAPSHOT.jar input/input.txt.gz output
...
19/05/08 09:40:05 WARN zlib.ZlibFactory: Failed to load/initialize native-zlib library
19/05/08 09:40:05 INFO compress.CodecPool: Got brand-new decompressor [.gz]
结果文件:
._SUCCESS.crc
.part-r-00000.gz.crc
_SUCCESS
part-r-00000.gz
如果为输出生成SequenceFile,即job.setOutputFormatClass(SequenceFileOutputFormat.class)
(注:输出生成SequenceFile,压缩格式必须兼容,默认为deflate,不能设置为FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class)
),可以设置mapreduce.output.fileoutputformat.compress.type
控制压缩格式,默认为RECORD,即针对每条记录压缩,建议使用BLOCK。
属性 | 类型 | 默认 | 描述 |
---|---|---|---|
mapreduce.output.fileoutputformat.compress |
boolean | false | 是否压缩输出 |
mapreduce.output.fileoutputformat.compress.codec |
类名称 | org.apache.hadoop.io.compress.DefaultCodec |
压缩codec |
mapreduce.output.fileoutputformat.compress.type |
String | RECORD | NONE、RECORD、BLOCK |
map任务输出压缩
对map阶段中间结果进行压缩,如使用LZO、Snappy、LZ4等快速压缩方式,可以减少网络数据传输,提升性能。
属性 | 类型 | 默认 | 描述 |
---|---|---|---|
mapreduce.map.output.compress |
boolean | false | 是否压缩map输出 |
mapreduce.map.output.compress.codec |
类名称 | org.apache.hadoop.io.compress.DefaultCodec |
压缩codec |
通过如下代码设置map输出压缩:
Configuration conf = new Configuration();
conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);
conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, DefaultCodec.class, CompressionCodec.class);
Job job = Job.getInstance(conf);
5.3 序列化
-
序列化 将结构化对象转为字节流,用于分布式数据处理两大领域:进程间通信和永久存储,Hadoop中通过RPC实现进程间通信,RPC将消息序列转为二进制流发送到远程节点,远程节点将二进制反序列化为原始消息。通常,RPC序列化格式如下:
- 紧凑 减小网络传输(有效利用存储空间)
- 快速 减少序列化和反序列化性能开销(高效读/写数据)
- 可扩展 协议向前向后兼容(透明读取老格式数据)
- 支持互操作 支持以不同语言实现的客户端与服务端交互(不同语言读写永久存储的数据)
反序列化 将字节流转为结构化对象
序列化格式Writable
,紧凑、速度快,但不易被java以外的语言使用和扩展。Avro克服了Writable
的部分不足。
5.3.1 Writable
接口
package org.apache.hadoop.io;
...
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
1、实现Writable
接口
package com.zyf.study5;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;
import java.io.*;
public class MyWritable implements Writable {
private String name;
private int score;
public MyWritable() {
super();
}
public MyWritable(String name, int score) {
this.name = name;
this.score = score;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(name);
out.writeInt(score);
}
@Override
public void readFields(DataInput in) throws IOException {
this.name = in.readUTF();
this.score = in.readInt();
}
public static byte[] serialize(MyWritable writable) throws IOException {
if (writable == null) {
return null;
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream outputStream = new DataOutputStream(baos);
writable.write(outputStream);
outputStream.close();
return baos.toByteArray();
}
public static MyWritable deserialize(byte[] value) throws IOException {
if (value == null || value.length == 0) {
return null;
}
ByteArrayInputStream bais = new ByteArrayInputStream(value);
DataInputStream inputStream = new DataInputStream(bais);
MyWritable fromBytes = new MyWritable();
fromBytes.readFields(inputStream);
return fromBytes;
}
@Override
public String toString() {
return "MyWritable{" +
"name='" + name + '\'' +
", score=" + score +
'}';
}
public static void main(String[] args) throws IOException {
MyWritable myWritable = new MyWritable("abcdef", 33);
byte[] value = MyWritable.serialize(myWritable);
System.out.println(StringUtils.byteToHexString(value));
MyWritable fromBytes = MyWritable.deserialize(value);
System.out.println(fromBytes);
}
}
输出如下:
000661626364656600000021
MyWritable{name='abcdef', score=33}
可以看到,序列化字节数组中,0006标示string长度,616263646566代表了字符串abcdef,最后4个字节(00000021)代表整型33.
2、WritableComparable
接口和Comparator
IntWritable
实现了WritableComparable
接口,WritableComparable
继承了Writable, Comparable
。
package org.apache.hadoop.io;
public class IntWritable implements WritableComparable {
/** A Comparator optimized for IntWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(IntWritable.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
int thisValue = readInt(b1, s1);
int thatValue = readInt(b2, s2);
return (thisValue
对于MapReduce来说,类型比较很重要,因为中间有个基于键排序的阶段。Hadoop还提供了一个优化版的比较器,WritableComparator
实现了RawComparator, Configurable
package org.apache.hadoop.io;
import org.apache.hadoop.io.serializer.DeserializerComparator;
/**
*
* A {@link Comparator} that operates directly on byte representations of
* objects.
*
* @param
* @see DeserializerComparator
*/
public interface RawComparator extends Comparator {
/**
* Compare two objects in binary.
* b1[s1:l1] is the first object, and b2[s2:l2] is the second object.
*
* @param b1 The first byte array.
* @param s1 The position index in b1. The object under comparison's starting index.
* @param l1 The length of the object in b1.
* @param b2 The second byte array.
* @param s2 The position index in b2. The object under comparison's starting index.
* @param l2 The length of the object under comparison in b2.
* @return An integer result of the comparison.
*/
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
该接口允许其实现直接比较流中的记录,而无须先把流反序列化为对象,避免新建对象的额外开销。如IntWritable.Comparator
直接从流中获取整数进行比较。WritableComparator
是一个RawComparator
接口的通用实现,提供两个功能:
1、提供了对原始
compare
方法的默认实现,该方法能够将流中数据反序列化为对象,再调用对象的compare
方法;
2、充当RawComparator
实现的工厂类,通过调用
public static WritableComparator get(Class extends WritableComparable> c)
方法即可获取对应比较器。
byte[] b1 = new byte[] {0, 0, 0, 15};
byte[] b2 = new byte[] {0, 0, 0, 13};
WritableComparator comparator = WritableComparator.get(IntWritable.class);
int compare = comparator.compare(b1, 0, 4, b2, 0, 4);
或
IntWritable intWritable1 = new IntWritable(100);
IntWritable intWritable2 = new IntWritable(101);
compare = comparator.compare(intWritable1, intWritable2);
5.3.2 Writable
实现类
1)Java基本类型对应的Writable
类
基本类型 | Writable 实现类 |
序列化大小(字节) |
---|---|---|
boolean | BooleanWritable |
1 |
byte | ByteWritable |
1 |
short | ShortWritable |
2 |
char | 无,可用IntWritable |
- |
int | IntWritable |
4 |
VIntWritable |
1-5 | |
long | LongWritable |
8 |
VLongWritable |
1-9 | |
float | FloatWritable |
4 |
double | DoubleWritable |
8 |
整形包括定长格式和变长格式,需要编码的数值相当小,变长格式更节约空间,如127只需要一个字节。
定长格式适合整个数值域非常均匀的情况,如精心设计的hash函数。
绝大情况数值变量分布都不均匀,一般而言变长格式更节约空间。
2)Text类型
Text
是针对UTF-8序列Writable
类,一般认为与String
的Writable
等价。
Text
使用变长整型记录字符串字节数,因此最大可存储2GB,另外,使用了标准UTF-8编码。
package org.apache.hadoop.io;
public class Text extends BinaryComparable
implements WritableComparable {
private static final ThreadLocal ENCODER_FACTORY =
new ThreadLocal() {
@Override
protected CharsetEncoder initialValue() {
return Charset.forName("UTF-8").newEncoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
/** Set to contain the contents of a string.
*/
public void set(String string) {
try {
ByteBuffer bb = encode(string, true);
bytes = bb.array();
length = bb.limit();
}catch(CharacterCodingException e) {
throw new RuntimeException("Should not have happened ", e);
}
}
public static ByteBuffer encode(String string, boolean replace)
throws CharacterCodingException {
CharsetEncoder encoder = ENCODER_FACTORY.get();
if (replace) {
encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
}
ByteBuffer bytes =
encoder.encode(CharBuffer.wrap(string.toCharArray()));
if (replace) {
encoder.onMalformedInput(CodingErrorAction.REPORT);
encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
}
return bytes;
}
@Override
public void write(DataOutput out) throws IOException {
WritableUtils.writeVInt(out, length);
out.write(bytes, 0, length);
}
/** Write a UTF8 encoded string to out
*/
public static int writeString(DataOutput out, String s) throws IOException {
ByteBuffer bytes = encode(s);
int length = bytes.limit();
WritableUtils.writeVInt(out, length);
out.write(bytes.array(), 0, length);
return length;
}
}
Text
索引是根据字符串UTF-8编码后的字节位置实现的,String
是基于char
String str = "我要学Hadoop,哈";
Text text = new Text(str);
print("str.length()=" + str.length());
print("text.getLength()=" + text.getLength());
byte[] bytes = text.getBytes();
print("text.getBytes()=" + StringUtils.byteToHexString(bytes));
print("text.getBytes().length=" + bytes.length);
print("text.find(\"我\")=" + text.find("我"));
print("text.find(\"要\")=" + text.find("要"));
print("text.find(\"学\")=" + text.find("学"));
print("text.find(\"o\")=" + text.find("o"));
print(str.charAt(2));
print((char) text.charAt(6));
print(text.charAt(100));
//print(str.charAt(100));
输出:
str.length()=11
text.getLength()=19
text.getBytes()=e68891e8a681e5ada64861646f6f702ce59388000000000000
text.getBytes().length=25
text.find("我")=0
text.find("要")=3
text.find("学")=6
text.find("o")=12
学
学
-1
-
String
的length
方法返回的是字符串字符数,text.getLength()
返回的是字符串转UTF-8后字节数,text.getBytes()
末尾包含了一部分空字节,如上例,所以text.getBytes().length
值比实际字节长度大。 -
text.find()
方法与string.indexOf()
方法类似,前者返回字节位置,后者返回字符位置; -
String
与Text
都有charAt方法,前者返回字符串中指定索引位置的字符,若索引位置大于字符串长度,会报StringIndexOutOfBoundsException
,后者返回指定字节位置的字符,若索引位置大于字节数组长度,返回-1;
Unicode编码 | U+0041 | U+00DF | U+6771 | U+10400 |
---|---|---|---|---|
名称 | A | ß | 東 | “字符代理串” |
Java表示 | \u0041 | \u00DF | \u6771 | \uD801\uDC00 |
String s = "\u0041\u00DF\u6771\uD801\uDC00";
Text t = new Text(s);
print(s.length() + ", " + t.getLength());
print(s.getBytes(Charset.forName("UTF-8")).length + ", " + t.getLength());
print(s.indexOf("\u0041") + ", " + t.find("\u0041"));
print(s.indexOf("\u00DF") + ", " + t.find("\u00DF"));
print(s.indexOf("\u6771") + ", " + t.find("\u6771"));
print(s.indexOf("\uD801\uDC00") + ", " + t.find("\uD801\uDC00"));
print(s.charAt(0) + ", " + (char)t.charAt(0));
print(s.charAt(1) + ", " + (char)t.charAt(1));
print(s.charAt(2) + ", " + (char)t.charAt(3));
print(s.charAt(3) + ", " + (char)t.charAt(6));
print(s.codePointAt(0) == 0x0041);
print(s.codePointAt(1) == 0x00DF);
print(s.codePointAt(2) == 0x6771);
print(s.codePointAt(3) == 0x10400);
输出:
5, 10
10, 10
0, 0
1, 1
2, 3
3, 6
A, A
ß, ß
東, 東
?, Ѐ
true
true
true
true
上述,Text对象长度是其UTF-8编码的字节数(1+2+3+4)。
迭代Text
String s1 = "\u0041\u00DF\u6771\uD801\uDC00";
Text t1 = new Text(s1);
ByteBuffer byteBuffer = ByteBuffer.wrap(t1.getBytes(), 0, t1.getLength());
int value;
while(byteBuffer.hasRemaining() && (value = Text.bytesToCodePoint(byteBuffer)) != -1) {
print(Integer.toHexString(value));
}
输出:
41
df
6771
10400
Text
可变性,可通过set修改值Text
API不如String
丰富,可转为String
操作
3)BytesWritable
对二进制数据数组封装,可变,getBytes().getLength()
无法体现对象数据字节真正长度,序列化格式为:4字节指定字节长度+数据内容
BytesWritable bytesWritable = new BytesWritable(new byte[]{3, 5, 7});
print(StringUtils.byteToHexString(serialize(bytesWritable)));
输出:
00000003030507
4)NullWritable
是Writable
的特殊类型,序列化长度为0,不从数据流读取数据,也不写入数据,充当占位符,如在MapReduce中,不需要使用键或值的序列化地址,就可以将键或值声明为NullWritable,这样可以高效存储常量空值。NullWritable也可以用作SequenceFile键。
public class NullWritable implements WritableComparable {
private static final NullWritable THIS = new NullWritable();
private NullWritable() {} // no public ctor
public static NullWritable get() { return THIS; }
@Override
public String toString() {
return "(null)";
}
@Override
public int compareTo(NullWritable other) {
return 0;
}
@Override
public void readFields(DataInput in) throws IOException {}
@Override
public void write(DataOutput out) throws IOException {}
/** A Comparator "optimized" for NullWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(NullWritable.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
return 0;
}
}
}
5)ObjectWritable和GenericWritable
ObjectWritable
是String、java基本类型、Enum、Writable或这些数据类型组成的数组的封装类。当一个字段包含多个类型时,ObjectWritable非常有用,如SequenceFile的值包含多个类型,可以将值声明为ObjectWritable,并将每个类型封装在一个ObjectWritable中。缺点:每次序列化都会写封装类型的全名称,非常浪费空间,可以使用GenericWritable解决。
public class ObjectWritable implements Writable, Configurable {
public static void writeObject(DataOutput out, Object instance,
Class declaredClass, Configuration conf, boolean allowCompactArrays)
throws IOException {
if (instance == null) { // null
instance = new NullInstance(declaredClass, conf);
declaredClass = Writable.class;
}
// Special case: must come before writing out the declaredClass.
// If this is an eligible array of primitives,
// wrap it in an ArrayPrimitiveWritable$Internal wrapper class.
if (allowCompactArrays && declaredClass.isArray()
&& instance.getClass().getName().equals(declaredClass.getName())
&& instance.getClass().getComponentType().isPrimitive()) {
instance = new ArrayPrimitiveWritable.Internal(instance);
declaredClass = ArrayPrimitiveWritable.Internal.class;
}
UTF8.writeString(out, declaredClass.getName()); // always write declared
if (declaredClass.isArray()) { // non-primitive or non-compact array
int length = Array.getLength(instance);
out.writeInt(length);
for (int i = 0; i < length; i++) {
writeObject(out, Array.get(instance, i),
declaredClass.getComponentType(), conf, allowCompactArrays);
}
} else if (declaredClass == ArrayPrimitiveWritable.Internal.class) {
((ArrayPrimitiveWritable.Internal) instance).write(out);
} else if (declaredClass == String.class) { // String
UTF8.writeString(out, (String)instance);
} else if (declaredClass.isPrimitive()) { // primitive type
if (declaredClass == Boolean.TYPE) { // boolean
out.writeBoolean(((Boolean)instance).booleanValue());
} else if (declaredClass == Character.TYPE) { // char
out.writeChar(((Character)instance).charValue());
} else if (declaredClass == Byte.TYPE) { // byte
out.writeByte(((Byte)instance).byteValue());
} else if (declaredClass == Short.TYPE) { // short
out.writeShort(((Short)instance).shortValue());
} else if (declaredClass == Integer.TYPE) { // int
out.writeInt(((Integer)instance).intValue());
} else if (declaredClass == Long.TYPE) { // long
out.writeLong(((Long)instance).longValue());
} else if (declaredClass == Float.TYPE) { // float
out.writeFloat(((Float)instance).floatValue());
} else if (declaredClass == Double.TYPE) { // double
out.writeDouble(((Double)instance).doubleValue());
} else if (declaredClass == Void.TYPE) { // void
} else {
throw new IllegalArgumentException("Not a primitive: "+declaredClass);
}
} else if (declaredClass.isEnum()) { // enum
UTF8.writeString(out, ((Enum)instance).name());
} else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
UTF8.writeString(out, instance.getClass().getName());
((Writable)instance).write(out);
} else if (Message.class.isAssignableFrom(declaredClass)) {
((Message)instance).writeDelimitedTo(
DataOutputOutputStream.constructOutputStream(out));
} else {
throw new IOException("Can't write: "+instance+" as "+declaredClass);
}
}
}
public class UTF8 implements WritableComparable {
public static int writeString(DataOutput out, String s) throws IOException {
int len = utf8Length(s);
out.writeShort(len);
writeChars(out, s, 0, s.length());
return len;
}
}
使用ObjectWritable序列化对象
public static byte[] serialize(Object object) throws IOException {
ObjectWritable objectWritable = new ObjectWritable(object);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream outputStream = new DataOutputStream(baos);
objectWritable.write(outputStream);
return baos.toByteArray();
}
public static void main(String[] args) throws IOException {
Person p = new Person("z1", 1);
byte[] values = serialize(p);
print(StringUtils.byteToHexString(values));
int[] array = new int[]{3, 5, 10};
values = serialize(array);
print(StringUtils.byteToHexString(values));
Person[] ps = new Person[] {new Person("z1", 1), new Person("z2", 2)};
values = serialize(ps);
print(StringUtils.byteToHexString(values));
}
输出说明
#输出1(为了方便理解加了空格换行,实际没有)
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a31 00000001
#说明1
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z1 1
#输出2(为了方便理解加了空格换行,实际没有)
0002 5b49 00000003 0003696e74 00000003 0003696e74 00000005 0003696e74 0000000a
#说明2
{字符串长度:2} [I {数组长度:3} {字符串长度:3} int 3 {字符串长度:3} int 5 {字符串长度:3} int 10
#输出3(为了方便理解加了空格换行,实际没有)
002b 5b4c 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e3b 00000002
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a31 00000001
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e 0002 7a32 00000002
#说明3:
{字符串长度:43} [Lcom.zyf.study5.ObjectWritableDemo$Person; {数组长度:2}
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z1 1
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person
{字符串长度:40} com.zyf.study5.ObjectWritableDemo$Person {字符串长度:2} z2 2
通过上述输出说明可以看到,ObjectWritable序列化的后值比较长,对于数组,首先先将数组全类名进行序列化,接着数组长度,对于数组元素,循环递归调用writeObject方法,方法入口先序列化数组元素类全名称,因为元素是Writable子类,在该分之内又先序列化了元素类全名称,再调用了
Writable的write方法序列化元素属性。若使用GenericType,可以大大减少类名长度,如下:
public static void main(String[] args) throws IOException {
Person p = new Person("z1", 1);
CustomGenericWritable customGenericWritable = new CustomGenericWritable();
customGenericWritable.set(p);
byte[] values = serialize(customGenericWritable);
print(StringUtils.byteToHexString(values));
MyWritable myWritable = new MyWritable("abcdef", 33);
customGenericWritable.set(myWritable);
values = serialize(customGenericWritable);
print(StringUtils.byteToHexString(values));
}
static class CustomGenericWritable extends GenericWritable {
final static Class[] TYPES = new Class[] {MyWritable.class, Person.class};
@Override
protected Class extends Writable>[] getTypes() {
return TYPES;
}
}
输出:
#输出1:
01 0002 7a31 00000001
#说明1:
{TYPE类型,TYPES数组索引} {字符串长度:2} z1 1
#输出2:
00 0006 616263646566 00000021
#说明2:
{TYPE类型,TYPES数组索引} {字符串长度:6} abcdef 33
6)Writable集合
org.apache.hadoop.io
包下有6个Writable集合类,为ArrayWritable
、ArrayPrimitiveWritable
、TwoDArrayWritable
、MapWritable
、SortedMapWritable
、EnumSetWritable
,ArrayWritable
、ArrayPrimitiveWritable
分别为一维二维数组,元素必须为相同类型的Writable,都有toArray方法,浅拷贝,若需要存储不同类型Writable,可以使用GenericWritable封装。
public Object toArray() {
Object result = Array.newInstance(valueClass, values.length);
for (int i = 0; i < values.length; i++) {
Array.set(result, i, values[i]);
}
return result;
}
使用示例:
public static void main(String[] args) throws IOException {
MyWritable[] myWritables = new MyWritable[] {new MyWritable("z1", 1), new MyWritable("z2", 2)};
ArrayWritable arrayWritable = new ArrayWritable(MyWritable.class, myWritables);
byte[] values = serialize(arrayWritable);
System.out.println(StringUtils.byteToHexString(values));
}
输出:
00000002 0002 7a31 00000001 0002 7a32 00000002
{数组长度:2} {字符串长度:2} z1 1 {字符串长度:2} z2 1
ArrayPrimitiveWritable
是基础数据类型数组的封装,使用示例:
public static void main(String[] args) throws IOException {
int[] array = new int[]{1, 3, 5, 7};
ArrayPrimitiveWritable writable = new ArrayPrimitiveWritable(array);
byte[] bytes = serialize(writable);
System.out.println(StringUtils.byteToHexString(bytes));
}
输出:
0003 696e74 00000004 00000001 00000003 00000005 00000007
MapWritable
继承AbstractMapWritable
实现了Map
接口,通过HashMap存放键值对,AbstractMapWritable
维护了键值对class与id的映射关系,调用put方法插入键值对时,若键和/或值类型不存在于这个关系对象中,则会插入到关系中,id从1开始自增。
public class MapWritable extends AbstractMapWritable
implements Map {
}
使用示例代码如下:
public static void main(String[] args) throws IOException {
MapWritable mapWritable = new MapWritable();
IntWritable intWritable1 = new IntWritable(1);
IntWritable intWritable2 = new IntWritable(2);
IntWritable intWritable3 = new IntWritable(3);
mapWritable.put(intWritable1, new MyWritable("z1", 1));
mapWritable.put(intWritable2, new MyWritable("z2", 2));
mapWritable.put(intWritable3, new ObjectWritableDemo.Person("z3", 2));
byte[] bytes = serialize(mapWritable);
System.out.println(StringUtils.byteToHexString(bytes));
}
输出:
02
01 0019 636f6d2e7a79662e7374756479352e4d795772697461626c65
02 0028 636f6d2e7a79662e7374756479352e4f626a6563745772697461626c6544656d6f24506572736f6e
00000003
85 00000001 01 0002 7a31 00000001
85 00000002 01 0002 7a32 00000002
85 00000003 02 0002 7a33 00000002
#说明:
{当前类型Id:2}
{id:1}{字符串长度:25}com.zyf.study5.MyWritable
{id:2}{字符串长度:40}com.zyf.study5.ObjectWritableDemo$Person
{map.size:3}
{key类型id,即IntWritable对应id,为-123}{key值:1}{value类型id:1}{字符串长度:2}z1 1
{key类型id,即IntWritable对应id,为-123}{key值:2}{value类型id:1}{字符串长度:2}z2 1
{key类型id,即IntWritable对应id,为-123}{key值:3}{value类型id:2}{字符串长度:2}z3 2
SortedMapWritable
实现了SortedMap
接口,通过TreeMap
存放键值对。
5.3.3 实现定制Writable集合
5.3.4 序列化框架
尽管大多数MapReduce程序使用的都是Writable类型的键和值,但这不是MapReduce API强制要求的,可以使用任何类型,只要有一种机制能对这个类型和二进制进行来回转换,Hadoop有一个针对可替换序列化框架的API以支持这机制,序列化框架用一个org.apache.hadoop.io.serializer.Serialization
实现来表示,如WritableSerialization.对象定义了类型与二进制来回转换的实现。
5.4 基于文件的数据结构
5.4.1 SequenceFile
SequenceFile,为二进制键-值对提供了持久化数据结构。也可作为小文件容器,获得更高效率的存储和处理。
1、SequenceFile写文件
SequenceFile.createWriter
可选参数还包括压缩codec、Progressable、在SequenceFile文件头添加Metadata。
SequenceFile
键和值不一定是Writable,也可以是序列化对象。
package com.zyf.study5;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import java.io.IOException;
import java.net.URI;
public class SequenceFileWriterDemo {
final static String[] DATAS = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, sex, pick up sticks",
"Seven, eight, lay them straight",
"Nine, then, a big fat hen"
};
public static void main(String[] args) {
URI uri = URI.create("hdfs://127.0.0.1:9000/user/ossuser/sequenceFile.seq");
FSDataOutputStream outputStream = null;
SequenceFile.Writer writer = null;
try {
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(uri, conf, "ossuser");
Path path = new Path(uri);
outputStream = fileSystem.create(path);
SequenceFile.Writer.Option streamOption = SequenceFile.Writer.stream(outputStream);
SequenceFile.Writer.Option keyOption = SequenceFile.Writer.keyClass(IntWritable.class);
SequenceFile.Writer.Option valueOption = SequenceFile.Writer.valueClass(Text.class);
writer = SequenceFile.createWriter(conf, streamOption, keyOption, valueOption);
IntWritable key = new IntWritable();
Text value = new Text();
for(int i=0; i<100; i++) {
key.set(100 - i);
value.set(DATAS[i % DATAS.length]);
if ((100-i) % 10 == 0) {
writer.sync();
}
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
}
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
IOUtils.closeStreams(outputStream, writer);
}
}
}
创建SequenceFile.Writer
,每10条记录插入同步点。writer.getLength()
获取文件当前位置。
2、SequenceFile读文件
Writable
键值对,通过调用public boolean next(Writable, Writable)
迭代获取记录;- 序列化值,调用如下方法:
public synchronized Object next(Object key) throws IOException
public synchronized Object getCurrentValue(Object val) throws IOException
next
返回null
代表读取到文件尾,否则将值传入getCurrentValue
可得value.
确保io.serializations
设置了序列化框架
package com.zyf.study5;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.IOException;
import java.net.URI;
public class SequenceFileReaderDemo {
public static void main(String[] args) {
URI uri = URI.create("hdfs://127.0.0.1:9000/user/ossuser/sequenceFile.txt");
SequenceFile.Reader reader = null;
try {
Configuration conf = new Configuration();
SequenceFile.Reader.Option fileOption = SequenceFile.Reader.file(new Path(uri));
reader = new SequenceFile.Reader(conf, fileOption);
Class keyClass = reader.getKeyClass();
Class valueClass = reader.getValueClass();
Writable key = (Writable) ReflectionUtils.newInstance(keyClass, conf);
Writable value = (Writable) ReflectionUtils.newInstance(valueClass, conf);
long position = reader.getPosition();
while(reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? "*":"";
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
position = reader.getPosition();
}
} catch (IOException e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(reader);
}
}
}
输出:
[128*]100 One, two, buckle my shoe
[193]99 Three, four, shut the door
[240]98 Five, sex, pick up sticks
[284]97 Seven, eight, lay them straight
[334]96 Nine, then, a big fat hen
[378]95 One, two, buckle my shoe
[423]94 Three, four, shut the door
[470]93 Five, sex, pick up sticks
[514]92 Seven, eight, lay them straight
[564]91 Nine, then, a big fat hen
[608*]90 One, two, buckle my shoe
[673]89 Three, four, shut the door
...
可通过如下代码,定位到文件具体位置
reader.seek(240);
reader.next(key, value);
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
输出:
[240]98 Five, sex, pick up sticks
但如果给定数值不是记录边界,就会报错
reader.seek(239);
输出:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
可以使用同步点查找记录边界,如下代码,会定位到239后最近的一个同步点,如果position后没有同步点了,会定位到文件最后一行。可以将插入了同步点的SequenceFile作为MapReduce的输入,这类文件允许切分,不同部分由独立的map任务单独处理。详见SequenceFileInputFormat
reader.sync(239);
long position = reader.getPosition();
String syncSeen = reader.syncSeen() ? "*":"";
reader.next(key, value);
System.out.printf("[%s%s]%s %s\n", position, syncSeen, key, value);
输出:
[608]90 One, two, buckle my shoe
3、命令行读取SequenceFile
hdfs dfs -text
可以以文本形式显示SequenceFile,该选项可以查看文件代码、检测文件类型以转成文本,支持gzip、bzip2、avro、SequenceFile;如果有自定义健值类,需要确保在Hadoop类路径下。
> hdfs dfs -text /user/ossuser/sequenceFile.txt
2019-05-08 14:57:58,744 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-08 14:57:59,914 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
100 One, two, buckle my shoe
99 Three, four, shut the door
98 Five, sex, pick up sticks
...
> hdfs dfs -text /user/ossuser/test.txt.gz
abc
> hdfs dfs -text /user/ossuser/test.txt.bz2
bzip2 file
4、SequenceFile排序与合并
>hadoop jar hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar sort -r 1 \
-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
/user/ossuser/sequenceFile.txt /user/ossuser/sorted
> hdfs dfs -text sorted/part-r-00000
1 Nine, then, a big fat hen
2 Seven, eight, lay them straight
3 Five, sex, pick up sticks
4 Three, four, shut the door
...
-r reduce任务数
-inFormat 任务输入数据格式
-outFormat 任务输出数据格式
-outKey 输出key数据类型
-outValue 输出value数据类型
5、SequenceFile文件格式
基于不同的压缩方式,SequenceFile有3种格式,这三种格式head都一样
Header
- version - 前3字节为魔数SEQ, 紧接的后一字节为实际版本号(e.g. SEQ4 or SEQ6)
- keyClassName - key class
- valueClassName - value class
- compression - 布尔值,标示文件中键值对是否压缩.
- blockCompression - 布尔值,标示是否对文件中键值对进行块压缩.
- compression codec -
CompressionCodec
类用来对键和/或值进行压缩(若开启了压缩). - metadata -
SequenceFile.Metadata
文件元数据. - sync - 同步位,标示header结束.
Uncompressed
- Header
- Record
- Record length
- Key length
- Key
- Value
- 每大约100字节一个同步位.
记录压缩
- Header
- Record
- Record length
- Key length
- Key
- Compressed Value
- 每大约100字节一个同步位.
块压缩
- Header
- Record Block
- Uncompressed 块中记录数
- Compressed key-lengths块 大小
- Compressed key-lengths块
- Compressed keys块 大小
- Compressed keys块
- Compressed value-lengths块 大小
- Compressed value-lengths块
- Compressed values块 大小
- Compressed values块
- 每个块一个同步位.
key-lengths块和value-lengths块,包含了每一键值对(编码为ZeroCompressedInteger 格式)真正的长度。
块压缩一次压缩多条记录,可以利用记录间相似性进行压缩,相对单条记录压缩,有更高的压缩率,可以不断向块中压缩记录,直到块大小不小于io.seqfile.compress.blocksize
,该值默认1MB。
5.4.2 MapFile
MapFile是排过序的SequenceFile,有索引,可以按键查找,索引本身是SequenceFile,包含了map中一小部分键(默认128个),索引能加载进内存,因此可以提供对主数据快速查找,主数据文件是另一个SequenceFile,包含所有键值对,且按顺序存放。
MapFile.Writer
进行写时,map entry必须按顺序添加,否则IOException异常。
MapFile变种
- ArrayFile 键是整形,标示元素索引,值为Writable
- SetFile 特殊的MapFile,用于存放Writable键
- BloomMapFile 提供了
get()
方法高性能实现,对稀疏文件很有用,使用一个动态布隆过滤器检测map是否存在某个给定键,测试在内存中完成,非常快, 结果出现假阳性概率大于0,仅当测试通过时,才能调用get()
.
5.4.3 其他文件格式和面向列格式
Avro类似SequenceFile,二进制格式,面向大规模数据处理,紧凑而可切分,可移植,跨编程语言,使用模式描述;
- 面向行 SequenceFile、Avro、map文件,每一行值在文件中连续存储。只读取少部分列,也需要将整行都加载进内存。适用于读取多列场景。
- 面向列 行被分割成分片,每个分片以面向列形式存储。如果读取列少,只需要将对应列加载进内存。适用于读取小部分列场景。不适合流写操作,需要操作多个文件,不易控制。Hive早先面向列格式为RCFile(Record Columnar File),已被ORCFile(Optimized Record Columnar File)及Parquet取代。