Avro是由Doug Cutting(Hadoop之父)创建的数据序列化系统,旨在解决Writeable类型的不足:缺乏语言的可移植性。为了支持跨语言,Avro的schema与语言的模式无关。有关Avro的更多特性请参看官方文档 1。
Avro文件的读写是依据schema而进行的。通常情况下,Avro的schema是用JSON编写,而数据部分则是二进制格式编码,并采用压缩算法对数据进行压缩,以便减少传输量。
schema中数据字段的类型包括两种
复杂类型比较常用的record。这里用[2]中twitter.avro文件为例,打开文件后,文件头如下:
Objavro.codecnullavro.schemaò{“type”:”record”,”name”:”twitter_schema”,”namespace”:”com.miguno.avro”,”fields”:[{“name”:”username”,”type”:”string”,”doc”:”Name of the user account on Twitter.com”},{“name”:”tweet”,”type”:”string”,”doc”:”The content of the user’s Twitter message”},{“name”:”timestamp”,”type”:”long”,”doc”:”Unix epoch time in milliseconds”}],”doc:”:”A basic schema for storing Twitter messages”}
将schema格式化之后
{
"type": "record",
"name": "twitter_schema",
"namespace": "com.miguno.avro",
"fields": [
{
"name": "username", "type": "string",
"doc": "Name of the user account on Twitter.com"
},
{
"name": "tweet", "type": "string",
"doc": "The content of the user's Twitter message"
},
{
"name": "timestamp", "type": "long",
"doc": "Unix epoch time in milliseconds"
}
],
"doc:": "A basic schema fostoring Twitter messages"
}
其中,name是该JSON串的名字,type是指明name的类型,doc是对该name更为详细的说明。
MetaDatas
与16位
sync marker
组成,MetaDatas中的信息包含
codec
与schema;codec是data block中的数据采用的压缩方式,为
null
(不压缩)或者是
deflate
。deflate算法是gzip所采用的压缩算法,就我自己感觉而言压缩比在6倍以上(具体还没研究过)。
其实每个data block间都会间隔一个sync marker,具体参看4。sync marker是为了用于mapReduce阶段时文件分割与同步;此外Avro本身是为了mapReduce而设计的。
Header与Datablock声明代码在Avro源码org.apache.avro.file.DataFileStream.java中给出。
//org.apache.avro.file.DataFileStream.java
public static final class Header {
Schema schema;
Mapbyte[]> meta = new HashMapbyte[]>();
private transient List metaKeyList = new ArrayList();
byte[] sync = new byte[DataFileConstants.SYNC_SIZE]; //byte[16]
private Header() {}
}
static class DataBlock {
private byte[] data;
private long numEntries;
private int blockSize;
private int offset = 0;
private boolean flushOnWrite = true;
private DataBlock(long numEntries, int blockSize) {
this.data = new byte[blockSize];
this.numEntries = numEntries;
this.blockSize = blockSize;
}
下面给出是关于Header与DataBlock的测试代码。得到schema的方式有两种:
Map meta
中的得到byte类型的schema然后转成String。Map
的keySet为[“avro.codec”, “avro.schema”]。
DataFileReader reader =
new DataFileReader(new FsInput(new Path("twitter.avro"), new Configuration()),
new GenericDatumReader());
//print schema
System.out.println(reader.getSchema().toString(true));
//print meta
List metaKeyList = reader.getMetaKeys();
System.out.println(metaKeyList.toString());
System.out.println(reader.getMetaString("avro.codec"));
System.out.println(reader.getMetaString("avro.schema"));
//print blockount
reader.getBlockCount();
//print the data in data block
System.out.println(reader.next());
官网上给出了两种序列化方式:specific与generic。
specific方式是根据所生成的User类,提取出schema来进行Avro的解析。
// Serialize user1, user2 and user3 to disk
DatumWriter userDatumWriter = new SpecificDatumWriter(User.class);
DataFileWriter dataFileWriter = new DataFileWriter(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
// Deserialize Users from disk
DatumReader userDatumReader = new SpecificDatumReader(User.class);
DataFileReader dataFileReader = new DataFileReader(file, userDatumReader);
User user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
}
generic方式是预先生成了一个schema,然后再根据其解析。因为Avro文件会将schema写在文件头,所以generic解析方式更为常见。
GenericRecord user1 = new GenericData.Record(schema);
user1.put("name", "Alyssa");
user1.put("favorite_number", 256);
// Leave favorite color null
GenericRecord user2 = new GenericData.Record(schema);
user2.put("name", "Ben");
user2.put("favorite_number", 7);
user2.put("favorite_color", "red");
// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter datumWriter = new GenericDatumWriter(schema);
DataFileWriter dataFileWriter = new DataFileWriter(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.close();
// Deserialize users from disk
DatumReader datumReader = new GenericDatumReader(schema);
DataFileReader dataFileReader = new DataFileReader(file, datumReader);
GenericRecord user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
}
avro-tools的jar包提供了对Avro文件丰富的操作,包括对Avro文件进行切割,以用于做测试数据。
Available tools:
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, one record per line.
totext Converts an Avro data file to a text file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.