Avro 是 Hadoop 的一个子项目,是Hadoop的创始人Doug Cutting(也是Lucene,Nutch等项目的创始人,膜拜)牵头开发。
Avro是一个数据序列化系统,设计用于支持大批量数据交换的应用。
它的主要特点有:支持二进制序列化方式,可以便捷,快速地处理大量数据;动态语言友好,Avro提供的机制使动态语言可以方便地处理Avro数据。
其他的序列化系统:
avro 参考资料:
Avro提供了RPC机制,可以不需要生成额外的API代码即可使用Avro来存储数据和RPC交互,“代码生成”是可选的,这一点区别于protobuf和thrift。此外Hadoop平台上的多个项目正在使用(或者支持)Avro作为数据序列化的服务。
Avro尽管提供了RPC机制,事实上Avro的核心特性决定了它通常用在“大数据”存储场景(Mapreduce),即我们通过借助schema将数据写入到“本地文件”或者HDFS中,然后reader再根据schema去迭代获取数据条目。它的schema可以有限度的变更、调整,而且Avro能够巧妙的兼容,这种强大的可扩展性正是“文件数据”存储所必须的。
avro是基于schema(模式),这和protobuf、thrift没什么区别,在schema文件中(.avsc文件)中声明数据类型或者protocol(RPC接口),那么avro在read、write时将依据schema对数据进行序列化。因为有了schema,那么Avro的读、写操作将可以使用不同的平台语言。Avro的schema是JSON格式,所以编写起来也非常简单、可读性很好。目前Avro所能支持的平台语言并不是很多,其中包括JAVA、C++、Python。
当Avro将数据写入文件时,将会把schema连同实际数据一同存储,此后reader将可以根据这个schema处理数据,如果reader使用了不同的schema,那么Avro也提供了一些兼容机制来解决这个问题。
在RPC中使用Avro,Client和server端将会在传输数据之前,首先通过handshake交换Schema,并在Schema一致性上达成统一。
user.avsc
{"namespace": "com.beng.rpc.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
上述schema表示User类有三个field:“name”、“age”、“email”;“type”用来声明field的数据类型,比如“email”的type为“[“string”,“null”]”,则表示类型可以为“string”或者为null。
使用 avro-tools.jar 对其进行编译
java -jar avro-tools-1.7.7.jar compile schema user.avsc .
将编译后的文件编译当前文件夹
你也可以使用 eclipse maven 对其进行编译,pom 依赖:
org.apache.avro
avro
1.7.7
org.apache.avro
avro-tools
1.7.7
avro和avro-tools两个依赖包,是avro开发的必备的基础包。如果你的项目需要让maven来根据.avsc文件生成java代码的话,还需要增加如下avro-maven-plugin依赖,否则此处是不需要的。
org.apache.avro
avro-maven-plugin
1.7.7
generate-sources
schema
${project.basedir}/src/main/resources/avro/
${project.basedir}/src/main/java/
org.apache.maven.plugins
maven-compiler-plugin
utf-8
看一下编译后的 User.java 代码:
/**
* Autogenerated by Avro
*
* DO NOT EDIT DIRECTLY
*/
package com.beng.rpc.avro;
@SuppressWarnings("all")
@org.apache.avro.specific.AvroGenerated
public class User extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"com.beng.rpc.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
@Deprecated public java.lang.CharSequence name;
@Deprecated public java.lang.Integer favorite_number;
@Deprecated public java.lang.CharSequence favorite_color;
/**
* Default constructor. Note that this does not initialize fields
* to their default values from the schema. If that is desired then
* one should use newBuilder()
.
*/
public User() {}
/**
* All-args constructor.
*/
public User(java.lang.CharSequence name, java.lang.Integer favorite_number, java.lang.CharSequence favorite_color) {
this.name = name;
this.favorite_number = favorite_number;
this.favorite_color = favorite_color;
}
public org.apache.avro.Schema getSchema() { return SCHEMA$; }
// Used by DatumWriter. Applications should not call.
public java.lang.Object get(int field$) {
switch (field$) {
case 0: return name;
case 1: return favorite_number;
case 2: return favorite_color;
default: throw new org.apache.avro.AvroRuntimeException("Bad index");
}
}
// Used by DatumReader. Applications should not call.
@SuppressWarnings(value="unchecked")
public void put(int field$, java.lang.Object value$) {
switch (field$) {
case 0: name = (java.lang.CharSequence)value$; break;
case 1: favorite_number = (java.lang.Integer)value$; break;
case 2: favorite_color = (java.lang.CharSequence)value$; break;
default: throw new org.apache.avro.AvroRuntimeException("Bad index");
}
}
/**
* Gets the value of the 'name' field.
*/
public java.lang.CharSequence getName() {
return name;
}
/**
* Sets the value of the 'name' field.
* @param value the value to set.
*/
public void setName(java.lang.CharSequence value) {
this.name = value;
}
/**
* Gets the value of the 'favorite_number' field.
*/
public java.lang.Integer getFavoriteNumber() {
return favorite_number;
}
/**
* Sets the value of the 'favorite_number' field.
* @param value the value to set.
*/
public void setFavoriteNumber(java.lang.Integer value) {
this.favorite_number = value;
}
/**
* Gets the value of the 'favorite_color' field.
*/
public java.lang.CharSequence getFavoriteColor() {
return favorite_color;
}
/**
* Sets the value of the 'favorite_color' field.
* @param value the value to set.
*/
public void setFavoriteColor(java.lang.CharSequence value) {
this.favorite_color = value;
}
/** Creates a new User RecordBuilder */
public static com.beng.rpc.avro.User.Builder newBuilder() {
return new com.beng.rpc.avro.User.Builder();
}
/** Creates a new User RecordBuilder by copying an existing Builder */
public static com.beng.rpc.avro.User.Builder newBuilder(com.beng.rpc.avro.User.Builder other) {
return new com.beng.rpc.avro.User.Builder(other);
}
/** Creates a new User RecordBuilder by copying an existing User instance */
public static com.beng.rpc.avro.User.Builder newBuilder(com.beng.rpc.avro.User other) {
return new com.beng.rpc.avro.User.Builder(other);
}
/**
* RecordBuilder for User instances.
*/
public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase
implements org.apache.avro.data.RecordBuilder {
private java.lang.CharSequence name;
private java.lang.Integer favorite_number;
private java.lang.CharSequence favorite_color;
/** Creates a new Builder */
private Builder() {
super(com.beng.rpc.avro.User.SCHEMA$);
}
/** Creates a Builder by copying an existing Builder */
private Builder(com.beng.rpc.avro.User.Builder other) {
super(other);
if (isValidValue(fields()[0], other.name)) {
this.name = data().deepCopy(fields()[0].schema(), other.name);
fieldSetFlags()[0] = true;
}
if (isValidValue(fields()[1], other.favorite_number)) {
this.favorite_number = data().deepCopy(fields()[1].schema(), other.favorite_number);
fieldSetFlags()[1] = true;
}
if (isValidValue(fields()[2], other.favorite_color)) {
this.favorite_color = data().deepCopy(fields()[2].schema(), other.favorite_color);
fieldSetFlags()[2] = true;
}
}
/** Creates a Builder by copying an existing User instance */
private Builder(com.beng.rpc.avro.User other) {
super(com.beng.rpc.avro.User.SCHEMA$);
if (isValidValue(fields()[0], other.name)) {
this.name = data().deepCopy(fields()[0].schema(), other.name);
fieldSetFlags()[0] = true;
}
if (isValidValue(fields()[1], other.favorite_number)) {
this.favorite_number = data().deepCopy(fields()[1].schema(), other.favorite_number);
fieldSetFlags()[1] = true;
}
if (isValidValue(fields()[2], other.favorite_color)) {
this.favorite_color = data().deepCopy(fields()[2].schema(), other.favorite_color);
fieldSetFlags()[2] = true;
}
}
/** Gets the value of the 'name' field */
public java.lang.CharSequence getName() {
return name;
}
/** Sets the value of the 'name' field */
public com.beng.rpc.avro.User.Builder setName(java.lang.CharSequence value) {
validate(fields()[0], value);
this.name = value;
fieldSetFlags()[0] = true;
return this;
}
/** Checks whether the 'name' field has been set */
public boolean hasName() {
return fieldSetFlags()[0];
}
/** Clears the value of the 'name' field */
public com.beng.rpc.avro.User.Builder clearName() {
name = null;
fieldSetFlags()[0] = false;
return this;
}
/** Gets the value of the 'favorite_number' field */
public java.lang.Integer getFavoriteNumber() {
return favorite_number;
}
/** Sets the value of the 'favorite_number' field */
public com.beng.rpc.avro.User.Builder setFavoriteNumber(java.lang.Integer value) {
validate(fields()[1], value);
this.favorite_number = value;
fieldSetFlags()[1] = true;
return this;
}
/** Checks whether the 'favorite_number' field has been set */
public boolean hasFavoriteNumber() {
return fieldSetFlags()[1];
}
/** Clears the value of the 'favorite_number' field */
public com.beng.rpc.avro.User.Builder clearFavoriteNumber() {
favorite_number = null;
fieldSetFlags()[1] = false;
return this;
}
/** Gets the value of the 'favorite_color' field */
public java.lang.CharSequence getFavoriteColor() {
return favorite_color;
}
/** Sets the value of the 'favorite_color' field */
public com.beng.rpc.avro.User.Builder setFavoriteColor(java.lang.CharSequence value) {
validate(fields()[2], value);
this.favorite_color = value;
fieldSetFlags()[2] = true;
return this;
}
/** Checks whether the 'favorite_color' field has been set */
public boolean hasFavoriteColor() {
return fieldSetFlags()[2];
}
/** Clears the value of the 'favorite_color' field */
public com.beng.rpc.avro.User.Builder clearFavoriteColor() {
favorite_color = null;
fieldSetFlags()[2] = false;
return this;
}
@Override
public User build() {
try {
User record = new User();
record.name = fieldSetFlags()[0] ? this.name : (java.lang.CharSequence) defaultValue(fields()[0]);
record.favorite_number = fieldSetFlags()[1] ? this.favorite_number : (java.lang.Integer) defaultValue(fields()[1]);
record.favorite_color = fieldSetFlags()[2] ? this.favorite_color : (java.lang.CharSequence) defaultValue(fields()[2]);
return record;
} catch (Exception e) {
throw new org.apache.avro.AvroRuntimeException(e);
}
}
}
}
public class Main {
public static void main(String[] args) throws IOException {
// 初始化对象
// 1. get/set
User user = new User();
user.setName("Janny");
user.setFavoriteColor("red");
user.setFavoriteNumber(110);
// 2. 构造函数
User user1 = new User("Danny", 119, "black");
System.out.println(user1.toString());
// 3. Builder 模式
User user2 = User.newBuilder().setName("LiMing").setFavoriteColor("yellow").setFavoriteNumber(120).build();
// 将文件序列化到 user.avro
String path = "user.avro";
DatumWriter userDatumWriter = new SpecificDatumWriter<>(User.class);
DataFileWriter dataFileWriter = new DataFileWriter<>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File(path));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user);
dataFileWriter.close();
System.out.println();
// 读取序列化的文件
DatumReader reader = new SpecificDatumReader<>(User.class);
DataFileReader dataFileReader = new DataFileReader<>(new File("user.avro"), reader);
User user4 = null;
while (dataFileReader.hasNext()) {
user4 = dataFileReader.next();
System.out.println(user4);
}
}
}
看一下序列化到文件的内容:
Objavro.schema�{"type":"record","name":"User","namespace":"com.beng.rpc.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":["string","null"]}]}�f���rs뿾�z`
Danny�
blackLiMing�yellow
Janny�red�f��
以上是通过 Scheme 生成Java代码
无需通过Schema生成java代码,但是开发者需要在运行时指定Schema。
//user.avsc放置在“resources/avro”目录下
InputStream inputStream = ClassLoader.getSystemResourceAsStream("avro/user.avsc");
Schema schema = new Schema.Parser().parse(inputStream);
GenericRecord user = new GenericData.Record(schema);
user.put("name", "张三");
user.put("age", 30);
user.put("email","zhangsan@*.com");
File diskFile = new File("/data/users.avro");
DatumWriter datumWriter = new GenericDatumWriter(schema);
DataFileWriter dataFileWriter = new DataFileWriter(datumWriter);
dataFileWriter.create(schema, diskFile);
dataFileWriter.append(user);
dataFileWriter.close();
DatumReader datumReader = new GenericDatumReader(schema);
DataFileReader dataFileReader = new DataFileReader(diskFile, datumReader);
GenericRecord _current = null;
while (dataFileReader.hasNext()) {
_current = dataFileReader.next(_current);
System.out.println(user);
}
dataFileReader.close();
这种情况下,没有生成JAVA API,那么序列化过程就需要开发者预先熟悉Schema的结构,创建User的过程就像构建JSON字符串一样,通过put操作来“赋值”。反序列化也是一样,需要指定schema。
GenericRecord接口提供了根据“field”名称获取值的方法:Object get(String fieldName);不过需要声明,这内部实现并不是基于map,而是一个数组,数组和Schema声明的Fileds按照index对应。put操作根据field名称找到对应的index,然后赋值;get反之。那么在对待Schema兼容性上和“代码生成”基本一致。
应用场景: Apache Storm + AVRO + Kafka 三大Apache家族成员进行大数据平台搭建
Kryo 也是一个序列化工具,有时间阔以研究研究。