avro 序列化框架

前言

Avro 是 Hadoop 的一个子项目,是Hadoop的创始人Doug Cutting(也是Lucene,Nutch等项目的创始人,膜拜)牵头开发。
Avro是一个数据序列化系统,设计用于支持大批量数据交换的应用。
它的主要特点有:支持二进制序列化方式,可以便捷,快速地处理大量数据;动态语言友好,Avro提供的机制使动态语言可以方便地处理Avro数据

其他的序列化系统:

  • Google的Protocol Buffers
  • Facebook的Thrift
  • hessian

avro 参考资料:

  • http://avro.apache.org/docs/current/spec.html 官方文档
  • https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/ 大牛的文章
  • http://www.trieuvan.com/ avro-tools.jar 镜像地址
  1. Avro提供了RPC机制,可以不需要生成额外的API代码即可使用Avro来存储数据和RPC交互,“代码生成”是可选的,这一点区别于protobuf和thrift。此外Hadoop平台上的多个项目正在使用(或者支持)Avro作为数据序列化的服务。

  2. Avro尽管提供了RPC机制,事实上Avro的核心特性决定了它通常用在“大数据”存储场景(Mapreduce),即我们通过借助schema将数据写入到“本地文件”或者HDFS中,然后reader再根据schema去迭代获取数据条目。它的schema可以有限度的变更、调整,而且Avro能够巧妙的兼容,这种强大的可扩展性正是“文件数据”存储所必须的。

  3. avro是基于schema(模式),这和protobuf、thrift没什么区别,在schema文件中(.avsc文件)中声明数据类型或者protocol(RPC接口),那么avro在read、write时将依据schema对数据进行序列化。因为有了schema,那么Avro的读、写操作将可以使用不同的平台语言。Avro的schema是JSON格式,所以编写起来也非常简单、可读性很好。目前Avro所能支持的平台语言并不是很多,其中包括JAVA、C++、Python。

  4. 当Avro将数据写入文件时,将会把schema连同实际数据一同存储,此后reader将可以根据这个schema处理数据,如果reader使用了不同的schema,那么Avro也提供了一些兼容机制来解决这个问题。

    在RPC中使用Avro,Client和server端将会在传输数据之前,首先通过handshake交换Schema,并在Schema一致性上达成统一。

Java 示例

首先建立 user.avsc 文件

user.avsc

{"namespace": "com.beng.rpc.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

上述schema表示User类有三个field:“name”、“age”、“email”;“type”用来声明field的数据类型,比如“email”的type为“[“string”,“null”]”,则表示类型可以为“string”或者为null。

使用 avro-tools.jar 对其进行编译

java -jar avro-tools-1.7.7.jar compile schema user.avsc  .

将编译后的文件编译当前文件夹
avro 序列化框架_第1张图片
你也可以使用 eclipse maven 对其进行编译,pom 依赖:

  
    org.apache.avro  
    avro  
    1.7.7  
  
  
  
    org.apache.avro  
    avro-tools  
    1.7.7  
  

avro和avro-tools两个依赖包,是avro开发的必备的基础包。如果你的项目需要让maven来根据.avsc文件生成java代码的话,还需要增加如下avro-maven-plugin依赖,否则此处是不需要的。

  
    org.apache.avro  
    avro-maven-plugin  
    1.7.7  
      
          
            generate-sources  
              
                schema  
              
              
                ${project.basedir}/src/main/resources/avro/  
                ${project.basedir}/src/main/java/  
              
          
      
  
  
    org.apache.maven.plugins  
    maven-compiler-plugin  
      
        utf-8  
      
  

看一下编译后的 User.java 代码:

/**
 * Autogenerated by Avro
 * 
 * DO NOT EDIT DIRECTLY
 */
package com.beng.rpc.avro;  
@SuppressWarnings("all")
@org.apache.avro.specific.AvroGenerated
public class User extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"com.beng.rpc.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}");
  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
  @Deprecated public java.lang.CharSequence name;
  @Deprecated public java.lang.Integer favorite_number;
  @Deprecated public java.lang.CharSequence favorite_color;

  /**
   * Default constructor.  Note that this does not initialize fields
   * to their default values from the schema.  If that is desired then
   * one should use newBuilder(). 
   */
  public User() {}

  /**
   * All-args constructor.
   */
  public User(java.lang.CharSequence name, java.lang.Integer favorite_number, java.lang.CharSequence favorite_color) {
    this.name = name;
    this.favorite_number = favorite_number;
    this.favorite_color = favorite_color;
  }

  public org.apache.avro.Schema getSchema() { return SCHEMA$; }
  // Used by DatumWriter.  Applications should not call. 
  public java.lang.Object get(int field$) {
    switch (field$) {
    case 0: return name;
    case 1: return favorite_number;
    case 2: return favorite_color;
    default: throw new org.apache.avro.AvroRuntimeException("Bad index");
    }
  }
  // Used by DatumReader.  Applications should not call. 
  @SuppressWarnings(value="unchecked")
  public void put(int field$, java.lang.Object value$) {
    switch (field$) {
    case 0: name = (java.lang.CharSequence)value$; break;
    case 1: favorite_number = (java.lang.Integer)value$; break;
    case 2: favorite_color = (java.lang.CharSequence)value$; break;
    default: throw new org.apache.avro.AvroRuntimeException("Bad index");
    }
  }

  /**
   * Gets the value of the 'name' field.
   */
  public java.lang.CharSequence getName() {
    return name;
  }

  /**
   * Sets the value of the 'name' field.
   * @param value the value to set.
   */
  public void setName(java.lang.CharSequence value) {
    this.name = value;
  }

  /**
   * Gets the value of the 'favorite_number' field.
   */
  public java.lang.Integer getFavoriteNumber() {
    return favorite_number;
  }

  /**
   * Sets the value of the 'favorite_number' field.
   * @param value the value to set.
   */
  public void setFavoriteNumber(java.lang.Integer value) {
    this.favorite_number = value;
  }

  /**
   * Gets the value of the 'favorite_color' field.
   */
  public java.lang.CharSequence getFavoriteColor() {
    return favorite_color;
  }

  /**
   * Sets the value of the 'favorite_color' field.
   * @param value the value to set.
   */
  public void setFavoriteColor(java.lang.CharSequence value) {
    this.favorite_color = value;
  }

  /** Creates a new User RecordBuilder */
  public static com.beng.rpc.avro.User.Builder newBuilder() {
    return new com.beng.rpc.avro.User.Builder();
  }
  
  /** Creates a new User RecordBuilder by copying an existing Builder */
  public static com.beng.rpc.avro.User.Builder newBuilder(com.beng.rpc.avro.User.Builder other) {
    return new com.beng.rpc.avro.User.Builder(other);
  }
  
  /** Creates a new User RecordBuilder by copying an existing User instance */
  public static com.beng.rpc.avro.User.Builder newBuilder(com.beng.rpc.avro.User other) {
    return new com.beng.rpc.avro.User.Builder(other);
  }
  
  /**
   * RecordBuilder for User instances.
   */
  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase
    implements org.apache.avro.data.RecordBuilder {

    private java.lang.CharSequence name;
    private java.lang.Integer favorite_number;
    private java.lang.CharSequence favorite_color;

    /** Creates a new Builder */
    private Builder() {
      super(com.beng.rpc.avro.User.SCHEMA$);
    }
    
    /** Creates a Builder by copying an existing Builder */
    private Builder(com.beng.rpc.avro.User.Builder other) {
      super(other);
      if (isValidValue(fields()[0], other.name)) {
        this.name = data().deepCopy(fields()[0].schema(), other.name);
        fieldSetFlags()[0] = true;
      }
      if (isValidValue(fields()[1], other.favorite_number)) {
        this.favorite_number = data().deepCopy(fields()[1].schema(), other.favorite_number);
        fieldSetFlags()[1] = true;
      }
      if (isValidValue(fields()[2], other.favorite_color)) {
        this.favorite_color = data().deepCopy(fields()[2].schema(), other.favorite_color);
        fieldSetFlags()[2] = true;
      }
    }
    
    /** Creates a Builder by copying an existing User instance */
    private Builder(com.beng.rpc.avro.User other) {
            super(com.beng.rpc.avro.User.SCHEMA$);
      if (isValidValue(fields()[0], other.name)) {
        this.name = data().deepCopy(fields()[0].schema(), other.name);
        fieldSetFlags()[0] = true;
      }
      if (isValidValue(fields()[1], other.favorite_number)) {
        this.favorite_number = data().deepCopy(fields()[1].schema(), other.favorite_number);
        fieldSetFlags()[1] = true;
      }
      if (isValidValue(fields()[2], other.favorite_color)) {
        this.favorite_color = data().deepCopy(fields()[2].schema(), other.favorite_color);
        fieldSetFlags()[2] = true;
      }
    }

    /** Gets the value of the 'name' field */
    public java.lang.CharSequence getName() {
      return name;
    }
    
    /** Sets the value of the 'name' field */
    public com.beng.rpc.avro.User.Builder setName(java.lang.CharSequence value) {
      validate(fields()[0], value);
      this.name = value;
      fieldSetFlags()[0] = true;
      return this; 
    }
    
    /** Checks whether the 'name' field has been set */
    public boolean hasName() {
      return fieldSetFlags()[0];
    }
    
    /** Clears the value of the 'name' field */
    public com.beng.rpc.avro.User.Builder clearName() {
      name = null;
      fieldSetFlags()[0] = false;
      return this;
    }

    /** Gets the value of the 'favorite_number' field */
    public java.lang.Integer getFavoriteNumber() {
      return favorite_number;
    }
    
    /** Sets the value of the 'favorite_number' field */
    public com.beng.rpc.avro.User.Builder setFavoriteNumber(java.lang.Integer value) {
      validate(fields()[1], value);
      this.favorite_number = value;
      fieldSetFlags()[1] = true;
      return this; 
    }
    
    /** Checks whether the 'favorite_number' field has been set */
    public boolean hasFavoriteNumber() {
      return fieldSetFlags()[1];
    }
    
    /** Clears the value of the 'favorite_number' field */
    public com.beng.rpc.avro.User.Builder clearFavoriteNumber() {
      favorite_number = null;
      fieldSetFlags()[1] = false;
      return this;
    }

    /** Gets the value of the 'favorite_color' field */
    public java.lang.CharSequence getFavoriteColor() {
      return favorite_color;
    }
    
    /** Sets the value of the 'favorite_color' field */
    public com.beng.rpc.avro.User.Builder setFavoriteColor(java.lang.CharSequence value) {
      validate(fields()[2], value);
      this.favorite_color = value;
      fieldSetFlags()[2] = true;
      return this; 
    }
    
    /** Checks whether the 'favorite_color' field has been set */
    public boolean hasFavoriteColor() {
      return fieldSetFlags()[2];
    }
    
    /** Clears the value of the 'favorite_color' field */
    public com.beng.rpc.avro.User.Builder clearFavoriteColor() {
      favorite_color = null;
      fieldSetFlags()[2] = false;
      return this;
    }

    @Override
    public User build() {
      try {
        User record = new User();
        record.name = fieldSetFlags()[0] ? this.name : (java.lang.CharSequence) defaultValue(fields()[0]);
        record.favorite_number = fieldSetFlags()[1] ? this.favorite_number : (java.lang.Integer) defaultValue(fields()[1]);
        record.favorite_color = fieldSetFlags()[2] ? this.favorite_color : (java.lang.CharSequence) defaultValue(fields()[2]);
        return record;
      } catch (Exception e) {
        throw new org.apache.avro.AvroRuntimeException(e);
      }
    }
  }
}

使用 User.java

public class Main {

    public static void main(String[] args) throws IOException {

        // 初始化对象
        // 1. get/set
        User user = new User();
        user.setName("Janny");
        user.setFavoriteColor("red");
        user.setFavoriteNumber(110);

        // 2. 构造函数
        User user1 = new User("Danny", 119, "black");
        System.out.println(user1.toString());

        // 3. Builder 模式
        User user2 = User.newBuilder().setName("LiMing").setFavoriteColor("yellow").setFavoriteNumber(120).build();

        // 将文件序列化到 user.avro
        String path = "user.avro";
        DatumWriter userDatumWriter = new SpecificDatumWriter<>(User.class);
        DataFileWriter dataFileWriter = new DataFileWriter<>(userDatumWriter);
        dataFileWriter.create(user1.getSchema(), new File(path));
        dataFileWriter.append(user1);
        dataFileWriter.append(user2);
        dataFileWriter.append(user);
        dataFileWriter.close();
        System.out.println();

        // 读取序列化的文件
        DatumReader reader = new SpecificDatumReader<>(User.class);
        DataFileReader dataFileReader = new DataFileReader<>(new File("user.avro"), reader);
        User user4 = null;
        while (dataFileReader.hasNext()) {
            user4 = dataFileReader.next();
            System.out.println(user4);
        }
    }
}

看一下序列化到文件的内容:

Objavro.schema�{"type":"record","name":"User","namespace":"com.beng.rpc.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":["string","null"]}]}�f���rs뿾�z`
Danny�
blackLiMing�yellow
Janny�red�f��

以上是通过 Scheme 生成Java代码

非“代码生成”情况

无需通过Schema生成java代码,但是开发者需要在运行时指定Schema。

//user.avsc放置在“resources/avro”目录下  
InputStream inputStream = ClassLoader.getSystemResourceAsStream("avro/user.avsc");  
Schema schema = new Schema.Parser().parse(inputStream);  
  
GenericRecord user = new GenericData.Record(schema);  
user.put("name", "张三");  
user.put("age", 30);  
user.put("email","zhangsan@*.com");  
  
File diskFile = new File("/data/users.avro");  
DatumWriter datumWriter = new GenericDatumWriter(schema);  
DataFileWriter dataFileWriter = new DataFileWriter(datumWriter);  
dataFileWriter.create(schema, diskFile);  
dataFileWriter.append(user);  
dataFileWriter.close();  
  
DatumReader datumReader = new GenericDatumReader(schema);  
DataFileReader dataFileReader = new DataFileReader(diskFile, datumReader);  
GenericRecord _current = null;  
while (dataFileReader.hasNext()) {  
    _current = dataFileReader.next(_current);  
    System.out.println(user);  
}  
  
dataFileReader.close(); 

这种情况下,没有生成JAVA API,那么序列化过程就需要开发者预先熟悉Schema的结构,创建User的过程就像构建JSON字符串一样,通过put操作来“赋值”。反序列化也是一样,需要指定schema。
GenericRecord接口提供了根据“field”名称获取值的方法:Object get(String fieldName);不过需要声明,这内部实现并不是基于map,而是一个数组,数组和Schema声明的Fileds按照index对应。put操作根据field名称找到对应的index,然后赋值;get反之。那么在对待Schema兼容性上和“代码生成”基本一致。

应用场景: Apache Storm + AVRO + Kafka 三大Apache家族成员进行大数据平台搭建

Kryo 也是一个序列化工具,有时间阔以研究研究。

你可能感兴趣的:(大数据处理)