avro数据格式说明

1.API参考文档地址

https://avro.apache.org/docs/current/api/java/index.html

2.avro数据格式定义

官网说明:https://avro.apache.org/docs/current/spec.html

这里定义一个简单的schema文件user.avsc,注意,后缀一定是avsc,其中的内容如下:

{
    "namespace": "com.yyj.avro.demo",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "id", "type": "string","default":""},
        {"name": "name",  "type": ["string", "null"]},
        {"name": "age", "type": ["int", "null"]}
    ]
}

Records use the type name “record” and support the following attributes:

  • name: a JSON string providing the name of the record (required).

  • namespace, a JSON string that qualifies the name;

  • doc: a JSON string providing documentation to the user of this schema (optional).

  • aliases: a JSON array of strings, providing alternate names for this record (optional).

  • fields: a JSON array, listing fields (required). Each field is a JSON object with the following attributes:

  • name: a JSON string providing the name of the field (required), and

  • doc: a JSON string describing this field for users (optional).

  • type: a schema, as defined above

  • default: A default value for this field, only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time. Permitted values depend on the field’s schema type, according to the table below. Default values for union fields correspond to the first schema in the union. Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255. Avro encodes a field even if its value is equal to its default.

field default values

avro type	json type	example
null	null	null
boolean	boolean	true
int,long	integer	1
float,double	number	1.1
bytes	string	"\u00FF"
string	string	"foo"
record	object	{"a": 1}
enum	string	"FOO"
array	array	[1]
map	object	{"a": 1}
fixed	string	"\u00ff"
  • order: specifies how this field impacts sort ordering of this record (optional). Valid values are “ascending” (the default), “descending”, or “ignore”. For more details on how this is used, see the sort order section below.

  • aliases: a JSON array of strings, providing alternate names for this field (optional).

  • namespace:定义了根据 schema 文件生成的类的包名

  • type:固定写法

  • name:生成的类的名称

  • fields:定义了生成的类中的属性的名称和类型,其中"type": [“int”, “null”]的意思是,age 这个属性是int类型,但可以为null

基本类型:null、boolean、int、long、float、double、bytes、string

复杂类型:record、enum、array、map、union、fixed

3.设置maven依赖

<dependencies>
    <dependency>
        <groupId>org.apache.avrogroupId>
        <artifactId>avroartifactId>
        <version>1.8.2version>
    dependency>
dependencies>
<build>
   <plugins>
       <plugin>
           <groupId>org.apache.avrogroupId>
           <artifactId>avro-maven-pluginartifactId>
           <version>1.8.2version>
           <executions>
               <execution>
                   <phase>generate-sourcesphase>
                   <goals>
                       <goal>schemagoal>
                   goals>
                   <configuration>
                       <sourceDirectory>${project.basedir}/src/main/resources/sourceDirectory>
                       <outputDirectory>${project.basedir}/src/main/java/outputDirectory>
                   configuration>
               execution>
           executions>
       plugin>
       <plugin>
           <groupId>org.apache.maven.pluginsgroupId>
           <artifactId>maven-compiler-pluginartifactId>
           <configuration>
               <source>1.8source>
               <target>1.8target>
           configuration>
       plugin>
   plugins>
build>

4.Schema Evolution有4种:

  • Backward: 向后兼容,用新schema可以读取旧数据,有些字段没有数据,就用default值

  • Forward: 向前兼容,用旧schema可以读取新数据,avro将忽略新加的字段

  • Full: 全兼容,支持向前兼容,向后兼容,只能新增或删除带默认值的字段

  • Breaking: 不兼容

你可能感兴趣的:(大数据,java)