英文简介 https://en.wikipedia.org/wiki/Apache_Avro
官网简介 http://avro.apache.org/docs/current/
avro是一个数据序列化系统,它提供
以上来源于avro百度百科,也是翻译于官网简介,至于为什么以avro命名,找了很久确实没找到
有很多avro千万级大文件需要分析
在hadoop的环境基础上,配置avro的开发环境
1、首先pom.xml里增加avro依赖
<dependency>
<groupId>org.apache.avrogroupId>
<artifactId>avro-mapredartifactId>
<version>1.7.7version>
<classifier>hadoop2classifier>
dependency>
添加avro插件
<plugin>
<groupId>org.apache.avrogroupId>
<artifactId>avro-maven-pluginartifactId>
<version>1.7.7version>
<executions>
<execution>
<phase>generate-sourcesphase>
<goals>
<goal>schemagoal>
goals>
<configuration>
<sourceDirectory>${project.basedir}/../sourceDirectory>
<outputDirectory>${project.build.directory}/generated-sources/javaoutputDirectory>
configuration>
execution>
executions>
plugin>
2、在src目录下,也就是和main同级目录,建立schemas目录,在这里建立一个avro依赖的模式文件
{"namespace": "com.xmliu.hadoop.avsc",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
3、配置了assembly.xml后,就可以编译项目生成对应java文件,按照上面的配置,生成的java文件应该在这里
./target/generated-sources/java/com.xmliu.hadoop.avsc/User.java
4、为了测试方便,在本地生成avro文件,在main方法中执行以下代码,会在项目根目录下生成users.avro文件
try {
DatumWriter writer = new SpecificDatumWriter(User.class);
DataFileWriter dataFileWriter = new DataFileWriter(writer);
dataFileWriter.create(User.getClassSchema(), new File("users.avro"));
User user1 = new User();
user1.setFavoriteColor("red");
user1.setFavoriteNumber(23);
user1.setName("Lendy");
dataFileWriter.append(user1);
// 可生成多个user
dataFileWriter.flush();
dataFileWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
5、mapreduce读取分析,这里有mapper和reducer,来源于官方文档,具体网页代码见这里,详细的官方源码可在这里下载
try {
Job job = Job.getInstance(getConf(), "avro");
job.setJarByClass(AvroMain.class);
FileInputFormat.setInputPaths(job, new Path(avroPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setInputFormatClass(AvroKeyInputFormat.class);
job.setMapperClass(MyAvroJob.ColorCountMapper.class);
AvroJob.setInputKeySchema(job, User.getClassSchema());
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(MyAvroJob.ColorCountReducer.class);
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.INT));
if (job.waitForCompletion(true)) {
return 0;
}
} catch (Exception e) {
e.printStackTrace();
}
1、
Error:java: Compilation failed: internal java compiler error
解决方法就是调整java编译器版本,参见Error:java: Compilation failed: internal java compiler error
2、
Error: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
出现这个问题的解决方法就是去掉下面这行代码
job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
本人理解的意为设置输出格式为avro,这里不需要这个设置
3、
找不到org.codehaus.jackson.JsonParser的类文件
解决方法其实很奇妙,就是重新建module,一步步来,跟手机出问题后重启一样的道理,相当于用了排除法