从结构上看,Avro和SequenceFile的很相似。schema被序列成Header的一部分,可以是反序列化变的简单。每个block都包含一系列Avro记录,默认情况下,大小为16KB。Avro数据文件支持压缩,并且可切分。
使用程序从数据流中读/写 Avro数据,首先需要一个Avro模式文件。
Avro 模式文件(.avsc):
{
"namespace": "com.hadoop2.data",
"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{
"name": "left",
"type": "string"
},
{
"name": "right",
"type": "string"
}
]
}
java写:
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(App.class.getResourceAsStream("/StringPair.avsc"));
GenericRecord record = new GenericData.Record(schema);
record.put("left", "L");
record.put("right", "R");
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter writer = new GenericDatumWriter<>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(record, encoder);
encoder.flush();
out.close();
DatumWriter 将数据对象翻译成Encoder可以理解的类型。然后有Encoder写入到输出流中。
/** Write data of a schema.
* Implemented for different in-memory data representations.
*/
public interface DatumWriter {
/** Set the schema. */
void setSchema(Schema schema);
/** Write a datum. Traverse the schema, depth first, writing each leaf value
* in the schema from the datum to the output. */
void write(D datum, Encoder out) throws IOException;
}
read:
DatumReader reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(),null);
record = reader.read(null,decoder);
record.get("left");
record.get("right"); //输出类型是UTF8
System.out.println(record.toString());
<plugin>
<groupId>org.apache.avrogroupId>
<artifactId>avro-maven-pluginartifactId>
<version>${avro.version}version>
<executions>
<execution>
<id>schemasid>
<phase>generate-sourcesphase>
<goals>
<goal>schemagoal>
goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resourcessourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/outputDirectory>
configuration>
execution>
executions>
plugin>
在工程目录下运行命令:mvn generate-sources 即可在目录下生成java 代码。当然也可以使用 avro-tools 工具包,不过有些麻烦。
使用StringPair实例代替GenericRecord:
StringPair pair = new StringPair();
pair.setLeft("L");
pair.setRight("R");
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
DatumWriter writer = new SpecificDatumWriter<>(StringPair.class);
Encoder encoder = EncoderFactory.get().binaryEncoder(byteArrayOutputStream,null);
writer.write(pair,encoder);
encoder.flush();
byteArrayOutputStream.close();
DatumReader reader = new SpecificDatumReader<>(StringPair.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(byteArrayOutputStream.toByteArray(),null);
StringPair result = reader.read(null,decoder);
System.out.println(result);
从文章开头的图看到了,数据文件的Header包含元数据(Avro schema 和 sync marker),紧接着是一系列包含序列化Avro对象的数据块。数据块由sync marker来分隔,它对于该文件是唯一的,并且可以像HDFS block一样,允许在文件中搜索到任意位置之后通过 block边界快速地重新进行同步。因此,Avro数据文件是可切分的。
数据文件的扩展名一般为 .avro
换个例子:
DataFileWriter writer = new DataFileWriter<>(new SpecificDatumWriter<>());
System.out.println(App.class.getResource("/"));
FileOutputStream outputStream = new FileOutputStream("/data/workspace/hadoop/target/stocks.avro");
writer.setCodec(CodecFactory.snappyCodec());
writer.create(Stock.SCHEMA$,outputStream);
AvroStockUtils.fromCsvStream(App.class.getResourceAsStream("/stocks.txt"))
.stream().forEach(s -> {
try {
writer.append(s);
} catch (IOException e) {
e.printStackTrace();
}
});
IOUtils.closeStream(writer);
IOUtils.closeStream(outputStream);
FileInputStream inputStream = new FileInputStream("/data/workspace/hadoop/target/stocks.avro");
DataFileStream stream = new DataFileStream<>(inputStream,new SpecificDatumReader(Stock.class));
stream.forEach(s -> System.out.println(s));
IOUtils.closeStream(stream);
IOUtils.closeStream(inputStream);
scheme文件:
{
"namespace":"com.hadoop2.data",
"name": "Stock",
"type": "record",
"fields":[
{"name": "symbol", "type":"string"},
{"name": "date", "type":"string"},
{"name": "open", "type":"double"},
{"name": "high", "type":"double"},
{"name": "low", "type":"double"},
{"name": "close", "type":"double"},
{"name": "volume", "type":"int"},
{"name": "adjClose", "type":"double"}
]
}
输出:
{"symbol": "MSFT", "date": "2002-01-02", "open": 66.65, "high": 67.11, "low": 65.51, "close": 67.04, "volume": 48124000, "adjClose": 27.4}
{"symbol": "MSFT", "date": "2001-01-02", "open": 44.13, "high": 45.0, "low": 42.88, "close": 43.38, "volume": 82413200, "adjClose": 17.73}
{"symbol": "MSFT", "date": "2000-01-03", "open": 117.37, "high": 118.62, "low": 112.0, "close": 116.56, "volume": 53228400, "adjClose": 47.64}
{"symbol": "YHOO", "date": "2009-01-02", "open": 12.17, "high": 12.85, "low": 12.12, "close": 12.85, "volume": 9514600, "adjClose": 12.85}
{"symbol": "YHOO", "date": "2008-01-02", "open": 23.8, "high": 24.15, "low": 23.6, "close": 23.72, "volume": 25671700, "adjClose": 23.72}
{"symbol": "YHOO", "date": "2007-01-03", "open": 25.85, "high": 26.26, "low": 25.26, "close": 25.61, "volume": 26352700, "adjClose": 25.61}
{"symbol": "YHOO", "date": "2006-01-03", "open": 39.69, "high": 41.22, "low": 38.79, "close": 40.91, "volume": 24227700, "adjClose": 40.91}
上面是以java对象的方式读取,换种方法,使用GenericRecord,比较啰嗦
Schema schema = new Schema.Parser().parse(new File("/data/workspace/hadoop/src/main/resources/Stock.avsc"));
File file = new File("/data/workspace/hadoop/src/main/resources/stocks.avro");
DatumWriter writer = new GenericDatumWriter<>(schema);
DataFileWriter dataFileWriter = new DataFileWriter<>(writer);
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(schema,file);
List list = Files.lines(new File("/data/workspace/hadoop/src/main/resources/stocks.txt").toPath()).collect(Collectors.toList());
list.stream().forEach(s -> {
String[] arrays = s.split(",");
GenericRecord record = new GenericData.Record(schema);
record.put("symbol",arrays[0]);
record.put("date",arrays[1]);
record.put("open",Double.valueOf(arrays[2]));
record.put("high",Double.valueOf(arrays[3]));
record.put("low",Double.valueOf(arrays[4]));
record.put("close",Double.valueOf(arrays[5]));
record.put("volume",Integer.valueOf(arrays[6]));
record.put("adjClose",Double.valueOf(arrays[7]));
try {
dataFileWriter.append(record);
} catch (IOException e) {
e.printStackTrace();
}
});
IOUtils.closeStream(dataFileWriter);
DatumReader reader = new GenericDatumReader<>();
DataFileReader dataFileReader = new DataFileReader(file,reader);
GenericRecord record;
while (dataFileReader.hasNext()){
record = dataFileReader.next();
System.out.println(record);
}
如果需要随机访问数据文件,使用seek() and sync()方法。
还是拿最大气温举例:
public class AvroGenricMaxTemperature extends Configured implements Tool {
private static final Schema SCHEMA = new Schema.Parser().parse("{" +
" \"type\": \"record\"," +
" \"name\": \"WeatherRecord\"," +
" \"doc\": \"A weather reading.\"," +
" \"fields\": [" +
"{\"name\": \"year\", \"type\": \"int\"}," +
"{\"name\": \"temperature\", \"type\": \"int\"}," +
"{\"name\": \"stationId\", \"type\": \"string\"}" +
" ]" +
"}");
public static void main(String[] args) throws Exception {
int code = ToolRunner.run(new AvroGenricMaxTemperature(),args);
System.exit(code);
}
@Override
public int run(String[] strings) throws Exception {
Job job = Job.getInstance(getConf(),"AvroGenricMaxTemperature");
job.setJarByClass(getClass());
//使用用户avro版本
job.getConfiguration().setBoolean(Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST,true);
FileInputFormat.addInputPath(job,new Path("hdfs://hadoop:9000/user/madong/input"));
FileOutputFormat.setOutputPath(job,new Path("hdfs://hadoop:9000/user/madong/avro-out"));
AvroJob.setMapOutputKeySchema(job,Schema.create(Schema.Type.INT));
AvroJob.setMapOutputValueSchema(job,SCHEMA);
AvroJob.setOutputKeySchema(job,SCHEMA);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(AvroKeyOutputFormat.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
return job.waitForCompletion(true) ? 0 : 1;
}
static class MaxTemperatureMapper extends Mapper<LongWritable,Text,AvroKey<Integer>,AvroValue<GenericRecord>>{
private NcdcRecordParser parser = new NcdcRecordParser();
private GenericRecord record = new GenericData.Record(SCHEMA);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
parser.parse(value);
if (parser.isValidTemperature()){
record.put("year",parser.getYearInt());
record.put("temperature",parser.getAirTemperature());
record.put("stationId",parser.getStationId());
context.write(new AvroKey<>(parser.getYearInt()),new AvroValue<>(record));
}
}
}
static class MaxTemperatureReducer extends Reducer<AvroKey<Integer>,AvroValue<GenericRecord>,AvroKey<GenericRecord>,NullWritable>{
@Override
protected void reduce(AvroKey key, Iterable> values, Context context) throws IOException, InterruptedException {
GenericRecord max = null;
for (AvroValue value : values){
GenericRecord record = value.datum();
if (max == null || (Integer)record.get("temperature") > (Integer)max.get("temperature")){
max = newWeatherRecord(record);
}
}
context.write(new AvroKey<>(max),NullWritable.get());
}
private GenericRecord newWeatherRecord(GenericRecord value) {
GenericRecord record = new GenericData.Record(SCHEMA);
record.put("year", value.get("year"));
record.put("temperature", value.get("temperature"));
record.put("stationId", value.get("stationId"));
return record;
}
}
}
Avro与常规的Hadoop MapReduce 有两处不同:
第一,使用Avro java类型的包装类。在这个程序中,key是year,value是气象记录,用GenericRecord表示。在map输出,reduce输入时,使用AvroKey,AvroValue包装。
第二, 使用AvroJob配置job。AvroJob 主要用于配置 map输入、输出,以及最后数据输出的schema。在上面的程序中,因为读取的是text,没有设置输入schema。
Avro自身定义了对象的排列顺序,不过至于三种方式: ascending(默认)、descending、ignore。
下面代码示例结合Avro的mapreduce排序:
public class AvroSort extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new AvroSort(), args);
System.exit(exitCode);
}
@Override
public int run(String[] strings) throws Exception {
Job job = Job.getInstance(getConf(),"AvroSort");
job.setJarByClass(getClass());
//使用用户avro版本
job.getConfiguration().setBoolean(Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST,true);
FileInputFormat.addInputPath(job,new Path("hdfs://hadoop:9000/user/madong/avro/pairs.avro"));
FileOutputFormat.setOutputPath(job,new Path("hdfs://hadoop:9000/user/madong/avro-out"));
AvroJob.setDataModelClass(job, GenericData.class);
Schema schema = new Schema.Parser().parse(new File("/data/workspace/hadoop/src/main/resources/SortedStringPair.avsc"));
AvroJob.setInputKeySchema(job,schema);
// AvroKey,AvroValue
AvroJob.setMapOutputKeySchema(job,schema);
AvroJob.setMapOutputValueSchema(job,schema);
//AvroKey,NullWritable
AvroJob.setOutputKeySchema(job,schema);
job.setInputFormatClass(AvroKeyInputFormat.class);
job.setOutputFormatClass(AvroKeyOutputFormat.class);
job.setOutputKeyClass(AvroKey.class);
job.setOutputValueClass(NullWritable.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
return job.waitForCompletion(true) ? 0 : 1;
}
static class SortMapper<K> extends Mapper<AvroKey<K>,NullWritable,AvroKey<K>,AvroValue<K>>{
@Override
protected void map(AvroKey key, NullWritable value, Context context) throws IOException, InterruptedException {
context.write(key,new AvroValue(key.datum()));
}
}
static class SortReducer<K> extends Reducer<AvroKey<K>,AvroValue<K>,AvroKey<K>,NullWritable>{
@Override
protected void reduce(AvroKey key, Iterable> values, Context context) throws IOException, InterruptedException {
for (AvroValue value : values){
context.write(new AvroKey(value.datum()),NullWritable.get());
}
}
}
}
排序发生在 mapreduce的shuffle期间,并且排序函数有Avro schema确定并传入程序中。