Apahce Avro入门及其在Spark中的应用

1 Avro简介

在互联网发展早期,很多项目都是运行在单体架构上,使用Java原生序列化机制能满足大部分场景需求。后面随着业务和访问量的增大,项目架构慢慢迁移到微服务架构。每个微服务可能采用不同的开发语言,而且部分服务对通信性能要求比较高,这时候Java原生的序列化就不能满足要求。因为Java原生序列化机制存在1)不支持跨语言2)序列化后的体积比较大等问题,所以采第三方的序列化协议就显得很有必要。
Apache Avro是Hadoop之父Doug Cutting创建,具有跨语言,序列化后空间小等优点,被广泛应用在大数据领域。Avro模式通常用JSON来写,使用二进制格式编码,支持的简单类型如下:
Apahce Avro入门及其在Spark中的应用_第1张图片
也支持复杂类型,详见官网https://avro.apache.org/。

2 使用Avro替代Java原生序列化机制

2.1 定义Avro的模式(Schema)

创建Avro的Schema文件user.avsc,写入以下内容:

{
    "namespace": "com.dhhy.avro",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "id", "type": "int"},
        {"name": "phonenum", "type": ["string", "null"]}
    ]
}

这里的变量名name是字符串类型,id是整数类型,都为非空;phonenum是字符串类型,可以为空。

2.2 编译模式

下载依赖avro-tools-1.9.2.jar,执行命令java -jar avro-tools-1.9.2.jar compile schema user.avsc ./
在这里插入图片描述

2.3 创建maven工程

项目名为KafkaStudy,整体文件目录结构如下:
Apahce Avro入门及其在Spark中的应用_第2张图片

1)pom文件如下:

<?xml version="1.0" encoding="UTF-8"?>
://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    >4.0.0>
    >com.dhhy>
    >kafkaStudy>
    >1.0-SNAPSHOT>

    >
        >
            >org.apache.kafka>
            >kafka-clients>
            >2.1.1>
        >

        >
        	>org.apache.avro>
        	>avro>
        	>1.9.2>
		>

		>
            >com.google.code.gson>
            >gson>
            >2.2.4>
		>
	>

    >
        >
            >
                >org.apache.maven.plugins>
                >maven-compiler-plugin>
                >
                    >8>
                    >8>
                >
            >
        >
    >
>

2)将模式编译生成的文件夹复制到KafkaStudy工程下,即User.java文件
3)创建一个Java类AvroApp.java,内容如下:

package com.dhhy.avro;

import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import java.io.File;
import java.io.IOException;

/**
 * Created by JayLai on 2020-02-24 16:44:18
 */
public class AvroApp {

    /**
     * 创建User对象
     * @return
     */
    public User createUser(){
        User user = User.newBuilder()
                .setName("Jay")
                .setId(1024)
                .setPhonenum("18814123346")
                .build();

        return user;
    }

    /**
     * 将User对象序列化到磁盘
     */
    public void serializeUser(User user, File file){
        DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
        DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);

        try {
            dataFileWriter.create(user.getSchema(), file);
            dataFileWriter.append(user);
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            try{
                dataFileWriter.close();
            }catch(IOException e){
                e.printStackTrace();
            }
        }
    }



    /**
     * 将User对象从磁盘序列化到内存
     */
    public void deserializeUser(File file){
        DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);
        DataFileReader<User> dataFileReader = null;


        try {
            dataFileReader = new DataFileReader<User>(file, userDatumReader);
            User user = null;
            while (dataFileReader.hasNext()) {
                user = dataFileReader.next(user);
                System.out.println(user);
            }
            dataFileReader.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            try{
                if(dataFileReader != null){
                    dataFileReader.close();
                }
            }catch(IOException e){
                e.printStackTrace();
            }
        }
    }


    public static void main(String[] args) {
        AvroApp app = new AvroApp();
        File file = new File("users.avro");
        app.serializeUser(app.createUser(), file);
        app.deserializeUser(file);
    }


}

2.4 测试

运行AvroApp.java,可以看到控制台输出{“name”: “Jay”, “id”: 1024, “phonenum”: “18814123346”},同时本地磁盘多了一个users.avro的文件,说明执行成功,内存中的User对象能通过Avro序列化到本地磁盘,同时也能通过Avro反序列化到内存中。
Apahce Avro入门及其在Spark中的应用_第3张图片

3 在Kafka中使用Avro进行编解码

由于消息队列Kafka具有高吞吐等优点,在大数据领域,上下游数据的实时交互大部分是通过Kafka实现。上游可能使用不同开发语言,如C++,Java等,使用Avro将数据进行序列化后写入Kafka的broker中;下游的客户端同样也存在多种开发语言,再使用Avro进行反序列化,如图所示。

Apahce Avro入门及其在Spark中的应用_第4张图片

在KafkaStudy工程的基础上,删掉无关的2个文件AvroApp.java和user.varo,整体文件结构如下:
Apahce Avro入门及其在Spark中的应用_第5张图片

3.1 创建Topic

使用Xshell登陆KAFKA_HOME,创建名为avro_topic的Topic

hadoop@ubuntu:~/app/kafka$ ls
bin  config  kafka_log  libs  LICENSE  logs  NOTICE  site-docs  start.sh
hadoop@ubuntu:~/app/kafka$:bin/kafka-topics.sh --create --zookeeper 192.168.0.131:2181 --partitions 2 --replication-factor 1 --topic avro_topic
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "avro_topic".

3.2 生产者端的话单序列化类

package com.dhhy.avro;

import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Serializer;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Map;

/**
 * 序列化类
 * Created by JayLai on 2020-03-24 22:17:40
 */
public class AvroSerializer implements Serializer<User> {
    @Override
    public void configure(Map<String, ?> map, boolean b) {

    }

    @Override
    public byte[] serialize(String topic, User data) {
        if(data == null) {
            return null;
        }
        DatumWriter<User> writer = new SpecificDatumWriter<>(data.getSchema());
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        BinaryEncoder encoder = EncoderFactory.get().directBinaryEncoder(out, null);
        try {
            writer.write(data, encoder);
        }catch (IOException e) {
            throw new SerializationException(e.getMessage());
        }
        return out.toByteArray();
    }

    @Override
    public void close() {

    }
}

3.3 生产者

package com.dhhy.avro;

import com.google.gson.Gson;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

/**
 * Created by JayLai on 2020-02-24 19:55:29
 */
public class AvroProducer {
    public static final String brokerList = "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092";
    public static final String topic = "avro_topic";
    static int count = 0;


    /**
     * 创建User对象
     * @return
     */
    public static User createUser(){
        User user = User.newBuilder()
                .setName("Jay")
                .setId(++count)
                .setPhonenum("18814123456")
                .build();

        return user;
    }

    public static void main(String[] args) {
        User[] users = new User[10];
        for (int i = 0; i < 10; i++){
            users[i] = createUser();
        }


        Properties properties =  new Properties();
        properties.put("bootstrap.servers", brokerList);
        properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        properties.put("value.serializer", "com.dhhy.avro.AvroSerializer");


        //自己直生产者客户端参数并创建KafkaProducer实例
        KafkaProducer<String, User> producer = new KafkaProducer<>(properties);
        //发送消息
        Map<String, Object> map = new HashMap<>();
        Gson gson= new Gson();
        try {
            for (User user : users) {
                ProducerRecord<String, User> record = new ProducerRecord<>(topic, user);
                producer.send(record, new Callback() {
                    @Override
                    public void onCompletion(RecordMetadata metadata, Exception exception) {
                        if (exception != null){
                            exception.printStackTrace();
                        }else{
                            map.put("topic", metadata.topic());
                            map.put("partition", metadata.partition());
                            map.put("offset", metadata.offset());
                            map.put("user", user);
                            System.out.println(gson.toJson(map));
                        }
                    }
                });
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            //关闭生产着客户端实例
            if(producer != null){
                producer.close();
            }
        }

    }
}

3.4 消费者端的话单反序列化类

package com.dhhy.avro;

import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.kafka.common.serialization.Deserializer;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Map;

/**
 * 反序列化类
 * Created by JayLai on 2020-03-24 22:30:03
 */
public class AvroDeserializer implements Deserializer<User>{
    @Override
    public void configure(Map<String, ?> map, boolean b) {

    }


    @Override
    public void close() {

    }

    @Override
    public User deserialize(String topic, byte[] data) {
        if(data == null) {
            return null;
        }
        User user = new User();
        ByteArrayInputStream in = new ByteArrayInputStream(data);
        DatumReader<User> userDatumReader = new SpecificDatumReader<>(User.getClassSchema());
        BinaryDecoder decoder = DecoderFactory.get().directBinaryDecoder(in, null);
        try {
            user = userDatumReader.read(null, decoder);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return user;
    }
}
3.4 消费者
package com.dhhy.avro;

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
import java.util.concurrent.atomic.AtomicBoolean;

/**
 * Created by JayLai on 2020-03-24 23:34:17
 */
public class AvroConsumer {
    public static final String brokerList = "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092";
    public static final String topic = "avro_topic";
    public static final String groupId = "avro_group_001";
    public static final AtomicBoolean isRunning =new AtomicBoolean(true);

    public static void main(String[] args) {
        Properties properties =  new Properties();
        properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("value.deserializer", "com.dhhy.avro.AvroDeserializer");
        properties.put("bootstrap.servers", brokerList);
        properties.put ("group.id", groupId);
        properties.put ("auto.offset.reset", "earliest");


        //创建一个消费者客户端实例
        KafkaConsumer<String, User> consumer = new KafkaConsumer(properties);
        //订阅主题
        consumer.subscribe(Collections.singletonList(topic));
        consumer.partitionsFor(topic);

        //循环消货消息
        while (isRunning.get()) {
            ConsumerRecords<String, User> records =
                    consumer.poll(5000);

            System.out.println(records.count());

            for (ConsumerRecord<String, User> record :records  ) {
                System.out.println("{topic:"  + record.topic() + " ,partition:" + record.partition()
                        +  " ,offset:" + record.offset() +
                        " ,key:" + record.topic() + " ,value:" +record.value().toString() + "}");
            }

        }

    }
}

3.5 测试

先启动生产者AvroProducer类,再运行消费者AvroConsumer类,可以看到消费者的控制台打印了消费记录。
Apahce Avro入门及其在Spark中的应用_第6张图片

4 Spark消费以Avro编码的kafka消息

Kafk下游的消费者也可以是不同的大数据组件,如Spark Streaming, Flink等,在这里以Spark Streaming为例子。
Apahce Avro入门及其在Spark中的应用_第7张图片

4.1 创建maven工程

在实际工作中,Spark的应用主要以Scala开发为主,因此需要重新创建一个Scala工程,名称为sparkstudy。
Pom文件的内容如下:


```markup

```yaml
://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  >4.0.0>


  >com.dhhy>
  >sparkstudy>
  >1.0-SNAPSHOT>

    >
        >UTF-8>
        >2.11.8>
        >2.3.0>
    >

    >

        <!--scala-->
        
            org.scala-lang
            scala-library
            ${scala.version}
        

        <!--SparkSQL-->
        
            org.apache.spark
            spark-sql_2.11
            ${spark.version}
        

        >
            >org.apache.spark>
            >spark-hive_2.11>
            >${spark.version}>
        >

        >
            >org.spark-project.hive>
            >hive-jdbc>
            >1.2.1.spark2>
        >


        <!-- Spark ML-->
        
            org.apache.spark
            spark-mllib_2.11
            ${spark.version}
        


        <!-- Spark Streaming-->
        
            org.apache.spark
            spark-streaming_2.11
            ${spark.version}
        


        <!--Spark Streaming整合kafka-->
        
            org.apache.spark
            spark-streaming-kafka-0-10_2.11
            ${spark.version}
        

        >
            >org.apache.kafka>
            >kafka-clients>
            >0.10.2.1>
        >


        >
            >mysql>
            >mysql-connector-java>
            >5.1.38>
        >


        <!--avro序列化 -->
        
            org.apache.avro
            avro
            1.9.2
        



    >


    >
        >
            <!-- mixed scala/java compile -->
            
                org.scala-tools
                maven-scala-plugin
                
                    
                        compile
                        
                            compile
                        
                        compile
                    
                    
                        test-compile
                        
                            testCompile
                        
                        test-compile
                    
                    
                        process-resources
                        
                            compile
                        
                    
                
            
            
                maven-compiler-plugin
                
                    1.8
                    1.8
                
            
            
            
                org.apache.maven.plugins
                maven-assembly-plugin
                2.4
                
                    
                        jar-with-dependencies
                    
                
                
                    
                        assemble-all
                        package
                        
                            single
                        
                    
                
            
            
                org.apache.maven.plugins
                maven-jar-plugin
                
                    
                        
                            true
                            com.dhhy.spark.streaming.kafkaConsumerApp
                        
                    
                
            
        >
    >
    >
        >
            >aliyunmaven>
            >http://maven.aliyun.com/nexus/content/groups/public/>
        >

    >
>

并将User.java和AvroDeserializer.java两个类引入到sparkstudy中。

4.2 Spark应用的消费类

package com.dhhy.avro

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}

object AvroSparkConsumer {

  def main(args: Array[String]): Unit = {

    //配置项
    val conf = new SparkConf().setMaster("local[2]").setAppName("AvroSparkConsumer")

    val ssc = new StreamingContext(conf, Seconds(5))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092", //kafka集群地址
      "key.deserializer" -> classOf[StringDeserializer], //序列化类型,此处为字符类型
      "value.deserializer" -> classOf[AvroDeserializer], //序列化类型,此处为AVRO
      "group.id" -> "AvroConsumeGroup2", //Kafka消费组
      "auto.offset.reset" -> "earliest" //读取最新offset
    )


    //kafka的话题
    val topics = Array("avro_topic")
    val stream: InputDStream[ConsumerRecord[String, User]] = KafkaUtils.createDirectStream[String, User](
      ssc,
      PreferConsistent,
      Subscribe[String, User](topics, kafkaParams)
    )

    stream.foreachRDD(rdd =>
      rdd.foreachPartition(partition =>
        partition.foreach(record =>
          println("{" +
            "topic:" + record.topic() +
            " ,partition:" + record.partition() +
            " ,offset:" + record.offset() +
            " ,key:" + record.topic() +
            " ,value:" + record.value().toString() +
            "}")
        )
      )

    )


    ssc.start()
    ssc.awaitTermination()
  }

}

4.3 工程的目录结构

在这里插入图片描述

4.4 测试

运行AvroSparkConsumer.scala,消费的结果打印在控制台上:
Apahce Avro入门及其在Spark中的应用_第8张图片

5 参考文献

[1] Hadoop权威指南 第4版. 2017, 清华大学出版社.
[2] Avro官网 https://avro.apache.org/
[3] 深入理解Kafka:核心设计与实践原理 第1版. 2019, 电子工业出版社.
[4] Spark官网 http://spark.apache.org/

你可能感兴趣的:(Spark,rpc,大数据,序列化)