Spark Streaming将处理结果数据写入Kafka

Spark 没有直截有效的方式将消息发送到Kafka。

input.foreachRDD(rdd =>
  // 不能在这里创建KafkaProducer
  rdd.foreachPartition(partition =>
    partition.foreach {
      case x: String => {
        val props = new HashMap[String, Object]()
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
          "org.apache.kafka.common.serialization.StringSerializer")
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
          "org.apache.kafka.common.serialization.StringSerializer")
        println(x)
        val producer = new KafkaProducer[String, String](props)
        val message = new ProducerRecord[String, String]("output", null, x)
        producer.send(message)
      }
    }
  )
)

这种方式缺点很明显,对于每个partition的每条记录,我们都需要创建 KafkaProducer,然后利用 KafkaProducer 进行输出操作,
但是由于 KafkaProducer 是不可序列化的,所以不能将 KafkaProducer 实例放在 foreachPartition 外边创建。
上面这种做法会为每条记录都建立一次连接,这种方法是不灵活且低效的。

因而我们采用以下方法:

  1. 创建 KafkaProducer 包装器

    import java.util.concurrent.Future
    import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
    
    class KafkaSink[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
      /* This is the key idea that allows us to work around running into
         NotSerializableExceptions. */
      lazy val producer = createProducer()
    
      def send(topic: String, key: K, value: V): Future[RecordMetadata] =
        producer.send(new ProducerRecord[K, V](topic, key, value))
    
      def send(topic: String, value: V): Future[RecordMetadata] =
        producer.send(new ProducerRecord[K, V](topic, value))
    }
    
    object KafkaSink {
    
      import scala.collection.JavaConversions._
    
      def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
        val createProducerFunc = () => {
          val producer = new KafkaProducer[K, V](config)
          sys.addShutdownHook {
            // Ensure that, on executor JVM shutdown, the Kafka producer sends
            // any buffered messages to Kafka before shutting down.
            producer.close()
          }
          producer
        }
        new KafkaSink(createProducerFunc)
      }
    
      def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
    }
    
  2. 使用广播变量为每个执行程序提供自己的包装 KafkaProducer 实例

    import org.apache.kafka.clients.producer.ProducerConfig
    
    val ssc: StreamingContext = {
      val sparkConf = new SparkConf().setAppName("spark-streaming-kafka-example").setMaster("local[2]")
      new StreamingContext(sparkConf, Seconds(5))
    }
    
    ssc.checkpoint("checkpoint-directory")
    
    // 初始化KafkaSink,并广播
    val kafkaProducer: Broadcast[KafkaSink[String, String]] = {
      val kafkaProducerConfig: Map[String, Object] = getKafkaProducerParams()
      if (logger.isInfoEnabled){
        logger.info("kafka producer init done!")
      }
      ssc.sparkContext.broadcast(KafkaSink[String, String](kafkaProducerConfig))
    }
    def getKafkaProducerParams(): Map[String, Object] = {
        Map[String, Object](
          ProducerConfig.BOOTSTRAP_SERVERS_CONFIG -> properties.getProperty("kafka1.bootstrap.servers"),
          ProducerConfig.ACKS_CONFIG -> properties.getProperty("kafka1.acks"),
          ProducerConfig.RETRIES_CONFIG -> properties.getProperty("kafka1.retries"),
          ProducerConfig.BATCH_SIZE_CONFIG -> properties.getProperty("kafka1.batch.size"),
          ProducerConfig.LINGER_MS_CONFIG -> properties.getProperty("kafka1.linger.ms"),
          ProducerConfig.BUFFER_MEMORY_CONFIG -> properties.getProperty("kafka1.buffer.memory"),
          ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG -> classOf[StringSerializer],
          ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG -> classOf[StringSerializer]
        )
    }
    
  3. 从 Spark Streaming 写入Kafka,重新使用相同的包装 KafkaProducer 实例(对于每个执行程序)

    import java.util.concurrent.Future
    import org.apache.kafka.clients.producer.RecordMetadata
    
    val stream: DStream[String] = ???
    stream.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
    	val metadata: Stream[Future[RecordMetadata]] = partitionOfRecords.map(record => {
    	  kafkaProducer.value.send("my-output-topic", new Gson().toJson(record))
    	}).toStream
    	metadata.foreach(data => {
    	  data.get()
    	})
      })
    })
    
  4. 运行测试

    创建主题

    kafka-topics --create --zookeeper cdh01:2181,cdh02:2181,cdh03:2181 --topic my-output-topic --partitions 3 --replication-factor 3
    

    消费主题数据

    kafka-console-consumer --bootstrap-server cdh01:9092 --topic my-output-topic --from-beginnin
    

你可能感兴趣的:(Spark,Spark,Kafka)