Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)

目前Spark的最新版本是2.3.0,更新了Spark streaming对接Kafka的API,但是最新的API仍属于实验阶段,正式版本可能会有变化,本文主要介绍2.3.0的API如何使用。

This version of the integration is marked as experimental, so the API is potentially subject to change.

pom.xml配置

加入如下依赖

    
        
            org.apache.spark
            spark-streaming_2.11
            2.3.0
        

        
            org.apache.spark
            spark-streaming-kafka-0-10_2.11
            2.3.0
        
    

代码

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, TaskContext}

object SparkStreamingNewAPIExample {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("SparkStreamingNewAPIExample")
    val ssc = new StreamingContext(sparkConf, Seconds(10))

    val kafkaParams = scala.collection.Map[String, Object](
      "bootstrap.servers" -> "hostA:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "testGroup",
      "auto.offset.reset" -> "latest",
      "partition.assignment.strategy" -> "org.apache.kafka.clients.consumer.RangeAssignor",
      "enable.auto.commit" -> (true: java.lang.Boolean)
    )

    val topics = Array("topic1","topic2")
    
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    stream.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd.foreachPartition { item =>
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)

        println(s"The record from topic [${o.topic}] is in partition ${o.partition} which offset from ${o.fromOffset} to ${o.untilOffset}")
        println(s"The record content is ${item.toList.mkString}")
      }

      rdd.count()
    }

    ssc.start()
    ssc.awaitTermination()
  }
}

分析

上面的代码的作用是spark streaming每10秒消费一次topic 1和topic2,然后将RDD的相关信息打印在标准输出中。
其中可以看到KafkaUtils.createDirectStream与spark 1.6.x版本不论是方法参数还是返回值都有了很大的不同,尤其是返回值,返回的RDD的类型不再是键值对,而是内容更加丰富的ConsumerRecord[K, V]类型。
例如得到如下的日志打印,可以很详细的知道当前spark处理的数据是来自kafka的哪个topic,partition和offset。

The record is in partition 0 which offset from 23 to 25
The record content is ConsumerRecord(topic = topic1, partition = 0, offset = 23, CreateTime = 1487209064531, checksum = 2357653885, serialized key size = -1, serialized value size = 6, key = null, value = aaaaaa)ConsumerRecord(topic = topic1, partition = 0, offset = 24, CreateTime = 1487209065989, checksum = 2696444472, serialized key size = -1, serialized value size = 8, key = null, value = bbbbbbbb)

参数说明

对于代码中的enable.auto.commit参数值是true,含义是当数据被消费完之后会,如果spark streaming的程序由于某种原因停止之后再启动,下次不会重复消费之前消费过的数据。这样就会产生一个问题,从业务的角度,有可能消费之后的数据还没有经过业务处理,并不是真正意义上的“消费完成”。所以如果为false那么什么情况算消费完,由业务决定。这样就需要手动提交,只需在rdd.count()之前加入这段代码stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)即可。

参考文献

  1. http://www.jianshu.com/p/05281717a451
  2. http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
  3. https://www.confluent.io/blog/tutorial-getting-started-with-the-new-apache-kafka-0-9-consumer-client/

你可能感兴趣的:(Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher))