Spark Streaming与kafka对接

使用KafkaUtil来实现SparkStreaming的对接。 KafkaUtil共有两个版本:

spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
kafka version 0.8.2.1 or higher 0.10.0 or higher
Offset Commit API ×

其中0.8版本已经被遗弃, 不建议使用。

消费kafka共有三种消费语义:
1.At most once: 至多消费一次
2.At least once: 至少一次
3.Exactly once:精确消费一次

其中, at least once级别推荐使用官方API对kafka offset进行维护, 代码如下:

object SSApp02 {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("hehe")
    val ssc = new StreamingContext(sparkConf, Seconds(3))
    ssc.sparkContext.setLogLevel("ERROR")
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "c1:9092,c2:9092,c3:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "earliest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    val topics = Array("testtopic")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    stream.map(record => ((record.key), (record.value))).print()
    stream.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd.foreachPartition { iter =>
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
        println(s"partition=${o.partition}, fromOffset=${o.fromOffset}, untilOffset=${o.untilOffset}")

      }
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }

    ssc.start()
    ssc.awaitTermination()
  }
}

Exactly once级别的实现方法见我的另一篇博客:
kafka + spark streaming 确保不丢失不重复消费的offset管理方法

转载于:https://my.oschina.net/dreamness/blog/3093153

你可能感兴趣的:(Spark Streaming与kafka对接)