Spark Streaming-2.4.3整合Kafka-0.10

1.依赖准备

添加spark-streaming整合kafka的依赖,将依赖添加到pom.xml中,如下:


  org.apache.spark
  spark-streaming-kafka-0-10_2.11
  2.4.3

注意:Spark 的版本要与你集群版本保持一致。

2.Spark Streaming程序

以统计kafka中的wordcount为例:

package org.apache.spark.examples.streaming

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @Description:
  * @Date: 2019/11/15
  */
object KafkaWordCount {

  val func= (iter: Iterator[(String, Seq[Int], Option[Int])]) =>{
    iter.flatMap{case(x, y, z) => Some(y.sum + z.getOrElse(0)).map(count => (x,count))}
  }

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("Kafka Streaming")
      .setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.checkpoint("E:\\software\\checkpoint")

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "kafka_spark_streaming",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    val topics = Array("testKafka")
    val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
    val result = kafkaStream.flatMap(_.value().split(" "))
        .map((_, 1))
        .updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

 

3.关键点说明

3.1 LocationStrategies

  • LocationStrategies.PreferConsistent:在可用的executor上均匀分布分区;
  • LocationStrategies.PreferBroker:如果你的executor与Kafka的Broker节点在同一台物理机上,使用PreferBrokers,这更倾向于在该节点上安排KafkaLeader对应的分区;
  • LocationStrategies.PreferFixed:如果发生分区之间数据负载倾斜,使用PreferFixed。这允许你指定分区和主机之间的显示映射(任何未指定的分区将使用一致的位置);

3.2 ConsumerStrategies

  • ConsumerStrategies.Subscribe:可以指定订阅的固定主题集合,可以指定多个主题,但是主题中的数据格式应保持一致;
  • ConsumerStrategies.SubscribePattern:使用正则表达式匹配订阅的主题;
  • ConsumerStrategies.Assign:可指定固定的分区集合

如果上述的策略无法满足需求,那么可以使用ConsumerStrategy这个公共类自行拓展,自定义消费策略。

3.3 整合Kafka时的偏移量管理

在写spark streaming整合kafka时,一般需要将enable.auto.commit”设置为false,即禁用自动提交偏移量。

偏移量管理有如下几种方式

  • 利用chechpoint存储偏移量:如果启用spark的checkpoint,则偏移量将存储在检查点中。此中方法有缺陷,当应用程序代码更改后,可能会丢失偏移量数据。
  • 保存在kafka中:利用kafka的commitAsync API来手动提交偏移量,与checkpoint相比,它的好处是,无论您对应用程序代码进行如何更改和升级,kaka对偏移量都是持久存储的(存储在单独的topic中),Kafka不是事务性的,输出必须仍然保证是幂等的。

.

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  // some time later, after outputs have completed
  // commitAsync最好在计算完成后调用
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
  • 自己存储管理偏移量:将偏移量和计算结果进行事务性存储,如存储在数据库中,当失败时可以进行回滚。
// The details depend on your data store, but the general idea looks like this

// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
  new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap

val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  val results = yourCalculation(rdd)

  // begin your transaction

  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly

  // end your transaction
}

 

你可能感兴趣的:(Spark)