1.依赖准备
添加spark-streaming整合kafka的依赖,将依赖添加到pom.xml中,如下:
org.apache.spark
spark-streaming-kafka-0-10_2.11
2.4.3
注意:Spark 的版本要与你集群版本保持一致。
2.Spark Streaming程序
以统计kafka中的wordcount为例:
package org.apache.spark.examples.streaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @Description:
* @Date: 2019/11/15
*/
object KafkaWordCount {
val func= (iter: Iterator[(String, Seq[Int], Option[Int])]) =>{
iter.flatMap{case(x, y, z) => Some(y.sum + z.getOrElse(0)).map(count => (x,count))}
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Kafka Streaming")
.setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("E:\\software\\checkpoint")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "kafka_spark_streaming",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("testKafka")
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
val result = kafkaStream.flatMap(_.value().split(" "))
.map((_, 1))
.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
3.关键点说明
3.1 LocationStrategies
3.2 ConsumerStrategies
如果上述的策略无法满足需求,那么可以使用ConsumerStrategy这个公共类自行拓展,自定义消费策略。
3.3 整合Kafka时的偏移量管理
在写spark streaming整合kafka时,一般需要将enable.auto.commit”设置为false,即禁用自动提交偏移量。
偏移量管理有如下几种方式
.
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
// commitAsync最好在计算完成后调用
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
// The details depend on your data store, but the general idea looks like this
// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val results = yourCalculation(rdd)
// begin your transaction
// update results
// update offsets where the end of existing offsets matches the beginning of this batch of offsets
// assert that offsets were updated correctly
// end your transaction
}