pom.xml
spark streaming:
org.apache.spark spark-streaming_2.11 2.1.0
kafka:
org.apache.spark spark-streaming-kafka-0-10_2.11 2.2.0
kafka项目在不同的版本中介绍了不同的消费者API,0-8和0-10 。其中如果使用spark streaming 和kafka整合版本为0.8的话,则有两种整合方式:Receiver-based Approach,Direct Approach (No Receivers)。如果使用spark streaming和kafka整合版本为0-10的话,整合方式类似于0-8版本的Direct Approach (No Receivers)。
从上面可以看出,0-8版本已经建议放弃使用了,所以,我们接下来的笔记主要以0-10版本为例进行讲解。但是,在讲解的开头,会对0-8版本的两种整合方式做些简单的介绍。
0-8版本
这个方法启动一个Receiver来接受数据。Receiver是使用Kafka高级消费者API实现的。与所有的接收器一样,通过Receiver从Kafka处接收到的数据会被存储到Spark的executors中,然后被Spark Streaming启动的jobs来处理数据。
然而,在默认配置下,该方法可能会在Failure时丢失数据。为了确保零数据丢失,你需要额外的启用Spark Streaming中的Write-Ahead Logs机制。这个同步地保存所有从kafka接收到数据到位于分布式文件系统上的write-ahead logs上面,因此Failure时所有的数据能够被恢复。
关键点:
kafka中的topic分区与Spark Streaming生成的RDD分区无关。因此,在KafkaUtils.createStream()中增加topic-specific的数量只会增加单个receiver的消费topics使用的到线程数目。它并不会增加处理数据时的并行度。
可以使用不同的group和topic创建多个kafka input Dstream,以使用多个接收器并行接受数据。
Spark 1.3引进的新的无receiver的direct方式确保了健壮的端到端的保障。对比使用receivers接受数据,这个方法定期的向kafka查询每个topic+partition的最新offsets,另外,需要额外的定义每个批次处理偏移量的范围。当处理数据的jobs启动后,Kafka简单的消费者API用于从Kafka读取定义的offsets的范围(类似于从文件系统读取文件类似)。
这个方法对比于Receiver-Based approach有如下的优势:
简化的并行性:无需创建多个输入Kafka流并将它们联合起来。使用Direct Stream,Spark Streaming将创建于要使用的Kafka分区一样多的RDD分区,这些分区将并行地从Kafka读取数据。因此,Kafka和RDD分区之间存在一对一的映射,这更容易理解和调整。
效率:通过Receiver-based实现零数据丢失需要数据存储到Write-Ahead Log,这会进一步复制数据。这实际上是低效的,因为数据有效的复制两次-一次通过Kakfa,另外一次复制到Write-Ahead Log。第二种方法消除了这个问题,因为没有receiver,因此不需要预写日志(Write-Ahead Log)。只要有足够的Kafka保留,就可以从Kafka恢复信息。
完全一次的语义:Receiver-Based方法使用Kafka‘s 高级API在Zookeeper中存储消耗的偏移量。传统上,这是从Kafka消费数据。虽然这种方法(与预写日志Write-ahead logs结合)能够确保零数据丢失(即只收一次语义),某些记录在某些Failure情况下可能会被消耗两次,即使这种情况发生的概率很小。这是因为Spark Streaming可靠地接受数据与zookeeper跟踪的offsets之间存在不一致。因此,在第二种方法中,我们使用简单的Kafka API而不是使用Zookeeper。Spark Streaming在其checkpoints内跟踪偏移量。这消除了Spark Streaming和Kafka/Zookeeper之间的不一致,因此尽管出现Failure,Spark Streaming也会有效地一次性接收数据。为了实现输出结果的一次性语义,将数据保存到外部数据存储的输出操作必须是幂等的,或者是保存结果和偏移的原子事务。
请注意,此方法的一个缺点是它不会更新Zookeeper中的偏移量,因此基于Zookeeper的Kafka监视工具将不会显示进度。 但是,您可以在每个批处理中访问此方法处理的偏移量,并自行更新Zookeeper
0-10版本
Kafka 0.10的Spark Streaming集成在设计上与0.8的Direct Stream方式类似。它提供了简单的并行性,Kafka分区和Spark分区之间的1:1对应关系,以及对偏移和元数据的访问。
以下是简单的模板代码:
1.导包
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.4.0
2.代码
object KafkaDirectWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaDirectWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
ssc.checkpoint(".")
val kafkaParams = Map("bootstrap.servers" -> "master:9092"
, ("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
, "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
, "group.id" -> "kafkatest"
, "enable.auto.commit" -> "false"
)
val topics = Set("test")
val consumerStrategies = ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
val kafkaDStream = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, consumerStrategies)
val res = kafkaDStream
.map(x => {
x.value()
})
.flatMap(_.split(","))
.map(x => (x, 1))
.reduceByKey(_ + _)
res.print()
ssc.start()
ssc.awaitTermination()
}
}
对于批次时间我们需要注意以下两点:
If your Spark batch duration is larger than the default Kafka heartbeat session timeout (30 seconds), increase heartbeat.interval.ms and session.timeout.ms appropriately. For batches larger than 5 minutes, this will require changing group.max.session.timeout.ms on the broker.
ConsumerRecord:
A key/value pair to be received from Kafka. This also consists of a topic name and a partition number from which the record is being received, an offset that points to the record in a Kafka partition, and a timestamp as marked by the corresponding ProducerRecord.
从上面的模板代码中,我们很容易发现两个新的变量:LocationStrategies,ConsumerStrategies。以下内容我们将对这两个变量进行说明。
LocationStrategies:
The new Kafka consumer API will pre-fetch messages into buffers. Therefore it is important for performance reasons that the Spark integration keep cached consumers on executors (rather than recreating them for each batch), and prefer to schedule partitions on the host locations that have the appropriate consumers.
In most cases, you should use LocationStrategies.PreferConsistent as shown above. This will distribute partitions evenly across available executors. If your executors are on the same hosts as your Kafka brokers, use PreferBrokers, which will prefer to schedule partitions on the Kafka leader for that partition. Finally, if you have a significant skew in load among partitions, use PreferFixed. This allows you to specify an explicit mapping of partitions to hosts (any unspecified partitions will use a consistent location).
If you would like to disable the caching for Kafka consumers, you can set spark.streaming.kafka.consumer.cache.enabled to false. Disabling the cache may be needed to workaround the problem described in SPARK-19185. This property may be removed in later versions of Spark, once SPARK-19185 is resolved.
The cache is keyed by topicpartition and group.id, so use a separate(单独的) group.id for each call to createDirectStream.
ConsumerStrategies:
The new Kafka consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation(实例化) setup.ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.
ConsumerStrategies.Subscribe, as shown above, allows you to subscribe to a fixed(固定的) collection of topics. SubscribePattern allows you to use a regex to specify topics of interest. Note that unlike the 0.8 integration, using Subscribe or SubscribePattern should respond to adding partitions during a running stream. Finally, Assign allows you to specify a fixed collection of partitions. All three strategies have overloaded constructors that allow you to specify the starting offset for a particular partition.
If you have specific consumer setup needs that are not met by the options above, ConsumerStrategy is a public class that you can extend.
如果我们集成kafka与spark streaming的时候,如果我们使用的依赖是(
)则会产生这样的异常:Caused by: java.lang.ClassNotFoundException: kafka.api.TopicMetadataRequest。
如果使用的依赖是(
),又会产生这样的异常:AbstractMethodError:kafka.consumer.FetchRequestAndResponseMetrics.metricName Lcom/yammer/metrics/core/MetricName;。
但是我们使用官网提供的依赖(
),则不会有异常。