1.问题描述
使用SparkStreaming连接Kafka的demo程序每次重启,都会从Kafka队列里第一条数据开始消费。
修改enable.auto.commit相关参数都无效。
2.原因分析
demo程序使用"KafkaUtils.createDirectStream"创建Kafka输入流,此API内部使用了Kafka客户端低阶API,不支持offset自动提交(提交到zookeeper)。
"KafkaUtils.createDirectStream"官方文档:
http://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html
3.对策
方案一)通过zookeeper提供的API,自己编写代码,将offset提交到zookeeper;服务启动时,从zookeeper读取offset,并作为"KafkaUtils.createDirectStream"的输入参数
优点:可与基于zookeeper的监控系统融合,对消费情况进行监控
缺点:频繁的读写offset可能影响zookeeper集群性能,从而影响到Kafka集群的稳定性
方案二)自己编写代码维护offset,并将offset保存到MongoDB或者redis
优点:不影响zookeeper集群性能;可基于MongoDB或者redis自主实现消费情况的监控
缺点:无法与基于zookeeper的监控系统融合
4.代码示例
基于上述方案二,将offset保存到redis,并在服务重启时从redis获取offset,确保不会重复消费。
1)Scala操作redis的工具类
package xxx.demo.scala_test
import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}
import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import org.slf4j.LoggerFactory
import com.typesafe.scalalogging.slf4j.Logger
class RedisUtil extends Serializable {
@transient private var pool: JedisPool = null
@transient val logger = Logger(LoggerFactory.getLogger("cn.com.flaginfo.demo.scala_test.RedisUtil"))
def makePool(redisHost: String, redisPort: Int,
password: String, database: Int): Unit = {
if (pool == null) {
val poolConfig = new GenericObjectPoolConfig()
pool = new JedisPool(poolConfig, redisHost, redisPort, Protocol.DEFAULT_TIMEOUT, password, database)
val hook = new Thread {
override def run = {
pool.destroy()
logger.debug("JedisPool destroyed by ShutdownHook")
}
}
sys.addShutdownHook(hook.run)
}
}
def jedisPool: JedisPool = {
assert(pool != null)
pool
}
def generateKafkaOffsetGroupIdTopicKey(groupId : String, topic : String) : String = {
groupId + "/" + topic
}
}
2)初始化redis工具
import kafka.serializer.{StringDecoder, DefaultDecoder}
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.SparkSession
import org.bson.Document
import com.mongodb.spark.config._
import com.mongodb.spark._
import com.mongodb._
import xxx.demo.model._
import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}
import scala.collection.JavaConversions.{mapAsScalaMap}
import org.slf4j.LoggerFactory
import com.typesafe.scalalogging.slf4j.Logger
...
var redisUtil = new RedisUtil()
redisUtil.makePool(redisHost, redisPort, redisPassword, redisDatabase)
var jedisPool = redisUtil.jedisPool
注意,不需要密码验证时,redisPassword必须设置为null,空字符串会报错。
3)从redis获取上次的offset
var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)
var allOffset: java.util.Map[String, String] = jedisPool.getResource().hgetAll(kafkaOffsetKey)
val fromOffsets = scala.collection.mutable.Map[TopicAndPartition,Long]()
if( allOffset != null && !allOffset.isEmpty() ){
// Jedis获取的Java Map转换为Scala Map
var allOffsetScala : scala.collection.mutable.Map[String, String] = mapAsScalaMap[String, String](allOffset)
for(offset <- allOffsetScala){
// 将offset传入kafka参数。offset._1 : partition, offset._2 : offset
fromOffsets += (TopicAndPartition(newsAnalysisTopic, offset._1.toInt) -> offset._2.toLong)
}
logger.debug( "fromOffsets : " + fromOffsets.toString() )
}
else{
// 初次消费
for( i <- 0 to (newsAnalysisTopicPartitionCount - 1) ){
fromOffsets += (TopicAndPartition(newsAnalysisTopic, i) -> 0)
}
logger.debug( "fromOffsets : " + fromOffsets.toString() )
}
// mutable转换为imutable
var imutableFromOffsets = Map[TopicAndPartition,Long](
fromOffsets.map(kv => (kv._1, kv._2)).toList: _*
)
4)定义消息过滤器:根据metadata取出需要的字段
val messageHandler: (MessageAndMetadata[String, String]) => (String,String, Long, Int) = (mmd: MessageAndMetadata[String, String]) =>
(mmd.topic, mmd.message, mmd.offset, mmd.partition)
5)创建kafka输入流
val kafkaParam = Map[String, String](
"bootstrap.servers" -> kafkaServer,
"group.id" -> groupId,
"client.id" -> clientId,
"auto.offset.reset" -> "smallest",
"enable.auto.commit" -> "false"
)
var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String,String, Long, Int)](ssc, kafkaParam, imutableFromOffsets, messageHandler)
其中ssc为StreamingContext对象
6)业务逻辑代码中,将offset更新到redis
var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)
// _._1 : topic name, _._2 : message body, _._3 : offset, _._4 : partition
kafkaStream.foreachRDD { rdd =>
if( !rdd.isEmpty() ){ // 此处判断可防止offsetRanges.foreach循环意外执行
var offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// 处理数据
rdd.foreach{ row =>
logger.info("message : " + row + offsetRanges)
}
// 开启Redis事务
var jedis = jedisPool.getResource()
var jedisPipeline = jedis.pipelined()
jedisPipeline.multi()
// 更新offset
offsetRanges.foreach { offsetRange =>
logger.debug("partition : " + offsetRange.partition + " fromOffset: " + offsetRange.fromOffset + " untilOffset: " + offsetRange.untilOffset)
jedisPipeline.hset(kafkaOffsetKey, offsetRange.partition.toString(), offsetRange.untilOffset.toString())
}
jedisPipeline.exec() //提交事务
jedisPipeline.sync //关闭pipeline
jedis.close()
}
}