SparkStreaming踩坑之Kafka重复消费

1.问题描述

使用SparkStreaming连接Kafka的demo程序每次重启,都会从Kafka队列里第一条数据开始消费。

修改enable.auto.commit相关参数都无效。

2.原因分析

demo程序使用"KafkaUtils.createDirectStream"创建Kafka输入流,此API内部使用了Kafka客户端低阶API,不支持offset自动提交(提交到zookeeper)。

"KafkaUtils.createDirectStream"官方文档:

http://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html

3.对策

方案一)通过zookeeper提供的API,自己编写代码,将offset提交到zookeeper;服务启动时,从zookeeper读取offset,并作为"KafkaUtils.createDirectStream"的输入参数

优点:可与基于zookeeper的监控系统融合,对消费情况进行监控

缺点:频繁的读写offset可能影响zookeeper集群性能,从而影响到Kafka集群的稳定性

方案二)自己编写代码维护offset,并将offset保存到MongoDB或者redis

优点:不影响zookeeper集群性能;可基于MongoDB或者redis自主实现消费情况的监控

缺点:无法与基于zookeeper的监控系统融合

4.代码示例

基于上述方案二,将offset保存到redis,并在服务重启时从redis获取offset,确保不会重复消费。

1)Scala操作redis的工具类

package xxx.demo.scala_test

 

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}

import org.apache.commons.pool2.impl.GenericObjectPoolConfig

import org.slf4j.LoggerFactory

import com.typesafe.scalalogging.slf4j.Logger

 

class RedisUtil extends Serializable {

  @transient private var pool: JedisPool = null

  @transient val logger = Logger(LoggerFactory.getLogger("cn.com.flaginfo.demo.scala_test.RedisUtil"))

 

  def makePool(redisHost: String, redisPort: Int,

               password: String, database: Int): Unit = {

    if (pool == null) {

      val poolConfig = new GenericObjectPoolConfig()

      pool = new JedisPool(poolConfig, redisHost, redisPort, Protocol.DEFAULT_TIMEOUT, password, database)

       

      val hook = new Thread {

        override def run = {

          pool.destroy()

          logger.debug("JedisPool destroyed by ShutdownHook")

        }

      }

      sys.addShutdownHook(hook.run)

    }

  }

 

  def jedisPool: JedisPool = {

    assert(pool != null)

    pool

  }

   

  def generateKafkaOffsetGroupIdTopicKey(groupId : String, topic : String) : String = {

    groupId + "/" + topic

  }

}

2)初始化redis工具

import kafka.serializer.{StringDecoder, DefaultDecoder}

import kafka.common.TopicAndPartition

import kafka.message.MessageAndMetadata

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

import org.apache.spark.sql.SparkSession

import org.bson.Document

import com.mongodb.spark.config._

import com.mongodb.spark._

import com.mongodb._

import xxx.demo.model._

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}

import scala.collection.JavaConversions.{mapAsScalaMap}

import org.slf4j.LoggerFactory

import com.typesafe.scalalogging.slf4j.Logger

...

  var redisUtil = new RedisUtil()

  redisUtil.makePool(redisHost, redisPort, redisPassword, redisDatabase)

  var jedisPool = redisUtil.jedisPool

注意,不需要密码验证时,redisPassword必须设置为null,空字符串会报错。

3)从redis获取上次的offset

var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)

var allOffset: java.util.Map[String, String] = jedisPool.getResource().hgetAll(kafkaOffsetKey)

val fromOffsets = scala.collection.mutable.Map[TopicAndPartition,Long]()

if( allOffset != null && !allOffset.isEmpty() ){

  // Jedis获取的Java Map转换为Scala Map

  var allOffsetScala : scala.collection.mutable.Map[String, String] = mapAsScalaMap[String, String](allOffset)

  for(offset <- allOffsetScala){

    // 将offset传入kafka参数。offset._1 : partition, offset._2 : offset

    fromOffsets += (TopicAndPartition(newsAnalysisTopic, offset._1.toInt) -> offset._2.toLong)

  }

  logger.debug( "fromOffsets : " + fromOffsets.toString() )

}

else{

  // 初次消费

  for( i <- 0 to (newsAnalysisTopicPartitionCount - 1) ){

    fromOffsets += (TopicAndPartition(newsAnalysisTopic, i) -> 0)

  }

  logger.debug( "fromOffsets : " + fromOffsets.toString() )

}

// mutable转换为imutable

var imutableFromOffsets = Map[TopicAndPartition,Long](

  fromOffsets.map(kv => (kv._1, kv._2)).toList: _*

)

4)定义消息过滤器:根据metadata取出需要的字段

val messageHandler: (MessageAndMetadata[String, String]) => (String,String, Long, Int) = (mmd: MessageAndMetadata[String, String]) =>

    (mmd.topic, mmd.message, mmd.offset, mmd.partition)

5)创建kafka输入流

val kafkaParam = Map[String, String](

  "bootstrap.servers" -> kafkaServer,

  "group.id" -> groupId,

  "client.id" -> clientId,

  "auto.offset.reset" -> "smallest",

  "enable.auto.commit" -> "false"

  )

var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String,String, Long, Int)](ssc, kafkaParam, imutableFromOffsets, messageHandler)

其中ssc为StreamingContext对象

6)业务逻辑代码中,将offset更新到redis

var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)

// _._1 : topic name, _._2 : message body, _._3 : offset, _._4 : partition

kafkaStream.foreachRDD { rdd =>

  if( !rdd.isEmpty() ){  // 此处判断可防止offsetRanges.foreach循环意外执行

    var offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

    // 处理数据

    rdd.foreach{ row =>

      logger.info("message : " + row + offsetRanges)

    }

    // 开启Redis事务

    var jedis = jedisPool.getResource()

    var jedisPipeline = jedis.pipelined()

    jedisPipeline.multi()

    // 更新offset

    offsetRanges.foreach { offsetRange =>

      logger.debug("partition : " + offsetRange.partition + " fromOffset:  " + offsetRange.fromOffset + " untilOffset: " + offsetRange.untilOffset)

      jedisPipeline.hset(kafkaOffsetKey, offsetRange.partition.toString(), offsetRange.untilOffset.toString())

    }

    jedisPipeline.exec() //提交事务

    jedisPipeline.sync //关闭pipeline

    jedis.close()

  }

}

 

 

你可能感兴趣的:(SparkStreaming踩坑之Kafka重复消费)