kafka获取某一group当前消费情况的命令如下:
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --topic ctr_kafka --zookeeper localhost:2181 --group mobil-canel-online
正常情况,会返回group:mobil-canel-online 消费 topic:ctr_kafka 每个partition下的lag情况。
但经由一次断网(导致了controller切换),却出现上述命令卡住的现象。运行上面命令,命令等待,无返回结果也不退出。
查看bin/kafka-run-class.sh所使用的log4j文件,修改config/tools-log4j.properties的log4j.rootLogger值由info改为DEBUG,查看DEBUG日志,发现请求一直失败然后重试。
后面开始对这个问题的分析。
kafka 在controller切换时,会调用:ControllerContext.onControllerFailover()->sendUpdateMetadataRequest()
def sendUpdateMetadataRequest(brokers: Seq[Int], partitions: Set[TopicAndPartition] = Set.empty[TopicAndPartition]) {
brokerRequestBatch.newBatch()
brokerRequestBatch.addUpdateMetadataRequestForBrokers(brokers, partitions)
brokerRequestBatch.sendRequestsToBrokers(epoch, controllerContext.correlationId.getAndIncrement)
}
Callbacks.addUpdateMetadataRequestForBrokers()->updateMetadataRequestMapFor()
val partitionStateInfo = if (beingDeleted) {
val leaderAndIsr = new LeaderAndIsr(LeaderAndIsr.LeaderDuringDelete, leaderIsrAndControllerEpoch.leaderAndIsr.isr)
PartitionStateInfo(LeaderIsrAndControllerEpoch(leaderAndIsr, leaderIsrAndControllerEpoch.controllerEpoch), replicas)
} else {
PartitionStateInfo(leaderIsrAndControllerEpoch, replicas)
}
Callback
s.
sendRequestsToBrokers()->ControllerContext.
sendRequest(
broker
,
updateMetadataRequest
,
null
)->(rpc到broker上)KafkaApis.
handleUpdateMetadataRequest()->
ReplicaManager.
maybeUpdateMetadataCache()
->MetadataCache.updateCache()
if (info.leaderIsrAndControllerEpoch.leaderAndIsr.leader == LeaderAndIsr.LeaderDuringDelete) {
removePartitionInfo(tp.topic, tp.partition)
stateChangeLogger.trace(("Broker %d deleted partition %s from metadata cache in response to UpdateMetadata request " +
"sent by controller %d epoch %d with correlation id %d")
.format(brokerId, tp, updateMetadataRequest.controllerId,
updateMetadataRequest.controllerEpoch, updateMetadataRequest.correlationId))
}
removePartitionInfo()里面会将改topic的所有partition信息都从内存中删除,最后,再删除该topic cache.remove(topic)
在执行ConsumerOffsetChecker操作时:
ConsumerOffsetChecker.main()->ClientUtils.channelToOffsetManager->queryChannel.send(ConsumerMetadataRequest(group))->(rpc到broker上)KafkaApis.handleConsumerMetadataRequest()->getTopicMetadata()->
metadataCache.getTopicMetadata(topics)
topic是:__consumer_offsets ,而由于这个topic进入了zookeeper中的/admin/delete_topics下面,从而
topics.isEmpty和cache.contains(topic) 两个条件都不满足,结果一直向上层返回空结果,最后得到一个errorResponse,然后进入不断的重试操作中。
至此问题症结已找到。就是由于很久之前曾经删除过__consumer_offsets topic,而未配置delete.topic.enable 为true,所以其实删除并未真正删除topic,仅仅是在zookeeper中增加了一个/admin/delete_topics/__consumer_offsets 节点,在controller未变化的时候,未触发上述sendUpdateMetadataRequest操作,从而kafka.tools.ConsumerOffsetChecker一直有效。经由断网,controller进行了切换,导致触发了上面的ControllerContext.onControllerFailover() 操作,broker内存中cache将该topic删除了,从而一直返回errorResponse。
从zookeeper中删除/admin/delete_topics/__consumer_offsets 节点
然后使controller切换(将现有controller节点停掉重启),从而触发上述ControllerContext.onControllerFailover() 操作,更新broker的内存结构。