最近在折腾Kafka日志集群,由于公司部署的应用不断增加,日志采集程序将采集到的日志发送到Kafka集群时出现了较大延迟,总的TPS始终上不去,为了不影响业务团队通过日志排查问题,采取了先解决问题,再排查的做法,对Kafka集群进行扩容,但扩容后尴尬的是新增加的5台机器中,有两台机器的消费发送响应时间比其他机器明显高出不少,为了确保消息服务的稳定性,又临时对集群进行缩容,将这台机器从集群中剔除,具体的操作就是简单粗暴的使用 kill pid命令,但意外发生了。
而Go客户端报的错误如下所示:
基本可以认为是部分分区没有在线Leader,无法成功发送消息。
那为什么会出现这个问题吗?Kafka一个节点下线,不是会自动触发故障转移,分区leader不是会被重新选举吗?请带着这个疑问,开始我们今天的探究之旅。
首先我们可以先看看当前存在问题的分区的路由信息,从第一张图中看出主题dw_test_kafka_0816000的101分区消息发送失败,我们在Zookeeper中看一下其状态,具体命令如下:
./zkCli.sh -server 127.0.0.1:2181
get -s /kafka_cluster_01/brokers/topics/dw_test_kafka_0816000/partitions/101/state
该命令可以看到对应分区的相信信息,如下图所示:
这里显示出leader的状态为-1,而isr列表中只有一副本,在broker-1上,但此时broker id为1的机器已经下线了,那为什么不会触发分区Leader重新选举呢?
其实看到这里,我相信你只要稍微细想一下,就能发现端倪,isr字段的值为1,说明该分区的副本数为1,说明该分区只在一个Broker上存储数据,一旦Broker下线,由于集群内其他Broker上并没有该分区的数据,此时是无法进行故障转移的,因为一旦要进行故障转移,分区的数据就会丢失,这样带来的影响将是非常严重的。
那为什么该主题的副本数会设置为1呢?那是因为当时集群的压力太大,节点之间复制数据量巨大,网卡基本满负荷在运转,而又是日志集群,对数据的丢失的接受程度较大,故当时为了避免数据在集群之间的大量复制,将该主题的副本数设置为了1。
但集群节点的停机维护是少不了的,总不能每一次停机维护,都会出现一段时间数据写入失败吧。要解决这个问题,我们在停机之前,需要先对主题进行分区移动,将该主题的分区从需要停机的集群中移除。
主题分区移动的具体做法,请参考我之前的一篇文章Kafka主题迁移实践 的第三部分。
Kafka单副本的主题在集群内一台节点下线后,将无法完成分区的故障转移机制,为了深入掌握底层的一些实现细节,我想再深入探究一下kafka节点下线的一些故障转移机制。
温馨提示:接下来主要是从源码角度深入探究实现原理,加深对这个过程的理解,如果大家不感兴趣,可以直接进入到本文的第4个部分:总结。
在Kafka中依赖的Zookeeper服务器上存储了当前集群内存活的broker信息,具体的路径为/{namespace}/brokers/brokers/ids,具体图示如下:
并且ids下的每一个节点记录了Broker的一些信息,例如对外提供服务的协议、端口等,值得注意的是这些节点为临时节点,如下图所示:
这样一旦对应的Broker宕机下线,对应的节点会删除,Kafka集群内的Controller角色在启动时会监听该节点下节点的变化,并作出响应,最终将会调用KafkaController的onBrokerFailure方法,具体代码如下所示:
这个方法实现比较复杂,我们在这里不做过多分散,重点查找分区的故障转移机制,也就是接下来我们将具体分析KafkaController的onReplicasBecomeOffline方法,主要探究分区的故障转移机制。
由于该方法实现复杂,接下来将分布对其进行详解。
Step1:从需要设置为下线状态分区进行分组,分组依据为是否需要删除,没有触发删除的集合用newofflineReplicasNotForDeletion表示,需要被删除的集合用newofflineReplicasForDeletion表示。
val (newOfflineReplicasForDeletion, newOfflineReplicasNotForDeletion) =
newOfflineReplicas.partition(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))
def isTopicQueuedUpForDeletion(topic: String): Boolean = {
if (isDeleteTopicEnabled) {
topicsToBeDeleted.contains(topic)
} else
false
}
Step2:挑选没有Leader的分区,用partitionsWithoutLeader,代码如下图所示:
val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader =>
!controllerContext.isReplicaOnline(partitionAndLeader._2.leaderAndIsr.leader, partitionAndLeader._1) &&
!topicDeletionManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet
def isReplicaOnline(brokerId: Int, topicPartition: TopicPartition, includeShuttingDownBrokers: Boolean = false): Boolean = {
val brokerOnline = {
if (includeShuttingDownBrokers) liveOrShuttingDownBrokerIds.contains(brokerId)
else liveBrokerIds.contains(brokerId)
}
brokerOnline && !replicasOnOfflineDirs.getOrElse(brokerId, Set.empty).contains(topicPartition)
}
分区没有Leader的标准是:分区的Leader副本所在的Broker没有下线,并且没有被删除。
Step3:将没有Leader的分区状态变更为OfflinePartition(离线状态),这里的状态更新是放在kafka Controller中的内存中,具体的内存结构:Map[TopicPartition, PartitionState]。
Step4:Kafka分区状态机驱动(触发)分区状态为OfflinePartition、NewPartition向OnlinePartition转化,状态的转化主要包括两个重要的步骤:
由于篇幅的问题,我们这篇文章不会体系化的介绍Kafka分区状态机的实现细节,先重点关注OfflinePartition离线状态向OnlinePartition转化过程。
我们首先说明一下OfflinePartition离线状态向OnlinePartition转化过程时各个参数的含义:
这里判断一下分区是否有效的依据主要是要根据状态机设置的驱动条件,例如只有分区状态为OnlinePartition、NewPartition、OfflinePartition三个状态才能转换为OnlinePartition。
接下来重点看变更为OnlinePartition的具体实现逻辑,具体代码如下所示:
case OnlinePartition =>
//step1 start
val uninitializedPartitions = validPartitions.filter(partition => partitionState(partition) == NewPartition)
val partitionsToElectLeader = validPartitions.filter(partition => partitionState(partition) ==
OfflinePartition || partitionState(partition) == OnlinePartition)
//step1 end
// step2 start
if (uninitializedPartitions.nonEmpty) {
val successfulInitializations = initializeLeaderAndIsrForPartitions(uninitializedPartitions)
successfulInitializations.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition from ${partitionState(partition)} to
$targetState with state "+ s"${controllerContext.partitionLeadershipInfo(partition).leaderAndIsr}")
changeStateTo(partition, partitionState(partition), OnlinePartition)
}
}
// step2 end
// step3 start
if (partitionsToElectLeader.nonEmpty) {
val (successfulElections, failedElections) = electLeaderForPartitions(
partitionsToElectLeader, partitionLeaderElectionStrategyOpt.get)
successfulElections.foreach { partition =>
stateChangeLog.trace(s"Changed partition $partition from ${partitionState(partition)} to
$targetState with state " +
s"${controllerContext.partitionLeadershipInfo(partition).leaderAndIsr}")
changeStateTo(partition, partitionState(partition), OnlinePartition)
}
failedElections
} else {
Map.empty
}
// step3 end
case OfflinePartition =>
具体实现分为3个步骤:
接下来我们重点看一下离线状态变更为OnlinePartition的分区leader选举实现,具体方法为:PartitionStateMachine的electLeaderForPartitions方法,其代码如下所示:
private def electLeaderForPartitions(partitions: Seq[TopicPartition],
partitionLeaderElectionStrategy: PartitionLeaderElectionStrategy): (Seq[TopicPartition], Map[TopicPartition, Throwable]) = {
val successfulElections = mutable.Buffer.empty[TopicPartition]
var remaining = partitions
var failures = Map.empty[TopicPartition, Throwable]
while (remaining.nonEmpty) {
val (success, updatesToRetry, failedElections) = doElectLeaderForPartitions(partitions,
partitionLeaderElectionStrategy)
remaining = updatesToRetry
successfulElections ++= success
failedElections.foreach { case (partition, e) =>
logFailedStateChange(partition, partitionState(partition), OnlinePartition, e)
}
failures ++= failedElections
}
(successfulElections, failures)
}
这个方法的实现结构比较简单,返回值为两个集合,一个选举成功的集合,一个选举失败的集合,同时选举过程中如果出现可恢复异常,则会进行重试。
具体的重试逻辑由doElectLeaderForPartitions方法实现,该方法非常复杂。
分区选举由PartitionStateMachine的doElectLeaderForPartitions方法实现,接下来分步进行讲解。
Step1:首先从Zookeeper中获取需要选举分区的元信息,代码如下所示:
val getDataResponses = try {
zkClient.getTopicPartitionStatesRaw(partitions)
} catch {
case e: Exception =>
return (Seq.empty, Seq.empty, partitions.map(_ -> e).toMap)
}
Kafka中主题的路由信息存储在Zookeeper中,具体路径为:/{namespace}/brokers/topics/{topicName}}/partitions/{partition}/state,具体存储的内容如下所示:
Step2:将查询出来的主题分区元信息,组装成Map< TopicPartition, LeaderIsrAndControllerEpoch>的Map结构,代码如下所示:
val leaderIsrAndControllerEpochPerPartition = mutable.Buffer.empty[(TopicPartition, LeaderIsrAndControllerEpoch)]
getDataResponses.foreach { getDataResponse =>
val partition = getDataResponse.ctx.get.asInstanceOf[TopicPartition]
val currState = partitionState(partition)
if (getDataResponse.resultCode == Code.OK) {
val leaderIsrAndControllerEpochOpt = TopicPartitionStateZNode.decode(getDataResponse.data, getDataResponse.stat)
if (leaderIsrAndControllerEpochOpt.isEmpty) {
val exception = new StateChangeFailedException(s"LeaderAndIsr information doesn't exist for partition $partition in $currState state")
failedElections.put(partition, exception)
}
leaderIsrAndControllerEpochPerPartition += partition -> leaderIsrAndControllerEpochOpt.get
} else if (getDataResponse.resultCode == Code.NONODE) {
val exception = new StateChangeFailedException(s"LeaderAndIsr information doesn't exist for partition $partition in $currState state")
failedElections.put(partition, exception)
} else {
failedElections.put(partition, getDataResponse.resultException.get)
}
}
Step3:将分区中的controllerEpoch与当前Kafka Controller的epoch对比,刷选出无效与有效集合,具体代码如下所示:
val (invalidPartitionsForElection, validPartitionsForElection) = leaderIsrAndControllerEpochPerPartition.partition { case (_, leaderIsrAndControllerEpoch) =>
leaderIsrAndControllerEpoch.controllerEpoch > controllerContext.epoch
}
invalidPartitionsForElection.foreach { case (partition, leaderIsrAndControllerEpoch) =>
val failMsg = s"aborted leader election for partition $partition since the LeaderAndIsr path was " +
s"already written by another controller. This probably means that the current controller $controllerId went through " +
s"a soft failure and another controller was elected with epoch ${leaderIsrAndControllerEpoch.controllerEpoch}."
failedElections.put(partition, new StateChangeFailedException(failMsg))
}
if (validPartitionsForElection.isEmpty) {
return (Seq.empty, Seq.empty, failedElections.toMap)
}
如果当前控制器的controllerEpoch小于分区状态中的controllerEpoch,说明已有新的Broker已取代当前Controller成为集群新的Controller,本次无法进行Leader选取,并且打印日志。
Step4:根据Leader选举策略进行Leader选举,代码如下所示:
val (partitionsWithoutLeaders, partitionsWithLeaders) = partitionLeaderElectionStrategy match {
case OfflinePartitionLeaderElectionStrategy =>
leaderForOffline(validPartitionsForElection).partition { case (_, newLeaderAndIsrOpt, _) => newLeaderAndIsrOpt.isEmpty }
case ReassignPartitionLeaderElectionStrategy =>
leaderForReassign(validPartitionsForElection).partition { case (_, newLeaderAndIsrOpt, _) => newLeaderAndIsrOpt.isEmpty }
case PreferredReplicaPartitionLeaderElectionStrategy =>
leaderForPreferredReplica(validPartitionsForElection).partition { case (_, newLeaderAndIsrOpt, _) => newLeaderAndIsrOpt.isEmpty }
case ControlledShutdownPartitionLeaderElectionStrategy =>
leaderForControlledShutdown(validPartitionsForElection, shuttingDownBrokers).partition { case (_, newLeaderAndIsrOpt, _) => newLeaderAndIsrOpt.isEmpty }
}
由于我们这次是由OfflinePartition状态向OnlinePartition状态转换,进入的分支为leaderForOffline,稍后我们再详细介绍该方法,经过选举后的返回值为两个集合,其中partitionsWithoutLeaders表示未成功选举出Leader的分区,而partitionsWithLeaders表示成功选举出Leader的分区。
Step5:没有成功选举出Leader的分区打印对应日志,并加入到失败队列集合中,如下图所示:
partitionsWithoutLeaders.foreach { case (partition, _, _) =>
val failMsg = s"Failed to elect leader for partition $partition under strategy $partitionLeaderElectionStrategy"
failedElections.put(partition, new StateChangeFailedException(failMsg))
}
Step5:将选举结果更新到zookeeper中,如下所示:
val adjustedLeaderAndIsrs = partitionsWithLeaders.map { case (partition, leaderAndIsrOpt, _) => partition -> leaderAndIsrOpt.get }.toMap
val UpdateLeaderAndIsrResult(successfulUpdates, updatesToRetry, failedUpdates) = zkClient.updateLeaderAndIsr(
adjustedLeaderAndIsrs, controllerContext.epoch, controllerContext.epochZkVersion)
Step6:将最新的分区选举结果同步到其他Broker节点上。
successfulUpdates.foreach { case (partition, leaderAndIsr) =>
val replicas = controllerContext.partitionReplicaAssignment(partition)
val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipientsPerPartition(partition), partition,
leaderIsrAndControllerEpoch, replicas, isNew = false)
}
更新分区状态的请求LEADER_AND_ISR被其他Broker接受后,会根据分区的leader与副本信息,成为该分区的Leader节点或从节点,关于这块的实现细节在专栏的后续文章中会专门提及。
那OfflinePartitionLeaderElectionStrategy选举策略具体是如何进行选举的呢?接下来我们探究其实现细节。
OfflinePartitionLeaderElectionStrategy的选举策略实现代码见PartitionStateMachine的leaderForOffline,我们还是采取分步探讨的方式。
Step1:主要初始化几个集合,代码如下
val (partitionsWithNoLiveInSyncReplicas, partitionsWithLiveInSyncReplicas) = leaderIsrAndControllerEpochs.partition { case (partition, leaderIsrAndControllerEpoch) =>
val liveInSyncReplicas = leaderIsrAndControllerEpoch.leaderAndIsr.isr.filter(replica => controllerContext.isReplicaOnline(replica, partition))
liveInSyncReplicas.isEmpty
}
val (logConfigs, failed) = zkClient.getLogConfigs(partitionsWithNoLiveInSyncReplicas.map { case (partition, _) => partition.topic }, config.originals())
val partitionsWithUncleanLeaderElectionState = partitionsWithNoLiveInSyncReplicas.map { case (partition, leaderIsrAndControllerEpoch) =>
if (failed.contains(partition.topic)) {
logFailedStateChange(partition, partitionState(partition), OnlinePartition, failed(partition.topic))
(partition, None, false)
} else {
(partition, Option(leaderIsrAndControllerEpoch), logConfigs(partition.topic).uncleanLeaderElectionEnable.booleanValue())
}
} ++ partitionsWithLiveInSyncReplicas.map { case (partition, leaderIsrAndControllerEpoch) => (partition, Option(leaderIsrAndControllerEpoch), false) }
对上面的变量做一个简单介绍:
Step2:执行分区Leader选举,具体实现代码如下所示:
partitionsWithUncleanLeaderElectionState.map { case (partition, leaderIsrAndControllerEpochOpt, uncleanLeaderElectionEnabled) =>
val assignment = controllerContext.partitionReplicaAssignment(partition)
val liveReplicas = assignment.filter(replica => controllerContext.isReplicaOnline(replica, partition))
if (leaderIsrAndControllerEpochOpt.nonEmpty) {
val leaderIsrAndControllerEpoch = leaderIsrAndControllerEpochOpt.get
val isr = leaderIsrAndControllerEpoch.leaderAndIsr.isr
val leaderOpt = PartitionLeaderElectionAlgorithms.offlinePartitionLeaderElection(assignment, isr, liveReplicas.toSet, uncleanLeaderElectionEnabled, controllerContext)
val newLeaderAndIsrOpt = leaderOpt.map { leader =>
val newIsr = if (isr.contains(leader)) isr.filter(replica => controllerContext.isReplicaOnline(replica, partition))
else List(leader)
leaderIsrAndControllerEpoch.leaderAndIsr.newLeaderAndIsr(leader, newIsr)
}
(partition, newLeaderAndIsrOpt, liveReplicas)
} else {
(partition, None, liveReplicas)
}
}
首先解释如下几个变量的含义:
离线转在线的选举算法比较简单:如果unclean.leader.election.enable=false,则从存活的ISR集合中选择第一个成为分区的Leader,如果没有存活的ISR副本,并且unclean.leader.election.enable=true,则选择一个在线的副本,否则返回NONE,表示没有成功选择一个合适的Leader。
然后返回本次选举的结果,完成本次选举。
本文从一个生产实际故障开始进行分析,经过分析得出单副本主题在集群中单台节点下线会引起部分队列无法写入,解决办法是要先执行主题分区移动,也就是将需要停止的broker上所在的分区移动到其他broker上,这个过程并不会对消息发送,消息消费造成影响。
最后大家如果和我一样,喜欢看看分区故障转移相关实现细节的话,我也带领大家一睹源码,加深对分区选举机制的理解,做到举一反三。
原文首发:Kafka消费组无法消费问题排查实战
一键三连(关注、点赞、留言)是对我最大的鼓励。
各位技术朋友们,我是《RocketMQ技术内幕》一书作者,CSDN2020博客之星TOP2,热衷于中间件领域的技术分享,维护「中间件兴趣圈」公众号,旨在成体系剖析Java主流中间件,构建完备的分布式架构体系,欢迎大家大家关注我,回复「专栏」可获取15个专栏;回复「PDF」可获取海量学习资料,回复「加群」可以拉你入技术交流群,零距离与BAT大厂的大神交流。