Kafka的Replica分配策略之二 Replica变为0了怎么办

这一篇文章准备讨论当kafka集群的broker发生变化,诸如broker崩溃,退出时,kafka集群会如何分配该broker上的Replica和Partition.

在讨论这个问题之前,需要先搞清kafka集群中,leader与follower的分工.可以看我写的这篇文章 Kafka的leader选举过程

在之前介绍kafka的选举过程时,提到成功选举出的leader会向zookeeper注册各种监视其中

replicaStateMachine.registerListeners()    //"/brokers/ids" 重点,监视所有的follower的加入,离开集群的行为
这一句注册了对/brokers/ids的监视,跟进这句命令

private def registerBrokerChangeListener() = {
  zkUtils.zkClient.subscribeChildChanges(ZkUtils.BrokerIdsPath, brokerChangeListener)
}
来到了这里,ZkUtils.BrokerIdsPath就是/brokers/ids这个路径,那由此可以得知,重点就在brokerChangeListener上.这个listener被定义在kafka.controller包的ReplicaStateMachine.scala下.

class BrokerChangeListener() extends IZkChildListener with Logging {
  this.logIdent = "[BrokerChangeListener on Controller " + controller.config.brokerId + "]: "
  def handleChildChange(parentPath : String, currentBrokerList : java.util.List[String]) {
    info("Broker change listener fired for path %s with children %s".format(parentPath, currentBrokerList.sorted.mkString(",")))
    inLock(controllerContext.controllerLock) {
      if (hasStarted.get) {
        ControllerStats.leaderElectionTimer.time {
          try {
            val curBrokers = currentBrokerList.map(_.toInt).toSet.flatMap(zkUtils.getBrokerInfo)
            val curBrokerIds = curBrokers.map(_.id)
            val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
            val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
            val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
            val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
            //上面几句很好理解,筛选出新加入的broker,与退出的broker
            controllerContext.liveBrokers = curBrokers
            val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
            val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
            val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
            info("Newly added brokers: %s, deleted brokers: %s, all live brokers: %s"
              .format(newBrokerIdsSorted.mkString(","), deadBrokerIdsSorted.mkString(","), liveBrokerIdsSorted.mkString(",")))
            newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
            deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
            //上面两句维护存有broker信息的map
            if(newBrokerIds.nonEmpty)
              controller.onBrokerStartup(newBrokerIdsSorted)
            if(deadBrokerIds.nonEmpty)
              //这一句是重点,如何处理failbroker
              controller.onBrokerFailure(deadBrokerIdsSorted)
          } catch {
            case e: Throwable => error("Error while handling broker changes", e)
          }
        }
      }
    }
  }
}

可以看到,针对broker变化这一情况.Kafka controller从znode节点的变化,推测出了新加入与新离开的节点.对于离开的节点调用了onBrokerFailure函数.

继续跟进这里只截取了部分onBrokerFailure的源码,一段一段来分析.

def onBrokerFailure(deadBrokers: Seq[Int]){
.....
....
val deadBrokersSet = deadBrokers.toSet
// trigger OfflinePartition state for all partitions whose current leader is one amongst the dead brokers
//筛选出deadbroker中所有担任partition leaderbroker
val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader =>
  deadBrokersSet.contains(partitionAndLeader._2.leaderAndIsr.leader) &&
    !deleteTopicManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet
//对于这些leader挂掉的partition,进行处理
partitionStateMachine.handleStateChanges(partitionsWithoutLeader, OfflinePartition)
我们看到这一段中,提取出了所有担任partition leader的broker,这些broker下线了,是要做特殊处理的.也就是handleStateChanges函数,跟进这个函数

def handleStateChanges(partitions: Set[TopicAndPartition], targetState: PartitionState,
                       leaderSelector: PartitionLeaderSelector = noOpPartitionLeaderSelector,
                       callbacks: Callbacks = (new CallbackBuilder).build) {
....
....
case OfflinePartition => // pre: partition should be in New or Online state assertValidPreviousStates(topicAndPartition , List(NewPartition , OnlinePartition , OfflinePartition) , OfflinePartition) // should be called when the leader for a partition is no longer alive stateChangeLogger.trace( "Controller %d epoch %d changed partition %s state from %s to %s" .format( controllerId , controller.epoch , topicAndPartition , currState , targetState)) partitionState.put(topicAndPartition , OfflinePartition) 这个函数中,针对offline的情况仅仅是在partitionState这个HashMap中,将这个partion的状态值为offline了.其他没有任何改变,如果你也看了源码,会发现之后会有sendRequestToBrokers的调用.但是在之前的处理中,既没有改变metaInfo,也没有改变leaderShip所以没有什么可以通知follower broker的,这个sendRequestToBrokers()可以说什么都没有做.

接下来回到OnBrokerFailure()上,看看接下来做了什么?

def onBrokerFailure(deadBrokers: Seq[Int]) {
...
...
    partitionStateMachine.triggerOnlinePartitionStateChange()

调用了triggerOnlinePartitionStateChange()函数

def triggerOnlinePartitionStateChange() {
  try {
    brokerRequestBatch.newBatch()
    // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
    // that belong to topics to be deleted
    for((topicAndPartition, partitionState) <- partitionState
        if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
      if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
        //这里把offlinenew的情况筛选出来了,为什么不直接作为参数传入呢?感觉怪怪的.
        //针对broker down的情况,里边的核心其实就是这一句partitionState.put(topicAndPartition, OfflinePartition)
        //但是这里的这个target Position变成了OnlinePartition!!!!!!
        handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                          (new CallbackBuilder).build)
    }
    brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
有没有看到和之前的handleStateChanges很相似,没错,这两个handleStateChange调用的是同一个函数.但是,参数有变化了,看到那个OnlinePatition了嘛?之前在hashmap中,将该topic标记为下线,现在重新处理将这个topic上线.

def handleStateChanges(partitions: Set[TopicAndPartition], targetState: PartitionState,
                       leaderSelector: PartitionLeaderSelector = noOpPartitionLeaderSelector,
                       callbacks: Callbacks = (new CallbackBuilder).build) {
....
....
case OnlinePartition => assertValidPreviousStates(topicAndPartition , List(NewPartition , OnlinePartition , OfflinePartition) , OnlinePartition) partitionState(topicAndPartition) match { case NewPartition => // initialize leader and isr path for new partition initializeLeaderAndIsrForPartition(topicAndPartition) case OfflinePartition => electLeaderForPartition(topic , partition , leaderSelector) case OnlinePartition => // invoked when the leader needs to be re-elected electLeaderForPartition(topic , partition , leaderSelector) case _ => // should never come here since illegal previous states are checked above } 由于之前的改变,现在我们进入了OfflinePartion分支,看到了令人激动的函数有没有?就是electLeaderForPartion函数,重新选则partition的leader.如果觉得这个选举函数太长就可以只关注两句,就是我在上边写注释,重点的那两句.

class OfflinePartitionLeaderSelector(controllerContext: ControllerContext, config: KafkaConfig)
  extends PartitionLeaderSelector with Logging {
  this.logIdent = "[OfflinePartitionLeaderSelector]: "

  def selectLeader(topicAndPartition: TopicAndPartition, currentLeaderAndIsr: LeaderAndIsr): (LeaderAndIsr, Seq[Int]) = {
    controllerContext.partitionReplicaAssignment.get(topicAndPartition) match {
      case Some(assignedReplicas) =>
        val liveAssignedReplicas = assignedReplicas.filter(r => controllerContext.liveBrokerIds.contains(r))
        val liveBrokersInIsr = currentLeaderAndIsr.isr.filter(r => controllerContext.liveBrokerIds.contains(r))
        val currentLeaderEpoch = currentLeaderAndIsr.leaderEpoch
        val currentLeaderIsrZkPathVersion = currentLeaderAndIsr.zkVersion
        val newLeaderAndIsr =
          if (liveBrokersInIsr.isEmpty) {
            // Prior to electing an unclean (i.e. non-ISR) leader, ensure that doing so is not disallowed by the configuration
            // for unclean leader election.
            if (!LogConfig.fromProps(config.originals, AdminUtils.fetchEntityConfig(controllerContext.zkUtils,
              ConfigType.Topic, topicAndPartition.topic)).uncleanLeaderElectionEnable) {
              throw new NoReplicaOnlineException(("No broker in ISR for partition " +
                "%s is alive. Live brokers are: [%s],".format(topicAndPartition, controllerContext.liveBrokerIds)) +
                " ISR brokers are: [%s]".format(currentLeaderAndIsr.isr.mkString(",")))
            }
            debug("No broker in ISR is alive for %s. Pick the leader from the alive assigned replicas: %s"
              .format(topicAndPartition, liveAssignedReplicas.mkString(",")))
            if (liveAssignedReplicas.isEmpty) {
              throw new NoReplicaOnlineException(("No replica for partition " +
                "%s is alive. Live brokers are: [%s],".format(topicAndPartition, controllerContext.liveBrokerIds)) +
                " Assigned replicas are: [%s]".format(assignedReplicas))
            } else {
              ControllerStats.uncleanLeaderElectionRate.mark()
              //重点就是这么简单,liveAssignedReplicas.head
              val newLeader = liveAssignedReplicas.head
              warn("No broker in ISR is alive for %s. Elect leader %d from live brokers %s. There's potential data loss."
                .format(topicAndPartition, newLeader, liveAssignedReplicas.mkString(",")))
              new LeaderAndIsr(newLeader, currentLeaderEpoch + 1, List(newLeader), currentLeaderIsrZkPathVersion + 1)
            }
          } else {
            val liveReplicasInIsr = liveAssignedReplicas.filter(r => liveBrokersInIsr.contains(r))
            //重点就是这么简单,liveReplicasInIsr.head
            val newLeader = liveReplicasInIsr.head
            debug("Some broker in ISR is alive for %s. Select %d from ISR %s to be the leader."
              .format(topicAndPartition, newLeader, liveBrokersInIsr.mkString(",")))
            new LeaderAndIsr(newLeader, currentLeaderEpoch + 1, liveBrokersInIsr.toList, currentLeaderIsrZkPathVersion + 1)
          }
        info("Selected new leader and ISR %s for offline partition %s".format(newLeaderAndIsr.toString(), topicAndPartition))
        (newLeaderAndIsr, liveAssignedReplicas)
      case None =>
        throw new NoReplicaOnlineException("Partition %s doesn't have replicas assigned to it".format(topicAndPartition))
    }
  }
}
重新选举的逻辑很简单,有没有同步的ISR?有的话,从ISR列表里拿一个出来.如果没有同步的ISR,那有没有不同步但是仍然运行的replica,有的话从存活repilica列表里拿一个出来.同时发出警告:"there is potential data loss"->"有数据丢失的可能".连活的replica都没有?那么只能throw exception了.

OK,那么现在partition的leader选举完了,该处理replica的关系了.看一下kafka源码的结构,可以知道每个leader controller维护了partitionStateMachine以及ReplicaStateMachine两种状态机.这可以看作是面向对象的抽象,把每个partition与replica都抽象成一个类.上面这些分析都是在partitionStateMachine里进行的.

回到一开始的OnBrokerFailure()

def onBrokerFailure(deadBrokers: Seq[Int]) {
....
....
var allReplicasOnDeadBrokers = controllerContext.replicasOnBrokers(deadBrokersSet)
val
activeReplicasOnDeadBrokers = allReplicasOnDeadBrokers.filterNot(p => deleteTopicManager.isTopicQueuedUpForDeletion(p.topic)) // handle dead replicas
replicaStateMachine
.handleStateChanges(activeReplicasOnDeadBrokers , OfflineReplica) 如我刚才所讲,现在开始处理受影响的replica的状态机,target state是OfflineReplica,来到相应的case分支

case OfflineReplica =>
  assertValidPreviousStates(partitionAndReplica,
    List(NewReplica, OnlineReplica, OfflineReplica, ReplicaDeletionIneligible), targetState)
  // send stop replica command to the replica so that it stops fetching from the leader
  brokerRequestBatch.addStopReplicaRequestForBrokers(List(replicaId), topic, partition, deletePartition = false)
  // As an optimization, the controller removes dead replicas from the ISR
  val leaderAndIsrIsEmpty: Boolean =
    controllerContext.partitionLeadershipInfo.get(topicAndPartition) match {
      case Some(currLeaderIsrAndControllerEpoch) =>
        //重点,rempveReplicaFromIsr
        controller.removeReplicaFromIsr(topic, partition, replicaId) match {
          case Some(updatedLeaderIsrAndControllerEpoch) =>
            // send the shrunk ISR state change request to all the remaining alive replicas of the partition.
            val currentAssignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
            if (!controller.deleteTopicManager.isPartitionToBeDeleted(topicAndPartition)) {
              brokerRequestBatch.addLeaderAndIsrRequestForBrokers(currentAssignedReplicas.filterNot(_ == replicaId),
                topic, partition, updatedLeaderIsrAndControllerEpoch, replicaAssignment)
            }
            replicaState.put(partitionAndReplica, OfflineReplica)
            stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
              .format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
            false
          case None =>
            true
        }
      case None =>
        true
    }
一开始会先通知,受影响的follower broker,现在你们的leader变了,不要再向这台broker拉数据以维持ISR了.也就是下面这句

  brokerRequestBatch.addStopReplicaRequestForBrokers(List(replicaId), topic, partition, deletePartition = false)
然后主要的处理逻辑来到了removeReplicaFromISR这边,

def removeReplicaFromIsr(topic: String, partition: Int, replicaId: Int): Option[LeaderIsrAndControllerEpoch] = {
......
......
if (leaderAndIsr.isr.contains(replicaId)) {
          // if the replica to be removed from the ISR is also the leader, set the new leader value to -1
          val newLeader = if (replicaId == leaderAndIsr.leader) LeaderAndIsr.NoLeader else leaderAndIsr.leader
          var newIsr = leaderAndIsr.isr.filter(b => b != replicaId)

          // if the replica to be removed from the ISR is the last surviving member of the ISR and unclean leader election
          // is disallowed for the corresponding topic, then we must preserve the ISR membership so that the replica can
          // eventually be restored as the leader.
          if (newIsr.isEmpty && !LogConfig.fromProps(config.originals, AdminUtils.fetchEntityConfig(zkUtils,
            ConfigType.Topic, topicAndPartition.topic)).uncleanLeaderElectionEnable) {
            info("Retaining last ISR %d of partition %s since unclean leader election is disabled".format(replicaId, topicAndPartition))
            newIsr = leaderAndIsr.isr
          }

          val newLeaderAndIsr = new LeaderAndIsr(newLeader, leaderAndIsr.leaderEpoch + 1,
            newIsr, leaderAndIsr.zkVersion + 1)
          // update the new leadership decision in zookeeper or retry
          val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partition,
            newLeaderAndIsr, epoch, leaderAndIsr.zkVersion)

          newLeaderAndIsr.zkVersion = newVersion
          finalLeaderIsrAndControllerEpoch = Some(LeaderIsrAndControllerEpoch(newLeaderAndIsr, epoch))
          controllerContext.partitionLeadershipInfo.put(topicAndPartition, finalLeaderIsrAndControllerEpoch.get)
          if (updateSucceeded)
            info("New leader and ISR for partition %s is %s".format(topicAndPartition, newLeaderAndIsr.toString()))
          updateSucceeded
        }
......
......
}
删除了一些非重点的代码,看到重点了吗?这个Repica会记录下自己的leader,要是leader为空了就标记为noleader=-1.其他的就是向zookeeper同步数据,更新版本号(这个partition leader的epoch)

好了到这里,一台broker down了的基本主要逻辑就讲完了.

那么现在问题来了,如果一个ISR中的Replica变为0了那么kafka好像并不会有什么额外的作为.

看到这里,我其实是很震惊的.于是整个上午我都在源码中寻找,当replica变为0时,kafka的该怎么处理.但是并没有找到kafka会新分配一台机器当作partition leader的代码.

既然实践是检验真理的唯一标准,那么不如实际测试一下,看看kafka在这种情况下会怎么表现?实验环境是一个有5台broker的kafka集群

建立一个名为testSource的topic,partition为2,replica为1.之所以用replica为1,是为了更快地模拟出replica为0的情况.

u928@master:~/yf/deploying/kafka-configs$ kafka-topics.sh --zookeeper 10.255.0.12:2181 --describe  --topic testSource
Topic:testSource	PartitionCount:2	ReplicationFactor:1	Configs:
	Topic: testSource	Partition: 0	Leader: 1	Replicas: 1	Isr: 1
	Topic: testSource	Partition: 1	Leader: 2	Replicas: 2	Isr: 2
那么现在kill 掉 broker 2上的进程.

u928@master:~/yf/deploying/kafka-configs$ kafka-topics.sh --zookeeper 10.255.0.12:2181 --describe  --topic testSource
Topic:testSource	PartitionCount:2	ReplicationFactor:1	Configs:
	Topic: testSource	Partition: 0	Leader: 1	Replicas: 1	Isr: 1
	Topic: testSource	Partition: 1	Leader: -1	Replicas: 2	Isr: 

看到broker2,挂了之后leader如预期地变为了-1,就像刚才提到的,leader为空就标记为noleader=-1,Isr副本数变为了0.

这个时候能否提供服务?在控制台上分别开一个producer与一个consumer

u928@master:~/yf/deploying/kafka-configs$ kafka-console-producer.sh --broker-list 10.255.0.12:9092 --topic testSource
sdsd
sdsd
sdsd
aaa
u928@slaver2:~$ /opt/kafka0.10/bin/kafka-console-consumer.sh --bootstrap-server master:9092  --topic testSource
sdsd
sdsd
sdsd
aaa

OK,我们看到即使有一个partition挂了,kafka还是可以正常运转.

现在我们kill掉 broker1,也就是剩下的唯一broker

u928@master:~/yf/deploying/kafka-configs$ kafka-topics.sh --zookeeper 10.255.0.12:2181 --describe  --topic testSource
Topic:testSource	PartitionCount:2	ReplicationFactor:1	Configs:
	Topic: testSource	Partition: 0	Leader: -1	Replicas: 1	Isr: 
	Topic: testSource	Partition: 1	Leader: -1	Replicas: 2	Isr: 
那么两个partition的leader都变为了-1,这时候kafka的这个topic还会正常工作吗?

u928@master:~/yf/deploying/kafka-configs$ kafka-console-producer.sh --broker-list 10.255.0.12:9092 --topic testSource
dsd
fdfd
gfgfg
sdsd[2017-11-20 14:41:40,532] ERROR Error when sending message to topic testSource with key: null, value: 3 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for testSource-0 due to 1504 ms has passed since batch creation plus linger time
[2017-11-20 14:41:40,536] ERROR Error when sending message to topic testSource with key: null, value: 5 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for testSource-0 due to 1504 ms has passed since batch creation plus linger time

/opt/kafka0.10/bin/kafka-console-consumer.sh --bootstrap-server master:9092  --topic testSource
[2017-11-20 14:45:52,852] WARN Auto offset commit failed for group console-consumer-33901: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
^CProcessed a total of 0 messages
rua, 我们现在看到这个topic, 真的摸了!

也就是这个topic不再提供服务,现在我们再把两台broker打开

u928@master:~kafka-topics.sh --zookeeper 10.255.0.12:2181 --describe  --topic testSource
Topic:testSource	PartitionCount:2	ReplicationFactor:1	Configs:
	Topic: testSource	Partition: 0	Leader: 1	Replicas: 1	Isr: 1
	Topic: testSource	Partition: 1	Leader: 2	Replicas: 2	Isr: 2
恢复的broker自动加入了kafka集群,基于改topic的kafka服务也恢复了.


现在进入总结环节:

虽然难以置信,不过kafka的可靠性并没有我想象中的那么高.当一个topic创建完毕后,那么为这个topic服务的broker也确定下来了.(不包括人为地通过合法的命令增加partition和replica的情况,这里只考虑在server down/crash的情况).如果一个topic的某个partition的ISR全部挂了,那么这个partition就会停止服务,但是客户端仍然可以通过这个topic接收或者发送数据.如果某个topic的全部partition都挂了,那么不好意思,整个topic就不再提供服务了.Producer和Consumer要么阻塞要么报错,直到指定的broker恢复为止.这里其实有一个trick,kafka集群识别broker仅仅通过brokerId,只要配置了相同的brokerId那么即便不是同一台物理机或Ip,也可以顶替之前broker的服务.

还记得一开始想要解决的问题吗?

2.要是一个broker down了,那它的replica该怎么重新分配.

3.如果一个broker因为2成为了一个topic的新replica,那么他没有之前的那些message该怎么办?需要从其他broker拉过来吗,如果要拉,那么数据量太大的会不会对网络造成负载?

现在应该也有了答案,在kafka的默认配置下:

2.从这个partition的replica里依次取broker作为新ISR的leader,直到ISR为空,该partition停止服务.

3.不会有broker会因为其他服务器的崩溃而成为新的broker

你可能感兴趣的:(kafka)