KafkaController机制(十二):KafkaController初始化与故障转移

在kakfa的集群中,只有一个Controller能够成为Leader管理整个集群,而其他未成为ControllerLeader的Broker也会创建一个KafkaController对象,他们唯一能做的事情就是当leader发生故障的时候竞争成为新的Controller。
KafkaController的启动和故障转移的过程与ZookeeperLeaderElector有着密切的关系,ZookeeperLeaderElector中有两个比较重要的字段
 

class ZookeeperLeaderElector(controllerContext: ControllerContext,
                             electionPath: String,
                             onBecomingLeader: () => Unit,
                             onResigningAsLeader: () => Unit,
                             brokerId: Int)
  extends LeaderElector with Logging {
  var leaderId = -1 // 缓存当前的Controller leaderID
  // create the election path in ZK, if one does not exist
  val index = electionPath.lastIndexOf("/")
  if (index > 0)
    controllerContext.zkUtils.makeSurePersistentPathExists(electionPath.substring(0, index))
  // LeaderChangeListener会监听/controller节点的数据变化,当节点中保存的leaderID发生变化时,会触发LeaderChangeListner进行相应的处理。
  val leaderChangeListener = new LeaderChangeListener
  }
  
def handleDataChange(dataPath: String, data: Object) {
  inLock(controllerContext.controllerLock) {
    val amILeaderBeforeDataChange = amILeader
    //记录新的Controller的brokerID
    leaderId = KafkaController.parseControllerId(data.toString)
    info("New leader is %d".format(leaderId))
    // The old leader needs to resign leadership if it is no longer the leader
    // 如果当前的Broker由Controller Leader变成Foolower,则要进行相应的清理动作。
    if (amILeaderBeforeDataChange && !amILeader)
      onResigningAsLeader()
  }
}

当/controller节点中的数据被删除时会触发handleDataDeleted()方法进行处理

def handleDataDeleted(dataPath: String) {
  inLock(controllerContext.controllerLock) {
    debug("%s leader change listener fired for path %s to handle data deleted: trying to elect as a leader"
      .format(brokerId, dataPath))
    if(amILeader)
      onResigningAsLeader()
    // 尝试新的leader选举
    elect
  }
}

ZookeeperLeaderElector.elect()方法具体如下:

  def elect: Boolean = {
    val timestamp = SystemTime.milliseconds.toString
    val electString = Json.encode(Map("version" -> 1, "brokerid" -> brokerId, "timestamp" -> timestamp))
   // 获得当前ZK中记录的Controller Leader的ID
   leaderId = getControllerID 
    /* 
     * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition, 
     * it's possible that the controller has already been elected when we get here. This check will prevent the following 
     * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
     */
     // 已经存在leader,放弃选举
    if(leaderId != -1) {
       debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
       return amILeader
    }

    try {
        //尝试创建临时节点,如果存在就跑出隐藏
      val zkCheckedEphemeral = new ZKCheckedEphemeral(electionPath,
                                                      electString,
                                                      controllerContext.zkUtils.zkConnection.getZookeeper,
                                                      JaasUtils.isZkSecurityEnabled())
      zkCheckedEphemeral.create()
      info(brokerId + " successfully elected as leader")
      // 创建成功,成为Leader,更新leaderID
      leaderId = brokerId
      // 实际上是调用onControllerFailover()
      onBecomingLeader()
    } catch {
      case e: ZkNodeExistsException =>
        // If someone else has written the path, then
        leaderId = getControllerID 

        if (leaderId != -1)
          debug("Broker %d was elected as leader instead of broker %d".format(leaderId, brokerId))
        else
          warn("A leader has been elected but just resigned, this will result in another round of election")

      case e2: Throwable =>
        error("Error while electing or becoming leader on broker %d".format(brokerId), e2)
        // 对onBecomingLeader的异常进行处理,重置leaderID,删除/controller路径
        resign()
    }
    amILeader
  }

  /**
   * This callback is invoked by the zookeeper leader elector on electing the current broker as the new controller.
   * It does the following things on the become-controller state change -
   * 1. Register controller epoch changed listener
   * 2. Increments the controller epoch
   * 3. Initializes the controller's context object that holds cache objects for current topics, live brokers and
   *    leaders for all existing partitions.
   * 4. Starts the controller's channel manager
   * 5. Starts the replica state machine
   * 6. Starts the partition state machine
   * If it encounters any unexpected exception/error while becoming controller, it resigns as the current controller.
   * This ensures another controller election will be triggered and there will always be an actively serving controller
   */
  def onControllerFailover() {
    if(isRunning) {
      info("Broker %d starting become controller state transition".format(config.brokerId))
      //read controller epoch from zk
      // 读取Zookeeper中ControllerEpochPath信息更新到ControllerContext中
      readControllerEpochFromZookeeper()
      // increment the controller epoch
      // 递增controller epoch并写入zk
      incrementControllerEpoch(zkUtils.zkClient)
      // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
      // 注册之前介绍过的一系列Zookeeper的监听器
      registerReassignedPartitionsListener()
      registerIsrChangeNotificationListener()
      registerPreferredReplicaElectionListener()
      partitionStateMachine.registerListeners()
      replicaStateMachine.registerListeners()
      // 初始化ControllerContext,主要是从Zookeeper中读取topic、分区、副本相关的各种元数据信息。
      initializeControllerContext()
      // 启动replicaStateMachine组件,初始化各个副本的状态
      replicaStateMachine.startup()
      // 启动partitionStateMachine组件,初始化分区状态
      partitionStateMachine.startup()
      // register the partition change listeners for all existing topics on failover
      // 为所有的topic注册PartitionModicationListener
      controllerContext.allTopics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic))
      info("Broker %d is ready to serve as the new controller with epoch %d".format(config.brokerId, epoch))
      // 修改Broker状态
      brokerState.newState(RunningAsController)
      // 处理副本重新分配的分区
      maybeTriggerPartitionReassignment()
      // 处理优先副本选举的的分区
      maybeTriggerPreferredReplicaElection()
      /* send partition leadership info to all live brokers */
      //向所有broker发送UpdateMetatadtaRequest
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
      // 根据配置决定开启分区自动rebalance功能
      if (config.autoLeaderRebalanceEnable) {
        info("starting the partition rebalance scheduler")
        autoRebalanceScheduler.startup()
        autoRebalanceScheduler.schedule("partition-rebalance-thread", checkAndTriggerPartitionRebalance,
          5, config.leaderImbalanceCheckIntervalSeconds.toLong, TimeUnit.SECONDS)
      }
      deleteTopicManager.start()
    }
    else
      info("Controller has been shut down, aborting startup/failover")
  }

initializeControllerContext方法从zk读取信息:

  private def initializeControllerContext() {
    // update controller cache with delete topic information
    // 读取ids字段
    controllerContext.liveBrokers = zkUtils.getAllBrokersInCluster().toSet
    // 读取topics路径
    controllerContext.allTopics = zkUtils.getAllTopics().toSet
    // 读取/brokers/topics/xxx/partitions,初始化每个partition的AR信息
    controllerContext.partitionReplicaAssignment = zkUtils.getReplicaAssignmentForTopics(controllerContext.allTopics.toSeq)
    controllerContext.partitionLeadershipInfo = new mutable.HashMap[TopicAndPartition, LeaderIsrAndControllerEpoch]
    controllerContext.shuttingDownBrokerIds = mutable.Set.empty[Int]
    // update the leader and isr cache for all existing partitions from Zookeeper
    // 初始化每个Partition的leader、ISR信息
    updateLeaderAndIsrCache()
    // start the channel manager
    // 启动ControllerChannelManager
    startChannelManager()
    // 读取需要优选副本选举的partition
    initializePreferredReplicaElection()
    // 读取/admin/reassign_partitions,初始化需要进行副本重新分配的Partition
    initializePartitionReassignment()
    // 启动TopicDeletionManager
    initializeTopicDeletion()
    info("Currently active brokers in the cluster: %s".format(controllerContext.liveBrokerIds))
    info("Currently shutting brokers in the cluster: %s".format(controllerContext.shuttingDownBrokerIds))
    info("Current list of topics in the cluster: %s".format(controllerContext.allTopics))
  }

  Partition Rebalance
  为了达到负载均衡,在onControllerFailover中会启动一个名为partition-rebalance的定时任务,提供了分区自动均衡的功能,如果一些broker宕机,然后leader集中在一个broker上,则会导致这个broker压力过大,此时需要重新选举。checkAndTriggerPartitionRebalance:
 

  private def checkAndTriggerPartitionRebalance(): Unit = {
    if (isActive()) {
      trace("checking need to trigger partition rebalance")
      // get all the active brokers
      // 获取所有可用的broker副本
      var preferredReplicasForTopicsByBrokers: Map[Int, Map[TopicAndPartition, Seq[Int]]] = null
      inLock(controllerContext.controllerLock) {
          // 获取优先副本坐在的BrokerID与分区的关系。
        preferredReplicasForTopicsByBrokers =
          controllerContext.partitionReplicaAssignment.filterNot(p => deleteTopicManager.isTopicQueuedUpForDeletion(p._1.topic)).groupBy {
            case(topicAndPartition, assignedReplicas) => assignedReplicas.head
          }
      }
      debug("preferred replicas by broker " + preferredReplicasForTopicsByBrokers)
      // for each broker, check if a preferred replica election needs to be triggered
      // 算每个broker的imbalance比例
      preferredReplicasForTopicsByBrokers.foreach {
        case(leaderBroker, topicAndPartitionsForBroker) => {
          var imbalanceRatio: Double = 0
          var topicsNotInPreferredReplica: Map[TopicAndPartition, Seq[Int]] = null
          inLock(controllerContext.controllerLock) {
            topicsNotInPreferredReplica =
              topicAndPartitionsForBroker.filter {
                case(topicPartition, replicas) => {
                  controllerContext.partitionLeadershipInfo.contains(topicPartition) &&
                  controllerContext.partitionLeadershipInfo(topicPartition).leaderAndIsr.leader != leaderBroker
                }
              }
            debug("topics not in preferred replica " + topicsNotInPreferredReplica)
            val totalTopicPartitionsForBroker = topicAndPartitionsForBroker.size
            val totalTopicPartitionsNotLedByBroker = topicsNotInPreferredReplica.size
            //非优选副本除以当前broker的leader数目
            imbalanceRatio = totalTopicPartitionsNotLedByBroker.toDouble / totalTopicPartitionsForBroker
            trace("leader imbalance ratio for broker %d is %f".format(leaderBroker, imbalanceRatio))
          }
          // check ratio and if greater than desired ratio, trigger a rebalance for the topic partitions
          // that need to be on this broker
          // 比例大于一定阈值时,触发优先选举
          if (imbalanceRatio > (config.leaderImbalancePerBrokerPercentage.toDouble / 100)) {
            topicsNotInPreferredReplica.foreach {
              case(topicPartition, replicas) => {
                inLock(controllerContext.controllerLock) {
                  // do this check only if the broker is live and there are no partitions being reassigned currently
                  // and preferred replica election is not in progress
                  if (controllerContext.liveBrokerIds.contains(leaderBroker) &&
                      controllerContext.partitionsBeingReassigned.size == 0 &&
                      controllerContext.partitionsUndergoingPreferredReplicaElection.size == 0 &&
                      !deleteTopicManager.isTopicQueuedUpForDeletion(topicPartition.topic) &&
                      controllerContext.allTopics.contains(topicPartition.topic)) {
                    // 触发选举
                    onPreferredReplicaElection(Set(topicPartition), true)
                  }
                }
              }
            }
          }
        }
      }
    }
  }

OnControllerResignation
当他监听到/controller中的数据被删除时,旧的leader会调用回调函数进行一些清理的工作,它实际上是OnControllerResignation方法:

  def onControllerResignation() {
    debug("Controller resigning, broker id %d".format(config.brokerId))
    // de-register listeners
    // 取消zk上的监听器
    deregisterIsrChangeNotificationListener()
    deregisterReassignedPartitionsListener()
    deregisterPreferredReplicaElectionListener()

    // shutdown delete topic manager
    if (deleteTopicManager != null)
      deleteTopicManager.shutdown()

    // shutdown leader rebalance scheduler
    if (config.autoLeaderRebalanceEnable)
      autoRebalanceScheduler.shutdown()

    inLock(controllerContext.controllerLock) {
      // de-register partition ISR listener for on-going partition reassignment task
      // 关闭deregisterReassignedPartitionsIsrChangeListeners
      deregisterReassignedPartitionsIsrChangeListeners()
      // shutdown partition state machine
      partitionStateMachine.shutdown()
      // shutdown replica state machine
      replicaStateMachine.shutdown()
      // shutdown controller channel manager
      if(controllerContext.controllerChannelManager != null) {
        controllerContext.controllerChannelManager.shutdown()
        controllerContext.controllerChannelManager = null
      }
      // reset controller context
      controllerContext.epoch=0
      controllerContext.epochZkVersion=0
      // 切换Broker状态
      brokerState.newState(RunningAsBroker)

      info("Broker %d resigned as the controller".format(config.brokerId))
    }
  }

 

你可能感兴趣的:(Kafka)