Kafka 3.0 源码笔记(10)-Kafka 服务端消息数据的主从同步源码分析

文章目录

  • 前言
  • 1. 消息数据主从同步的流程
  • 2. 消息数据主从同步源码分析
      • 2.1 元数据变动的发布
      • 2.2 变动元数据的消费应用
      • 2.3 主从副本的消息数据同步

前言

Kafka 3.0 源码笔记(9)-Kafka 服务端元数据的主从同步 中笔者在文章的末尾提到了元数据主从同步完成后,元数据的变动被 broker 模块监听处理后才能对集群产生影响,本文实际上就是以创建 Topic 功能为引子,从消息数据分区副本主从同步的场景来分析这个过程。结合 Kafka 的整体设计实现来看,创建 Topic 后,整个消息生产消费功能的完整流程如下图所示:

Kafka 3.0 源码笔记(10)-Kafka 服务端消息数据的主从同步源码分析_第1张图片

1. 消息数据主从同步的流程

Kafka 3.0 源码笔记(10)-Kafka 服务端消息数据的主从同步源码分析_第2张图片

上图展示了消息写入后 Follower 副本通过 Fetch 请求完成消息数据及 HW 同步的过程,大致分为 4 个阶段:

  1. 初始状态
    当某个分区的 Leader 副本和所有 Follower 副本保存的消息都一致时,HW 与 LEO 指向同一个位置。在以上示例图中,Leader 副本和 Follower 副本都保存了 Offset=0 的数据,HW 和 LEO 都指向还未写入的 Offset=1 的位置
  2. 消息写入
    Leader 副本所在节点接收生产消息请求,会将消息写入到本地的分区副本,此时 Leader 副本 LEO 指向 Offset=2 的位置。与此同时, Leader 副本还会尝试更新本地日志的 HW 水位,不过在当前阶段实际不会更新 HW 水位。 HW 更新具体算法如下,下文提及的尝试更新本地日志HW 皆指代以下过程:
    1. 遍历分区内保存远程副本的 remoteReplicasMap,取所有ISR列表及消息数据落后但正在追上 Leader (由配置 replica.lag.time.max.ms 决定)的副本中最小的 LEO 作为新水位候选,即 new HW = min(LEOs)。需注意,这种算法实际上意味着对消息数据主从同步的强一致性要求,只有所有活跃的分区副本都保存了消息数据,这条消息才对外可见,可以被消费
    2. 为防止单调递增的分区 HW 降低,取分区 Leader 本地的 old HW 与 new HW 比较,如果 old HW < new HW,则更新分区 HW 为 new HW,否则不更新
  3. 第一次 Fetch 请求的交互
    Follower 副本节点会通过 Fetcher 线程定时发送 Fetch 请求到 Leader 副本同步消息,发起的请求中会携带本地分区副本的 LEO=1。 Leader 副本所在节点接收到请求后,更新目标分区 remoteReplicasMap 中保存的该 Follower 副本状态,尝试更新本地日志HW。此时只要当前 Follower 副本不是最后一个来同步消息的,Leader 副本就不会更新本地 HW,仅仅返回消息记录。Follower 副本节点在处理 Fetch 响应时,仅会将消息追加到本地日志,并将 LEO 指向 Offset=2 的位置
  4. 第二次 Fetch 请求的交互
    与第一次 Fetch 请求交互类似,只不过这时 Fetcher 线程发起的请求中会携带本地分区副本的 LEO=2。假设此时已经有所有 Follower 副本都保存了新消息,那么 Leader 副本节点在 尝试更新本地日志HW 时会成功更新本地 HW 指向 Offset=2 的位置,并在 Fetch 响应中将当前 HW 返回给 Follower。Follower 依据 Leader 的 HW 更新本地副本 HW 指向 Offset=2 的位置,最终完成 HW 同步

2. 消息数据主从同步源码分析

新增 Topic 后,要最终实现消息数据在主从副本之间的同步,整个过程大致分为以下几个阶段:

  1. 元数据变动的发布
  2. 变动元数据的消费应用
  3. 主从副本的消息数据同步

Kafka 3.0 源码笔记(10)-Kafka 服务端消息数据的主从同步源码分析_第3张图片

2.1 元数据变动的发布

  1. BrokerServer 启动过程中会调用 KafkaRaftManager.scala#register() 方法将 BrokerMetadataListener 注册到 KafkaRaftClient 中,当元数据分区 HW 更新后将回调 BrokerMetadataListener.scala#handleCommit() 方法通知监听器。这个方法源码如下,关键处理显而易见:

    1. 新建封装元数据消息的异步事件 HandleCommitsEvent 对象,事件被处理时该对象 HandleCommitsEvent#run() 方法将被执行
    2. 调用 KafkaEventQueue.java#append() 方法将事件投入到异步队列,关于这个事件队列的运作机制读者可参考Kafka 3.0 源码笔记(8)-Kafka 服务端集群 Leader 对 CreateTopics 请求的处理,笔者不再赘述
    override def handleCommit(reader: BatchReader[ApiMessageAndVersion]): Unit =
     eventQueue.append(new HandleCommitsEvent(reader))
    
  2. HandleCommitsEvent#run() 方法比较简练,核心逻辑如下:

    1. 调用 BrokerMetadataListener.scala#loadBatches() 解析元数据消息记录,将其重放出来载入到数据结构 MetadataDelta
    2. 调用 BrokerMetadataListener.scala#publish() 方法使用元数据发布器发布元数据
    class HandleCommitsEvent(reader: BatchReader[ApiMessageAndVersion])
       extends EventQueue.FailureLoggingEvent(log) {
     override def run(): Unit = {
       val results = try {
         val loadResults = loadBatches(_delta, reader)
         if (isDebugEnabled) {
           debug(s"Loaded new commits: ${loadResults}")
         }
         loadResults
       } finally {
         reader.close()
       }
       _publisher.foreach(publish(_, results.highestMetadataOffset))
    
       snapshotter.foreach { snapshotter =>
         _bytesSinceLastSnapshot = _bytesSinceLastSnapshot + results.numBytes
         if (shouldSnapshot()) {
           if (snapshotter.maybeStartSnapshot(results.highestMetadataOffset,
             _highestEpoch,
             _highestTimestamp,
             _delta.apply())) {
             _bytesSinceLastSnapshot = 0L
           }
         }
       }
     }
    }
    
  3. BrokerMetadataListener.scala#loadBatches() 内部逻辑比较简单,可以看到就是遍历元数据消息列表,调用MetadataDelta.java#replay()方法将其载入

    private def loadBatches(delta: MetadataDelta,
                           iterator: util.Iterator[Batch[ApiMessageAndVersion]]): BatchLoadResults = {
     val startTimeNs = time.nanoseconds()
     var numBatches = 0
     var numRecords = 0
     var batch: Batch[ApiMessageAndVersion] = null
     var numBytes = 0L
     while (iterator.hasNext()) {
       batch = iterator.next()
       var index = 0
       batch.records().forEach { messageAndVersion =>
         if (isTraceEnabled) {
           trace("Metadata batch %d: processing [%d/%d]: %s.".format(batch.lastOffset, index + 1,
             batch.records().size(), messageAndVersion.message().toString()))
         }
         delta.replay(messageAndVersion.message())
         numRecords += 1
         index += 1
       }
       numBytes = numBytes + batch.sizeInBytes()
       metadataBatchSizeHist.update(batch.records().size())
       numBatches = numBatches + 1
     }
     val newHighestMetadataOffset = if (batch == null) {
       _highestMetadataOffset
     } else {
       _highestMetadataOffset = batch.lastOffset()
       _highestEpoch = batch.epoch()
       _highestTimestamp = batch.appendTimestamp()
       batch.lastOffset()
     }
     val endTimeNs = time.nanoseconds()
     val elapsedUs = TimeUnit.MICROSECONDS.convert(endTimeNs - startTimeNs, TimeUnit.NANOSECONDS)
     batchProcessingTimeHist.update(elapsedUs)
     BatchLoadResults(numBatches, numRecords, elapsedUs, numBytes, newHighestMetadataOffset)
    }
    
  4. MetadataDelta.java#replay()方法将根据元数据记录的类型进行处理分发,新增 Topic 生成的元数据记录类型为 TOPIC_RECORD,则将触发MetadataDelta.java#replay()重载方法执行

    public void replay(ApiMessage record) {
         MetadataRecordType type = MetadataRecordType.fromId(record.apiKey());
         switch (type) {
             case REGISTER_BROKER_RECORD:
                 replay((RegisterBrokerRecord) record);
                 break;
             case UNREGISTER_BROKER_RECORD:
                 replay((UnregisterBrokerRecord) record);
                 break;
             case TOPIC_RECORD:
                 replay((TopicRecord) record);
                 break;
             case PARTITION_RECORD:
                 replay((PartitionRecord) record);
                 break;
             case CONFIG_RECORD:
                 replay((ConfigRecord) record);
                 break;
             case PARTITION_CHANGE_RECORD:
                 replay((PartitionChangeRecord) record);
                 break;
             case FENCE_BROKER_RECORD:
                 replay((FenceBrokerRecord) record);
                 break;
             case UNFENCE_BROKER_RECORD:
                 replay((UnfenceBrokerRecord) record);
                 break;
             case REMOVE_TOPIC_RECORD:
                 replay((RemoveTopicRecord) record);
                 break;
             case FEATURE_LEVEL_RECORD:
                 replay((FeatureLevelRecord) record);
                 break;
             case CLIENT_QUOTA_RECORD:
                 replay((ClientQuotaRecord) record);
                 break;
             case PRODUCER_IDS_RECORD:
                 // Nothing to do.
                 break;
             case REMOVE_FEATURE_LEVEL_RECORD:
                 replay((RemoveFeatureLevelRecord) record);
                 break;
             case BROKER_REGISTRATION_CHANGE_RECORD:
                 replay((BrokerRegistrationChangeRecord) record);
                 break;
             default:
                 throw new RuntimeException("Unknown metadata record type " + type);
         }
     }
    
    
  5. MetadataDelta.java#replay()处理 TOPIC_RECORD 消息类型的重载方法如下,可以看到核心处理是调用 TopicsDelta.java#replay() 方法

     public void replay(TopicRecord record) {
         if (topicsDelta == null) topicsDelta = new TopicsDelta(image.topics());
         topicsDelta.replay(record);
     }
    
  6. TopicsDelta.java#replay() 方法此处只是暂存消息中的 Topic 元数据,实际使用将在后文进行

     public void replay(TopicRecord record) {
         TopicDelta delta = new TopicDelta(
             new TopicImage(record.name(), record.topicId(), Collections.emptyMap()));
         changedTopics.put(record.topicId(), delta);
     }
    
  7. 此时回到本节步骤2第2步BrokerMetadataListener.scala#publish() 方法通过元数据发布器将元数据发布出来,触发 BrokerMetadataPublisher.scala#publish() 方法执行,至此元数据变动的发布基本结束

    private def publish(publisher: MetadataPublisher,
                       newHighestMetadataOffset: Long): Unit = {
     val delta = _delta
     _image = _delta.apply()
     _delta = new MetadataDelta(_image)
     publisher.publish(newHighestMetadataOffset, delta, _image)
    }
    

2.2 变动元数据的消费应用

  1. BrokerMetadataPublisher.scala#publish() 方法实现如下,关键处理如下:

    1. 如果是第一次发布元数据变更,需要调用 BrokerMetadataPublisher.scala#initializeManagers() 方法进行初始化操作。这一步大多是定时任务的启动,包括日志文件相关的定期刷盘、异常恢复检测,副本管理相关的 ISR 列表过期收缩,以及消费者组协调器删除过期消费者组信息等
    2. 开始计算元数据的变动,进行相应处理,本文以 Topic 变动触发 ReplicaManager.scala#applyDelta() 方法执行为例
    override def publish(newHighestMetadataOffset: Long,
                        delta: MetadataDelta,
                        newImage: MetadataImage): Unit = {
     try {
       // Publish the new metadata image to the metadata cache.
       metadataCache.setImage(newImage)
    
       if (_firstPublish) {
         info(s"Publishing initial metadata at offset ${newHighestMetadataOffset}.")
    
         // If this is the first metadata update we are applying, initialize the managers
         // first (but after setting up the metadata cache).
         initializeManagers()
       } else if (isDebugEnabled) {
         debug(s"Publishing metadata at offset ${newHighestMetadataOffset}.")
       }
    
       // Apply feature deltas.
       Option(delta.featuresDelta()).foreach { featuresDelta =>
         featureCache.update(featuresDelta, newHighestMetadataOffset)
       }
    
       // Apply topic deltas.
       Option(delta.topicsDelta()).foreach { topicsDelta =>
         // Notify the replica manager about changes to topics.
         replicaManager.applyDelta(newImage, topicsDelta)
    
         // Handle the case where the old consumer offsets topic was deleted.
         if (topicsDelta.topicWasDeleted(Topic.GROUP_METADATA_TOPIC_NAME)) {
           topicsDelta.image().getTopic(Topic.GROUP_METADATA_TOPIC_NAME).partitions().entrySet().forEach {
             entry =>
               if (entry.getValue().leader == brokerId) {
                 groupCoordinator.onResignation(entry.getKey(), Some(entry.getValue().leaderEpoch))
               }
           }
         }
         // Handle the case where we have new local leaders or followers for the consumer
         // offsets topic.
         getTopicDelta(Topic.GROUP_METADATA_TOPIC_NAME, newImage, delta).foreach { topicDelta =>
           val changes = topicDelta.localChanges(brokerId)
    
           changes.deletes.forEach { topicPartition =>
             groupCoordinator.onResignation(topicPartition.partition, None)
           }
           changes.leaders.forEach { (topicPartition, partitionInfo) =>
             groupCoordinator.onElection(topicPartition.partition, partitionInfo.partition.leaderEpoch)
           }
           changes.followers.forEach { (topicPartition, partitionInfo) =>
             groupCoordinator.onResignation(topicPartition.partition, Some(partitionInfo.partition.leaderEpoch))
           }
         }
    
         // Handle the case where the old transaction state topic was deleted.
         if (topicsDelta.topicWasDeleted(Topic.TRANSACTION_STATE_TOPIC_NAME)) {
           topicsDelta.image().getTopic(Topic.TRANSACTION_STATE_TOPIC_NAME).partitions().entrySet().forEach {
             entry =>
               if (entry.getValue().leader == brokerId) {
                 txnCoordinator.onResignation(entry.getKey(), Some(entry.getValue().leaderEpoch))
               }
           }
         }
         // If the transaction state topic changed in a way that's relevant to this broker,
         // notify the transaction coordinator.
         getTopicDelta(Topic.TRANSACTION_STATE_TOPIC_NAME, newImage, delta).foreach { topicDelta =>
           val changes = topicDelta.localChanges(brokerId)
    
           changes.deletes.forEach { topicPartition =>
             txnCoordinator.onResignation(topicPartition.partition, None)
           }
           changes.leaders.forEach { (topicPartition, partitionInfo) =>
             txnCoordinator.onElection(topicPartition.partition, partitionInfo.partition.leaderEpoch)
           }
           changes.followers.forEach { (topicPartition, partitionInfo) =>
             txnCoordinator.onResignation(topicPartition.partition, Some(partitionInfo.partition.leaderEpoch))
           }
         }
    
         // Notify the group coordinator about deleted topics.
         val deletedTopicPartitions = new mutable.ArrayBuffer[TopicPartition]()
         topicsDelta.deletedTopicIds().forEach { id =>
           val topicImage = topicsDelta.image().getTopic(id)
           topicImage.partitions().keySet().forEach {
             id => deletedTopicPartitions += new TopicPartition(topicImage.name(), id)
           }
         }
         if (deletedTopicPartitions.nonEmpty) {
           groupCoordinator.handleDeletedPartitions(deletedTopicPartitions, RequestLocal.NoCaching)
         }
       }
    
       // Apply configuration deltas.
       Option(delta.configsDelta()).foreach { configsDelta =>
         configsDelta.changes().keySet().forEach { configResource =>
           val tag = configResource.`type`() match {
             case ConfigResource.Type.TOPIC => Some(ConfigType.Topic)
             case ConfigResource.Type.BROKER => Some(ConfigType.Broker)
             case _ => None
           }
           tag.foreach { t =>
             val newProperties = newImage.configs().configProperties(configResource)
             val maybeDefaultName = configResource.name() match {
               case "" => ConfigEntityName.Default
               case k => k
             }
             dynamicConfigHandlers(t).processConfigChanges(maybeDefaultName, newProperties)
           }
         }
       }
    
       // Apply client quotas delta.
       Option(delta.clientQuotasDelta()).foreach { clientQuotasDelta =>
         clientQuotaMetadataManager.update(clientQuotasDelta)
       }
    
       if (_firstPublish) {
         finishInitializingReplicaManager(newImage)
       }
     } catch {
       case t: Throwable => error(s"Error publishing broker metadata at ${newHighestMetadataOffset}", t)
         throw t
     } finally {
       _firstPublish = false
     }
    }
    
  2. ReplicaManager.scala#applyDelta() 方法源码如下,关键处理分为如下几步:

    1. 首先调用 TopicsDelta.java#localChanges() 方法计算元数据中的 topic 变动点
    2. 计算出 topic 的变动点后,如果当前节点被分配了充当某些分区的 Leader 副本,那么调用 ReplicaManager.scala#applyLocalLeadersDelta() 方法进行相应处理;如果当前节点还被分配负责某些分区的 Follower 副本,则调用 ReplicaManager.scala#applyLocalFollowersDelta() 进行处理
      def applyDelta(newImage: MetadataImage, delta: TopicsDelta): Unit = {
     // Before taking the lock, compute the local changes
     val localChanges = delta.localChanges(config.nodeId)
    
     replicaStateChangeLock.synchronized {
       // Handle deleted partitions. We need to do this first because we might subsequently
       // create new partitions with the same names as the ones we are deleting here.
       if (!localChanges.deletes.isEmpty) {
         val deletes = localChanges.deletes.asScala.map(tp => (tp, true)).toMap
         stateChangeLogger.info(s"Deleting ${deletes.size} partition(s).")
         stopPartitions(deletes).foreach { case (topicPartition, e) =>
           if (e.isInstanceOf[KafkaStorageException]) {
             stateChangeLogger.error(s"Unable to delete replica ${topicPartition} because " +
               "the local replica for the partition is in an offline log directory")
           } else {
             stateChangeLogger.error(s"Unable to delete replica ${topicPartition} because " +
               s"we got an unexpected ${e.getClass.getName} exception: ${e.getMessage}")
           }
         }
       }
    
       // Handle partitions which we are now the leader or follower for.
       if (!localChanges.leaders.isEmpty || !localChanges.followers.isEmpty) {
         val lazyOffsetCheckpoints = new LazyOffsetCheckpoints(this.highWatermarkCheckpoints)
         val changedPartitions = new mutable.HashSet[Partition]
         if (!localChanges.leaders.isEmpty) {
           applyLocalLeadersDelta(changedPartitions, delta, lazyOffsetCheckpoints, localChanges.leaders.asScala)
         }
         if (!localChanges.followers.isEmpty) {
           applyLocalFollowersDelta(changedPartitions, newImage, delta, lazyOffsetCheckpoints, localChanges.followers.asScala)
         }
         maybeAddLogDirFetchers(changedPartitions, lazyOffsetCheckpoints,
           name => Option(newImage.topics().getTopic(name)).map(_.id()))
    
         def markPartitionOfflineIfNeeded(tp: TopicPartition): Unit = {
           /*
            * If there is offline log directory, a Partition object may have been created by getOrCreatePartition()
            * before getOrCreateReplica() failed to create local replica due to KafkaStorageException.
            * In this case ReplicaManager.allPartitions will map this topic-partition to an empty Partition object.
            * we need to map this topic-partition to OfflinePartition instead.
            */
           if (localLog(tp).isEmpty)
             markPartitionOffline(tp)
         }
         localChanges.leaders.keySet.forEach(markPartitionOfflineIfNeeded)
         localChanges.followers.keySet.forEach(markPartitionOfflineIfNeeded)
    
         replicaFetcherManager.shutdownIdleFetcherThreads()
         replicaAlterLogDirsManager.shutdownIdleFetcherThreads()
       }
     }
    }
    
  3. TopicsDelta.java#localChanges() 方法如下,根据方法注释可以看到这里主要计算以下 3 种需要在本节点应用的所有 topic 变动,单个 topic 的变动计算逻辑由 TopicDelta.java#localChanges() 实现,本文不再赘述

    1. 当前节点需要删除的本地副本
    2. 当前节点新增的需要维护的 Leader 副本
    3. 当前节点新增的需要维护的 Follower 副本
        /**
      * Find the topic partitions that have change based on the replica given.
      *
      * The changes identified are:
      *   1. topic partitions for which the broker is not a replica anymore
      *   2. topic partitions for which the broker is now the leader
      *   3. topic partitions for which the broker is now a follower
      *
      * @param brokerId the broker id
      * @return the list of topic partitions which the broker should remove, become leader or become follower.
      */
     public LocalReplicaChanges localChanges(int brokerId) {
         Set<TopicPartition> deletes = new HashSet<>();
         Map<TopicPartition, LocalReplicaChanges.PartitionInfo> leaders = new HashMap<>();
         Map<TopicPartition, LocalReplicaChanges.PartitionInfo> followers = new HashMap<>();
    
         for (TopicDelta delta : changedTopics.values()) {
             LocalReplicaChanges changes = delta.localChanges(brokerId);
    
             deletes.addAll(changes.deletes());
             leaders.putAll(changes.leaders());
             followers.putAll(changes.followers());
         }
    
         // Add all of the removed topic partitions to the set of locally removed partitions
         deletedTopicIds().forEach(topicId -> {
             TopicImage topicImage = image().getTopic(topicId);
             topicImage.partitions().forEach((partitionId, prevPartition) -> {
                 if (Replicas.contains(prevPartition.replicas, brokerId)) {
                     deletes.add(new TopicPartition(topicImage.name(), partitionId));
                 }
             });
         });
    
         return new LocalReplicaChanges(deletes, leaders, followers);
     }
    
  4. 元数据变动计算完毕,回到本节步骤2第2步,如果当前节点有新增的需要维护的 Leader 副本,则 ReplicaManager.scala#applyLocalLeadersDelta() 方法将被触发执行。这个方法的实现如下,可以看到核心处理比较简单:

    1. 遍历新的 Leader 列表,调用 ReplicaManager.scala#getOrCreatePartition() 方法为其创建本地分区 Partition 对象
    2. 调用 Partition.scala#makeLeader() 将新建的 Partition 对象设置为分区副本 Leader
    private def applyLocalLeadersDelta(
     changedPartitions: mutable.Set[Partition],
     delta: TopicsDelta,
     offsetCheckpoints: OffsetCheckpoints,
     newLocalLeaders: mutable.Map[TopicPartition, LocalReplicaChanges.PartitionInfo]
    ): Unit = {
     stateChangeLogger.info(s"Transitioning ${newLocalLeaders.size} partition(s) to " +
       "local leaders.")
     replicaFetcherManager.removeFetcherForPartitions(newLocalLeaders.keySet)
     newLocalLeaders.forKeyValue { case (tp, info) =>
       getOrCreatePartition(tp, delta, info.topicId).foreach { case (partition, isNew) =>
         try {
           val state = info.partition.toLeaderAndIsrPartitionState(tp, isNew)
           if (!partition.makeLeader(state, offsetCheckpoints, Some(info.topicId))) {
             stateChangeLogger.info("Skipped the become-leader state change for " +
               s"${tp} with topic id ${info.topicId} because this partition is " +
               "already a local leader.")
           }
           changedPartitions.add(partition)
         } catch {
           case e: KafkaStorageException =>
             stateChangeLogger.info(s"Skipped the become-leader state change for ${tp} " +
               s"with topic id ${info.topicId} due to disk error ${e}")
             val dirOpt = getLogDir(tp)
             error(s"Error while making broker the leader for partition ${tp} in dir " +
               s"${dirOpt}", e)
         }
       }
     }
    }
    
  5. Partition.scala#makeLeader() 处理流程还算清晰,关键的处理如下:

    1. 执行 Partition.scala#updateAssignmentAndIsr() 方法更新当前分区 Leader 副本的 ISR 列表及内部的远程副本列表 remoteReplicasMap
    2. 调用 Partition.scala#createLogIfNotExists() 方法为当前分区 Leader 副本创建本地日志文件,如果文件已经存在则不创建
    3. 调用 Log.scala#maybeAssignEpochStartOffset() 方法更新当前分区的 Leader 版本等信息,后续将用于异常恢复,感兴趣的读者可参考 Kafka 3.0 源码笔记(12)-Kafka 服务端分区异常恢复机制的源码分析
    4. 调用 Partition.scala#maybeIncrementLeaderHW() 方法尝试更新当前分区的水位 HW
    def makeLeader(partitionState: LeaderAndIsrPartitionState,
                  highWatermarkCheckpoints: OffsetCheckpoints,
                  topicId: Option[Uuid]): Boolean = {
     val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
       // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
       // to maintain the decision maker controller's epoch in the zookeeper path
       controllerEpoch = partitionState.controllerEpoch
    
       val isr = partitionState.isr.asScala.map(_.toInt).toSet
       val addingReplicas = partitionState.addingReplicas.asScala.map(_.toInt)
       val removingReplicas = partitionState.removingReplicas.asScala.map(_.toInt)
    
       updateAssignmentAndIsr(
         assignment = partitionState.replicas.asScala.map(_.toInt),
         isr = isr,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
       try {
         createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints, topicId)
       } catch {
         case e: ZooKeeperClientException =>
           stateChangeLogger.error(s"A ZooKeeper client exception has occurred and makeLeader will be skipping the " +
             s"state change for the partition $topicPartition with leader epoch: $leaderEpoch ", e)
    
           return false
       }
    
       val leaderLog = localLogOrException
       val leaderEpochStartOffset = leaderLog.logEndOffset
       stateChangeLogger.info(s"Leader $topicPartition starts at leader epoch ${partitionState.leaderEpoch} from " +
         s"offset $leaderEpochStartOffset with high watermark ${leaderLog.highWatermark} " +
         s"ISR ${isr.mkString("[", ",", "]")} addingReplicas ${addingReplicas.mkString("[", ",", "]")} " +
         s"removingReplicas ${removingReplicas.mkString("[", ",", "]")}. Previous leader epoch was $leaderEpoch.")
    
       //We cache the leader epoch here, persisting it only if it's local (hence having a log dir)
       leaderEpoch = partitionState.leaderEpoch
       leaderEpochStartOffsetOpt = Some(leaderEpochStartOffset)
       zkVersion = partitionState.zkVersion
    
       // In the case of successive leader elections in a short time period, a follower may have
       // entries in its log from a later epoch than any entry in the new leader's log. In order
       // to ensure that these followers can truncate to the right offset, we must cache the new
       // leader epoch and the start offset since it should be larger than any epoch that a follower
       // would try to query.
       leaderLog.maybeAssignEpochStartOffset(leaderEpoch, leaderEpochStartOffset)
    
       val isNewLeader = !isLeader
       val curTimeMs = time.milliseconds
       // initialize lastCaughtUpTime of replicas as well as their lastFetchTimeMs and lastFetchLeaderLogEndOffset.
       remoteReplicas.foreach { replica =>
         val lastCaughtUpTimeMs = if (isrState.isr.contains(replica.brokerId)) curTimeMs else 0L
         replica.resetLastCaughtUpTime(leaderEpochStartOffset, curTimeMs, lastCaughtUpTimeMs)
       }
    
       if (isNewLeader) {
         // mark local replica as the leader after converting hw
         leaderReplicaIdOpt = Some(localBrokerId)
         // reset log end offset for remote replicas
         remoteReplicas.foreach { replica =>
           replica.updateFetchState(
             followerFetchOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata,
             followerStartOffset = Log.UnknownOffset,
             followerFetchTimeMs = 0L,
             leaderEndOffset = Log.UnknownOffset)
         }
       }
       // we may need to increment high watermark since ISR could be down to 1
       (maybeIncrementLeaderHW(leaderLog), isNewLeader)
     }
     // some delayed operations may be unblocked after HW changed
     if (leaderHWIncremented)
       tryCompleteDelayedRequests()
     isNewLeader
    }
    
  6. Partition.scala#maybeIncrementLeaderHW() 方法的实现如下所示,具体算法在上文第1节-消息数据主从同步的流程有提及,此处不再赘述

     private def maybeIncrementLeaderHW(leaderLog: Log, curTime: Long = time.milliseconds): Boolean = {
     // maybeIncrementLeaderHW is in the hot path, the following code is written to
     // avoid unnecessary collection generation
     var newHighWatermark = leaderLog.logEndOffsetMetadata
     remoteReplicasMap.values.foreach { replica =>
       // Note here we are using the "maximal", see explanation above
       if (replica.logEndOffsetMetadata.messageOffset < newHighWatermark.messageOffset &&
         (curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || isrState.maximalIsr.contains(replica.brokerId))) {
         newHighWatermark = replica.logEndOffsetMetadata
       }
     }
    
     leaderLog.maybeIncrementHighWatermark(newHighWatermark) match {
       case Some(oldHighWatermark) =>
         debug(s"High watermark updated from $oldHighWatermark to $newHighWatermark")
         true
    
       case None =>
         def logEndOffsetString: ((Int, LogOffsetMetadata)) => String = {
           case (brokerId, logEndOffsetMetadata) => s"replica $brokerId: $logEndOffsetMetadata"
         }
    
         if (isTraceEnabled) {
           val replicaInfo = remoteReplicas.map(replica => (replica.brokerId, replica.logEndOffsetMetadata)).toSet
           val localLogInfo = (localBrokerId, localLogOrException.logEndOffsetMetadata)
           trace(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old value. " +
             s"All current LEOs are ${(replicaInfo + localLogInfo).map(logEndOffsetString)}")
         }
         false
     }
    }
    
  7. 此时回到本节步骤2第2步,如果当前节点有新增的需要维护的 Follower 副本,则 ReplicaManager.scala#applyLocalFollowersDelta() 方法将被触发执行。这个方法的实现如下,关键步骤分为以下几步:

    1. 调用 ReplicaManager.scala#getOrCreatePartition() 方法创建本地分区 Partition 对象
    2. 调用 Partition.scala#makeFollower() 将新建的 Partition 对象设置为分区副本 Follower
    3. 调用 ReplicaFetcherManager.scala#addFetcherForPartitions() 方法为分区 Follower 副本设置 Fetcher 线程,该线程用于从分区Leader 副本处同步消息数据
    private def applyLocalFollowersDelta(
     changedPartitions: mutable.Set[Partition],
     newImage: MetadataImage,
     delta: TopicsDelta,
     offsetCheckpoints: OffsetCheckpoints,
     newLocalFollowers: mutable.Map[TopicPartition, LocalReplicaChanges.PartitionInfo]
    ): Unit = {
     stateChangeLogger.info(s"Transitioning ${newLocalFollowers.size} partition(s) to " +
       "local followers.")
     val shuttingDown = isShuttingDown.get()
     val partitionsToMakeFollower = new mutable.HashMap[TopicPartition, Partition]
     val newFollowerTopicSet = new mutable.HashSet[String]
     newLocalFollowers.forKeyValue { case (tp, info) =>
       getOrCreatePartition(tp, delta, info.topicId).foreach { case (partition, isNew) =>
         try {
           newFollowerTopicSet.add(tp.topic)
    
           if (shuttingDown) {
             stateChangeLogger.trace(s"Unable to start fetching ${tp} with topic " +
               s"ID ${info.topicId} because the replica manager is shutting down.")
           } else {
             val leader = info.partition.leader
             if (newImage.cluster.broker(leader) == null) {
               stateChangeLogger.trace(s"Unable to start fetching $tp with topic ID ${info.topicId} " +
                 s"from leader $leader because it is not alive.")
    
               // Create the local replica even if the leader is unavailable. This is required
               // to ensure that we include the partition's high watermark in the checkpoint
               // file (see KAFKA-1647).
               partition.createLogIfNotExists(isNew, false, offsetCheckpoints, Some(info.topicId))
             } else {
               val state = info.partition.toLeaderAndIsrPartitionState(tp, isNew)
               if (partition.makeFollower(state, offsetCheckpoints, Some(info.topicId))) {
                 partitionsToMakeFollower.put(tp, partition)
               } else {
                 stateChangeLogger.info("Skipped the become-follower state change after marking its " +
                   s"partition as follower for partition $tp with id ${info.topicId} and partition state $state.")
               }
             }
           }
           changedPartitions.add(partition)
         } catch {
           case e: Throwable => stateChangeLogger.error(s"Unable to start fetching ${tp} " +
               s"with topic ID ${info.topicId} due to ${e.getClass.getSimpleName}", e)
             replicaFetcherManager.addFailedPartition(tp)
         }
       }
     }
    
     // Stopping the fetchers must be done first in order to initialize the fetch
     // position correctly.
     replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.keySet)
     stateChangeLogger.info(s"Stopped fetchers as part of become-follower for ${partitionsToMakeFollower.size} partitions")
    
     val listenerName = config.interBrokerListenerName.value
     val partitionAndOffsets = new mutable.HashMap[TopicPartition, InitialFetchState]
     partitionsToMakeFollower.forKeyValue { (topicPartition, partition) =>
       val node = partition.leaderReplicaIdOpt
         .flatMap(leaderId => Option(newImage.cluster.broker(leaderId)))
         .flatMap(_.node(listenerName).asScala)
         .getOrElse(Node.noNode)
       val log = partition.localLogOrException
       partitionAndOffsets.put(topicPartition, InitialFetchState(
         new BrokerEndPoint(node.id, node.host, node.port),
         partition.getLeaderEpoch,
         initialFetchOffset(log)
       ))
     }
    
     replicaFetcherManager.addFetcherForPartitions(partitionAndOffsets)
     stateChangeLogger.info(s"Started fetchers as part of become-follower for ${partitionsToMakeFollower.size} partitions")
    
     partitionsToMakeFollower.keySet.foreach(completeDelayedFetchOrProduceRequests)
    
     updateLeaderAndFollowerMetrics(newFollowerTopicSet)
    }
    
  8. Partition.scala#makeFollower() 的处理比较简单,是 Partition.scala#makeLeader() 的精简版本,此处不再赘述

     def makeFollower(partitionState: LeaderAndIsrPartitionState,
                    highWatermarkCheckpoints: OffsetCheckpoints,
                    topicId: Option[Uuid]): Boolean = {
     inWriteLock(leaderIsrUpdateLock) {
       val newLeaderBrokerId = partitionState.leader
       val oldLeaderEpoch = leaderEpoch
       // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
       // to maintain the decision maker controller's epoch in the zookeeper path
       controllerEpoch = partitionState.controllerEpoch
    
       updateAssignmentAndIsr(
         assignment = partitionState.replicas.asScala.iterator.map(_.toInt).toSeq,
         isr = Set.empty[Int],
         addingReplicas = partitionState.addingReplicas.asScala.map(_.toInt),
         removingReplicas = partitionState.removingReplicas.asScala.map(_.toInt)
       )
       try {
         createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints, topicId)
       } catch {
         case e: ZooKeeperClientException =>
           stateChangeLogger.error(s"A ZooKeeper client exception has occurred. makeFollower will be skipping the " +
             s"state change for the partition $topicPartition with leader epoch: $leaderEpoch.", e)
    
           return false
       }
    
       val followerLog = localLogOrException
       val leaderEpochEndOffset = followerLog.logEndOffset
       stateChangeLogger.info(s"Follower $topicPartition starts at leader epoch ${partitionState.leaderEpoch} from " +
         s"offset $leaderEpochEndOffset with high watermark ${followerLog.highWatermark}. " +
         s"Previous leader epoch was $leaderEpoch.")
    
       leaderEpoch = partitionState.leaderEpoch
       leaderEpochStartOffsetOpt = None
       zkVersion = partitionState.zkVersion
    
       if (leaderReplicaIdOpt.contains(newLeaderBrokerId) && leaderEpoch == oldLeaderEpoch) {
         false
       } else {
         leaderReplicaIdOpt = Some(newLeaderBrokerId)
         true
       }
     }
    }
    
  9. ReplicaFetcherManager.scala#addFetcherForPartitions() 实际由其父类 AbstractFetcherManager.scala#addFetcherForPartitions() 实现,
    可以看到这里关键处理是调用内部 addAndStartFetcherThread() 方法执行子类 ReplicaFetcherManager.scala#createFetcherThread() 方法创建 Fetcher 线程,并将其启动

    def addFetcherForPartitions(partitionAndOffsets: Map[TopicPartition, InitialFetchState]): Unit = {
     lock synchronized {
       val partitionsPerFetcher = partitionAndOffsets.groupBy { case (topicPartition, brokerAndInitialFetchOffset) =>
         BrokerAndFetcherId(brokerAndInitialFetchOffset.leader, getFetcherId(topicPartition))
       }
    
       def addAndStartFetcherThread(brokerAndFetcherId: BrokerAndFetcherId,
                                    brokerIdAndFetcherId: BrokerIdAndFetcherId): T = {
         val fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)
         fetcherThreadMap.put(brokerIdAndFetcherId, fetcherThread)
         fetcherThread.start()
         fetcherThread
       }
    
       for ((brokerAndFetcherId, initialFetchOffsets) <- partitionsPerFetcher) {
         val brokerIdAndFetcherId = BrokerIdAndFetcherId(brokerAndFetcherId.broker.id, brokerAndFetcherId.fetcherId)
         val fetcherThread = fetcherThreadMap.get(brokerIdAndFetcherId) match {
           case Some(currentFetcherThread) if currentFetcherThread.sourceBroker == brokerAndFetcherId.broker =>
             // reuse the fetcher thread
             currentFetcherThread
           case Some(f) =>
             f.shutdown()
             addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)
           case None =>
             addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)
         }
    
         addPartitionsToFetcherThread(fetcherThread, initialFetchOffsets)
       }
     }
    }
    
  10. ReplicaFetcherManager.scala#createFetcherThread() 方法将创建 ReplicaFetcherThread 对象作为 Fetcher 线程实例,至此 Topic 元数据变动的消费应用告一段落

    override def createFetcherThread(fetcherId: Int, sourceBroker: BrokerEndPoint): ReplicaFetcherThread = {
    val prefix = threadNamePrefix.map(tp => s"$tp:").getOrElse("")
    val threadName = s"${prefix}ReplicaFetcherThread-$fetcherId-${sourceBroker.id}"
    new ReplicaFetcherThread(threadName, fetcherId, sourceBroker, brokerConfig, failedPartitions, replicaManager,
      metrics, time, quotaManager)
    }
    

2.3 主从副本的消息数据同步

  1. 在上一节中,ReplicaFetcherThread 线程对象被创建后会立即启动,触发 ReplicaFetcherThread.scala#run() 方法执行,不过这个方法实际是由其父类 ShutdownableThread.scala#run() 方法实现,可以看到核心逻辑就是在 while 循环中不断执行子类实现的 AbstractFetcherThread.scala#doWork()方法

    override def run(): Unit = {
     isStarted = true
     info("Starting")
     try {
       while (isRunning)
         doWork()
     } catch {
       case e: FatalExitError =>
         shutdownInitiated.countDown()
         shutdownComplete.countDown()
         info("Stopped")
         Exit.exit(e.statusCode())
       case e: Throwable =>
         if (isRunning)
           error("Error due to", e)
     } finally {
        shutdownComplete.countDown()
     }
     info("Stopped")
    }
    
  2. AbstractFetcherThread.scala#doWork()方法内部只有两个方法调用,其中 AbstractFetcherThread.scala#maybeTruncate()方法是在异常恢复时处理日志截断的,本文暂不分析。AbstractFetcherThread.scala#maybeFetch() 方法则通过以下几步实际完成 Fetch 请求同步消息数据的功能:

    1. 调用子类实现 ReplicaFetcherThread.scala#buildFetch() 方法构建 Fetch 请求
    2. 调用 AbstractFetcherThread.scala#processFetchRequest() 方法发起 Fetch 请求并处理响应数据
    override def doWork(): Unit = {
     maybeTruncate()
     maybeFetch()
    }
    
    private def maybeFetch(): Unit = {
     val fetchRequestOpt = inLock(partitionMapLock) {
       val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(partitionStates.partitionStateMap.asScala)
    
       handlePartitionsWithErrors(partitionsWithError, "maybeFetch")
    
       if (fetchRequestOpt.isEmpty) {
         trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
         partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
       }
    
       fetchRequestOpt
     }
    
     fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
       processFetchRequest(sessionPartitions, fetchRequest)
     }
    }
    
  3. ReplicaFetcherThread.scala#buildFetch() 方法其实比较简单,需要注意的是 Fetch 请求中会线程初始化时设置进来的本地日志的 LEO 填充到 fetchOffset 参数

    override def buildFetch(partitionMap: Map[TopicPartition, PartitionFetchState]): ResultWithPartitions[Option[ReplicaFetch]] = {
     val partitionsWithError = mutable.Set[TopicPartition]()
    
     val builder = fetchSessionHandler.newBuilder(partitionMap.size, false)
     partitionMap.forKeyValue { (topicPartition, fetchState) =>
       // We will not include a replica in the fetch request if it should be throttled.
       if (fetchState.isReadyForFetch && !shouldFollowerThrottle(quota, fetchState, topicPartition)) {
         try {
           val logStartOffset = this.logStartOffset(topicPartition)
           val lastFetchedEpoch = if (isTruncationOnFetchSupported)
             fetchState.lastFetchedEpoch.map(_.asInstanceOf[Integer]).asJava
           else
             Optional.empty[Integer]
           builder.add(topicPartition, new FetchRequest.PartitionData(
             fetchState.fetchOffset,
             logStartOffset,
             fetchSize,
             Optional.of(fetchState.currentLeaderEpoch),
             lastFetchedEpoch))
         } catch {
           case _: KafkaStorageException =>
             // The replica has already been marked offline due to log directory failure and the original failure should have already been logged.
             // This partition should be removed from ReplicaFetcherThread soon by ReplicaManager.handleLogDirFailure()
             partitionsWithError += topicPartition
         }
       }
     }
    
     val fetchData = builder.build()
     val fetchRequestOpt = if (fetchData.sessionPartitions.isEmpty && fetchData.toForget.isEmpty) {
       None
     } else {
       val requestBuilder = FetchRequest.Builder
         .forReplica(fetchRequestVersion, replicaId, maxWait, minBytes, fetchData.toSend)
         .setMaxBytes(maxBytes)
         .toForget(fetchData.toForget)
         .metadata(fetchData.metadata)
       Some(ReplicaFetch(fetchData.sessionPartitions(), requestBuilder))
     }
    
     ResultWithPartitions(fetchRequestOpt, partitionsWithError)
    }
    
  4. AbstractFetcherThread.scala#processFetchRequest() 方法关键处理如下:

    1. 首先调用子类 ReplicaFetcherThread.scala#fetchFromLeader() 实现向分区副本 Leader 发起 Fetch 请求,这部分往下分析涉及到底层网络组件 NetworkClient,读者如有兴趣可参考消费者组协调器定位了解其工作机制,本文不再赘述
    2. 如果 Fetch 响应中有消息数据,则调用子类实现方法 ReplicaFetcherThread.scala#processPartitionData() 将消息追加到本地日志
    3. 如果当前服务端版本支持在 Fetch 时进行日志截断操作,那么处理 Fetch 响应时会同步收集其携带的版本分歧信息。因为分区主从副本的版本不一致通常是发生了故障恢复,分区 Follower 副本可能需要进行日志截断以保持和 Leader 副本数据一致,这部分最后由 AbstractFetcherThread.scala#truncateOnFetchResponse() 方法处理,此处暂不深入分析
    private def processFetchRequest(sessionPartitions: util.Map[TopicPartition, FetchRequest.PartitionData],
                                   fetchRequest: FetchRequest.Builder): Unit = {
     val partitionsWithError = mutable.Set[TopicPartition]()
     val divergingEndOffsets = mutable.Map.empty[TopicPartition, EpochEndOffset]
     var responseData: Map[TopicPartition, FetchData] = Map.empty
    
     try {
       trace(s"Sending fetch request $fetchRequest")
       responseData = fetchFromLeader(fetchRequest)
     } catch {
       case t: Throwable =>
         if (isRunning) {
           warn(s"Error in response for fetch request $fetchRequest", t)
           inLock(partitionMapLock) {
             partitionsWithError ++= partitionStates.partitionSet.asScala
             // there is an error occurred while fetching partitions, sleep a while
             // note that `AbstractFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
             // partition with error effectively doubling the delay. It would be good to improve this.
             partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
           }
         }
     }
     fetcherStats.requestRate.mark()
    
     if (responseData.nonEmpty) {
       // process fetched data
       inLock(partitionMapLock) {
         responseData.forKeyValue { (topicPartition, partitionData) =>
           Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
             // It's possible that a partition is removed and re-added or truncated when there is a pending fetch request.
             // In this case, we only want to process the fetch response if the partition state is ready for fetch and
             // the current offset is the same as the offset requested.
             val fetchPartitionData = sessionPartitions.get(topicPartition)
             if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
               Errors.forCode(partitionData.errorCode) match {
                 case Errors.NONE =>
                   try {
                     // Once we hand off the partition data to the subclass, we can't mess with it any more in this thread
                     val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
                       partitionData)
    
                     logAppendInfoOpt.foreach { logAppendInfo =>
                       val validBytes = logAppendInfo.validBytes
                       val nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset
                       val lag = Math.max(0L, partitionData.highWatermark - nextOffset)
                       fetcherLagStats.getAndMaybePut(topicPartition).lag = lag
    
                       // ReplicaDirAlterThread may have removed topicPartition from the partitionStates after processing the partition data
                       if (validBytes > 0 && partitionStates.contains(topicPartition)) {
                         // Update partitionStates only if there is no exception during processPartitionData
                         val newFetchState = PartitionFetchState(nextOffset, Some(lag),
                           currentFetchState.currentLeaderEpoch, state = Fetching,
                           logAppendInfo.lastLeaderEpoch)
                         partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
                         fetcherStats.byteRate.mark(validBytes)
                       }
                     }
                     if (isTruncationOnFetchSupported) {
                       FetchResponse.divergingEpoch(partitionData).ifPresent { divergingEpoch =>
                         divergingEndOffsets += topicPartition -> new EpochEndOffset()
                           .setPartition(topicPartition.partition)
                           .setErrorCode(Errors.NONE.code)
                           .setLeaderEpoch(divergingEpoch.epoch)
                           .setEndOffset(divergingEpoch.endOffset)
                       }
                     }
                   } catch {
                     case ime@( _: CorruptRecordException | _: InvalidRecordException) =>
                       // we log the error and continue. This ensures two things
                       // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread
                       //    down and cause other topic partition to also lag
                       // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes
                       //    can cause this), we simply continue and should get fixed in the subsequent fetches
                       error(s"Found invalid messages during fetch for partition $topicPartition " +
                         s"offset ${currentFetchState.fetchOffset}", ime)
                       partitionsWithError += topicPartition
                     case e: KafkaStorageException =>
                       error(s"Error while processing data for partition $topicPartition " +
                         s"at offset ${currentFetchState.fetchOffset}", e)
                       markPartitionFailed(topicPartition)
                     case t: Throwable =>
                       // stop monitoring this partition and add it to the set of failed partitions
                       error(s"Unexpected error occurred while processing data for partition $topicPartition " +
                         s"at offset ${currentFetchState.fetchOffset}", t)
                       markPartitionFailed(topicPartition)
                   }
                 case Errors.OFFSET_OUT_OF_RANGE =>
                   if (handleOutOfRangeError(topicPartition, currentFetchState, fetchPartitionData.currentLeaderEpoch))
                     partitionsWithError += topicPartition
    
                 case Errors.UNKNOWN_LEADER_EPOCH =>
                   debug(s"Remote broker has a smaller leader epoch for partition $topicPartition than " +
                     s"this replica's current leader epoch of ${currentFetchState.currentLeaderEpoch}.")
                   partitionsWithError += topicPartition
    
                 case Errors.FENCED_LEADER_EPOCH =>
                   if (onPartitionFenced(topicPartition, fetchPartitionData.currentLeaderEpoch))
                     partitionsWithError += topicPartition
    
                 case Errors.NOT_LEADER_OR_FOLLOWER =>
                   debug(s"Remote broker is not the leader for partition $topicPartition, which could indicate " +
                     "that the partition is being moved")
                   partitionsWithError += topicPartition
    
                 case Errors.UNKNOWN_TOPIC_OR_PARTITION =>
                   warn(s"Received ${Errors.UNKNOWN_TOPIC_OR_PARTITION} from the leader for partition $topicPartition. " +
                        "This error may be returned transiently when the partition is being created or deleted, but it is not " +
                        "expected to persist.")
                   partitionsWithError += topicPartition
    
                 case partitionError =>
                   error(s"Error for partition $topicPartition at offset ${currentFetchState.fetchOffset}", partitionError.exception)
                   partitionsWithError += topicPartition
               }
             }
           }
         }
       }
     }
    
     if (divergingEndOffsets.nonEmpty)
       truncateOnFetchResponse(divergingEndOffsets)
     if (partitionsWithError.nonEmpty) {
       handlePartitionsWithErrors(partitionsWithError, "processFetchRequest")
     }
    }
    
  5. ReplicaFetcherThread.scala#processPartitionData() 的重点处理简单明了:

    1. 调用 Partition.scala#appendRecordsToFollowerOrFutureReplica() 方法将消息数据追加到本地,这部分主要是日志文件的写操作,读者如有兴趣可参考Kafka 3.0 源码笔记(7)-Kafka 服务端对客户端的 Produce 请求处理,至此就完成了主从副本消息数据的同步
    2. 消息写入完成后,需要把 Fetch 响应中的 HW 取出来,尝试更新本地日志的 HW,这部分由调用 Log.scala#updateHighWatermark() 方法实现
    override def processPartitionData(topicPartition: TopicPartition,
                                     fetchOffset: Long,
                                     partitionData: FetchData): Option[LogAppendInfo] = {
     val logTrace = isTraceEnabled
     val partition = replicaMgr.getPartitionOrException(topicPartition)
     val log = partition.localLogOrException
     val records = toMemoryRecords(FetchResponse.recordsOrFail(partitionData))
    
     maybeWarnIfOversizedRecords(records, topicPartition)
    
     if (fetchOffset != log.logEndOffset)
       throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
         topicPartition, fetchOffset, log.logEndOffset))
    
     if (logTrace)
       trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
         .format(log.logEndOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))
    
     // Append the leader's messages to the log
     val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)
    
     if (logTrace)
       trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
         .format(log.logEndOffset, records.sizeInBytes, topicPartition))
     val leaderLogStartOffset = partitionData.logStartOffset
    
     // For the follower replica, we do not need to keep its segment base offset and physical position.
     // These values will be computed upon becoming leader or handling a preferred read replica fetch.
     val followerHighWatermark = log.updateHighWatermark(partitionData.highWatermark)
     log.maybeIncrementLogStartOffset(leaderLogStartOffset, LeaderOffsetIncremented)
     if (logTrace)
       trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")
    
     // Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
     // traffic doesn't exceed quota.
     if (quota.isThrottled(topicPartition))
       quota.record(records.sizeInBytes)
    
     if (partition.isReassigning && partition.isAddingLocalReplica)
       brokerTopicStats.updateReassignmentBytesIn(records.sizeInBytes)
    
     brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)
    
     logAppendInfo
    }
    
  6. 至于分区 Leader 副本对来自 Follower 的 Fetch 请求的处理,读者可参考 Kafka 3.0 源码笔记(4)-Kafka 服务端对客户端的 Fetch 请求处理 一节第3步,此时将会触发ReplicaManager.scala#updateFollowerFetchState() 方法执行,核心处理执行 Partition.scala#updateFollowerFetchState() 方法以便更新 Leader 副本保存的远程副本列表中的 LEO

      private def updateFollowerFetchState(followerId: Int,
                                        readResults: Seq[(TopicPartition, LogReadResult)]): Seq[(TopicPartition, LogReadResult)] = {
     readResults.map { case (topicPartition, readResult) =>
       val updatedReadResult = if (readResult.error != Errors.NONE) {
         debug(s"Skipping update of fetch state for follower $followerId since the " +
           s"log read returned error ${readResult.error}")
         readResult
       } else if (readResult.divergingEpoch.nonEmpty) {
         debug(s"Skipping update of fetch state for follower $followerId since the " +
           s"log read returned diverging epoch ${readResult.divergingEpoch}")
         readResult
       } else {
         onlinePartition(topicPartition) match {
           case Some(partition) =>
             if (partition.updateFollowerFetchState(followerId,
               followerFetchOffsetMetadata = readResult.info.fetchOffsetMetadata,
               followerStartOffset = readResult.followerLogStartOffset,
               followerFetchTimeMs = readResult.fetchTimeMs,
               leaderEndOffset = readResult.leaderLogEndOffset)) {
               readResult
             } else {
               warn(s"Leader $localBrokerId failed to record follower $followerId's position " +
                 s"${readResult.info.fetchOffsetMetadata.messageOffset}, and last sent HW since the replica " +
                 s"is not recognized to be one of the assigned replicas ${partition.assignmentState.replicas.mkString(",")} " +
                 s"for partition $topicPartition. Empty records will be returned for this partition.")
               readResult.withEmptyFetchInfo
             }
           case None =>
             warn(s"While recording the replica LEO, the partition $topicPartition hasn't been created.")
             readResult
         }
       }
       topicPartition -> updatedReadResult
     }
    }
    
  7. Partition.scala#updateFollowerFetchState() 没有复杂逻辑,可以看到核心处理分为两步,至此本文全部分析基本结束

    1. 更新发出 Fetch 请求的 Follower 副本的 LEO 等信息
    2. 调用上文2.2节步骤6提到的Partition.scala#maybeIncrementLeaderHW() 方法在处理 Follower 端 Fetch 请求时尝试更新分区 HW,只有 HW 更新后新消息才算是 committed 状态,可以被消费者消费
    def updateFollowerFetchState(followerId: Int,
                                followerFetchOffsetMetadata: LogOffsetMetadata,
                                followerStartOffset: Long,
                                followerFetchTimeMs: Long,
                                leaderEndOffset: Long): Boolean = {
     getReplica(followerId) match {
       case Some(followerReplica) =>
         // No need to calculate low watermark if there is no delayed DeleteRecordsRequest
         val oldLeaderLW = if (delayedOperations.numDelayedDelete > 0) lowWatermarkIfLeader else -1L
         val prevFollowerEndOffset = followerReplica.logEndOffset
         followerReplica.updateFetchState(
           followerFetchOffsetMetadata,
           followerStartOffset,
           followerFetchTimeMs,
           leaderEndOffset)
    
         val newLeaderLW = if (delayedOperations.numDelayedDelete > 0) lowWatermarkIfLeader else -1L
         // check if the LW of the partition has incremented
         // since the replica's logStartOffset may have incremented
         val leaderLWIncremented = newLeaderLW > oldLeaderLW
    
         // Check if this in-sync replica needs to be added to the ISR.
         maybeExpandIsr(followerReplica, followerFetchTimeMs)
    
         // check if the HW of the partition can now be incremented
         // since the replica may already be in the ISR and its LEO has just incremented
         val leaderHWIncremented = if (prevFollowerEndOffset != followerReplica.logEndOffset) {
           // the leader log may be updated by ReplicaAlterLogDirsThread so the following method must be in lock of
           // leaderIsrUpdateLock to prevent adding new hw to invalid log.
           inReadLock(leaderIsrUpdateLock) {
             leaderLogIfLocal.exists(leaderLog => maybeIncrementLeaderHW(leaderLog, followerFetchTimeMs))
           }
         } else {
           false
         }
    
         // some delayed operations may be unblocked after HW or LW changed
         if (leaderLWIncremented || leaderHWIncremented)
           tryCompleteDelayedRequests()
    
         debug(s"Recorded replica $followerId log end offset (LEO) position " +
           s"${followerFetchOffsetMetadata.messageOffset} and log start offset $followerStartOffset.")
         true
    
       case None =>
         false
     }
    }
    

你可能感兴趣的:(Kafka,源码笔记,kafka,scala,中间件,分布式)