kafka源码愫读（5）、ReplicaManager模块源码分析

1、ReplicaManager模块简介

replicaManager主要用来管理topic在本broker上的副本信息。并且读写日志的请求都是通过replicaManager进行处理的。
每个replicaManager实例都会持有一个Pool[TopicPartition, Partition]类型的allPartitions变量。Pool其实就是一个Map的封装。通过key(TopicPartition)我们可以知道这个partition的编号,以及他对应的partiton信息。
接着parition信息中有持有该partiton的各个Replica的信息，通过获取到本broker上的Replica，可以间接获取到Replica对应的Log对象实例，然后用Log对象实例来处理日志相关的读写操作。

2、ReplicaManager初始化

ReplicaManager主要通过startup()初始化，而startup()中主要初始化了三个周期性任务：
isr-expiration：定期判断topic-partition的isr中是否有replica因为延迟或宕机等与leader未及时同步，需要从irs列表中移除；
isr-change-propagation：当有topic-partition的isr变更时，会触发zk接口变更，进而触发controller进行相应的isr操作；
shutdown-idle-replica-alter-log-dirs-thread：定期关闭空闲的同步日志的副本；

2.1、maybeShrinkIs

此任务中主要做两件事，判断当前topic-partition中是否有未同步的副本，若有未同步的副本，则更新当前分区对应的isr及HW等相关信息；
主要处理流程如下：

def maybeShrinkIsr(): Unit = {
  val needsIsrUpdate = inReadLock(leaderIsrUpdateLock) {
    //判断是否需要进行isr的更新
    //只有当前副本为leader时才进行是否有未同步的副本检查
    //未同步判断：followerReplica.logEndOffset != leaderEndOffset && (currentTimeMs - followerReplica.lastCaughtUpTimeMs) > maxLagMs
    needsShrinkIsr()
  }
  val leaderHWIncremented = needsIsrUpdate && inWriteLock(leaderIsrUpdateLock) {
    leaderLogIfLocal match {
      case Some(leaderLog) =>
      //获取未同步的副本id列表
        val outOfSyncReplicaIds = getOutOfSyncReplicas(replicaLagTimeMaxMs)
        //未同步列表为空表示isr需要变更
        if (outOfSyncReplicaIds.nonEmpty) {
          //获取新的isr  
          val newInSyncReplicaIds = inSyncReplicaIds -- outOfSyncReplicaIds
          assert(newInSyncReplicaIds.nonEmpty)
          info("Shrinking ISR from %s to %s. Leader: (highWatermark: %d, endOffset: %d). Out of sync replicas: %s."
            .format(inSyncReplicaIds.mkString(","),
              newInSyncReplicaIds.mkString(","),
              leaderLog.highWatermark,
              leaderLog.logEndOffset,
              outOfSyncReplicaIds.map { replicaId =>
                s"(brokerId: $replicaId, endOffset: ${getReplicaOrException(replicaId).logEndOffset})"
              }.mkString(" ")
            )
          )

          // update ISR in zk and in cache
          //更新zk及本地缓存的isr列表
          shrinkIsr(newInSyncReplicaIds)

          // we may need to increment high watermark since ISR could be down to 1
          //判断是否需要更新HW值，比较分区的所有副本，获取messageOffset最小值作为新的HW
          maybeIncrementLeaderHW(leaderLog)
        } else {
          false
        }

      case None => false // do nothing if no longer leader
    }
  }

  // some delayed operations may be unblocked after HW changed
  //若HW更新，立即处理一些延迟的请求，主要为DelayedProduce，DelayedFetch，DelayedDeleteRecords等
  if (leaderHWIncremented)
    tryCompleteDelayedRequests()
}

2.2、maybePropagateIsrChanges

方法的作用是将那些 isr 变动的 topic-partition 列表（isrChangeSet）通过 ReplicationUtils 的 propagateIsrChanges() 方法更新 zk 上，这时候 Controller 才能知道哪些 topic-partition 的 isr 发生了变动。

def maybePropagateIsrChanges(): Unit = {
  val now = System.currentTimeMillis()
  isrChangeSet synchronized {
      //更新的isr集合非空
    if (isrChangeSet.nonEmpty &&
    //isr列表已有lastIsrChangeMs ms未更新
      (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
        //距离上次传播isr列表时间小于IsrChangePropagationInterval ms
        lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
      //通过zk进行isr变更传播
      zkClient.propagateIsrChanges(isrChangeSet)
      //清空isr变更集合
      isrChangeSet.clear()
      //更新isr传播时间
      lastIsrPropagationMs.set(now)
    }
  }
}

3、主要处理方法

ReplicaManager处理KafkaApi中多种类型的请求，主要有LeaderAndIsr 、StopReplica 、UpdateMetadata 、Produce 、Fetch 、ListOffset 等；

3.1、becomeLeaderOrFollower()处理

def becomeLeaderOrFollower(correlationId: Int,
                           leaderAndIsrRequest: LeaderAndIsrRequest,
                           onLeadershipChange: (Iterable[Partition], Iterable[Partition]) => Unit): LeaderAndIsrResponse = {
  if (stateChangeLogger.isTraceEnabled) {
    leaderAndIsrRequest.partitionStates.asScala.foreach { partitionState =>
      stateChangeLogger.trace(s"Received LeaderAndIsr request $partitionState " +
        s"correlation id $correlationId from controller ${leaderAndIsrRequest.controllerId} " +
        s"epoch ${leaderAndIsrRequest.controllerEpoch}")
    }
  }
  replicaStateChangeLock synchronized {
    //leaderAndIsr请求中的controllerEpoch与本地的进行比较，小于本地则非法
    if (leaderAndIsrRequest.controllerEpoch < controllerEpoch) {
      stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from controller ${leaderAndIsrRequest.controllerId} with " +
        s"correlation id $correlationId since its controller epoch ${leaderAndIsrRequest.controllerEpoch} is old. " +
        s"Latest known controller epoch is $controllerEpoch")
      leaderAndIsrRequest.getErrorResponse(0, Errors.STALE_CONTROLLER_EPOCH.exception)
    } else {
      val responseMap = new mutable.HashMap[TopicPartition, Errors]
      val controllerId = leaderAndIsrRequest.controllerId
      controllerEpoch = leaderAndIsrRequest.controllerEpoch

      // First check partition's leader epoch
      val partitionStates = new mutable.HashMap[Partition, LeaderAndIsrPartitionState]()
      val updatedPartitions = new mutable.HashSet[Partition]

      //过滤leaderAndIsrRequest中partition的状态，去除offline状态的partition
      leaderAndIsrRequest.partitionStates.asScala.foreach { partitionState =>
        val topicPartition = new TopicPartition(partitionState.topicName, partitionState.partitionIndex)
        //获取本地topicPartition对应的分区信息
        val partitionOpt = getPartition(topicPartition) match {
          //分区离线
          case HostedPartition.Offline =>
            stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from " +
              s"controller $controllerId with correlation id $correlationId " +
              s"epoch $controllerEpoch for partition $topicPartition as the local replica for the " +
              "partition is in an offline log directory")

            responseMap.put(topicPartition, Errors.KAFKA_STORAGE_ERROR)
            None

          //分区在线
          case HostedPartition.Online(partition) =>
            updatedPartitions.add(partition)
            Some(partition)

          //分区不存在，将分区添加到allPartitions中并置为上线状态
          case HostedPartition.None =>
            val partition = Partition(topicPartition, time, this)
            allPartitions.putIfNotExists(topicPartition, HostedPartition.Online(partition))
            updatedPartitions.add(partition)
            Some(partition)
        }

        partitionOpt.foreach { partition =>
          val currentLeaderEpoch = partition.getLeaderEpoch
          val requestLeaderEpoch = partitionState.leaderEpoch
          //获取副本集在本地的所有集合
          if (requestLeaderEpoch > currentLeaderEpoch) {
            // If the leader epoch is valid record the epoch of the controller that made the leadership decision.
            // This is useful while updating the isr to maintain the decision maker controller's epoch in the zookeeper path
            if (partitionState.replicas.contains(localBrokerId))
              partitionStates.put(partition, partitionState)
            else {
              stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from controller $controllerId with " +
                s"correlation id $correlationId epoch $controllerEpoch for partition $topicPartition as itself is not " +
                s"in assigned replica list ${partitionState.replicas.asScala.mkString(",")}")
              responseMap.put(topicPartition, Errors.UNKNOWN_TOPIC_OR_PARTITION)
            }
          } else if (requestLeaderEpoch < currentLeaderEpoch) {
            stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from " +
              s"controller $controllerId with correlation id $correlationId " +
              s"epoch $controllerEpoch for partition $topicPartition since its associated " +
              s"leader epoch $requestLeaderEpoch is smaller than the current " +
              s"leader epoch $currentLeaderEpoch")
            responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)
          } else {
            stateChangeLogger.debug(s"Ignoring LeaderAndIsr request from " +
              s"controller $controllerId with correlation id $correlationId " +
              s"epoch $controllerEpoch for partition $topicPartition since its associated " +
              s"leader epoch $requestLeaderEpoch matches the current leader epoch")
            responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)
          }
        }
      }

      //过滤出leader的副本
      val partitionsTobeLeader = partitionStates.filter { case (_, partitionState) =>
        partitionState.leader == localBrokerId
      }
      //获取follower副本
      val partitionsToBeFollower = partitionStates -- partitionsTobeLeader.keys

      val highWatermarkCheckpoints = new LazyOffsetCheckpoints(this.highWatermarkCheckpoints)
      val partitionsBecomeLeader = if (partitionsTobeLeader.nonEmpty)
        //调用makeLeaders，设置对应的分区为leader
        makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, correlationId, responseMap,
          highWatermarkCheckpoints)
      else
        Set.empty[Partition]
      val partitionsBecomeFollower = if (partitionsToBeFollower.nonEmpty)
      //调用makeFollowers，设置对应的分区为follower
        makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, correlationId, responseMap,
          highWatermarkCheckpoints)
      else
        Set.empty[Partition]

      /*
       * KAFKA-8392
       * For topic partitions of which the broker is no longer a leader, delete metrics related to
       * those topics. Note that this means the broker stops being either a replica or a leader of
       * partitions of said topics
       */
      //去除topic相关的leader及follower的监控
      val leaderTopicSet = leaderPartitionsIterator.map(_.topic).toSet
      val followerTopicSet = partitionsBecomeFollower.map(_.topic).toSet
      followerTopicSet.diff(leaderTopicSet).foreach(brokerTopicStats.removeOldLeaderMetrics)

      // remove metrics for brokers which are not followers of a topic
      leaderTopicSet.diff(followerTopicSet).foreach(brokerTopicStats.removeOldFollowerMetrics)

      leaderAndIsrRequest.partitionStates.asScala.foreach { partitionState =>
        val topicPartition = new TopicPartition(partitionState.topicName, partitionState.partitionIndex)
        /*
         * If there is offline log directory, a Partition object may have been created by getOrCreatePartition()
         * before getOrCreateReplica() failed to create local replica due to KafkaStorageException.
         * In this case ReplicaManager.allPartitions will map this topic-partition to an empty Partition object.
         * we need to map this topic-partition to OfflinePartition instead.
         */
        if (localLog(topicPartition).isEmpty)
          markPartitionOffline(topicPartition)
      }

      // we initialize highwatermark thread after the first leaderisrrequest. This ensures that all the partitions
      // have been completely populated before starting the checkpointing there by avoiding weird race conditions
      startHighWatermarkCheckPointThread()

      val futureReplicasAndInitialOffset = new mutable.HashMap[TopicPartition, InitialFetchState]
      for (partition <- updatedPartitions) {
        val topicPartition = partition.topicPartition
        //若分区对应的日志不存在，则创建对应日志目录信息
        if (logManager.getLog(topicPartition, isFuture = true).isDefined) {
          partition.log.foreach { log =>
            val leader = BrokerEndPoint(config.brokerId, "localhost", -1)

            // Add future replica to partition's map
            //创建分区日志
            partition.createLogIfNotExists(Request.FutureLocalReplicaId, isNew = false, isFutureReplica = true,
              highWatermarkCheckpoints)

            // pause cleaning for partitions that are being moved and start ReplicaAlterDirThread to move
            // replica from source dir to destination dir
            logManager.abortAndPauseCleaning(topicPartition)

            futureReplicasAndInitialOffset.put(topicPartition, InitialFetchState(leader,
              partition.getLeaderEpoch, log.highWatermark))
          }
        }
      }
      //给新的分区添加Fetcher，其从leader同步消息
      replicaAlterLogDirsManager.addFetcherForPartitions(futureReplicasAndInitialOffset)

      //关闭空闲的Fetcher
      replicaFetcherManager.shutdownIdleFetcherThreads()
      replicaAlterLogDirsManager.shutdownIdleFetcherThreads()
      //回调onLeadershipChange，groupCoordinator做一些选举及注册等相关工作
      onLeadershipChange(partitionsBecomeLeader, partitionsBecomeFollower)
      val responsePartitions = responseMap.iterator.map { case (tp, error) =>
        new LeaderAndIsrPartitionError()
          .setTopicName(tp.topic)
          .setPartitionIndex(tp.partition)
          .setErrorCode(error.code)
      }.toBuffer
      new LeaderAndIsrResponse(new LeaderAndIsrResponseData()
        .setErrorCode(Errors.NONE.code)
        .setPartitionErrors(responsePartitions.asJava))
    }
  }
}

3.2、makeLeaders()处理

private def makeLeaders(controllerId: Int,
                        controllerEpoch: Int,
                        partitionStates: Map[Partition, LeaderAndIsrPartitionState],
                        correlationId: Int,
                        responseMap: mutable.Map[TopicPartition, Errors],
                        highWatermarkCheckpoints: OffsetCheckpoints): Set[Partition] = {
  partitionStates.keys.foreach { partition =>
    stateChangeLogger.trace(s"Handling LeaderAndIsr request correlationId $correlationId from " +
      s"controller $controllerId epoch $controllerEpoch starting the become-leader transition for " +
      s"partition ${partition.topicPartition}")
  }

  for (partition <- partitionStates.keys)
    responseMap.put(partition.topicPartition, Errors.NONE)

  val partitionsToMakeLeaders = mutable.Set[Partition]()

  try {
    // First stop fetchers for all the partitions
    //停止leader的Fetcher，不再从旧的leader同步消息
    replicaFetcherManager.removeFetcherForPartitions(partitionStates.keySet.map(_.topicPartition))
    // Update the partition information to be the leader
    //更新分区信息，初始化其为leader
    partitionStates.foreach { case (partition, partitionState) =>
      try {
        //初始分区信息，主要为日志信息、本地同步偏移等、follower副本初始信息等
        if (partition.makeLeader(controllerId, partitionState, correlationId, highWatermarkCheckpoints)) {
          partitionsToMakeLeaders += partition
          stateChangeLogger.trace(s"Stopped fetchers as part of become-leader request from " +
            s"controller $controllerId epoch $controllerEpoch with correlation id $correlationId for partition ${partition.topicPartition} " +
            s"(last update controller epoch ${partitionState.controllerEpoch})")
        } else
          stateChangeLogger.info(s"Skipped the become-leader state change after marking its " +
            s"partition as leader with correlation id $correlationId from controller $controllerId epoch $controllerEpoch for " +
            s"partition ${partition.topicPartition} (last update controller epoch ${partitionState.controllerEpoch}) " +
            s"since it is already the leader for the partition.")
      } catch {
        case e: KafkaStorageException =>
          stateChangeLogger.error(s"Skipped the become-leader state change with " +
            s"correlation id $correlationId from controller $controllerId epoch $controllerEpoch for partition ${partition.topicPartition} " +
            s"(last update controller epoch ${partitionState.controllerEpoch}) since " +
            s"the replica for the partition is offline due to disk error $e")
          val dirOpt = getLogDir(partition.topicPartition)
          error(s"Error while making broker the leader for partition $partition in dir $dirOpt", e)
          responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
      }
    }

  } catch {
    case e: Throwable =>
      partitionStates.keys.foreach { partition =>
        stateChangeLogger.error(s"Error while processing LeaderAndIsr request correlationId $correlationId received " +
          s"from controller $controllerId epoch $controllerEpoch for partition ${partition.topicPartition}", e)
      }
      // Re-throw the exception for it to be caught in KafkaApis
      throw e
  }

  partitionStates.keys.foreach { partition =>
    stateChangeLogger.trace(s"Completed LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
      s"epoch $controllerEpoch for the become-leader transition for partition ${partition.topicPartition}")
  }

  partitionsToMakeLeaders
}

3.3、appendRecords()处理

此处理主要是向leader的副本中添加消息记录，并会等待消息同步到其他follower副本，直到同步超时或完成；

def appendRecords(timeout: Long,
                  requiredAcks: Short,
                  internalTopicsAllowed: Boolean,
                  origin: AppendOrigin,
                  entriesPerPartition: Map[TopicPartition, MemoryRecords],
                  responseCallback: Map[TopicPartition, PartitionResponse] => Unit,
                  delayedProduceLock: Option[Lock] = None,
                  recordConversionStatsCallback: Map[TopicPartition, RecordConversionStats] => Unit = _ => ()): Unit = {
  if (isValidRequiredAcks(requiredAcks)) {
    val sTime = time.milliseconds
    //将消息同步到本地的leader日志中
    val localProduceResults = appendToLocalLog(internalTopicsAllowed = internalTopicsAllowed,
      origin, entriesPerPartition, requiredAcks)
    debug("Produce to local log in %d ms".format(time.milliseconds - sTime))

    //消息处理的接口
    val produceStatus = localProduceResults.map { case (topicPartition, result) =>
      topicPartition ->
              ProducePartitionStatus(
                result.info.lastOffset + 1, // required offset
                new PartitionResponse(result.error, result.info.firstOffset.getOrElse(-1), result.info.logAppendTime,
                  result.info.logStartOffset, result.info.recordErrors.asJava, result.info.errorMessage)) // response status
    }

    recordConversionStatsCallback(localProduceResults.map { case (k, v) => k -> v.info.recordConversionStats })

    //需要延迟进行应答处理？
    if (delayedProduceRequestRequired(requiredAcks, entriesPerPartition, localProduceResults)) {
      // create delayed produce operation
      val produceMetadata = ProduceMetadata(requiredAcks, produceStatus)
      val delayedProduce = new DelayedProduce(timeout, produceMetadata, this, responseCallback, delayedProduceLock)

      // create a list of (topic, partition) pairs to use as keys for this delayed produce operation
      val producerRequestKeys = entriesPerPartition.keys.map(TopicPartitionOperationKey(_)).toSeq

      // try to complete the request immediately, otherwise put it into the purgatory
      // this is because while the delayed produce operation is being created, new
      // requests may arrive and hence make this operation completable.
      //添加到监控队列中
      delayedProducePurgatory.tryCompleteElseWatch(delayedProduce, producerRequestKeys)

    } else {
      // we can respond immediately
      //立即应答
      val produceResponseStatus = produceStatus.map { case (k, status) => k -> status.responseStatus }
      responseCallback(produceResponseStatus)
    }
  } else {
    // If required.acks is outside accepted range, something is wrong with the client
    // Just return an error and don't handle the request at all
    val responseStatus = entriesPerPartition.map { case (topicPartition, _) =>
      topicPartition -> new PartitionResponse(Errors.INVALID_REQUIRED_ACKS,
        LogAppendInfo.UnknownLogAppendInfo.firstOffset.getOrElse(-1), RecordBatch.NO_TIMESTAMP, LogAppendInfo.UnknownLogAppendInfo.logStartOffset)
    }
    responseCallback(responseStatus)
  }
}

3.4、fetchMessages()处理

此处理主要为从副本拉取消息，当超时或拉取到足够的消息时才返回，消费者可以从任何副本拉取消息，但follower只能从leader拉取消息；

def fetchMessages(timeout: Long,
                  replicaId: Int,
                  fetchMinBytes: Int,
                  fetchMaxBytes: Int,
                  hardMaxBytesLimit: Boolean,
                  fetchInfos: Seq[(TopicPartition, PartitionData)],
                  quota: ReplicaQuota,
                  responseCallback: Seq[(TopicPartition, FetchPartitionData)] => Unit,
                  isolationLevel: IsolationLevel,
                  clientMetadata: Option[ClientMetadata]): Unit = {
  //消息是从follower来的
  val isFromFollower = Request.isValidBrokerId(replicaId)
  //消息是从消费者来的
  val isFromConsumer = !(isFromFollower || replicaId == Request.FutureLocalReplicaId)

  //若follower拉取，则直接从队列最后拉取
  //若隔离等级为读提交的，则从提交的消息开始拉取
  //否则从HW处开始拉取
  val fetchIsolation = if (!isFromConsumer)
    FetchLogEnd
  else if (isolationLevel == IsolationLevel.READ_COMMITTED)
    FetchTxnCommitted
  else
    FetchHighWatermark

  // Restrict fetching to leader if request is from follower or from a client with older version (no ClientMetadata)
  //判断是否只能从leader拉取消息
  val fetchOnlyFromLeader = isFromFollower || (isFromConsumer && clientMetadata.isEmpty)
  def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
    val result = readFromLocalLog(
      replicaId = replicaId,
      fetchOnlyFromLeader = fetchOnlyFromLeader,
      fetchIsolation = fetchIsolation,
      fetchMaxBytes = fetchMaxBytes,
      hardMaxBytesLimit = hardMaxBytesLimit,
      readPartitionInfo = fetchInfos,
      quota = quota,
      clientMetadata = clientMetadata)
    //更新follower的fetch信息
    if (isFromFollower) updateFollowerFetchState(replicaId, result)
    else result
  }

  //读取消息
  val logReadResults = readFromLog()

  // check if this fetch request can be satisfied right away
  var bytesReadable: Long = 0
  var errorReadingData = false
  val logReadResultMap = new mutable.HashMap[TopicPartition, LogReadResult]
  var anyPartitionsNeedHwUpdate = false
  logReadResults.foreach { case (topicPartition, logReadResult) =>
    if (logReadResult.error != Errors.NONE)
      errorReadingData = true
    bytesReadable = bytesReadable + logReadResult.info.records.sizeInBytes
    logReadResultMap.put(topicPartition, logReadResult)
    if (isFromFollower && logReadResult.followerNeedsHwUpdate) {
      anyPartitionsNeedHwUpdate = true
    }
  }

  // respond immediately if 1) fetch request does not want to wait
  //                        2) fetch request does not require any data
  //                        3) has enough data to respond
  //                        4) some error happens while reading data
  //                        5) any of the requested partitions need HW update
  //看是否能立即返回
  if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData || anyPartitionsNeedHwUpdate) {
    val fetchPartitionData = logReadResults.map { case (tp, result) =>
      tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records,
        result.lastStableOffset, result.info.abortedTransactions, result.preferredReadReplica, isFromFollower && isAddingReplica(tp, replicaId))
    }
    responseCallback(fetchPartitionData)
  } else {
    // construct the fetch results from the read results
    //构建fetch的结果
    val fetchPartitionStatus = new mutable.ArrayBuffer[(TopicPartition, FetchPartitionStatus)]
    fetchInfos.foreach { case (topicPartition, partitionData) =>
      logReadResultMap.get(topicPartition).foreach(logReadResult => {
        val logOffsetMetadata = logReadResult.info.fetchOffsetMetadata
        fetchPartitionStatus += (topicPartition -> FetchPartitionStatus(logOffsetMetadata, partitionData))
      })
    }
    val fetchMetadata: SFetchMetadata = SFetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit,
      fetchOnlyFromLeader, fetchIsolation, isFromFollower, replicaId, fetchPartitionStatus)
    val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, clientMetadata,
      responseCallback)

    // create a list of (topic, partition) pairs to use as keys for this delayed fetch operation
    val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => TopicPartitionOperationKey(tp) }

    // try to complete the request immediately, otherwise put it into the purgatory;
    // this is because while the delayed fetch operation is being created, new requests
    // may arrive and hence make this operation completable.
    delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
  }
}