ReplicaStateMachine是Controller Leader用来维护副本状态的状态机,副本有7种不同的状态
NewReplica:创建新Topic或者进行副本分区分配时,新创建的副本就处于这个状态,处于这个状态的副本只能成为Follower副本。
OnlineReplica:副本开始正常工作时的状态,处在此状态的副本可以成为leader或者follower
OfflineReplica:副所在的broker下线
ReplicaDeletionStarted:刚开始删除副本会转换成这个状态,然后开始删除
ReplicaDeletionSuccessful:副本删除成功
ReplicaDeletionIneligible:如果副本删除操作失败,会转换成这个状态
NonExistentReplica:副本删除最终状态
NonExistentReplica->NewReplica
Controller向这个副本所在的Broker发送LeaderAndISRRequest,并向集群中所有可用的Broker发送UpdateMeatadataRequest
NewReplica->OnlineReplica
Controller将NewReplica加入到AR集合中
OnlineReplica,OfflineReplica->OnlineReplica
Controller向此副本所在的Broker发送LeaderAndISRRequest,并向集群中所有可用的Broker发送UpdateMeatadataRequest。
NewReplica,OnlineReplica,OfflineReplica,ReplicaDeletionIneligible->OfflineReplica
Controller向副本所在的Broker发送StopReplicaRequest,之后从ISR集合中清楚此副本,最后向其他可用副本所在的Broker发送LeaderAndISRRequest,并向集群中所有可用的Broker发送UpdateMeatadataRequest。
OfflineReplica->ReplicaDeletionStarted
Controller向副本所在的Broker发送StopReplicaRequest。
ReplicaDeletionStarted->ReplicaDeletionSuccessful,ReplicaDeletionSuccessful->NonExistentReplica
只做状态装换,并没有其他操作
ReplicaStateMachine启动时会对replicaSate集合进行初始化,并调用handleStateChange()方法尝试对可用副本转换为OnlineReplica状态。
/**
* Invoked on successful controller election. First registers a broker change listener since that triggers all
* state transitions for replicas. Initializes the state of replicas for all partitions by reading from zookeeper.
* Then triggers the OnlineReplica state change for all replicas.
*/
def startup() {
// initialize replica state
// 初始化replicaState集合
initializeReplicaState()
// set started flag
hasStarted.set(true)
// move all Online replicas to Online
// 尝试把所有可用副本转换为OnlineReplica状态
handleStateChanges(controllerContext.allLiveReplicas(), OnlineReplica)
info("Started replica state machine with initial state -> " + replicaState.toString())
}
设置每个副本的初始状态的依据是controllerContext.partitionLeadershipInfo中记录的Broker状态。
/**
* Invoked on startup of the replica's state machine to set the initial state for replicas of all existing partitions
* in zookeeper
*/
private def initializeReplicaState() {
for((topicPartition, assignedReplicas) <- controllerContext.partitionReplicaAssignment) {
val topic = topicPartition.topic
val partition = topicPartition.partition
// 便利每个分区的AR集合
assignedReplicas.foreach { replicaId =>
val partitionAndReplica = PartitionAndReplica(topic, partition, replicaId)
controllerContext.liveBrokerIds.contains(replicaId) match {
//将可用的副本初始化为OnlineReplica状态,不可用的副本初始化为ReplicaDeleteionIneligible状态
case true => replicaState.put(partitionAndReplica, OnlineReplica)
case false =>
// mark replicas on dead brokers as failed for topic deletion, if they belong to a topic to be deleted.
// This is required during controller failover since during controller failover a broker can go down,
// so the replicas on that broker should be moved to ReplicaDeletionIneligible to be on the safer side.
replicaState.put(partitionAndReplica, ReplicaDeletionIneligible)
}
}
}
}
ReplicaStateMachine的核心方法时handleStateChange()方法,其中控制着ReplicaState的转换。
def handleStateChange(partitionAndReplica: PartitionAndReplica, targetState: ReplicaState,
callbacks: Callbacks) {
val topic = partitionAndReplica.topic
val partition = partitionAndReplica.partition
val replicaId = partitionAndReplica.replica
val topicAndPartition = TopicAndPartition(topic, partition)
//检测ReplicaStateMachine是否已经启动,如果未启动则抛出异常
if (!hasStarted.get)
throw new StateChangeFailedException(("Controller %d epoch %d initiated state change of replica %d for partition %s " +
"to %s failed because replica state machine has not started")
.format(controllerId, controller.epoch, replicaId, topicAndPartition, targetState))
val currState = replicaState.getOrElseUpdate(partitionAndReplica, NonExistentReplica)
try {
//获取分区的AR集合
val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)
targetState match {
// 在转换开始之前,根据targetState检测前置状态是否合法
case NewReplica =>
assertValidPreviousStates(partitionAndReplica, List(NonExistentReplica), targetState)
// start replica as a follower to the current leader for its partition
// 从zk中获取分区的Leader副本,ISR等信息
val leaderIsrAndControllerEpochOpt = ReplicationUtils.getLeaderIsrAndEpochForPartition(zkUtils, topic, partition)
leaderIsrAndControllerEpochOpt match {
case Some(leaderIsrAndControllerEpoch) => //处于NewReplica状态的副本不可能是LEADER,如果你leader就报错
if(leaderIsrAndControllerEpoch.leaderAndIsr.leader == replicaId)
throw new StateChangeFailedException("Replica %d for partition %s cannot be moved to NewReplica"
.format(replicaId, topicAndPartition) + "state as it is being requested to become leader")
// 向副本发送LeaderAndISRRequest,并向集群中所有可用的Broker发送UpdateMeatadataRequest
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(List(replicaId),
topic, partition, leaderIsrAndControllerEpoch,
replicaAssignment)
case None => // new leader request will be sent to this replica when one gets elected
}
// 更新副本状态为NewReplica
replicaState.put(partitionAndReplica, NewReplica)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState,
targetState))
case ReplicaDeletionStarted =>
assertValidPreviousStates(partitionAndReplica, List(OfflineReplica), targetState)
replicaState.put(partitionAndReplica, ReplicaDeletionStarted)
// send stop replica command
// 向副本发送StopReplicaRequest,设置回调函数
brokerRequestBatch.addStopReplicaRequestForBrokers(List(replicaId), topic, partition, deletePartition = true,
callbacks.stopReplicaResponseCallback)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
case ReplicaDeletionIneligible =>
assertValidPreviousStates(partitionAndReplica, List(ReplicaDeletionStarted), targetState)
//更新副本状态为ReplicaDeletionIneligible
replicaState.put(partitionAndReplica, ReplicaDeletionIneligible)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
case ReplicaDeletionSuccessful =>
assertValidPreviousStates(partitionAndReplica, List(ReplicaDeletionStarted), targetState)
//更新副本状态为ReplicaDeletionSuccessful
replicaState.put(partitionAndReplica, ReplicaDeletionSuccessful)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
case NonExistentReplica =>
assertValidPreviousStates(partitionAndReplica, List(ReplicaDeletionSuccessful), targetState)
// remove this replica from the assigned replicas list for its partition
// 从AR中删除这个副本
val currentAssignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
controllerContext.partitionReplicaAssignment.put(topicAndPartition, currentAssignedReplicas.filterNot(_ == replicaId))
// 删除副本状态
replicaState.remove(partitionAndReplica)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
case OnlineReplica =>
assertValidPreviousStates(partitionAndReplica,
List(NewReplica, OnlineReplica, OfflineReplica, ReplicaDeletionIneligible), targetState)
replicaState(partitionAndReplica) match {
// 添加到AR集合中
case NewReplica =>
// add this replica to the assigned replicas list for its partition
val currentAssignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
if(!currentAssignedReplicas.contains(replicaId))
controllerContext.partitionReplicaAssignment.put(topicAndPartition, currentAssignedReplicas :+ replicaId)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState,
targetState))
case _ =>
// check if the leader for this partition ever existed
// 检测是否有Leader副本
controllerContext.partitionLeadershipInfo.get(topicAndPartition) match {
//如果存在leader副本,向副本发送LeaderAndISRRequest,并向集群中所有可用的Broker发送UpdateMeatadataRequest
case Some(leaderIsrAndControllerEpoch) =>
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(List(replicaId), topic, partition, leaderIsrAndControllerEpoch,
replicaAssignment)
replicaState.put(partitionAndReplica, OnlineReplica)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
case None => // that means the partition was never in OnlinePartition state, this means the broker never
// started a log for that partition and does not have a high watermark value for this partition
}
}
//更新副本状态为Online
replicaState.put(partitionAndReplica, OnlineReplica)
case OfflineReplica =>
assertValidPreviousStates(partitionAndReplica,
List(NewReplica, OnlineReplica, OfflineReplica, ReplicaDeletionIneligible), targetState)
// send stop replica command to the replica so that it stops fetching from the leader
// 向副本发送StopReplicaRequest,这里不会删除副本
brokerRequestBatch.addStopReplicaRequestForBrokers(List(replicaId), topic, partition, deletePartition = false)
// As an optimization, the controller removes dead replicas from the ISR
val leaderAndIsrIsEmpty: Boolean =
controllerContext.partitionLeadershipInfo.get(topicAndPartition) match {
case Some(currLeaderIsrAndControllerEpoch) =>
//从ISR中移除
controller.removeReplicaFromIsr(topic, partition, replicaId) match {
case Some(updatedLeaderIsrAndControllerEpoch) =>
// send the shrunk ISR state change request to all the remaining alive replicas of the partition.
val currentAssignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
if (!controller.deleteTopicManager.isPartitionToBeDeleted(topicAndPartition)) {
// 向副本发送LeaderAndISRRequest,并向集群中所有可用的Broker发送UpdateMeatadataRequest
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(currentAssignedReplicas.filterNot(_ == replicaId),
topic, partition, updatedLeaderIsrAndControllerEpoch, replicaAssignment)
}
replicaState.put(partitionAndReplica, OfflineReplica)
stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
.format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
false
case None =>
true
}
case None =>
true
}
if (leaderAndIsrIsEmpty && !controller.deleteTopicManager.isPartitionToBeDeleted(topicAndPartition))
throw new StateChangeFailedException(
"Failed to change state of replica %d for partition %s since the leader and isr path in zookeeper is empty"
.format(replicaId, topicAndPartition))
}
}
catch {
case t: Throwable =>
stateChangeLogger.error("Controller %d epoch %d initiated state change of replica %d for partition [%s,%d] from %s to %s failed"
.format(controllerId, controller.epoch, replicaId, topic, partition, currState, targetState), t)
}
}