在讲解流程之前,先说明一下选举流程中涉及到的角色,以及涉及到的关键类和变量(源码参考版本:3.4.9):
角色:1.LOOKING:竞选
2.OBSERVING:观察
3.FOLLOWING:跟随者
4.LEADER:领导者
投票信息:
1.logicalclock(electionEpoch):本地选举周期,每次投票都会自增
2.epoch(peerEpoch):选举周期,每次选举最终确定完leader结束选举流程时会自增(真正zxid的前32位)
3.zxid:数据ID,每次数据变动都会自增(真正zxid的后32位,zxid一共64位)
4.sid:该投票信息所属的serverId
5.leader:提议的leader(被提议的server的serverId,即sid)
投票比较规则:
1.epoch大的胜出,否则进行步骤2
2.zxid大的胜出,否则进行步骤3
3.sid大的胜出
比较规则的源码如下:
/**
* Check if a pair (server id, zxid) succeeds our
* current vote.
*
* @param id Server identifier
* @param zxid Last zxid observed by the issuer of this vote
*/
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
if(self.getQuorumVerifier().getWeight(newId) == 0){
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
return ((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}
下面首先讲解一下大概的选举流程,这里暂时先不用考虑投票的数据是如何进行交互的,只管拿来用即可,后续会讲到选举期间投票数据是如何进行交互的。
1.首先更新logicalclock并提议自己为leader并广播出去
2.进入本轮投票的循环
3.从recvqueue队列中获取一个投票信息,如果为空则检查是否要重发自己的投票或者重连,否则进入步骤4
4.判断投票信息中的选举状态:
LOOKING状态:1.如果对方的logicalclock大于本地的logicalclock,则更新本地的logicalclock并清空本地投票信息统计箱recvset,并将自己作为候选和投票中的leader进行比较,选择大的作为新的投票,然后广播出去,否则进入步骤2
2.如果对方的logicalclock小于本地的logicalclock,则忽略对方的投票,重新进入下一轮选举流程,否则进入步骤3
3.如果两方的logicalclock相等,则比较当前本地被推选的leader和投票中的leader,选择大的作为新的投票,然后广播出去
4.把对方的投票信息保存到本地投票统计箱recvset中,判断当前被选举的leader是否在投票中占了大多数(大于一半的server数量),如果是则需再等待finalizeWait时间(从recvqueue继续poll投票消息)看是否有人修改了leader的候选,如果有则再将该投票信息再放回recvqueue中并重新开始下一轮循环,否则确定角色,结束选举
OBSERVING状态:没有投票权,无视直接进入下一轮选举
FOLLOWING/LEADING:1.如果对方的logicalclock等于本地的logicalclock,把对方的投票信息保存到本地投票统计箱recvset中,判断对方的投票信息是否在recvset中占大多数并且确认自己确实为leader,如果是则确定角色,结束选举,否则进入步骤2
2.将对方的投票信息放入本地统计不参与投票信息箱outofelection中,判断对方的投票信息是否在outofelection中占大多数并且确认自己确实为leader,如果是则更新logicalclock,并确定角色,结束选举,否则进入下一轮选举
选举流程源码如下:
/**
* Starts a new round of leader election. Whenever our QuorumPeer
* changes its state to LOOKING, this method is invoked, and it
* sends notifications to all other peers.
*
* 开始新的一轮leader选举。
* 每当当前的peer的选举状态为LOOKING时,这个方法就会执行,并且会向其他peer发送提议leader消息。
*
*/
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = System.currentTimeMillis();
}
try {
//本机统计的投票信息
HashMap recvset = new HashMap();
//FOLLOWING LEADING状态的节点信息-->非LOOKING状态
HashMap outofelection = new HashMap();
int notTimeout = finalizeWait;
//提议选举自己为leader
synchronized(this){
logicalclock++;
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*
* 循环:开始交换提议信息,直到选举出leader
*/
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
if(manager.haveDelivered()){
sendNotifications();
} else {
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
else if(self.getVotingView().containsKey(n.sid)) {
/*
* Only proceed if the vote comes from a replica in the
* voting view.
*/
switch (n.state) {
case LOOKING:
// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock) {
logicalclock = n.electionEpoch;
recvset.clear();
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications();
} else if (n.electionEpoch < logicalclock) {
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// 把对方的投票意愿缓存起来,用于最终的统计
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock, proposedEpoch))) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock,
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if(n.electionEpoch == logicalclock){
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
if(ooePredicate(recvset, outofelection, n)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify
* a majority is following the same leader.
*/
outofelection.put(n.sid, new Vote(n.version,
n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch,
n.state));
if(ooePredicate(outofelection, outofelection, n)) {
synchronized(this){
logicalclock = n.electionEpoch;
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
n.state, n.sid);
break;
}
} else {
LOG.warn("Ignoring notification from non-cluster member " + n.sid);
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
}
}
选举流程图如下:
上面讲解了快速的选举流程,那么选举中的数据是怎么交互的呢,下面来进行进一步的讲解:
在zookeeper的启动脚本zkServer.cmd可以看到有这么一行脚本内容:
set ZOOMAIN=org.apache.zookeeper.server.quorum.QuorumPeerMain
echo on
call %JAVA% "-Dzookeeper.log.dir=%ZOO_LOG_DIR%" "-Dzookeeper.root.logger=%ZOO_LOG4J_PROP%" -cp "%CLASSPATH%" %ZOOMAIN% "%ZOOCFG%" %*
我们得知启动类为:org.apache.zookeeper.server.quorum.QuorumPeerMain,跟踪代码可以得知选举流程为:
FastLeaderElection类中的lookForLeader()方法,实际发生网络交互的地方为QuorumCnxManager类,类图关系如下两图:
具体说明:
QuorumCnxManager类为实际发生网络交互的地方,负责网络通讯中收集与发送投票信息,有类图关系中可以看到此类中有个叫Listener的内部类,此类负责保证连接的一对一以及启动两个线程进行投票消息的收发:sendWorker和recvWorker;
FastLeaderElection类中也有两个内部类负责投票信息的收发:WorkerSender和WorkerReceiver。
消息发送条线:选举方法lookForLeader()中发送投票时是将投票信息放入FastLeaderElection类中的sendqueue队列中,而WorkerSender(FastLeaderElection):负责将sendqueue队列中的信息放入QuorumCnxManager类中的queueSendMap中;而sendWorker(QuorumCnxManager):负责将QuorumCnxManager类中的queueSendMap中的投票信息发送到网络上。
消息接收条线:recvWorker(QuorumCnxManager):负责接收网络上的投票信息,并放入QuorumCnxManager类的recrQueue队列中;WorkerReceiver(FastLeaderElection):负责从QuorumCnxManager类中的recrQueue队列中获取数据,并放入FastLeaderElection类中的recvqueue队列中。
自己拷贝了一份3.4.9的源码并添加了些许注释:https://github.com/learnertogether/zookeeper-3.4.9.git