对于Leader选举,就是Election接口和其实现类
AuthFastLeaderElection,LeaderElection其在3.4.0之后的版本中已经不建议使⽤。
Election源码分析
public interface Election {
public Vote lookForLeader() throws InterruptedException;
public void shutdown();
}
说明:
选举的⽗接⼝为Election,其定义了lookForLeader和shutdown两个⽅法,lookForLeader表示寻找Leader,shutdown则表示关闭,如关闭服务端之间的连接。
看默认实现类FastLeaderElection:
FastLeaderElection实现了Election接⼝,重写了接⼝中定义的lookForLeader⽅法和shutdown
⽅法。
在源码分析之前,我们⾸先介绍⼏个概念:
外部投票:特指其他服务器发来的投票。
内部投票:服务器⾃身当前的投票。
选举轮次:ZooKeeper服务器Leader选举的轮次,即logical clock(逻辑时钟)。
PK:指对内部投票和外部投票进⾏⼀个对⽐来确定是否需要变更内部投票。选票管理
sendqueue:选票发送队列,⽤于保存待发送的选票。
recvqueue:选票接收队列,⽤于保存接收到的外部投票。
lookForLeader函数
当 ZooKeeper 服务器检测到当前服务器状态变成 LOOKING 时,就会触发 Leader选举,即调⽤
lookForLeader⽅法来进⾏Leader选举。
public Vote lookForLeader() throws InterruptedException {
synchronized(this){
// ⾸先会将逻辑时钟⾃增,每进⾏⼀轮新的leader选举,都需要更新逻辑时钟
logicalclock++;
// 更新选票(初始化选票)
updateProposal(getInitId(), getInitLastLoggedZxid(),
getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
// 向其他服务器发送⾃⼰的选票(已更新的选票)
sendNotifications();
之后每台服务器会不断地从recvqueue队列中获取外部选票。如果服务器发现⽆法获取到任何外部投
票,就⽴即确认⾃⼰是否和集群中其他服务器保持着有效的连接,如果没有连接,则⻢上建⽴连接,如
果已经建⽴了连接,则再次发送⾃⼰当前的内部投票,其流程如下:
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
//没有接收到选票,就检查是不是自己的选票都已经投递了,没投递就说明和服务器断开了连接
if(manager.haveDelivered()){
//已经投递了就再投一遍
sendNotifications();
} else {
// 断开了就连接
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
// 收到了选票,则先校验投票的投票者和leader是否有效
else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
// If notification > current, replace and send messages out
//比较外部投票的轮次和自身的轮次大小
if (n.electionEpoch > logicalclock.get()) {
//外部大于内部,则将自身轮次设为外部投票
logicalclock.set(n.electionEpoch);
//清空自身投票
recvset.clear();
//比较投票
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
先看totalOrderPredicate方法:
/**
* Check if a pair (server id, zxid) succeeds our
* current vote.
*
* @param id Server identifier
* @param zxid Last zxid observed by the issuer of this vote
*/
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
if(self.getQuorumVerifier().getWeight(newId) == 0){
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
// 先比较轮次id,一样再比较事务id,一样再比较服务器id
return ((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}
再回到刚刚的分支:
} else if (n.electionEpoch < logicalclock.get()) {
//外部投票小于内部直接丢弃
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} //比较投票,若外部投票更大
else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
//更新投票信息
updateProposal(n.leader, n.zxid, n.peerEpoch);
//发送投票
sendNotifications();
}
判断完外部投票后:
//归档投票信息
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//判断是否有leader可以选出
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid, proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
查看termPredicate:
/**
* Termination predicate. Given a set of votes, determines if have
* sufficient to declare the end of the election round.
*
* @param votes
* Set of votes
* @param vote
* Identifier of the vote received last
*/
private boolean termPredicate(HashMap votes, Vote vote) {
SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
voteSet.addQuorumVerifier(self.getQuorumVerifier());
if (self.getLastSeenQuorumVerifier() != null
&& self.getLastSeenQuorumVerifier().getVersion() > self
.getQuorumVerifier().getVersion()) {
voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
}
/*
* First make the views consistent. Sometimes peers will have different
* zxids for a server depending on timing.
*/
//遍历投票集合
for (Map.Entry entry : votes.entrySet()) {
//找出所有等于此次投票的记录
if (vote.equals(entry.getValue())) {
voteSet.addAck(entry.getKey());
}
}
//判断是否可以找出leader,即是否过半
return voteSet.hasAllQuorums();
}
至此,leader的选举就完成了。