说起zk,就会想到一个leader,多个follower和observer这样一种架构,本文就是对zk选举源码的分析。本文分为两部分,第一部分是选举流程,第二部分是选举算法的核心逻辑:选举流程一定要跟着这两张图看,选举算法就是zk的选举算法实现。
首先看向这两张选举图:
第一张图解释:
- QuorumPeer:负责选举
- Messenger:则是选举消息发送和接收的具体类(注意在ZooKeeper中,提供了3种Leader的选举算法,分别是LeaderElection、 UDP版本的FastLeaderElection、TCP版本的FastLeaderElection,从3.4.0版本开始,ZooKeeper废弃了前2种算法,只保留了TCP版本的FastLeaderElection算法,所以这里的Messenger是FastLeaderElection.Messenger)
- WorkerReceiver和WorkerSender:Messenger中的接收和发送线程
- QuorumCnxManager:选举信息交换的Socket框架,采用Netty框架负责底层Socket链接管理,提供Select在多个Socket之间切换,先到先得处理选举交换
第二张图我们跟着代码做更详细的解释:
如果以下内容有不清楚,看向这两张图即可一目了然。
选举流程
每台服务器都会启动一个QuorumPeer进程,QuorumPeer负责选举整个过程,这是一个线程类,是在QuorumPeerMain类中启动:
代码启动顺序:
QuorumPeerMain#main->QuorumPeerMain#initializeAndRun->QuorumPeerMain#runFromConfig(集群)/ZooKeeperServerMain#main(单机)
我们看向集群方法,QuorumPeerMain#runFromConfig在加载各种配置后会启动QuorumPeer线程,我们直接看向QuorumPeer的run方法。
//主线程,管理QuoRumPeer循环在FastLeadingElection,Leader,Follower,Observer之间切换
@Override
public void run() {
updateThreadName();
LOG.debug("Starting quorum peer");
try {
jmxQuorumBean = new QuorumBean(this);
MBeanRegistry.getInstance().register(jmxQuorumBean, null);
for(QuorumServer s: getView().values()){
ZKMBeanInfo p;
if (getId() == s.id) {
p = jmxLocalPeerBean = new LocalPeerBean(this);
try {
MBeanRegistry.getInstance().register(p, jmxQuorumBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
jmxLocalPeerBean = null;
}
} else {
RemotePeerBean rBean = new RemotePeerBean(this, s);
try {
MBeanRegistry.getInstance().register(rBean, jmxQuorumBean);
jmxRemotePeerBean.put(s.id, rBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
}
}
}
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
jmxQuorumBean = null;
}
try {
/*
* Main loop
*/
while (running) {
switch (getPeerState()) {
case LOOKING:
LOG.info("LOOKING");
if (Boolean.getBoolean("readonlymode.enabled")) {
LOG.info("Attempting to start ReadOnlyZooKeeperServer");
// Create read-only server but don't start it immediately
final ReadOnlyZooKeeperServer roZk =
new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);
// Instead of starting roZk immediately, wait some grace
// period before we decide we're partitioned.
//
// Thread is used here because otherwise it would require
// changes in each of election strategy classes which is
// unnecessary code coupling.
Thread roZkMgr = new Thread() {
public void run() {
try {
// lower-bound grace period to 2 secs
sleep(Math.max(2000, tickTime));
if (ServerState.LOOKING.equals(getPeerState())) {
roZk.startup();
}
} catch (InterruptedException e) {
LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
} catch (Exception e) {
LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
}
}
};
try {
roZkMgr.start();
reconfigFlagClear();
if (shuttingDownLE) {
shuttingDownLE = false;
//--初始化选举过程中需要使用到的线程和队列
startLeaderElection();
}
//--选举逻辑入口
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
} finally {
// If the thread is in the the grace period, interrupt
// to come out of waiting.
roZkMgr.interrupt();
roZk.shutdown();
}
} else {
try {
reconfigFlagClear();
if (shuttingDownLE) {
shuttingDownLE = false;
startLeaderElection();
}
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
}
break;
case OBSERVING:
try {
LOG.info("OBSERVING");
setObserver(makeObserver(logFactory));
//6--启动Observer
//6--跟follower类似,注册,事务同步,调用processpacket
observer.observeLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e );
} finally {
observer.shutdown();
setObserver(null);
updateServerState();
}
break;
case FOLLOWING://6--Follwer启动首先要了链接到leader,同步写事务历史记录,然后才启动zookeeperServer提供服务给客户端
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
//6--注册到Leader connectToLeader->registerWithLeader->syncWithLeader
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
follower.shutdown();
setFollower(null);
updateServerState();
}
break;
case LEADING://6--选举完成后,Peer确认自己是leader的身份,
LOG.info("LEADING");
try {
setLeader(makeLeader(logFactory));
//6--执行leader真正的逻辑
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
if (leader != null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
updateServerState();
}
break;
}
start_fle = Time.currentElapsedTime();
}
} finally {
LOG.warn("QuorumPeer main thread exited");
MBeanRegistry instance = MBeanRegistry.getInstance();
instance.unregister(jmxQuorumBean);
instance.unregister(jmxLocalPeerBean);
for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) {
instance.unregister(remotePeerBean);
}
jmxQuorumBean = null;
jmxLocalPeerBean = null;
jmxRemotePeerBean = null;
}
}
QuorumPeer有4种工作模式,
- looking:选举模式,启动fastleaderElection
- leading:领导者模式,启动leader
- following:跟随者模式,启动follower
- observing:旁观者模式,启动observer
我们先看向选举模式调用方法顺序:
QuorumPeer#startLeaderElection->QuorumPeer#createElectionAlgorithm
看向QuorumPeer#createElectionAlgorithm源码:
@SuppressWarnings("deprecation")
protected Election createElectionAlgorithm(int electionAlgorithm){
Election le=null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 0:
le = new LeaderElection(this);
break;
case 1:
le = new AuthFastLeaderElection(this);
break;
case 2:
le = new AuthFastLeaderElection(this, true);
break;
case 3:
QuorumCnxManager qcm = createCnxnManager();
QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
if (oldQcm != null) {
LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
oldQcm.halt();
}
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
listener.start();
//--初始化sendqueue和recvqueue两个队列
//--初始化QuorumCnxManager
//--初始化Messenger中WorkerSender线程和WorkerReceiver线程
FastLeaderElection fle = new FastLeaderElection(this, qcm);
//--启动Messenger中WorkerSender线程和WorkerReceiver线程
fle.start();
le = fle;
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
}
直接看向case 3:中的这段代码:
//--初始化sendqueue和recvqueue两个队列
//--初始化QuorumCnxManager
//--初始化Messenger中WorkerSender线程和WorkerReceiver线程
FastLeaderElection fle = new FastLeaderElection(this, qcm);
//--启动Messenger中WorkerSender线程和WorkerReceiver线程
fle.start();
结合第二张图来看这段代码是非常清晰的,最后就是启动了Messenger中WorkerSender线程和WorkerReceiver线程.
FastLeaderElection.Messenger.WorkerSender#run
public void run() {
while (!stop) {
try {
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
if(m == null) continue;
//--调用tosend()
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
方法调用顺序:
FastLeaderElection.Messenger.WorkerSender#process->QuorumCnxManager#toSend
public void toSend(Long sid, ByteBuffer b) {
/*
* If sending message to myself, then simply enqueue it (loopback).
*/
//--消息发送给我自己,放进recvQueue
if (this.mySid == sid) {
b.position(0);
addToRecvQueue(new Message(b.duplicate(), sid));
/*
* Otherwise send to the corresponding thread to send.
*/
} else {//--否则放进queueSendMap
/*
* Start a new connection if doesn't have one already.
*/
ArrayBlockingQueue bq = new ArrayBlockingQueue(
SEND_CAPACITY);
ArrayBlockingQueue oldq = queueSendMap.putIfAbsent(sid, bq);
//--放进去queueSendMap
if (oldq != null) {
addToSendQueue(oldq, b);
} else {
addToSendQueue(bq, b);
}
connectOne(sid);
}
}
然后的方法调用顺序:
QuorumCnxManager#connectOne->QuorumCnxManager#initiateConnection->QuorumCnxManager#startConnection
private boolean startConnection(Socket sock, Long sid)
throws IOException {
DataOutputStream dout = null;
DataInputStream din = null;
try {
// Use BufferedOutputStream to reduce the number of IP packets. This is
// important for x-DC scenarios.
BufferedOutputStream buf = new BufferedOutputStream(sock.getOutputStream());
dout = new DataOutputStream(buf);
// Sending id and challenge
// represents protocol version (in other words - message type)
dout.writeLong(PROTOCOL_VERSION);
dout.writeLong(self.getId());
String addr = formatInetAddr(self.getElectionAddress());
byte[] addr_bytes = addr.getBytes();
dout.writeInt(addr_bytes.length);
dout.write(addr_bytes);
dout.flush();
din = new DataInputStream(
new BufferedInputStream(sock.getInputStream()));
} catch (IOException e) {
LOG.warn("Ignoring exception reading or writing challenge: ", e);
closeSocket(sock);
return false;
}
// authenticate learner
QuorumPeer.QuorumServer qps = self.getVotingView().get(sid);
if (qps != null) {
// TODO - investigate why reconfig makes qps null.
authLearner.authenticate(sock, qps.hostname);
}
// If lost the challenge, then drop the new connection
if (sid > self.getId()) {
LOG.info("Have smaller server identifier, so dropping the " +
"connection: (" + sid + ", " + self.getId() + ")");
closeSocket(sock);
// Otherwise proceed with the connection
} else {
SendWorker sw = new SendWorker(sock, sid);
RecvWorker rw = new RecvWorker(sock, din, sid, sw);
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if(vsw != null)
vsw.finish();
senderWorkerMap.put(sid, sw);
queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue(
SEND_CAPACITY));
sw.start();
rw.start();
return true;
}
return false;
}
所以直接看向QuorumCnxManager.SendWorker和QuorumCnxManager.RecvWorker的run方法,这里才是真正发送消息和接收的地方,前面都是各路封装。
QuorumCnxManager.SendWorker#run
//--将queueSendMap中的消息发送出去。
@Override
public void run() {
threadCnt.incrementAndGet();
try {
/**
* If there is nothing in the queue to send, then we
* send the lastMessage to ensure that the last message
* was received by the peer. The message could be dropped
* in case self or the peer shutdown their connection
* (and exit the thread) prior to reading/processing
* the last message. Duplicate messages are handled correctly
* by the peer.
*
* If the send queue is non-empty, then we have a recent
* message than that stored in lastMessage. To avoid sending
* stale message, we should send the message in the send queue.
*/
//5--获取QueueSendMap,根据机器sid
ArrayBlockingQueue bq = queueSendMap.get(sid);
if (bq == null || isSendQueueEmpty(bq)) {
//5--最后一条发送的消息
ByteBuffer b = lastMessageSent.get(sid);
if (b != null) {
LOG.debug("Attempting to send lastMessage to sid=" + sid);
send(b);
}
}
} catch (IOException e) {
LOG.error("Failed to send last message. Shutting down thread.", e);
this.finish();
}
try {
while (running && !shutdown && sock != null) {
ByteBuffer b = null;
try {
ArrayBlockingQueue bq = queueSendMap
.get(sid);
if (bq != null) {
b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
} else {
LOG.error("No queue of incoming messages for " +
"server " + sid);
break;
}
//--放入lasetMessageSent队列
if(b != null){
lastMessageSent.put(sid, b);
send(b);
}
} catch (InterruptedException e) {
LOG.warn("Interrupted while waiting for message on queue",
e);
}
}
} catch (Exception e) {
LOG.warn("Exception when using channel: for id " + sid
+ " my id = " + QuorumCnxManager.this.mySid
+ " error = " + e);
}
//--关闭连接
this.finish();
LOG.warn("Send worker leaving thread " + " id " + sid + " my id = " + self.getId());
}
}
其中send()
synchronized void send(ByteBuffer b) throws IOException {
byte[] msgBytes = new byte[b.capacity()];
try {
b.position(0);
b.get(msgBytes);
} catch (BufferUnderflowException be) {
LOG.error("BufferUnderflowException ", be);
return;
}
//--DataOutputStream-dout是Socket写接口
dout.writeInt(b.capacity());
dout.write(b.array());
dout.flush();
}
再看向QuorumCnxManager.RecvWorker#run
//--接受消息
@Override
public void run() {
threadCnt.incrementAndGet();
try {
while (running && !shutdown && sock != null) {
/**
* Reads the first int to determine the length of the
* message
*/
//--DataInputStream-din,Socket接口
int length = din.readInt();
if (length <= 0 || length > PACKETMAXSIZE) {
throw new IOException(
"Received packet with invalid packet: "
+ length);
}
/**
* Allocates a new ByteBuffer to receive the message
*/
byte[] msgArray = new byte[length];
din.readFully(msgArray, 0, length);
ByteBuffer message = ByteBuffer.wrap(msgArray);
//--添加到recvQueue队列
//--删除队列头部消息(如果内存空间不够),添加到队尾
addToRecvQueue(new Message(message.duplicate(), sid));
}
} catch (Exception e) {
LOG.warn("Connection broken for id " + sid + ", my id = "
+ QuorumCnxManager.this.mySid + ", error = " , e);
} finally {
LOG.warn("Interrupting SendWorker");
sw.finish();
closeSocket(sock);
}
}
}
其中addToRecvQueue()
//5--删除头部,保留尾部
public void addToRecvQueue(Message msg) {
synchronized(recvQLock) {
if (recvQueue.remainingCapacity() == 0) {
try {
recvQueue.remove();
} catch (NoSuchElementException ne) {
// element could be removed by poll()
LOG.debug("Trying to remove from an empty " +
"recvQueue. Ignoring exception " + ne);
}
}
try {
recvQueue.add(msg);
} catch (IllegalStateException ie) {
// This should never happen
LOG.error("Unable to insert element in the recvQueue " + ie);
}
}
}
接着再看向第二张图,这张图真的很重要。我们再看向FastLeaderElection,这时我们看向FastLeaderElection.Messenger.WorkerReceiver#run
//5--FastLeaderElection的WorkerReceive线程将来自QuorumCnxManager中recvQueue队列的消息组装成Notification放入FastLeaderElection的recvqueue队列
public void run() {
Message response;
while (!stop) {
// Sleeps on receive
try {
//5--QuorumCnxManager中recvQueue队列的消息
response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
if(response == null) continue;
// The current protocol and two previous generations all send at least 28 bytes
if (response.buffer.capacity() < 28) {
LOG.error("Got a short response: " + response.buffer.capacity());
continue;
}
// this is the backwardCompatibility mode in place before ZK-107
// It is for a version of the protocol in which we didn't send peer epoch
// With peer epoch and version the message became 40 bytes
boolean backCompatibility28 = (response.buffer.capacity() == 28);
// this is the backwardCompatibility mode for no version information
boolean backCompatibility40 = (response.buffer.capacity() == 40);
response.buffer.clear();
// Instantiate Notification and set its attributes
Notification n = new Notification();
int rstate = response.buffer.getInt();
long rleader = response.buffer.getLong();
long rzxid = response.buffer.getLong();
long relectionEpoch = response.buffer.getLong();
long rpeerepoch;
int version = 0x0;
if (!backCompatibility28) {
rpeerepoch = response.buffer.getLong();
if (!backCompatibility40) {
/*
* Version added in 3.4.6
*/
version = response.buffer.getInt();
} else {
LOG.info("Backward compatibility mode (36 bits), server id: {}", response.sid);
}
} else {
LOG.info("Backward compatibility mode (28 bits), server id: {}", response.sid);
rpeerepoch = ZxidUtils.getEpochFromZxid(rzxid);
}
QuorumVerifier rqv = null;
// check if we have a version that includes config. If so extract config info from message.
if (version > 0x1) {
int configLength = response.buffer.getInt();
byte b[] = new byte[configLength];
response.buffer.get(b);
synchronized(self) {
try {
rqv = self.configFromString(new String(b));
QuorumVerifier curQV = self.getQuorumVerifier();
if (rqv.getVersion() > curQV.getVersion()) {
LOG.info("{} Received version: {} my version: {}", self.getId(),
Long.toHexString(rqv.getVersion()),
Long.toHexString(self.getQuorumVerifier().getVersion()));
if (self.getPeerState() == ServerState.LOOKING) {
LOG.debug("Invoking processReconfig(), state: {}", self.getServerState());
self.processReconfig(rqv, null, null, false);
if (!rqv.equals(curQV)) {
LOG.info("restarting leader election");
self.shuttingDownLE = true;
self.getElectionAlg().shutdown();
break;
}
} else {
LOG.debug("Skip processReconfig(), state: {}", self.getServerState());
}
}
} catch (IOException e) {
LOG.error("Something went wrong while processing config received from {}", response.sid);
} catch (ConfigException e) {
LOG.error("Something went wrong while processing config received from {}", response.sid);
}
}
} else {
LOG.info("Backward compatibility mode (before reconfig), server id: {}", response.sid);
}
/*
* If it is from a non-voting server (such as an observer or
* a non-voting follower), respond right away.
*/
if(!validVoter(response.sid)) {
Vote current = self.getCurrentVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
current.getId(),
current.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
} else {
// Receive new message
if (LOG.isDebugEnabled()) {
LOG.debug("Receive new notification message. My id = "
+ self.getId());
}
// State of peer that sent this message
QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
switch (rstate) {
case 0:
ackstate = QuorumPeer.ServerState.LOOKING;
break;
case 1:
ackstate = QuorumPeer.ServerState.FOLLOWING;
break;
case 2:
ackstate = QuorumPeer.ServerState.LEADING;
break;
case 3:
ackstate = QuorumPeer.ServerState.OBSERVING;
break;
default:
continue;
}
n.leader = rleader;
n.zxid = rzxid;
n.electionEpoch = relectionEpoch;
n.state = ackstate;
n.sid = response.sid;
n.peerEpoch = rpeerepoch;
n.version = version;
n.qv = rqv;
/*
* Print notification info
*/
if(LOG.isInfoEnabled()){
printNotification(n);
}
/*
* If this server is looking, then send proposed leader
*/
if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
//5---放入到recvqueue队列中,待处理
recvqueue.offer(n);
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
//--对方也是Looking,判断那方的Epoch和zxid大,大的成为新leader候选
if((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock.get())){
Vote v = getVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
v.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
} else {
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
//--当前server不是looking,但是对方是looking,就发给自己认为是leader的消息过去
Vote current = self.getCurrentVote();
if(ackstate == QuorumPeer.ServerState.LOOKING){
if(LOG.isDebugEnabled()){
LOG.debug("Sending new notification. My id ={} recipient={} zxid=0x{} leader={} config version = {}",
self.getId(),
response.sid,
Long.toHexString(current.getZxid()),
current.getId(),
Long.toHexString(self.getQuorumVerifier().getVersion()));
}
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
//等会发给 发送消息队列的(queueSendMap) 消息队列
sendqueue.offer(notmsg);
}
}
}
} catch (InterruptedException e) {
LOG.warn("Interrupted Exception while waiting for new message" +
e.toString());
}
}
LOG.info("WorkerReceiver is down");
}
把关注点放在这段代码上:
/*
* If this server is looking, then send proposed leader
*/
if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
//5---放入到recvqueue队列中,待处理
recvqueue.offer(n);
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
//--对方也是Looking,判断那方的Epoch和zxid大,大的成为新leader候选
if((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock.get())){
Vote v = getVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
v.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
} else {
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
//--当前server不是looking,但是对方是looking,就发给自己认为是leader的消息过去
Vote current = self.getCurrentVote();
if(ackstate == QuorumPeer.ServerState.LOOKING){
if(LOG.isDebugEnabled()){
LOG.debug("Sending new notification. My id ={} recipient={} zxid=0x{} leader={} config version = {}",
self.getId(),
response.sid,
Long.toHexString(current.getZxid()),
current.getId(),
Long.toHexString(self.getQuorumVerifier().getVersion()));
}
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
//等会发给 发送消息队列的(queueSendMap) 消息队列
sendqueue.offer(notmsg);
}
}
选举逻辑
首先有这样一个选举类:
public class Vote {
//
final private int version;
//被选举leader的服务器ID
final private long id;
//被选举leader的事务ID
final private long zxid;
//逻辑时钟,判断多个选票是否处于同一个选举周期,
final private long electionEpoch;
//被推举leader的选举轮次
final private long peerEpoch;
//状态
final private ServerState state;
}
核心算法:FastLeaderElection#lookForLeader(可以启动只读模式和阻塞模式)
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
//--储存收到的Notication
HashMap recvset = new HashMap();
HashMap outofelection = new HashMap();
int notTimeout = finalizeWait;
synchronized(this){
//5--更新选举周期
logicalclock.incrementAndGet();
//5--把自己作为leader作为投票发给其它,这个时候并未发送出去,下面才是广播发送
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
//--放到sendqueue里面,等待发送给其他人
//--初始化投票
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*/
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
//5--recvqueue数据来自Messenger,也可能来自后面候选人失败了再放进去的消息
//5--notTimeout超时时间
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){
//5--检查网络发送队列queueSendMap是否为空,再次发送
if(manager.haveDelivered()){
sendNotifications();
} else {
//5--重连
//5--queueSendMap的key是每台机器的sid
manager.connectAll();
}
/*
* Exponential backoff
*/
//5--如果超时没有获取到选票vote则采用退避算法,下次使用更长的超时时间
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
} //5--?????
else if (validVoter(n.sid) && validVoter(n.leader)) {
//5--这里去看对面是什么状态
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
//5--选取electionEpoch较大的--选举轮次
//如果electionEpoch相等则取zxid较大的
//如果zxid相等则取myid较大的
// If notification > current, replace and send messages out
//5--对方投票周期大于自己IDE
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
//5--投票集合清空
recvset.clear();
//5--比较myid,zxid,electionEpoch
//5--1:electionEpoch大,2:zxid大,3:myid大(leader编号)
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
//5--更新大的一方的myid,zxid,electionEpoch
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
//5--告诉其他人
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {//5--忽略对方投票
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {//5--周期相同,跟第一个条件一样的比较
updateProposal(n.leader, n.zxid, n.peerEpoch);
//5--告诉其他人
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// don't care about the version if it's in LOOKING state
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//5--判断当前候选人proposedLeader,proposedZxid,proposedEpoch在选票中是否占了大多数???不清楚怎么判断的--
//--QuorumHierarchical.containsQuorum()或者QuorumMaj.containsQuorum()
//5--尝试通过现在已经收到的信息,判断是否已经足够确认最终的leader了,通过方法termPredicate() ,判断标准很简单:是否已经有超过半数的机
// 器所推举的leader为当前自己所推举的leader.如果是,保险起见,最多再等待finalizeWait(默认200ms)的时间进行最后的确认,如果发现有
// 了更新的leader信息,则把这个Notification重新放回recvqueue,显然,选举将继续进行。否则,选举结束,根据选举的leader是否是自己,设
// 置自己的状态为LEADING或者OBSERVING或者FOLLOWING。
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
//--看是否已选定的候选人被修改
//--注意这里有个finalizeWait延时获取
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
//--如果被修改,再次放到recvqueue再次循环
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
//5--队列中所有的投票都已处理完,则选举出Leader,并判断是否属于自己
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid, logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
//5--一个时钟周期
if(n.electionEpoch == logicalclock.get()){
//5--存到recvset
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//5--比较对方和自己选举的的leader,占据大多数
if(termPredicate(recvset, new Vote(n.version, n.leader,
n.zxid, n.electionEpoch, n.peerEpoch, n.state))
&& checkLeader(outofelection, n.leader, n.electionEpoch)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader,
n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify that
* a majority are following the same leader.
*/
//--不在一个时钟,说明自己挂了又起起来了,把被人的投票放到outofelection,
//--对方的投票在outofelection占据大多数并且承认自己愿意做leader
outofelection.put(n.sid, new Vote(n.version, n.leader,
n.zxid, n.electionEpoch, n.peerEpoch, n.state));
//--占据大多数
if (termPredicate(outofelection, new Vote(n.version, n.leader,
n.zxid, n.electionEpoch, n.peerEpoch, n.state))
//--承认自己是leader
&& checkLeader(outofelection, n.leader, n.electionEpoch)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader, n.zxid,
n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecoginized: " + n.state
+ " (n.state), " + n.sid + " (n.sid)");
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}",
manager.getConnectionThreadCount());
}
}
lookForLeader的逻辑便是:
1:更新选举周期,把自己作为leader作为投票发给其它。
2:进入本轮投票循环,直到不是looking状态。
在接收到投票后判断对方状态:
- LOOKING:比较两者的投票信息,比较的顺序是Epoch、zxid、Id,优先选投票轮次高的,投票轮次相同选Zxid高的,Zxid相同选id高的,并且确定结果后还要告诉其他人自己的选举结果,同时判断判断当前候选人proposedLeader,proposedZxid,proposedEpoch在选票中是否占了大多数,这个是在FastLeaderElection#termPredicate实现,具体实现类有两个, 分别是QuorumMaj.containsQuorum()或者QuorumHierarchical.containsQuorum(),做完这一步后,还要判断一下是否有人修改过leader,如果被修改,再次放到recvqueue再次循环
- OBSERVING:对方是一个OBSERVING状态,直接无视它。
- FOLLOWING和LEADING:如果对方和自己是在一个时钟内,就说明对方已经完成了选举,如果对方说它就是leader,我们承认就好,否则做一个大多数判断。都通过了的话就把该leader作为自己的leader。如果对方和自己不再一个时钟,说明自己挂了又起起来了,把被人的投票放到outofelection,如果对方的投票在outofelection占据大多数并且承认自己愿意做leader,这时候更新选举周期,修改自己的状态为FOLLOWING或者LEADING
以上就是本文全部内容,前面部分一定要跟着两张图来看,不然很容易晕,后面的选举逻辑看文字解析还算比较容易懂。