前言
对于一个包含多个节点的zookeeper集群,需要选出一个节点作为Leader节点来提供后续的服务。那么zookeeper选主的协议是怎么样的呢,我们下面一探究竟
选主协议
zookeeper会把集群中的节点分成2种类型:
- participant 参加选举
- observer 不能参加选举
对于partipant类型的节点会参加主节点的选举,选举的过程如下
- 每个节点启动之后生成自己的vote,这个vote包含主要三个方面的信息
id:推举的主节点的id,默认为自己
zxid:本机器的处理的最新的事物id
electionEpoch:每轮选举的标识
- 每个节点把当前的vote发送给别的参与选主的节点
- 每个节点接受来自于别的服务器发送来的投票信息r_vote,根据以下规则来判断是不是需要更新自己的vote
1. 比较vote.zxid和r_vote.zxid的大小关系,如果vote.zxid > r_vote.zxid,那么更新当前vot.id为r_vote.id,表示本节点推举vote推荐的节点作为主节点,如果vote.zxid < r_vote.zxid,不更新本vote,如果vote.zxid == r_vote.zxid那么执行
下面2的逻辑
2. 比较vote.id 和 r_vote的id,如果vote.id > r_vote.id不更新 ,如果vote.id < r_vote.id那么更新本vote
- 更新投票信息
- 查看是不是有节点得到超过半数的投票,如果有那么选举出主节点
- 如果没有节点得到超过半数的投票,那么重复执行步骤2
tips
这里提一下,每个节点在启动选举的时候都会有一个electionEpoch属性,在同一轮选举中各个节点的electionEpoch应该是相同的,如果有一个节点的electionEpoch小于别的其他节点,那么说明这个节点已经落后于其他节点了,这个时候需要清空它得到的投票信息,重新更新electionEpoch加入新一轮的选主过程
选主涉及的各个线程
- WorkerSender
接受别的服务器发来的投票信息(这里不涉及网络操作,只是把投票信息发送到待发队列中)
- WorkerReceiver
发送本机的投票信息给别的服务器(这里不涉及网络操作,只是从接受投票的队列中接受别的服务器发送来的投票信息)
每个参与投票的节点到其他所有的投票节点都会接连网络链接
- SendWorker
每个连接上都会有一个SendWorker用来通过网络把投票信息发送给对应的节点
- ReceiveWorker
每个连接上都会有一个ReceiveWorker用来通过网络接受来自其他节点发送过来的投票信息
- ListenerHandler
每个节点接受其他节点连接请求的处理线程
- QuorumPeer
根据获得到的其他节点的投票信息来动态的改变vote和检查是不是有主节点被选举出来,如果有主节点被选举出来,那么退出选举过程进入数据恢复过程,如果没有主节点被选举出来,那么继续选举过程
下面是上述几个线程工作交互的流程图
有了上述这些铺垫,那我们开始zookeeper集群选主源码分析吧
节点启动入口
QuorumPeerMain是每个服务节点的启动入口类
initializeAndRun
是启动入口方法,在这个方法中主要做了如下三件事
- 把zoo.cfg解析成QuorumPeerConfig的属性
- 启动DatadirCleanupManager来定期的清理过期snapshop文件
- 启动节点 runFromConfig
runFromConfig
这个方法很长,我把一些主要的点,做一些注释说明
public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
//上面省略一大波,但是不影响理解
if (config.getClientPortAddress() != null) {
//获取服务端的IO服务工厂类,默认是NIOServerCnxnFactory
cnxnFactory = ServerCnxnFactory.createFactory();
//设置ServerCnxnFactory类的一些属性:端口,最大可以接受的客户端连接数,创建SelectorThread,ExpiredThread类等
cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), false);
}
if (config.getSecureClientPortAddress() != null) {
secureCnxnFactory = ServerCnxnFactory.createFactory();
secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), true);
}
//QuorumPeer是服务节点的代表类,接下来发送的事情都和他有关
quorumPeer = getQuorumPeer();
//设置data和log的访问类
quorumPeer.setTxnFactory(new FileTxnSnapLog(config.getDataLogDir(), config.getDataDir()));
quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
quorumPeer.enableLocalSessionsUpgrading(config.isLocalSessionsUpgradingEnabled());
//quorumPeer.setQuorumPeers(config.getAllMembers());
//设置主节点选举算法,目前只有一种:FastLeaderElection
quorumPeer.setElectionType(config.getElectionAlg());
//设置本节点的sid
quorumPeer.setMyid(config.getServerId());
quorumPeer.setTickTime(config.getTickTime());
quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
quorumPeer.setInitLimit(config.getInitLimit());
quorumPeer.setSyncLimit(config.getSyncLimit());
quorumPeer.setConnectToLearnerMasterLimit(config.getConnectToLearnerMasterLimit());
quorumPeer.setObserverMasterPort(config.getObserverMasterPort());
quorumPeer.setConfigFileName(config.getConfigFilename());
quorumPeer.setClientPortListenBacklog(config.getClientPortListenBacklog());
//设置zookeeper的DataBase,注意这个时候,还没有做数据的恢复
quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
if (config.getLastSeenQuorumVerifier() != null) {
quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
}
quorumPeer.initConfigInZKDatabase();
quorumPeer.setCnxnFactory(cnxnFactory);
quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
quorumPeer.setSslQuorum(config.isSslQuorum());
quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
//设置节点的类型:participant或者observer
quorumPeer.setLearnerType(config.getPeerType());
quorumPeer.setSyncEnabled(config.getSyncEnabled());
// 省去一大波代码
//初始化quorumPeer,这里主要是创建认证服务的工具类
quorumPeer.initialize();
if (config.jvmPauseMonitorToRun) {
quorumPeer.setJvmPauseMonitor(new JvmPauseMonitor(config));
}
//启动quoumPeer
quorumPeer.start();
ZKAuditProvider.addZKStartStopAuditLog();
quorumPeer.join();
} catch (InterruptedException e) {
// warn, but generally this is ok
LOG.warn("Quorum Peer interrupted", e);
} finally {
if (metricsProvider != null) {
try {
metricsProvider.stop();
} catch (Throwable error) {
LOG.warn("Error while stopping metrics", error);
}
}
}
}
QuorumPeer.start()
QuorumPeer启动的地方
public synchronized void start() {
//检查本节点是不是被包含在配置文件配置的服务器列表中
if (!getView().containsKey(myid)) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
//做节点数据的恢复,请参考 https://www.jianshu.com/p/f10ffc0ff861
loadDataBase();
//启动SelectorThread,AcceptThread,来准备接受客户的请求,请参考https://www.jianshu.com/p/8153a113fdf7
startServerCnxnFactory();
// try {
// adminServer.start();
// } catch (AdminServerException e) {
// LOG.warn("Problem starting AdminServer", e);
// System.out.println(e);
// }
//启动集群选主过程
startLeaderElection();
startJvmPauseMonitor();
//本身QuorumPeer也是一个线程,现在启动QuorumPeer
super.start();
}
startLeaderElection
在startLeaderElection方法中会创建Leader选举过程中需要的一些线程
public synchronized void startLeaderElection() {
try {
if (getPeerState() == ServerState.LOOKING) {
//设置当前vote的信息,主要是三个信息:推举的主节点id,本节点最新的事物id zxid,当前选举的轮数。
//每个节点在启动的时候都推举自己作为Leader,emm。。脸皮挺厚
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
}
} catch (IOException e) {
RuntimeException re = new RuntimeException(e.getMessage());
re.setStackTrace(e.getStackTrace());
throw re;
}
//创建选举算法
this.electionAlg = createElectionAlgorithm(electionType);
}
createElectionAlgorithm
直接看源码
protected Election createElectionAlgorithm(int electionAlgorithm) {
Election le = null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 1:
throw new UnsupportedOperationException("Election Algorithm 1 is not supported.");
case 2:
throw new UnsupportedOperationException("Election Algorithm 2 is not supported.");
//目前zookeeper只是支持一种选举算法
case 3:
//QuorumCnxManager 是QuorumPeer管理与其他节点socket连接的类
QuorumCnxManager qcm = createCnxnManager();
//通过qcmRef检查是不是有已经存在的老的QuorumCnxManager存在,如果有那么就关闭他
QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
if (oldQcm != null) {
LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
oldQcm.halt();
}
//Listenser是ListenerHandler的管理类
QuorumCnxManager.Listener listener = qcm.listener;
if (listener != null) {
//启动listener来启动各个ListenserHandler
listener.start();
//创建FastLeaderElection
FastLeaderElection fle = new FastLeaderElection(this, qcm);
//通过start来启动WorkerSender,WorkerReceiver
fle.start();
le = fle;
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
}
tips
- QuorumCnxManager中为什么需要用Listenser来管理ListenserHandler?
因为服务节点可能具有多个网卡,这个节点可能会在不同的网卡对应的ip地址去启动监听端口,在这种情况下一个QuorumCnxManager可能会包含多个ListenserHandler,所以使用一个Listenser去管理这些ListenserHandler。
FastLeaderElection
创建FastLeaderElection的时候发生了什么
- 会创建QuorumPeer收发信息的队列sendqueue,recvqueue
private void starter(QuorumPeer self, QuorumCnxManager manager) {
this.self = self;
proposedLeader = -1;
proposedZxid = -1;
//创建 sendqueue和recvqueue对象
sendqueue = new LinkedBlockingQueue();
recvqueue = new LinkedBlockingQueue();
//创建Messenger来管理WorkerSender和WorkerReceiver
this.messenger = new Messenger(manager);
}
2.创建Messenger类,在Manager类中会创建WorkerSender,WorkerReceiver来处理sendqueue和recvqueue中的数据
Messenger(QuorumCnxManager manager) {
this.ws = new WorkerSender(manager);
this.wsThread = new Thread(this.ws, "WorkerSender[myid=" + self.getId() + "]");
this.wsThread.setDaemon(true);
this.wr = new WorkerReceiver(manager);
this.wrThread = new Thread(this.wr, "WorkerReceiver[myid=" + self.getId() + "]");
this.wrThread.setDaemon(true);
}
QuorumPeer.start
我们看下QuorumPeer的线程做了哪些逻辑处理
try {
/*
* Main loop
*/
while (running) {
switch (getPeerState()) {
//处理选主的逻辑
case LOOKING:
LOG.info("LOOKING");
//省略.....
try {
reconfigFlagClear();
if (shuttingDownLE) {
shuttingDownLE = false;
startLeaderElection();
}
//QuromPeer进入选主逻辑
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
break;
//处理作为observer的逻辑
case OBSERVING:
// 省略............
break;
//处理作为follower的逻辑
case FOLLOWING:
// 省略............
break;
//处理作为Leader的逻辑
case LEADING:
// 省略............
break;
}
}
} finally {
// 忽略这部分代码
}
}
FastLeaderElection.lookForLeader
选主的过程在lookForLeader完成,这个方法的代码很长,大概有200行,我回把一些不重要的代码删除,
public Vote lookForLeader() throws InterruptedException {
//这个地方删除了JMX 的一些信息
self.start_fle = Time.currentElapsedTime();
try {
/*
* The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
* if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
* of participants has voted for it.
*/
//上面英文注释已经很清楚了,主要意思就是这个recvset用来接受每个服务器发送来的投票信息,
//key 是服务器的sid,vote就是这个服务器推举的vote,通过recvset可以判断出master节点有没有被选举出来
Map recvset = new HashMap();
/*
* The votes from previous leader elections, as well as the votes from the current leader election are
* stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
* Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
* outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
* the electionEpoch of the received notifications) in a leader election.
*/
//是master节点用来存放 自己别选举为Leader的vote信息
Map outofelection = new HashMap();
int notTimeout = minNotificationInterval;
synchronized (this) {
//logicalclock用来标识每次选举的轮次,todo
logicalclock.incrementAndGet();
//更新本节点推举的Leader信息(proposedLeader,proposedZxid,proposedEpoch)
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info(
"New election. My id = {}, proposed zxid=0x{}",
self.getId(),
Long.toHexString(proposedZxid));
//把自己的Proposal发送给其他的服务器
sendNotifications();
SyncedLearnerTracker voteSet;
/*
* Loop in which we exchange notifications until we find a leader
*/
while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
//从recvqueue中获取别的服务器发送来的投票信息(也包括自己发送来的投票信息)
Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if (n == null) {
//如果从recvqueue中没有得到投票信息
//如果QuorumCnxManager到别的服务节点已经建立了socket连接,那么直接发送Notification
if (manager.haveDelivered()) {
sendNotifications();
} else {
//QuorumPeer通过QuorumCnxManager建立到别的服务节点网络连接
manager.connectAll();
}
/*
* Exponential backoff
*/
//更新notTimeout
int tmpTimeOut = notTimeout * 2;
notTimeout = Math.min(tmpTimeOut, maxNotificationInterval);
LOG.info("Notification time out: {}", notTimeout);
} else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
if (getInitLastLoggedZxid() == -1) {
LOG.debug("Ignoring notification as our zxid is -1");
break;
}
if (n.zxid == -1) {
LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
break;
}
// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock.get()) {
//如果接受到的投票信息所在的投票轮次大于logicalclock,那么就更新logicalclock,同时把
//之前接受到的投票信息清空
logicalclock.set(n.electionEpoch);
recvset.clear();
//totalOrderPredicate 作用是比较获得的vote个本节点vote,比较方式就是我们在文章开头描述的那样,依次比较zxid,id,
//通过totalOrderPredicate来决定是不是需要更新本节点的vote,如果需要更新,更新之后,把相关的该更新信息发送给别的服务节点
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
//如果接受到的vote的选举轮次electionEpoch小于本机的选举轮次electionEpoch,那么直接把接受到的这个vote丢弃
LOG.debug(
"Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
Long.toHexString(n.electionEpoch),
Long.toHexString(logicalclock.get()));
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
//同上面对totalOrderPredicate的分析
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
LOG.debug(
"Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",
n.sid,
n.leader,
Long.toHexString(n.zxid),
Long.toHexString(n.electionEpoch));
// don't care about the version if it's in LOOKING state
//把接受到的vote信息加入到recvset中
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//根据recvset和本节点的vote获取 VoteTracker
//VoteTracker用来判断本节点的vote是不是得到的过半数的其他节点的推举
voteSet = getVoteTracker(recvset, new Vote(proposedLeader,proposedZxid , logicalclock.get(), proposedEpoch));
if (voteSet.hasAllQuorums()) {
//即使如果本节点的vote获得了过半数participant的推举,那么还需要通过recvqueue最多等待finalizeWait ms来确定本机的vote会不会被新来的vote更新
// Verify if there is any change in the proposed leader
while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
//如果等了finalizeWait这么长时间之后,没有接收到任何的vote信息,那么说明,大家都承认本机的vote所推举的节点为Leader节点
//根据proposedLeader和本机的sid来设置QuorumPeer的节点状态
//如果proposedLeader == sid 那么设置本节点为Leader,反之,如果本节点是participant类型那么设置本节点状态为Following,如果本节点状态是Observer类型那么设置本节点状态为Observing
setPeerState(proposedLeader, voteSet);
//生成最终代表Leader节点信息的vote
Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
//leaveInstance 用来清空recvqueue,表示本轮选举结束
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
//如果接受的vote的state是observing 那么什么都不做
LOG.debug("Notification from observer: {}", n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
//这里有一个问题,就是什么情况下节点接受到的vote的状态会是following或者leading,
//换句话说就是集群中的Leader已经选举出来了。
//比如当一个集群中新加入了一个节点,那么在这种情况下,新节点就会得到别的服务节点的vote,这个vote就是following或者leading的:这个地方和WorkerReceiver的工作机制有关系
//如果接受到的vote的状态是Leading或者following,
if (n.electionEpoch == logicalclock.get()) {
//如果是同一轮选举,那么直接把vote加入recvset
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
//通过voteSet去判断是不是有过半数的participant推举当前vote.leader,同时还要求Leader服务器也把自己的vote发送给本节点了
if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
setPeerState(n.leader, voteSet);
Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify that
* a majority are following the same leader.
*
* Note that the outofelection map also stores votes from the current leader election.
* See ZOOKEEPER-1732 for more information.
*/
//如果不是同一轮选举,那么把获得的vote信息加入outofelection,下面就是通过outofelection来找出Leader节点
outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
synchronized (this) {
logicalclock.set(n.electionEpoch);
setPeerState(n.leader, voteSet);
}
Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecoginized: {} (n.state), {}(n.sid)", n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if (self.jmxLeaderElectionBean != null) {
MBeanRegistry.getInstance().unregister(self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());
}
}
对上述代码逻辑可以使用下图去描述
上面就是QuorumPeer线程的选主的工作逻辑
接下来我们看下其中的一些细节,这些细节会关联到我前面提到的其他线程
sendNotifications
当服务节点刚启动或者接受到别的节点发送来的r_vote来更新自己的proposal的时候都需要通过sendNotification方法将自己推荐的Leader信息发送给别的participant,我们分析下sendNotifications的源码
private void sendNotifications() {
for (long sid : self.getCurrentAndNextConfigVoters()) {
QuorumVerifier qv = self.getQuorumVerifier();
//把节点proposal的Leader信息封装成ToSend对象然后加入到sendqueue中
ToSend notmsg = new ToSend(
ToSend.mType.notification,
proposedLeader,
proposedZxid,
logicalclock.get(),
QuorumPeer.ServerState.LOOKING,
sid,
proposedEpoch,
qv.toString().getBytes());
LOG.debug(
"Sending Notification: {} (n.leader), 0x{} (n.zxid), 0x{} (n.round), {} (recipient),"
+ " {} (myid), 0x{} (n.peerEpoch) ",
proposedLeader,
Long.toHexString(proposedZxid),
Long.toHexString(logicalclock.get()),
sid,
self.getId(),
Long.toHexString(proposedEpoch));
sendqueue.offer(notmsg);
}
}
WorkerSender.run
我看看下消费sendqueue队列的WorkerSend线程的run方法
public void run() {
while (!stop) {
try {
//从sendqueue取出ToSend消息然后交给process处理
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
if (m == null) {
continue;
}
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
WorkerSender.process
void process(ToSend m) {
//把toSend转化成ByteBuffer对象
ByteBuffer requestBuffer = buildMsg(m.state.ordinal(), m.leader, m.zxid, m.electionEpoch, m.peerEpoch, m.configData);
//通过QuorumCnxManager把requestBuffer发送给指定的participant
manager.toSend(m.sid, requestBuffer);
}
QuorumCnxManager.toSend
我们分析下participant连接管理器toSend方法发生了什么
public void toSend(Long sid, ByteBuffer b) {
/*
* If sending message to myself, then simply enqueue it (loopback).
*/
//如果投票信息是发送给自己的那么直接放入recvQueue中
if (this.mySid == sid) {
b.position(0);
addToRecvQueue(new Message(b.duplicate(), sid));
/*
* Otherwise send to the corresponding thread to send.
*/
} else {
/*
* Start a new connection if doesn't have one already.
*/
//queueSendMap:ConcurrentHashMap类型,是本节点保存发送消息到其他节点的数据结构
BlockingQueue bq = queueSendMap.computeIfAbsent(sid, serverId -> new CircularBlockingQueue<>(SEND_CAPACITY));
//把本次发送给sid所代表的的节点投票信息保存到blockingQueue中
addToSendQueue(bq, b);
//建立本节点到sid节点的socket连接
connectOne(sid);
}
}
Tpis
在讲解connectOne方法之前我们先讲解下zookeeper投票节点直接的网络连接拓扑,
下图描述的是三个节点建立的网络连接拓扑示意图
每个节点都会和别的节点建立连接,zookeeper对于连接上的输入和输出投票消息分别使用SendWorker和ReceiveWorker来处理,他们都是线程类。因为任意两个节点之间都需要建立连接,为什么防止高效稳定的无浪费的建立起这些连接,zookeeper对于连接的建立创建了如下的一个约束:
值允许sid较大的机器去主动建立到sid较小的机器:举个 : sid为1 和sid为2的两个机器建立网络连接
如果sid=1的服务器主动发起向sid=2的服务器socket连接建立,该连接是无法建立起来的,底层socket建立之后,zookeeper会检查本机的sid和远程连接服务器的sid,如果发现自己的sid比较小那么会主动关闭socket连接。如果sid=2的服务器建立到sid=1的服务器socket连接,那么可以建立成功
QuorumCnxManager.connectOne
connectOne方法就是完成建立我们上面连接拓扑图示意的结果
synchronized void connectOne(long sid) {
//senderWorkerMap用来存放每个sid对应的SendWorker
if (senderWorkerMap.get(sid) != null) {
//如果sid对应的SendWorker已经存在(做一下多地址的检查)那么直接返回
LOG.debug("There is a connection already for server {}", sid);
if (self.isMultiAddressEnabled() && self.isMultiAddressReachabilityCheckEnabled()) {
// since ZOOKEEPER-3188 we can use multiple election addresses to reach a server. It is possible, that the
// one we are using is already dead and we need to clean-up, so when we will create a new connection
// then we will choose an other one, which is actually reachable
senderWorkerMap.get(sid).asyncValidateIfSocketIsStillReachable();
}
return;
}
synchronized (self.QV_LOCK) {
boolean knownId = false;
// Resolve hostname for the remote server before attempting to
// connect in case the underlying ip address has changed.
self.recreateSocketAddresses(sid);
Map lastCommittedView = self.getView();
QuorumVerifier lastSeenQV = self.getLastSeenQuorumVerifier();
Map lastProposedView = lastSeenQV.getAllMembers();
if (lastCommittedView.containsKey(sid)) {
knownId = true;
LOG.debug("Server {} knows {} already, it is in the lastCommittedView", self.getId(), sid);
//如果本节点到sid对应的服务器还没有建立socket连接,那么通过connectOne建立连接
if (connectOne(sid, lastCommittedView.get(sid).electionAddr)) {
return;
}
}
if (lastSeenQV != null
&& lastProposedView.containsKey(sid)
&& (!knownId
|| (lastProposedView.get(sid).electionAddr != lastCommittedView.get(sid).electionAddr))) {
knownId = true;
LOG.debug("Server {} knows {} already, it is in the lastProposedView", self.getId(), sid);
if (connectOne(sid, lastProposedView.get(sid).electionAddr)) {
return;
}
}
if (!knownId) {
LOG.warn("Invalid server id: {} ", sid);
}
}
}
上面的connectOne(sid,address)会继续调用initiateConnectionAsync()方法,
QuorumCnxManager.initiateConnectionAsync
initiateConnectionAsync方法就是把建立连接的任务封存成QuorumConnectionReqThread然后异步完成
public boolean initiateConnectionAsync(final MultipleAddresses electionAddr, final Long sid) {
if (!inprogressConnections.add(sid)) {
// simply return as there is a connection request to
// server 'sid' already in progress.
LOG.debug("Connection request to server id: {} is already in progress, so skipping this request", sid);
return true;
}
try {
connectionExecutor.execute(new QuorumConnectionReqThread(electionAddr, sid));
connectionThreadCnt.incrementAndGet();
} catch (Throwable e) {
// Imp: Safer side catching all type of exceptions and remove 'sid'
// from inprogress connections. This is to avoid blocking further
// connection requests from this 'sid' in case of errors.
inprogressConnections.remove(sid);
LOG.error("Exception while submitting quorum connection request", e);
return false;
}
return true;
}
QuorumConnectionReqThread
这是一个线程类主要负责完成到指定服务器的socket连接
我们看下它的run方法调用的initiateConnection的实现
public void initiateConnection(final MultipleAddresses electionAddr, final Long sid) {
Socket sock = null;
try {
LOG.debug("Opening channel to server {}", sid);
if (self.isSslQuorum()) {
sock = self.getX509Util().createSSLSocket();
} else {
//通过工厂方式创建socket
sock = SOCKET_FACTORY.get();
}
setSockOpts(sock);
//建立到远程服务器的连接
sock.connect(electionAddr.getReachableOrOne(), cnxTO);
if (sock instanceof SSLSocket) {
SSLSocket sslSock = (SSLSocket) sock;
sslSock.startHandshake();
LOG.info("SSL handshake complete with {} - {} - {}",
sslSock.getRemoteSocketAddress(),
sslSock.getSession().getProtocol(),
sslSock.getSession().getCipherSuite());
}
LOG.debug("Connected to server {} using election address: {}:{}",
sid, sock.getInetAddress(), sock.getPort());
} catch (X509Exception e) {
LOG.warn("Cannot open secure channel to {} at election address {}", sid, electionAddr, e);
closeSocket(sock);
return;
} catch (UnresolvedAddressException | IOException e) {
LOG.warn("Cannot open channel to {} at election address {}", sid, electionAddr, e);
closeSocket(sock);
return;
}
try {
//这个方法我们在下面分析下
startConnection(sock, sid);
} catch (IOException e) {
LOG.error(
"Exception while connecting, id: {}, addr: {}, closing learner connection",
sid,
sock.getRemoteSocketAddress(),
e);
closeSocket(sock);
}
}
QuorumConnectionReqThread.startConnection
startConnection完成了上面提到的连接建立的约束条件检查,创建对应的SendWorker和ReceiveWorker线程对象
private boolean startConnection(Socket sock, Long sid) throws IOException {
//数据输出流
DataOutputStream dout = null;
//数据输入流
DataInputStream din = null;
LOG.debug("startConnection (myId:{} --> sid:{})", self.getId(), sid);
try {
// Use BufferedOutputStream to reduce the number of IP packets. This is
// important for x-DC scenarios.
//封装数据输出流
BufferedOutputStream buf = new BufferedOutputStream(sock.getOutputStream());
dout = new DataOutputStream(buf);
// Sending id and challenge
// First sending the protocol version (in other words - message type).
// For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
// feature is enabled. During rolling upgrade, we must make sure that all the servers can
// understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
//下面是建立到别的服务节点会话发送的一些基础数据
long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_V2 : PROTOCOL_VERSION_V1;
//发送版本号
dout.writeLong(protocolVersion);
//发送本机的sid
dout.writeLong(self.getId());
// now we send our election address. For the new protocol version, we can send multiple addresses.
Collection addressesToSend = protocolVersion == PROTOCOL_VERSION_V2
? self.getElectionAddress().getAllAddresses()
: Arrays.asList(self.getElectionAddress().getOne());
String addr = addressesToSend.stream()
.map(NetUtils::formatInetAddr).collect(Collectors.joining("|"));
byte[] addr_bytes = addr.getBytes();
dout.writeInt(addr_bytes.length);
dout.write(addr_bytes);
dout.flush();
//创建数据输入流
din = new DataInputStream(new BufferedInputStream(sock.getInputStream()));
} catch (IOException e) {
LOG.warn("Ignoring exception reading or writing challenge: ", e);
closeSocket(sock);
return false;
}
// authenticate learner
QuorumPeer.QuorumServer qps = self.getVotingView().get(sid);
if (qps != null) {
// TODO - investigate why reconfig makes qps null.
//如果有配置了服务器认证,那么对远端的服务器做认证
authLearner.authenticate(sock, qps.hostname);
}
// If lost the challenge, then drop the new connection
if (sid > self.getId()) {
//这个地方就是上面提到的 建立socket连接的约束条件检查点
LOG.info("Have smaller server identifier, so dropping the connection: (myId:{} --> sid:{})", self.get
//如果sid>self.sid那么关闭socket连接
closeSocket(sock);
// Otherwise proceed with the connection
} else {
LOG.debug("Have larger server identifier, so keeping the connection: (myId:{} --> sid:{})", self.getI
//根据sid和建立的socket建立SendWorker
SendWorker sw = new SendWorker(sock, sid);
//根据sid,socket和输入信息流建立RecvWorker
RecvWorker rw = new RecvWorker(sock, din, sid, sw);
//SendWorker持有RecvWorker的引用
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if (vsw != null) {
vsw.finish();
}
//把SendWorker加入到 senderWorkerMap中
senderWorkerMap.put(sid, sw);
//queueSendMap初始化sid对应的数据发送队列
queueSendMap.putIfAbsent(sid, new CircularBlockingQueue<>(SEND_CAPACITY));
//分别启动SendWorker和ReceiveWorker
sw.start();
rw.start();
return true;
}
return false;
}
SendWorker
我们分析下SendWorker是如何工作的
public void run() {
threadCnt.incrementAndGet();
try {
/**
* If there is nothing in the queue to send, then we
* send the lastMessage to ensure that the last message
* was received by the peer. The message could be dropped
* in case self or the peer shutdown their connection
* (and exit the thread) prior to reading/processing
* the last message. Duplicate messages are handled correctly
* by the peer.
*
* If the send queue is non-empty, then we have a recent
* message than that stored in lastMessage. To avoid sending
* stale message, we should send the message in the send queue.
*/
//从queueSendMap根据sid获取本SendWorker对应的消息发送队列
BlockingQueue bq = queueSendMap.get(sid);
if (bq == null || isSendQueueEmpty(bq)) {
//在第一次运行的时候如果发现bq是null或者bq是空那么直接把lastMessageSent中存储的信息发送出去,当然前提是lastMessageSent中有数据,
//SendWorker每次都会把最近发送的数据存放在lastMessageSent中
ByteBuffer b = lastMessageSent.get(sid);
if (b != null) {
LOG.debug("Attempting to send lastMessage to sid={}", sid);
send(b);
}
}
} catch (IOException e) {
LOG.error("Failed to send last message. Shutting down thread.", e);
this.finish();
}
LOG.debug("SendWorker thread started towards {}. myId: {}", sid, QuorumCnxManager.this.mySid);
try {
//这里才是主循环,会一直尝试从自己的投票消息队列中获取投票消息然后发送出去
while (running && !shutdown && sock != null) {
ByteBuffer b = null;
try {
BlockingQueue bq = queueSendMap.get(sid);
if (bq != null) {
b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
} else {
LOG.error("No queue of incoming messages for server {}", sid);
break;
}
if (b != null) {
//把最新的投票消息存储到lastMessageSent中
lastMessageSent.put(sid, b);
//通过底层socket把消息发送出去
send(b);
}
} catch (InterruptedException e) {
LOG.warn("Interrupted while waiting for message on queue", e);
}
}
} catch (Exception e) {
LOG.warn(
"Exception when using channel: for id {} my id = {}",
sid ,
QuorumCnxManager.this.mySid,
e);
}
this.finish();
LOG.warn("Send worker leaving thread id {} my id = {}", sid, self.getId());
}
ReceiveWorker
分析完SendWorker的run方法,我们分析下ReceiveWorker的run方法
public void run() {
threadCnt.incrementAndGet();
try {
LOG.debug("RecvWorker thread towards {} started. myId: {}", sid, QuorumCnxManager.this.mySid);
//下面是循环从数据流中读取消息
while (running && !shutdown && sock != null) {
/**
* Reads the first int to determine the length of the
* message
*/
//在传递投票消息时,zookeeper采用变长消息格式,所以每次先读取消息的长度
int length = din.readInt();
if (length <= 0 || length > PACKETMAXSIZE) {
throw new IOException("Received packet with invalid packet: " + length);
}
/**
* Allocates a new ByteBuffer to receive the message
*/
final byte[] msgArray = new byte[length];
//根据消息的长度读取整个消息体的数据
din.readFully(msgArray, 0, length);
//把读取的到的消息封装成message然后放入到RecvQueue中,等待处理
addToRecvQueue(new Message(ByteBuffer.wrap(msgArray), sid));
}
} catch (Exception e) {
LOG.warn(
"Connection broken for id {}, my id = {}",
sid,
QuorumCnxManager.this.mySid,
e);
} finally {
LOG.warn("Interrupting SendWorker thread from RecvWorker. sid: {}. myId: {}", sid, QuorumCnxManager.this.mySid);
sw.finish();
closeSocket(sock);
}
}
}
WorkerReceiver
通过上面的分析,我们可以看到投票消息会通过ReceiveWorker读取封装之后放入到RecvQueue中,那么接下来就是看下WorkerReceiver是如何消费RecvQueue中的数据了,我们分析下WorkerReceiver的run方法,
这个方法很长,请耐心看完
public void run() {
Message response;
//主循环
while (!stop) {
// Sleeps on receive
try {
//从RecvQueue中尝试获取投票信息
response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
if (response == null) {
//如果为空那么 继续
continue;
}
//根据消息的大小会做下面一系列的合法性验证
final int capacity = response.buffer.capacity();
// The current protocol and two previous generations all send at least 28 bytes
if (capacity < 28) {
LOG.error("Got a short response from server {}: {}", response.sid, capacity);
continue;
}
// this is the backwardCompatibility mode in place before ZK-107
// It is for a version of the protocol in which we didn't send peer epoch
// With peer epoch and version the message became 40 bytes
boolean backCompatibility28 = (capacity == 28);
// this is the backwardCompatibility mode for no version information
boolean backCompatibility40 = (capacity == 40);
response.buffer.clear();
// Instantiate Notification and set its attributes
Notification n = new Notification();
//从消息中抽取信息,用这些信息来生成notification
int rstate = response.buffer.getInt();
long rleader = response.buffer.getLong();
long rzxid = response.buffer.getLong();
long relectionEpoch = response.buffer.getLong();
long rpeerepoch;
int version = 0x0;
QuorumVerifier rqv = null;
try {
if (!backCompatibility28) {
rpeerepoch = response.buffer.getLong();
if (!backCompatibility40) {
/*
* Version added in 3.4.6
*/
version = response.buffer.getInt();
} else {
LOG.info("Backward compatibility mode (36 bits), server id: {}", response.sid);
}
} else {
LOG.info("Backward compatibility mode (28 bits), server id: {}", response.sid);
rpeerepoch = ZxidUtils.getEpochFromZxid(rzxid);
}
// check if we have a version that includes config. If so extract config info from message.
if (version > 0x1) {
int configLength = response.buffer.getInt();
// we want to avoid errors caused by the allocation of a byte array with negative length
// (causing NegativeArraySizeException) or huge length (causing e.g. OutOfMemoryError)
if (configLength < 0 || configLength > capacity) {
throw new IOException(String.format("Invalid configLength in notification message! sid=%d, capacity=%d, version=%d, configLength=%d",
response.sid, capacity, version, configLength));
}
byte[] b = new byte[configLength];
//获取config的数据
response.buffer.get(b);
synchronized (self) {
try {
//根据config来生成QuorumVerifier
rqv = self.configFromString(new String(b));
QuorumVerifier curQV = self.getQuorumVerifier();
if (rqv.getVersion() > curQV.getVersion()) {
LOG.info("{} Received version: {} my version: {}",
self.getId(),
Long.toHexString(rqv.getVersion()),
Long.toHexString(self.getQuorumVerifier().getVersion()));
if (self.getPeerState() == ServerState.LOOKING) {
LOG.debug("Invoking processReconfig(), state: {}", self.getServerState());
self.processReconfig(rqv, null, null, false);
if (!rqv.equals(curQV)) {
LOG.info("restarting leader election");
self.shuttingDownLE = true;
self.getElectionAlg().shutdown();
break;
}
} else {
LOG.debug("Skip processReconfig(), state: {}", self.getServerState());
}
}
} catch (IOException | ConfigException e) {
LOG.error("Something went wrong while processing config received from {}", response.sid);
}
}
} else {
LOG.info("Backward compatibility mode (before reconfig), server id: {}", response.sid);
}
} catch (BufferUnderflowException | IOException e) {
LOG.warn("Skipping the processing of a partial / malformed response message sent by sid={} (message length: {})",
response.sid, capacity, e);
continue;
}
/*
* If it is from a non-voting server (such as an observer or
* a non-voting follower), respond right away.
*/
//如果发送的投票信息的服务器sid不是合法的投票者,那么直接恢复信息
if (!validVoter(response.sid)) {
Vote current = self.getCurrentVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
} else {
// Receive new message
LOG.debug("Receive new notification message. My id = {}", self.getId());
// State of peer that sent this message
QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
switch (rstate) {
case 0:
ackstate = QuorumPeer.ServerState.LOOKING;
break;
case 1:
ackstate = QuorumPeer.ServerState.FOLLOWING;
break;
case 2:
ackstate = QuorumPeer.ServerState.LEADING;
break;
case 3:
ackstate = QuorumPeer.ServerState.OBSERVING;
break;
default:
continue;
}
//使用Message中抽取出来的数据给Notification属性赋值
n.leader = rleader;
n.zxid = rzxid;
n.electionEpoch = relectionEpoch;
n.state = ackstate;
n.sid = response.sid;
n.peerEpoch = rpeerepoch;
n.version = version;
n.qv = rqv;
/*
* Print notification info
*/
LOG.info(
"Notification: my state:{}; n.sid:{}, n.state:{}, n.leader:{}, n.round:0x{}, "
+ "n.peerEpoch:0x{}, n.zxid:0x{}, message format version:0x{}, n.config version:0x{}",
self.getPeerState(),
n.sid,
n.state,
n.leader,
Long.toHexString(n.electionEpoch),
Long.toHexString(n.peerEpoch),
Long.toHexString(n.zxid),
Long.toHexString(n.version),
(n.qv != null ? (Long.toHexString(n.qv.getVersion())) : "0"));
/*
* If this server is looking, then send proposed leader
*/
//如果本节点是在Looking状态,那么把生成的Notification加入到recvqueue中
if (self.getPeerState() == QuorumPeer.ServerState.LOOKING) {
recvqueue.offer(n);
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
if ((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock.get())) {
//如果接受到sid的投票信息的轮次小于本机进行的投票轮次,那么把本机的vote信息发送给对应的sid
Vote v = getVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
v.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
} else {
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
//如果本机没有处在Looking的状态,也就是说主节点已经选举出来了,那么
Vote current = self.getCurrentVote();
if (ackstate == QuorumPeer.ServerState.LOOKING) {
//下面是判断Leader节点的合法性
if (self.leader != null) {
if (leadingVoteSet != null) {
self.leader.setLeadingVoteSet(leadingVoteSet);
leadingVoteSet = null;
}
self.leader.reportLookingSid(response.sid);
}
LOG.debug(
"Sending new notification. My id ={} recipient={} zxid=0x{} leader={} config version = {}",
self.getId(),
response.sid,
Long.toHexString(current.getZxid()),
current.getId(),
Long.toHexString(self.getQuorumVerifier().getVersion()));
//把主节点信息发送给对应的sid服务器
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
}
}
} catch (InterruptedException e) {
LOG.warn("Interrupted Exception while waiting for new message", e);
}
}
LOG.info("WorkerReceiver is down");
}
}
上面就是WorkerReceiver的工作流程,WorkerReceiver会把投票信息处理之后形成Notification加入到recevqueue中,QuorumPeer会从recevqueue去获取notification处理,这个处理逻辑在上面 我们已经分析过了。
End
自此我们完成了zookeeper主节点选举流程的源码分析