zookeeper ZAB Leader Elect 源码分析

前言

对于一个包含多个节点的zookeeper集群,需要选出一个节点作为Leader节点来提供后续的服务。那么zookeeper选主的协议是怎么样的呢,我们下面一探究竟

选主协议

zookeeper会把集群中的节点分成2种类型:

  • participant 参加选举
  • observer 不能参加选举

对于partipant类型的节点会参加主节点的选举,选举的过程如下

  1. 每个节点启动之后生成自己的vote,这个vote包含主要三个方面的信息
 id:推举的主节点的id,默认为自己
 zxid:本机器的处理的最新的事物id
 electionEpoch:每轮选举的标识
  1. 每个节点把当前的vote发送给别的参与选主的节点
  2. 每个节点接受来自于别的服务器发送来的投票信息r_vote,根据以下规则来判断是不是需要更新自己的vote
1. 比较vote.zxid和r_vote.zxid的大小关系,如果vote.zxid > r_vote.zxid,那么更新当前vot.id为r_vote.id,表示本节点推举vote推荐的节点作为主节点,如果vote.zxid < r_vote.zxid,不更新本vote,如果vote.zxid == r_vote.zxid那么执行
下面2的逻辑
2. 比较vote.id 和 r_vote的id,如果vote.id > r_vote.id不更新 ,如果vote.id < r_vote.id那么更新本vote
  1. 更新投票信息
  2. 查看是不是有节点得到超过半数的投票,如果有那么选举出主节点
  3. 如果没有节点得到超过半数的投票,那么重复执行步骤2
tips

这里提一下,每个节点在启动选举的时候都会有一个electionEpoch属性,在同一轮选举中各个节点的electionEpoch应该是相同的,如果有一个节点的electionEpoch小于别的其他节点,那么说明这个节点已经落后于其他节点了,这个时候需要清空它得到的投票信息,重新更新electionEpoch加入新一轮的选主过程

选主涉及的各个线程

- WorkerSender

接受别的服务器发来的投票信息(这里不涉及网络操作,只是把投票信息发送到待发队列中)

- WorkerReceiver

发送本机的投票信息给别的服务器(这里不涉及网络操作,只是从接受投票的队列中接受别的服务器发送来的投票信息)

每个参与投票的节点到其他所有的投票节点都会接连网络链接

- SendWorker

每个连接上都会有一个SendWorker用来通过网络把投票信息发送给对应的节点

- ReceiveWorker

每个连接上都会有一个ReceiveWorker用来通过网络接受来自其他节点发送过来的投票信息

- ListenerHandler

每个节点接受其他节点连接请求的处理线程

- QuorumPeer

根据获得到的其他节点的投票信息来动态的改变vote和检查是不是有主节点被选举出来,如果有主节点被选举出来,那么退出选举过程进入数据恢复过程,如果没有主节点被选举出来,那么继续选举过程

下面是上述几个线程工作交互的流程图


leader_elect_thread_exchange.png

有了上述这些铺垫,那我们开始zookeeper集群选主源码分析吧

节点启动入口

QuorumPeerMain是每个服务节点的启动入口类

initializeAndRun

是启动入口方法,在这个方法中主要做了如下三件事

  1. 把zoo.cfg解析成QuorumPeerConfig的属性
  2. 启动DatadirCleanupManager来定期的清理过期snapshop文件
  3. 启动节点 runFromConfig
runFromConfig

这个方法很长,我把一些主要的点,做一些注释说明

 public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
     
           //上面省略一大波,但是不影响理解
            if (config.getClientPortAddress() != null) {
                //获取服务端的IO服务工厂类,默认是NIOServerCnxnFactory
                cnxnFactory = ServerCnxnFactory.createFactory();
               //设置ServerCnxnFactory类的一些属性:端口,最大可以接受的客户端连接数,创建SelectorThread,ExpiredThread类等
                cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), false);
            }

            if (config.getSecureClientPortAddress() != null) {
                secureCnxnFactory = ServerCnxnFactory.createFactory();
                secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), true);
            }
            //QuorumPeer是服务节点的代表类,接下来发送的事情都和他有关
            quorumPeer = getQuorumPeer();
            //设置data和log的访问类
            quorumPeer.setTxnFactory(new FileTxnSnapLog(config.getDataLogDir(), config.getDataDir()));
            quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
            quorumPeer.enableLocalSessionsUpgrading(config.isLocalSessionsUpgradingEnabled());
            //quorumPeer.setQuorumPeers(config.getAllMembers());
            //设置主节点选举算法,目前只有一种:FastLeaderElection
            quorumPeer.setElectionType(config.getElectionAlg());
            //设置本节点的sid
            quorumPeer.setMyid(config.getServerId());
            quorumPeer.setTickTime(config.getTickTime());
            quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
            quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
            quorumPeer.setInitLimit(config.getInitLimit());
            quorumPeer.setSyncLimit(config.getSyncLimit());
            quorumPeer.setConnectToLearnerMasterLimit(config.getConnectToLearnerMasterLimit());
            quorumPeer.setObserverMasterPort(config.getObserverMasterPort());
            quorumPeer.setConfigFileName(config.getConfigFilename());
            quorumPeer.setClientPortListenBacklog(config.getClientPortListenBacklog());
           //设置zookeeper的DataBase,注意这个时候,还没有做数据的恢复
            quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
            quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
            if (config.getLastSeenQuorumVerifier() != null) {
                quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
            }
            quorumPeer.initConfigInZKDatabase();
            quorumPeer.setCnxnFactory(cnxnFactory);
            quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
            quorumPeer.setSslQuorum(config.isSslQuorum());
            quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
            //设置节点的类型:participant或者observer
            quorumPeer.setLearnerType(config.getPeerType());
            quorumPeer.setSyncEnabled(config.getSyncEnabled());
          // 省去一大波代码
           //初始化quorumPeer,这里主要是创建认证服务的工具类
            quorumPeer.initialize();

            if (config.jvmPauseMonitorToRun) {
                quorumPeer.setJvmPauseMonitor(new JvmPauseMonitor(config));
            }
            //启动quoumPeer
            quorumPeer.start();
            ZKAuditProvider.addZKStartStopAuditLog();
          
            quorumPeer.join();
        } catch (InterruptedException e) {
            // warn, but generally this is ok
            LOG.warn("Quorum Peer interrupted", e);
        } finally {
            if (metricsProvider != null) {
                try {
                    metricsProvider.stop();
                } catch (Throwable error) {
                    LOG.warn("Error while stopping metrics", error);
                }
            }
        }
    }
QuorumPeer.start()

QuorumPeer启动的地方

   public synchronized void start() {
        //检查本节点是不是被包含在配置文件配置的服务器列表中
        if (!getView().containsKey(myid)) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
        }
        //做节点数据的恢复,请参考 https://www.jianshu.com/p/f10ffc0ff861
        loadDataBase();
       //启动SelectorThread,AcceptThread,来准备接受客户的请求,请参考https://www.jianshu.com/p/8153a113fdf7
        startServerCnxnFactory();
//        try {
//            adminServer.start();
//        } catch (AdminServerException e) {
//            LOG.warn("Problem starting AdminServer", e);
//            System.out.println(e);
//        }
        //启动集群选主过程
        startLeaderElection();
        startJvmPauseMonitor();
        //本身QuorumPeer也是一个线程,现在启动QuorumPeer
        super.start();
    }

startLeaderElection

在startLeaderElection方法中会创建Leader选举过程中需要的一些线程

public synchronized void startLeaderElection() {
        try {
            if (getPeerState() == ServerState.LOOKING) {
                //设置当前vote的信息,主要是三个信息:推举的主节点id,本节点最新的事物id zxid,当前选举的轮数。
                //每个节点在启动的时候都推举自己作为Leader,emm。。脸皮挺厚
                currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
            }
        } catch (IOException e) {
            RuntimeException re = new RuntimeException(e.getMessage());
            re.setStackTrace(e.getStackTrace());
            throw re;
        }
        //创建选举算法
        this.electionAlg = createElectionAlgorithm(electionType);
    }
createElectionAlgorithm

直接看源码

 protected Election createElectionAlgorithm(int electionAlgorithm) {
        Election le = null;

        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 1:
            throw new UnsupportedOperationException("Election Algorithm 1 is not supported.");
        case 2:
            throw new UnsupportedOperationException("Election Algorithm 2 is not supported.");
         //目前zookeeper只是支持一种选举算法
        case 3:
           //QuorumCnxManager 是QuorumPeer管理与其他节点socket连接的类
            QuorumCnxManager qcm = createCnxnManager();
            //通过qcmRef检查是不是有已经存在的老的QuorumCnxManager存在,如果有那么就关闭他
            QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
            if (oldQcm != null) {
                LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
                oldQcm.halt();
            }
            //Listenser是ListenerHandler的管理类
            QuorumCnxManager.Listener listener = qcm.listener;
            if (listener != null) {
               //启动listener来启动各个ListenserHandler
                listener.start();
                //创建FastLeaderElection
                FastLeaderElection fle = new FastLeaderElection(this, qcm);
                //通过start来启动WorkerSender,WorkerReceiver
                fle.start();
                le = fle;
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }
tips
  1. QuorumCnxManager中为什么需要用Listenser来管理ListenserHandler?
    因为服务节点可能具有多个网卡,这个节点可能会在不同的网卡对应的ip地址去启动监听端口,在这种情况下一个QuorumCnxManager可能会包含多个ListenserHandler,所以使用一个Listenser去管理这些ListenserHandler。
FastLeaderElection

创建FastLeaderElection的时候发生了什么

  1. 会创建QuorumPeer收发信息的队列sendqueue,recvqueue
private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;
        //创建 sendqueue和recvqueue对象
        sendqueue = new LinkedBlockingQueue();
        recvqueue = new LinkedBlockingQueue();
        //创建Messenger来管理WorkerSender和WorkerReceiver
        this.messenger = new Messenger(manager);
    }

2.创建Messenger类,在Manager类中会创建WorkerSender,WorkerReceiver来处理sendqueue和recvqueue中的数据

Messenger(QuorumCnxManager manager) {

            this.ws = new WorkerSender(manager);

            this.wsThread = new Thread(this.ws, "WorkerSender[myid=" + self.getId() + "]");
            this.wsThread.setDaemon(true);

            this.wr = new WorkerReceiver(manager);

            this.wrThread = new Thread(this.wr, "WorkerReceiver[myid=" + self.getId() + "]");
            this.wrThread.setDaemon(true);
        }

QuorumPeer.start

我们看下QuorumPeer的线程做了哪些逻辑处理


try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                //处理选主的逻辑
                case LOOKING:
                    LOG.info("LOOKING");
                       //省略.....
                    
                        try {
                            reconfigFlagClear();
                            if (shuttingDownLE) {
                                shuttingDownLE = false;
                                startLeaderElection();
                            }
                            //QuromPeer进入选主逻辑
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }
                    break;
               //处理作为observer的逻辑
                case OBSERVING:
                    // 省略............
                    break;
               //处理作为follower的逻辑
                case FOLLOWING:
                    // 省略............
                    break;
                //处理作为Leader的逻辑
                case LEADING:
                      // 省略............
                    break;
                }
            }
        } finally {
           // 忽略这部分代码
        }
    }
FastLeaderElection.lookForLeader

选主的过程在lookForLeader完成,这个方法的代码很长,大概有200行,我回把一些不重要的代码删除,

 public Vote lookForLeader() throws InterruptedException {
          //这个地方删除了JMX 的一些信息

        self.start_fle = Time.currentElapsedTime();
        try {
            /*
             * The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
             * if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
             * of participants has voted for it.
             */
             //上面英文注释已经很清楚了,主要意思就是这个recvset用来接受每个服务器发送来的投票信息,
             //key 是服务器的sid,vote就是这个服务器推举的vote,通过recvset可以判断出master节点有没有被选举出来
            Map recvset = new HashMap();

            /*
             * The votes from previous leader elections, as well as the votes from the current leader election are
             * stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
             * Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
             * outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
             * the electionEpoch of the received notifications) in a leader election.
             */
            //是master节点用来存放 自己别选举为Leader的vote信息
            Map outofelection = new HashMap();

            int notTimeout = minNotificationInterval;

            synchronized (this) {
                //logicalclock用来标识每次选举的轮次,todo
                logicalclock.incrementAndGet();
                 //更新本节点推举的Leader信息(proposedLeader,proposedZxid,proposedEpoch)
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info(
                "New election. My id = {}, proposed zxid=0x{}",
                self.getId(),
                Long.toHexString(proposedZxid));
            //把自己的Proposal发送给其他的服务器
            sendNotifications();

            SyncedLearnerTracker voteSet;

            /*
             * Loop in which we exchange notifications until we find a leader
             */
            
            while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                //从recvqueue中获取别的服务器发送来的投票信息(也包括自己发送来的投票信息)
                Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if (n == null) {
                     //如果从recvqueue中没有得到投票信息
                     //如果QuorumCnxManager到别的服务节点已经建立了socket连接,那么直接发送Notification
                    if (manager.haveDelivered()) {
                        sendNotifications();
                    } else {
                        //QuorumPeer通过QuorumCnxManager建立到别的服务节点网络连接
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                     //更新notTimeout
                    int tmpTimeOut = notTimeout * 2;
                    notTimeout = Math.min(tmpTimeOut, maxNotificationInterval);
                    LOG.info("Notification time out: {}", notTimeout);
                } else if (validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the current or next
                     * voting view for a replica in the current or next voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        if (getInitLastLoggedZxid() == -1) {
                            LOG.debug("Ignoring notification as our zxid is -1");
                            break;
                        }
                        if (n.zxid == -1) {
                            LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                            break;
                        }
                        // If notification > current, replace and send messages out
                   
                        if (n.electionEpoch > logicalclock.get()) {
                           //如果接受到的投票信息所在的投票轮次大于logicalclock,那么就更新logicalclock,同时把
                           //之前接受到的投票信息清空
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            //totalOrderPredicate 作用是比较获得的vote个本节点vote,比较方式就是我们在文章开头描述的那样,依次比较zxid,id,
                           //通过totalOrderPredicate来决定是不是需要更新本节点的vote,如果需要更新,更新之后,把相关的该更新信息发送给别的服务节点
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                                 //如果接受到的vote的选举轮次electionEpoch小于本机的选举轮次electionEpoch,那么直接把接受到的这个vote丢弃
                                LOG.debug(
                                    "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
                                    Long.toHexString(n.electionEpoch),
                                    Long.toHexString(logicalclock.get()));
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                             //同上面对totalOrderPredicate的分析
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        LOG.debug(
                            "Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",
                            n.sid,
                            n.leader,
                            Long.toHexString(n.zxid),
                            Long.toHexString(n.electionEpoch));

                        // don't care about the version if it's in LOOKING state
                        //把接受到的vote信息加入到recvset中
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                        //根据recvset和本节点的vote获取 VoteTracker
                        //VoteTracker用来判断本节点的vote是不是得到的过半数的其他节点的推举
                        voteSet = getVoteTracker(recvset, new Vote(proposedLeader,proposedZxid , logicalclock.get(), proposedEpoch));
                        if (voteSet.hasAllQuorums()) {
                            //即使如果本节点的vote获得了过半数participant的推举,那么还需要通过recvqueue最多等待finalizeWait ms来确定本机的vote会不会被新来的vote更新

                            // Verify if there is any change in the proposed leader
                            while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                                if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                //如果等了finalizeWait这么长时间之后,没有接收到任何的vote信息,那么说明,大家都承认本机的vote所推举的节点为Leader节点
                              //根据proposedLeader和本机的sid来设置QuorumPeer的节点状态
                               //如果proposedLeader == sid 那么设置本节点为Leader,反之,如果本节点是participant类型那么设置本节点状态为Following,如果本节点状态是Observer类型那么设置本节点状态为Observing
                                setPeerState(proposedLeader, voteSet);
                               //生成最终代表Leader节点信息的vote
                                Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                                //leaveInstance 用来清空recvqueue,表示本轮选举结束
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                      //如果接受的vote的state是observing 那么什么都不做

                        LOG.debug("Notification from observer: {}", n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                         //这里有一个问题,就是什么情况下节点接受到的vote的状态会是following或者leading,
                          //换句话说就是集群中的Leader已经选举出来了。
                         //比如当一个集群中新加入了一个节点,那么在这种情况下,新节点就会得到别的服务节点的vote,这个vote就是following或者leading的:这个地方和WorkerReceiver的工作机制有关系
                         //如果接受到的vote的状态是Leading或者following,
                        if (n.electionEpoch == logicalclock.get()) {
                            //如果是同一轮选举,那么直接把vote加入recvset
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                            voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                              //通过voteSet去判断是不是有过半数的participant推举当前vote.leader,同时还要求Leader服务器也把自己的vote发送给本节点了
                            if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
                                setPeerState(n.leader, voteSet);
                                Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify that
                         * a majority are following the same leader.
                         *
                         * Note that the outofelection map also stores votes from the current leader election.
                         * See ZOOKEEPER-1732 for more information.
                         */
                        //如果不是同一轮选举,那么把获得的vote信息加入outofelection,下面就是通过outofelection来找出Leader节点
                        outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));

                        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            synchronized (this) {
                                logicalclock.set(n.electionEpoch);
                                setPeerState(n.leader, voteSet);
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecoginized: {} (n.state), {}(n.sid)", n.state, n.sid);
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if (self.jmxLeaderElectionBean != null) {
                    MBeanRegistry.getInstance().unregister(self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());
        }
    }

对上述代码逻辑可以使用下图去描述


leader_election_logical.png

上面就是QuorumPeer线程的选主的工作逻辑
接下来我们看下其中的一些细节,这些细节会关联到我前面提到的其他线程

sendNotifications

当服务节点刚启动或者接受到别的节点发送来的r_vote来更新自己的proposal的时候都需要通过sendNotification方法将自己推荐的Leader信息发送给别的participant,我们分析下sendNotifications的源码

 private void sendNotifications() {
        for (long sid : self.getCurrentAndNextConfigVoters()) {
            QuorumVerifier qv = self.getQuorumVerifier();
            //把节点proposal的Leader信息封装成ToSend对象然后加入到sendqueue中
            ToSend notmsg = new ToSend(
                ToSend.mType.notification,
                proposedLeader,
                proposedZxid,
                logicalclock.get(),
                QuorumPeer.ServerState.LOOKING,
                sid,
                proposedEpoch,
                qv.toString().getBytes());

            LOG.debug(
                "Sending Notification: {} (n.leader), 0x{} (n.zxid), 0x{} (n.round), {} (recipient),"
                    + " {} (myid), 0x{} (n.peerEpoch) ",
                proposedLeader,
                Long.toHexString(proposedZxid),
                Long.toHexString(logicalclock.get()),
                sid,
                self.getId(),
                Long.toHexString(proposedEpoch));

            sendqueue.offer(notmsg);
        }
    }
WorkerSender.run

我看看下消费sendqueue队列的WorkerSend线程的run方法

public void run() {
                while (!stop) {
                    try {
                        //从sendqueue取出ToSend消息然后交给process处理
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        if (m == null) {
                            continue;
                        }

                        process(m);
                    } catch (InterruptedException e) {
                        break;
                    }
                }
                LOG.info("WorkerSender is down");
            }
WorkerSender.process
 void process(ToSend m) {
                //把toSend转化成ByteBuffer对象
                ByteBuffer requestBuffer = buildMsg(m.state.ordinal(), m.leader, m.zxid, m.electionEpoch, m.peerEpoch, m.configData);
              //通过QuorumCnxManager把requestBuffer发送给指定的participant
                manager.toSend(m.sid, requestBuffer);

            }
QuorumCnxManager.toSend

我们分析下participant连接管理器toSend方法发生了什么

 public void toSend(Long sid, ByteBuffer b) {                                                                                           
     /*                                                                                                                                 
      * If sending message to myself, then simply enqueue it (loopback).                                                                
      */ 
    //如果投票信息是发送给自己的那么直接放入recvQueue中                                                                                                                               
     if (this.mySid == sid) {                                                                                                           
         b.position(0);                                                                                                                 
         addToRecvQueue(new Message(b.duplicate(), sid));                                                                               
         /*                                                                                                                             
          * Otherwise send to the corresponding thread to send.                                                                         
          */                                                                                                                            
     } else {                                                                                                                           
         /*                                                                                                                             
          * Start a new connection if doesn't have one already.                                                                         
          */  
        //queueSendMap:ConcurrentHashMap类型,是本节点保存发送消息到其他节点的数据结构                                                                                                                     
         BlockingQueue bq = queueSendMap.computeIfAbsent(sid, serverId -> new CircularBlockingQueue<>(SEND_CAPACITY));      
         //把本次发送给sid所代表的的节点投票信息保存到blockingQueue中
         addToSendQueue(bq, b);       
        //建立本节点到sid节点的socket连接                                                                                                  
         connectOne(sid);                                                                                                               
     }                                                                                                                                  
 }                                                                                                                                      
Tpis

在讲解connectOne方法之前我们先讲解下zookeeper投票节点直接的网络连接拓扑,
下图描述的是三个节点建立的网络连接拓扑示意图


connection_topology.png

每个节点都会和别的节点建立连接,zookeeper对于连接上的输入和输出投票消息分别使用SendWorker和ReceiveWorker来处理,他们都是线程类。因为任意两个节点之间都需要建立连接,为什么防止高效稳定的无浪费的建立起这些连接,zookeeper对于连接的建立创建了如下的一个约束:
值允许sid较大的机器去主动建立到sid较小的机器:举个 : sid为1 和sid为2的两个机器建立网络连接
如果sid=1的服务器主动发起向sid=2的服务器socket连接建立,该连接是无法建立起来的,底层socket建立之后,zookeeper会检查本机的sid和远程连接服务器的sid,如果发现自己的sid比较小那么会主动关闭socket连接。如果sid=2的服务器建立到sid=1的服务器socket连接,那么可以建立成功

QuorumCnxManager.connectOne

connectOne方法就是完成建立我们上面连接拓扑图示意的结果

 synchronized void connectOne(long sid) {  
      //senderWorkerMap用来存放每个sid对应的SendWorker                                                                        
     if (senderWorkerMap.get(sid) != null) {   
        //如果sid对应的SendWorker已经存在(做一下多地址的检查)那么直接返回                                                                       
         LOG.debug("There is a connection already for server {}", sid);                                                  
         if (self.isMultiAddressEnabled() && self.isMultiAddressReachabilityCheckEnabled()) {                            
             // since ZOOKEEPER-3188 we can use multiple election addresses to reach a server. It is possible, that the  
             // one we are using is already dead and we need to clean-up, so when we will create a new connection        
             // then we will choose an other one, which is actually reachable                                            
             senderWorkerMap.get(sid).asyncValidateIfSocketIsStillReachable();                                           
         }                                                                                                               
         return;                                                                                                         
     }                                                                                                                   
     synchronized (self.QV_LOCK) {                                                                                       
         boolean knownId = false;                                                                                        
         // Resolve hostname for the remote server before attempting to                                                  
         // connect in case the underlying ip address has changed.                                                       
         self.recreateSocketAddresses(sid);                                                                              
         Map lastCommittedView = self.getView();                                          
         QuorumVerifier lastSeenQV = self.getLastSeenQuorumVerifier();                                                   
         Map lastProposedView = lastSeenQV.getAllMembers();                               
         if (lastCommittedView.containsKey(sid)) {                                                                       
             knownId = true;                                                                                             
             LOG.debug("Server {} knows {} already, it is in the lastCommittedView", self.getId(), sid);   
             //如果本节点到sid对应的服务器还没有建立socket连接,那么通过connectOne建立连接       
             if (connectOne(sid, lastCommittedView.get(sid).electionAddr)) {                                             
                 return;                                                                                                 
             }                                                                                                           
         }                                                                                                               
         if (lastSeenQV != null                                                                                          
             && lastProposedView.containsKey(sid)                                                                        
             && (!knownId                                                                                                
                 || (lastProposedView.get(sid).electionAddr != lastCommittedView.get(sid).electionAddr))) {              
             knownId = true;                                                                                             
             LOG.debug("Server {} knows {} already, it is in the lastProposedView", self.getId(), sid);                  
                                                                                                                         
             if (connectOne(sid, lastProposedView.get(sid).electionAddr)) {                                              
                 return;                                                                                                 
             }                                                                                                           
         }                                                                                                               
         if (!knownId) {                                                                                                 
             LOG.warn("Invalid server id: {} ", sid);                                                                    
         }                                                                                                               
     }                                                                                                                   
 }                                                                                                                       

上面的connectOne(sid,address)会继续调用initiateConnectionAsync()方法,

QuorumCnxManager.initiateConnectionAsync

initiateConnectionAsync方法就是把建立连接的任务封存成QuorumConnectionReqThread然后异步完成

public boolean initiateConnectionAsync(final MultipleAddresses electionAddr, final Long sid) {                          
    if (!inprogressConnections.add(sid)) {                                                                              
        // simply return as there is a connection request to                                                            
        // server 'sid' already in progress.                                                                            
        LOG.debug("Connection request to server id: {} is already in progress, so skipping this request", sid);         
        return true;                                                                                                    
    }                                                                                                                   
    try {                                                                                                               
        connectionExecutor.execute(new QuorumConnectionReqThread(electionAddr, sid));                                   
        connectionThreadCnt.incrementAndGet();                                                                          
    } catch (Throwable e) {                                                                                             
        // Imp: Safer side catching all type of exceptions and remove 'sid'                                             
        // from inprogress connections. This is to avoid blocking further                                               
        // connection requests from this 'sid' in case of errors.                                                       
        inprogressConnections.remove(sid);                                                                              
        LOG.error("Exception while submitting quorum connection request", e);                                           
        return false;                                                                                                   
    }                                                                                                                   
    return true;                                                                                                        
}                                                                                                                       
QuorumConnectionReqThread

这是一个线程类主要负责完成到指定服务器的socket连接
我们看下它的run方法调用的initiateConnection的实现

 public void initiateConnection(final MultipleAddresses electionAddr, final Long sid) {               
     Socket sock = null;                                                                              
     try {                                                                                            
         LOG.debug("Opening channel to server {}", sid);                                              
         if (self.isSslQuorum()) {                                                                    
             sock = self.getX509Util().createSSLSocket();                                             
         } else {    
            //通过工厂方式创建socket                                                                                 
             sock = SOCKET_FACTORY.get();                                                             
         }                                                                                            
         setSockOpts(sock);    
        //建立到远程服务器的连接                                                                       
         sock.connect(electionAddr.getReachableOrOne(), cnxTO);                                       
         if (sock instanceof SSLSocket) {                                                             
             SSLSocket sslSock = (SSLSocket) sock;                                                    
             sslSock.startHandshake();                                                                
             LOG.info("SSL handshake complete with {} - {} - {}",                                     
                      sslSock.getRemoteSocketAddress(),                                               
                      sslSock.getSession().getProtocol(),                                             
                      sslSock.getSession().getCipherSuite());                                         
         }                                                                                            
                                                                                                      
         LOG.debug("Connected to server {} using election address: {}:{}",                            
                   sid, sock.getInetAddress(), sock.getPort());                                       
     } catch (X509Exception e) {                                                                      
         LOG.warn("Cannot open secure channel to {} at election address {}", sid, electionAddr, e);   
         closeSocket(sock);                                                                           
         return;                                                                                      
     } catch (UnresolvedAddressException | IOException e) {                                           
         LOG.warn("Cannot open channel to {} at election address {}", sid, electionAddr, e);          
         closeSocket(sock);                                                                           
         return;                                                                                      
     }                                                                                                
                                                                                                      
     try {    
        //这个方法我们在下面分析下                                                                                        
         startConnection(sock, sid);                                                                  
     } catch (IOException e) {                                                                        
         LOG.error(                                                                                   
           "Exception while connecting, id: {}, addr: {}, closing learner connection",                
           sid,                                                                                       
           sock.getRemoteSocketAddress(),                                                             
           e);                                                                                        
         closeSocket(sock);                                                                           
     }                                                                                                
 }                                                                                                    

QuorumConnectionReqThread.startConnection

startConnection完成了上面提到的连接建立的约束条件检查,创建对应的SendWorker和ReceiveWorker线程对象

 private boolean startConnection(Socket sock, Long sid) throws IOException {      
     //数据输出流                            
     DataOutputStream dout = null;       
     //数据输入流                                                                     
     DataInputStream din = null;                                                                              
     LOG.debug("startConnection (myId:{} --> sid:{})", self.getId(), sid);                                    
     try {                                                                                                    
         // Use BufferedOutputStream to reduce the number of IP packets. This is                              
         // important for x-DC scenarios.     
           //封装数据输出流                                                                
         BufferedOutputStream buf = new BufferedOutputStream(sock.getOutputStream());                         
         dout = new DataOutputStream(buf);                                                                    
                                                                                                              
         // Sending id and challenge                                                                          
                                                                                                              
         // First sending the protocol version (in other words - message type).                               
         // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress  
         // feature is enabled. During rolling upgrade, we must make sure that all the servers can            
         // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720     
        //下面是建立到别的服务节点会话发送的一些基础数据      
         long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_V2 : PROTOCOL_VERSION_V1;     
         //发送版本号
         dout.writeLong(protocolVersion);        
         //发送本机的sid                                                             
         dout.writeLong(self.getId());                                                                        
                                                                                                              
         // now we send our election address. For the new protocol version, we can send multiple addresses.   
         Collection addressesToSend = protocolVersion == PROTOCOL_VERSION_V2               
                 ? self.getElectionAddress().getAllAddresses()                                                
                 : Arrays.asList(self.getElectionAddress().getOne());                                         
                                                                                                              
         String addr = addressesToSend.stream()                                                               
                 .map(NetUtils::formatInetAddr).collect(Collectors.joining("|"));                             
         byte[] addr_bytes = addr.getBytes();                                                                 
         dout.writeInt(addr_bytes.length);                                                                    
         dout.write(addr_bytes);                                                                              
         dout.flush();                                                                                        
          //创建数据输入流                                                                                                    
         din = new DataInputStream(new BufferedInputStream(sock.getInputStream()));                           
     } catch (IOException e) {                                                                                
         LOG.warn("Ignoring exception reading or writing challenge: ", e);                                    
         closeSocket(sock);                                                                                   
         return false;                                                                                        
     }                                                                                                        
                                                                                                              
     // authenticate learner                                                                                  
     QuorumPeer.QuorumServer qps = self.getVotingView().get(sid);                                             
     if (qps != null) {                                                                                       
         // TODO - investigate why reconfig makes qps null.      
        //如果有配置了服务器认证,那么对远端的服务器做认证                                             
         authLearner.authenticate(sock, qps.hostname);                                                        
     }                                                                                                        
                                                                                                              
     // If lost the challenge, then drop the new connection                                                   
     if (sid > self.getId()) {       
         //这个地方就是上面提到的 建立socket连接的约束条件检查点                                                                         
         LOG.info("Have smaller server identifier, so dropping the connection: (myId:{} --> sid:{})", self.get
 //如果sid>self.sid那么关闭socket连接        
 closeSocket(sock);                                                                                   
         // Otherwise proceed with the connection                                                             
     } else {                                                                                                 
         LOG.debug("Have larger server identifier, so keeping the connection: (myId:{} --> sid:{})", self.getI
         //根据sid和建立的socket建立SendWorker
         SendWorker sw = new SendWorker(sock, sid);      
        //根据sid,socket和输入信息流建立RecvWorker                                                    
         RecvWorker rw = new RecvWorker(sock, din, sid, sw);        
         //SendWorker持有RecvWorker的引用                                    
         sw.setRecv(rw);                                                                                      
                                                                                                              
         SendWorker vsw = senderWorkerMap.get(sid);                                                           
                                                                                                              
         if (vsw != null) {                                                                                   
             vsw.finish();                                                                                    
         }                                                                                                    
           
        //把SendWorker加入到 senderWorkerMap中                                                                                           
         senderWorkerMap.put(sid, sw);                                                                        
           //queueSendMap初始化sid对应的数据发送队列                                                                                                   
         queueSendMap.putIfAbsent(sid, new CircularBlockingQueue<>(SEND_CAPACITY));                           
         //分别启动SendWorker和ReceiveWorker                                                                                                     
         sw.start();                                                                                          
         rw.start();                                                                                          
                                                                                                              
         return true;                                                                                         
                                                                                                   
     }                                                                                                        
     return false;                                                                                            
 }                                                                                                            
                                                                                                              
SendWorker

我们分析下SendWorker是如何工作的

 public void run() {                                                        
      threadCnt.incrementAndGet();                                           
      try {                                                                  
          /**                                                                
           * If there is nothing in the queue to send, then we               
           * send the lastMessage to ensure that the last message            
           * was received by the peer. The message could be dropped          
           * in case self or the peer shutdown their connection              
           * (and exit the thread) prior to reading/processing               
           * the last message. Duplicate messages are handled correctly      
           * by the peer.                                                    
           *                                                                 
           * If the send queue is non-empty, then we have a recent           
           * message than that stored in lastMessage. To avoid sending       
           * stale message, we should send the message in the send queue.    
           */ 
          //从queueSendMap根据sid获取本SendWorker对应的消息发送队列                                                               
          BlockingQueue bq = queueSendMap.get(sid);              
                if (bq == null || isSendQueueEmpty(bq)) {
                   //在第一次运行的时候如果发现bq是null或者bq是空那么直接把lastMessageSent中存储的信息发送出去,当然前提是lastMessageSent中有数据,
                  //SendWorker每次都会把最近发送的数据存放在lastMessageSent中
                    ByteBuffer b = lastMessageSent.get(sid);
                    if (b != null) {
                        LOG.debug("Attempting to send lastMessage to sid={}", sid);
                        send(b);
                    }
                }
            } catch (IOException e) {
                LOG.error("Failed to send last message. Shutting down thread.", e);
                this.finish();
            }
            LOG.debug("SendWorker thread started towards {}. myId: {}", sid, QuorumCnxManager.this.mySid);

            try {
              //这里才是主循环,会一直尝试从自己的投票消息队列中获取投票消息然后发送出去
                while (running && !shutdown && sock != null) {

                    ByteBuffer b = null;
                    try {
                        BlockingQueue bq = queueSendMap.get(sid);
                        if (bq != null) {
                            b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
                        } else {
                            LOG.error("No queue of incoming messages for server {}", sid);
                            break;
                        }

                        if (b != null) {
                          //把最新的投票消息存储到lastMessageSent中
                            lastMessageSent.put(sid, b);
                           //通过底层socket把消息发送出去
                            send(b);
                        }
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted while waiting for message on queue", e);
                    }
                }
            } catch (Exception e) {
                LOG.warn(
                    "Exception when using channel: for id {} my id = {}",
                    sid ,
                    QuorumCnxManager.this.mySid,
                    e);
            }
            this.finish();

            LOG.warn("Send worker leaving thread id {} my id = {}", sid, self.getId());
        }


ReceiveWorker

分析完SendWorker的run方法,我们分析下ReceiveWorker的run方法

 public void run() {
            threadCnt.incrementAndGet();
            try {
                LOG.debug("RecvWorker thread towards {} started. myId: {}", sid, QuorumCnxManager.this.mySid);
                //下面是循环从数据流中读取消息
                while (running && !shutdown && sock != null) {
                    /**
                     * Reads the first int to determine the length of the
                     * message
                     */
                    //在传递投票消息时,zookeeper采用变长消息格式,所以每次先读取消息的长度
                    int length = din.readInt();
                    if (length <= 0 || length > PACKETMAXSIZE) {
                        throw new IOException("Received packet with invalid packet: " + length);
                    }
                    /**
                     * Allocates a new ByteBuffer to receive the message
                     */
                    final byte[] msgArray = new byte[length];
                    //根据消息的长度读取整个消息体的数据
                    din.readFully(msgArray, 0, length);
                   //把读取的到的消息封装成message然后放入到RecvQueue中,等待处理
                    addToRecvQueue(new Message(ByteBuffer.wrap(msgArray), sid));
                }
            } catch (Exception e) {
                LOG.warn(
                    "Connection broken for id {}, my id = {}",
                    sid,
                    QuorumCnxManager.this.mySid,
                    e);
            } finally {
                LOG.warn("Interrupting SendWorker thread from RecvWorker. sid: {}. myId: {}", sid, QuorumCnxManager.this.mySid);
                sw.finish();
                closeSocket(sock);
            }
        }

    }
WorkerReceiver

通过上面的分析,我们可以看到投票消息会通过ReceiveWorker读取封装之后放入到RecvQueue中,那么接下来就是看下WorkerReceiver是如何消费RecvQueue中的数据了,我们分析下WorkerReceiver的run方法,
这个方法很长,请耐心看完


  public void run() {

                Message response;
                //主循环
                while (!stop) {
                    // Sleeps on receive
                    try {
                        //从RecvQueue中尝试获取投票信息
                        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
                        if (response == null) {
                            //如果为空那么 继续
                            continue;
                        }
                       //根据消息的大小会做下面一系列的合法性验证
                        final int capacity = response.buffer.capacity();

                        // The current protocol and two previous generations all send at least 28 bytes
                        if (capacity < 28) {
                            LOG.error("Got a short response from server {}: {}", response.sid, capacity);
                            continue;
                        }

                        // this is the backwardCompatibility mode in place before ZK-107
                        // It is for a version of the protocol in which we didn't send peer epoch
                        // With peer epoch and version the message became 40 bytes
                        boolean backCompatibility28 = (capacity == 28);

                        // this is the backwardCompatibility mode for no version information
                        boolean backCompatibility40 = (capacity == 40);

                        response.buffer.clear();

                        // Instantiate Notification and set its attributes
                        Notification n = new Notification();
                         //从消息中抽取信息,用这些信息来生成notification
                        int rstate = response.buffer.getInt();
                        long rleader = response.buffer.getLong();
                        long rzxid = response.buffer.getLong();
                        long relectionEpoch = response.buffer.getLong();
                        long rpeerepoch;

                        int version = 0x0;
                        QuorumVerifier rqv = null;

                        try {
                            if (!backCompatibility28) {
                                rpeerepoch = response.buffer.getLong();
                                if (!backCompatibility40) {
                                    /*
                                     * Version added in 3.4.6
                                     */

                                    version = response.buffer.getInt();
                                } else {
                                    LOG.info("Backward compatibility mode (36 bits), server id: {}", response.sid);
                                }
                            } else {
                                LOG.info("Backward compatibility mode (28 bits), server id: {}", response.sid);
                                rpeerepoch = ZxidUtils.getEpochFromZxid(rzxid);
                            }

                            // check if we have a version that includes config. If so extract config info from message.
                            if (version > 0x1) {
                                int configLength = response.buffer.getInt();

                                // we want to avoid errors caused by the allocation of a byte array with negative length
                                // (causing NegativeArraySizeException) or huge length (causing e.g. OutOfMemoryError)
                                if (configLength < 0 || configLength > capacity) {
                                    throw new IOException(String.format("Invalid configLength in notification message! sid=%d, capacity=%d, version=%d, configLength=%d",
                                                                        response.sid, capacity, version, configLength));
                                }

                                byte[] b = new byte[configLength];
                               //获取config的数据
                                response.buffer.get(b);

                                synchronized (self) {
                                    try {
                                         //根据config来生成QuorumVerifier
                                        rqv = self.configFromString(new String(b));
                                        QuorumVerifier curQV = self.getQuorumVerifier();
                                        if (rqv.getVersion() > curQV.getVersion()) {
                                            LOG.info("{} Received version: {} my version: {}",
                                                     self.getId(),
                                                     Long.toHexString(rqv.getVersion()),
                                                     Long.toHexString(self.getQuorumVerifier().getVersion()));
                                            if (self.getPeerState() == ServerState.LOOKING) {
                                                LOG.debug("Invoking processReconfig(), state: {}", self.getServerState());
                                                self.processReconfig(rqv, null, null, false);
                                                if (!rqv.equals(curQV)) {
                                                    LOG.info("restarting leader election");
                                                    self.shuttingDownLE = true;
                                                    self.getElectionAlg().shutdown();

                                                    break;
                                                }
                                            } else {
                                                LOG.debug("Skip processReconfig(), state: {}", self.getServerState());
                                            }
                                        }
                                    } catch (IOException | ConfigException e) {
                                        LOG.error("Something went wrong while processing config received from {}", response.sid);
                                    }
                                }
                            } else {
                                LOG.info("Backward compatibility mode (before reconfig), server id: {}", response.sid);
                            }
                        } catch (BufferUnderflowException | IOException e) {
                            LOG.warn("Skipping the processing of a partial / malformed response message sent by sid={} (message length: {})",
                                     response.sid, capacity, e);
                            continue;
                        }
                        /*
                         * If it is from a non-voting server (such as an observer or
                         * a non-voting follower), respond right away.
                         */
                          //如果发送的投票信息的服务器sid不是合法的投票者,那么直接恢复信息
                        if (!validVoter(response.sid)) {
                            Vote current = self.getCurrentVote();
                            QuorumVerifier qv = self.getQuorumVerifier();
                            ToSend notmsg = new ToSend(
                                ToSend.mType.notification,
                                current.getId(),
                                current.getZxid(),
                                logicalclock.get(),
                                self.getPeerState(),
                                response.sid,
                                current.getPeerEpoch(),
                                qv.toString().getBytes());

                            sendqueue.offer(notmsg);
                        } else {
                            // Receive new message
                            LOG.debug("Receive new notification message. My id = {}", self.getId());

                            // State of peer that sent this message
                            QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
                            switch (rstate) {
                            case 0:
                                ackstate = QuorumPeer.ServerState.LOOKING;
                                break;
                            case 1:
                                ackstate = QuorumPeer.ServerState.FOLLOWING;
                                break;
                            case 2:
                                ackstate = QuorumPeer.ServerState.LEADING;
                                break;
                            case 3:
                                ackstate = QuorumPeer.ServerState.OBSERVING;
                                break;
                            default:
                                continue;
                            }
                            //使用Message中抽取出来的数据给Notification属性赋值
                            n.leader = rleader;
                            n.zxid = rzxid;
                            n.electionEpoch = relectionEpoch;
                            n.state = ackstate;
                            n.sid = response.sid;
                            n.peerEpoch = rpeerepoch;
                            n.version = version;
                            n.qv = rqv;
                            /*
                             * Print notification info
                             */
                            LOG.info(
                                "Notification: my state:{}; n.sid:{}, n.state:{}, n.leader:{}, n.round:0x{}, "
                                    + "n.peerEpoch:0x{}, n.zxid:0x{}, message format version:0x{}, n.config version:0x{}",
                                self.getPeerState(),
                                n.sid,
                                n.state,
                                n.leader,
                                Long.toHexString(n.electionEpoch),
                                Long.toHexString(n.peerEpoch),
                                Long.toHexString(n.zxid),
                                Long.toHexString(n.version),
                                (n.qv != null ? (Long.toHexString(n.qv.getVersion())) : "0"));

                            /*
                             * If this server is looking, then send proposed leader
                             */
                            //如果本节点是在Looking状态,那么把生成的Notification加入到recvqueue中
                            if (self.getPeerState() == QuorumPeer.ServerState.LOOKING) {
                                recvqueue.offer(n);

                                /*
                                 * Send a notification back if the peer that sent this
                                 * message is also looking and its logical clock is
                                 * lagging behind.
                                 */
                                if ((ackstate == QuorumPeer.ServerState.LOOKING)
                                    && (n.electionEpoch < logicalclock.get())) {
                                   //如果接受到sid的投票信息的轮次小于本机进行的投票轮次,那么把本机的vote信息发送给对应的sid
                                    Vote v = getVote();
                                    QuorumVerifier qv = self.getQuorumVerifier();
                                    ToSend notmsg = new ToSend(
                                        ToSend.mType.notification,
                                        v.getId(),
                                        v.getZxid(),
                                        logicalclock.get(),
                                        self.getPeerState(),
                                        response.sid,
                                        v.getPeerEpoch(),
                                        qv.toString().getBytes());
                                    sendqueue.offer(notmsg);
                                }
                            } else {
                                /*
                                 * If this server is not looking, but the one that sent the ack
                                 * is looking, then send back what it believes to be the leader.
                                 */
                                //如果本机没有处在Looking的状态,也就是说主节点已经选举出来了,那么
                                Vote current = self.getCurrentVote();
                                if (ackstate == QuorumPeer.ServerState.LOOKING) {
                                    //下面是判断Leader节点的合法性
                                    if (self.leader != null) {
                                        if (leadingVoteSet != null) {
                                            self.leader.setLeadingVoteSet(leadingVoteSet);
                                            leadingVoteSet = null;
                                        }
                                        self.leader.reportLookingSid(response.sid);
                                    }


                                    LOG.debug(
                                        "Sending new notification. My id ={} recipient={} zxid=0x{} leader={} config version = {}",
                                        self.getId(),
                                        response.sid,
                                        Long.toHexString(current.getZxid()),
                                        current.getId(),
                                        Long.toHexString(self.getQuorumVerifier().getVersion()));
                                     //把主节点信息发送给对应的sid服务器
                                    QuorumVerifier qv = self.getQuorumVerifier();
                                    ToSend notmsg = new ToSend(
                                        ToSend.mType.notification,
                                        current.getId(),
                                        current.getZxid(),
                                        current.getElectionEpoch(),
                                        self.getPeerState(),
                                        response.sid,
                                        current.getPeerEpoch(),
                                        qv.toString().getBytes());
                                    sendqueue.offer(notmsg);
                                }
                            }
                        }
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted Exception while waiting for new message", e);
                    }
                }
                LOG.info("WorkerReceiver is down");
            }

        }

上面就是WorkerReceiver的工作流程,WorkerReceiver会把投票信息处理之后形成Notification加入到recevqueue中,QuorumPeer会从recevqueue去获取notification处理,这个处理逻辑在上面 我们已经分析过了。

End

自此我们完成了zookeeper主节点选举流程的源码分析

你可能感兴趣的:(zookeeper ZAB Leader Elect 源码分析)