无微不至之Zookeeper源码深度讲解(2)-核心流程梳理

一.源码仓库:

zookeeper
基于分支3.4.14分支在windows系统启动流程进行分析。

二.流程分析:

源码入口
通过zkServer.cmd可执行文件内容可以看出zookeeper的服务端是通过org.apache.zookeeper.server.quorum.QuorumPeerMain这个类的main作为入口来启动服务端程序的.main方法传入的是我们zoo.cfg文件的地址,然后通过解析zoo.cfg文件,将key,value的配置信息转换成QuorumPeerConfig的对象,转换细节可以看QuorumPeerConfig.parse方法,其中转换后的核心配置参数有:

参数名	参数描述
dataLogDir	事务日志存储路径
dataDir	快照存储路径
electionType	选举算法,目前只支持3-快速选举算法
myid	当前服务id
tickTime	时间单位
initLimit
syncLimit	事务存储路径
minSessionTimeout	最小会话超时时间
maxSessionTimeout	最大会话超时时间
peerType	角色类型-OBSERVER,PARTICIPANT
clientPort	客户端连接端口
clientPortAddress	客户端连接Host
snapRetainCount	快照保留个数,最小为3
purgeInterval	快照清除间隔
server.sid	hostName:port(通信端口):electionPort(选举端口):peerType
maxClientCnxns	最大客户端连接数

拿到解析后的参数后,可以通过是否配置了server.id参数来决定是否集群启动还是单机启动,单机启动运行通过ZooKeeperServerMain#main方法启动,集群启动则还是在QuorumPeerMain#runFromConfig方法进行处理的,这里我们就直接讲解集群模式,因为集群模式比单机模式多了集群间的通信相关的处理,如Leader选举,数据同步,请求转发等.

    public void runFromConfig(QuorumPeerConfig config) throws IOException {
      try {
          ManagedUtil.registerLog4jMBeans();
      } catch (JMException e) {
          LOG.warn("Unable to register log4j JMX control", e);
      }
  
      LOG.info("Starting quorum peer");
      try {
          ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
          cnxnFactory.configure(config.getClientPortAddress(),
                                config.getMaxClientCnxns());

          quorumPeer = getQuorumPeer();

          quorumPeer.setQuorumPeers(config.getServers());
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                  new File(config.getDataLogDir()),
                  new File(config.getDataDir())));
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier());
          quorumPeer.setClientPortAddress(config.getClientPortAddress());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }

          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          quorumPeer.initialize();

          quorumPeer.start();
          quorumPeer.join();
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }

可以从代码片段中可以看出,新创建出了一个QuorumPeer对象,其实这就是OOP思想,当前实例代表着集群的一个节点,然后将QuorumPeerConfig重新设置给QuorumPeer对象,在这里出现几个新的类：

类名	类描述
FileTxnSnapLog	持久化核心类别,包括快照,事务日志操作
ServerCnxnFactory 3	服务端网络处理核心类,其实现包含NIO和Netty两种实现
ZKDatabase	内存操作核心类,通过树结构存储

在设置了参数之后,接下来调用了QuorumPeer#initialize方法,在这个方法里主要是一些鉴权类的对象实例化。核心还是QuorumPeer#start方法:

        loadDataBase();//将数据从快照和事务日志加载到内存中
        cnxnFactory.start();        //网络服务启动
        startLeaderElection(); //选举工作准备
        super.start();

loadDataBase:
在这个方法里主要是通过委托给ZKDatabase#loadDataBase进行加载工作的

    public long loadDataBase() throws IOException {
        long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener);
        initialized = true;
        return zxid;
    }

    public long restore(DataTree dt, Map sessions, 
            PlayBackListener listener) throws IOException {
        snapLog.deserialize(dt, sessions); //数据反序列化
        return fastForwardFromEdits(dt, sessions, listener);
    }

 public long deserialize(DataTree dt, Map sessions)
            throws IOException {
        //找到有效的100个快照文件,降序
        List snapList = findNValidSnapshots(100);
        if (snapList.size() == 0) {
            return -1L;
        }
        File snap = null;
        boolean foundValid = false;
        for (int i = 0; i < snapList.size(); i++) {
            snap = snapList.get(i);
            InputStream snapIS = null;
            CheckedInputStream crcIn = null;
            try {
                LOG.info("Reading snapshot " + snap);
                snapIS = new BufferedInputStream(new FileInputStream(snap));
                crcIn = new CheckedInputStream(snapIS, new Adler32());
                InputArchive ia = BinaryInputArchive.getArchive(crcIn);
                //真正序列化的地方
                deserialize(dt,sessions, ia);
                long checkSum = crcIn.getChecksum().getValue();
                long val = ia.readLong("val");
                //校验快照文件的完整性
                if (val != checkSum) {
                    throw new IOException("CRC corruption in snapshot :  " + snap);
                }
                foundValid = true;
                break;
            } catch(IOException e) {
                LOG.warn("problem reading snap file " + snap, e);
            } finally {
                if (snapIS != null) 
                    snapIS.close();
                if (crcIn != null) 
                    crcIn.close();
            } 
        }
        if (!foundValid) {
            throw new IOException("Not able to find valid snapshots in " + snapDir);
        }
        //快照文件命名为snapshot.lastZxid
        dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX);
        return dt.lastProcessedZxid;
    }

在ZkDataBase里有一下几个核心属性:

表列 A	表列 B
DataTree dataTree	存储树结构
FileTxnSnapLog snapLog	事务快照持久化类别
,ConcurrentHashMap sessionsWithTimeouts	会话管理,sessionId

在loadDataBase方法中,可以看到调用的snapLog#restore方法,进入到restore方法中可以看到调用到的是FileTxnSnapLog#deserialize进行反序化,然后保存到传入的dt,sessions参数中,可以定位到FileTxnSnapLog#deserialize(DataTree dt, Map sessions,

        InputArchive ia)的这个重载方法来看下,如何对快照文件进行反序列化的:

    public void deserialize(DataTree dt, Map sessions,
            InputArchive ia) throws IOException {
        FileHeader header = new FileHeader();
        header.deserialize(ia, "fileheader");
        if (header.getMagic() != SNAP_MAGIC) {
            throw new IOException("mismatching magic headers "
                    + header.getMagic() + 
                    " !=  " + FileSnap.SNAP_MAGIC);
        }

首先通过文件输入流的包装类InputArchive进行读取,调用的是FileHeader#deserialize方法：

  public void deserialize(InputArchive a_, String tag) throws java.io.IOException {
    a_.startRecord(tag);
    magic=a_.readInt("magic");
    version=a_.readInt("version");
    dbid=a_.readLong("dbid");
    a_.endRecord(tag);
}

FileHeader实现Record接口,其实后面所有需要的序列化和反序列化的都实现了这个接口,通过传进来的输入流对象来自定义自己的序列化和反序列化细节.
在这里可以看到FileHeader的存储结构为:

属性值	占用大小	描述
magic	4字节	魔法数字
version	4字节	版本号
version	8字节	数据库id

经过FileHedare#deserialize方法后,已经从文件流读取了16个字节,接下来调用的是 SerializeUtils#deserializeSnapshot(dt,ia,sessions)进行其他内容的加载,

    public static void deserializeSnapshot(DataTree dt,InputArchive ia,
            Map sessions) throws IOException {
        //会话数量
        int count = ia.readInt("count");
        while (count > 0) {
            //会话id
            long id = ia.readLong("id");
            //会话超时时间
            int to = ia.readInt("timeout");
            sessions.put(id, to);
            if (LOG.isTraceEnabled()) {
                ZooTrace.logTraceMessage(LOG, ZooTrace.SESSION_TRACE_MASK,
                        "loadData --- session in archive: " + id
                        + " with timeout: " + to);
            }
            count--;
        }
        dt.deserialize(ia, "tree");
    }

可以看到首先是从流里面读取了4个字节的count属性,也就是会话数量,接着再遍历读取了8个字节sessionId（会话id）和4个字节的timeout（会话超时时间）,再赋值个给了sessions(也就是ZkDataBase的sessionsWithTimeouts属性),最后调用的是DataTree#deserialize进行真正存储内容的反序列化工作：

    public void deserialize(InputArchive ia, String tag) throws IOException {
        aclCache.deserialize(ia);
        nodes.clear();
        pTrie.clear();
        String path = ia.readString("path");
        while (!path.equals("/")) {
            DataNode node = new DataNode();
            ia.readRecord(node, "node");
            nodes.put(path, node);
            synchronized (node) {
                aclCache.addUsage(node.acl);
            }
            int lastSlash = path.lastIndexOf('/');
            if (lastSlash == -1) {
                root = node;
            } else {
                String parentPath = path.substring(0, lastSlash);
                node.parent = nodes.get(parentPath);
                if (node.parent == null) {
                    throw new IOException("Invalid Datatree, unable to find " +
                            "parent " + parentPath + " of path " + path);
                }
                node.parent.addChild(path.substring(lastSlash + 1));
                long eowner = node.stat.getEphemeralOwner();
                if (eowner != 0) {
                    HashSet list = ephemerals.get(eowner);
                    if (list == null) {
                        list = new HashSet();
                        ephemerals.put(eowner, list);
                    }
                    list.add(path);
                }
            }
            path = ia.readString("path");
        }
        nodes.put("/", root);

        setupQuota();

        aclCache.purgeUnused();
    }

网络传输(NIO)
zookeeper与客户端建立连接与请求与响应的数据传输都是通过ServerCnxnFactory这个类的实现类进行处理的,我们这里直接通过NIO的实现类NIOServerCnxnFactory来进行讲解,再QuorumPeer的start方法里我们看到调用NIOServerCnxnFactory#start方法.

    public void start() {
        // ensure thread is started once and only once
        if (thread.getState() == Thread.State.NEW) {
            thread.start();
        }
    }

再start方法里我们看到就简单调用了Thread#start方法启动线程.至于thread方法是在哪里进行初始化的,我可以定位到NIOServerCnxnFactory#configure方法里：

    public void configure(InetSocketAddress addr, int maxcc) throws IOException {
        configureSaslLogin();
        //初始化线程对象
        thread = new ZooKeeperThread(this, "NIOServerCxn.Factory:" + addr);
        thread.setDaemon(true);
        //设置最大连接数参数
        maxClientCnxns = maxcc;
        //初始化Socket相关配置
        this.ss = ServerSocketChannel.open();
        ss.socket().setReuseAddress(true);
        LOG.info("binding to port " + addr);
        ss.socket().bind(addr);
        ss.configureBlocking(false);
        ss.register(selector, SelectionKey.OP_ACCEPT);
    }

选举
在进启动了网络传输服务之后,就开始准备着选举前的一些准备工作,我们可以从QuorumPeer#start方法中的QuorumPeer#startLeaderElection()调用进行一个选举的切入点:

 synchronized public void startLeaderElection() {
     try {
//设置初始化投票
         currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
     } catch(IOException e) {
         RuntimeException re = new RuntimeException(e.getMessage());
         re.setStackTrace(e.getStackTrace());
         throw re;
     }
     for (QuorumServer p : getView().values()) {
         if (p.id == myid) {
             myQuorumAddr = p.addr;
             break;
         }
     }
     if (myQuorumAddr == null) {
         throw new RuntimeException("My id " + myid + " not in the peer list");
     }
     if (electionType == 0) {
         try {
             udpSocket = new DatagramSocket(myQuorumAddr.getPort());
             //启动响应线程
             responder = new ResponderThread();
             responder.start();
         } catch (SocketException e) {
             throw new RuntimeException(e);
         }
     }
     //根据配置的选举算法进行一些初始化工作
     this.electionAlg = createElectionAlgorithm(electionType);
 }

从startLeaderElection这个方法中可以看出,主要是将初始化投票设置为自身,sid为自身serverId,zxid为通过快照和事务日志加载后的最大lastZxid,还有peerEpoch(选举年代)也就是当前自身的选举年代,然后就是启动了ReponseThread这个响应线程,核心逻辑还是在createElectionAlgorithm这个方法中,我们可以跟进去看一下具体的代码逻辑:

   protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;
                
        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
//已过时
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
//已过时
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
//创建连接管理器
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                //启动监听其他节点的连接请求
                listener.start();
//实例化快速选举算法核心类
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

从上述代码中,可以看出主要工作是实例化了一个QuorumCnxManager这个对象,也就是通过这个对象中的Listener这个类来处理和其他节点的连接请求,调用了Listener#start方法实际是运行到了Listener#run方法代码中:

        public void run() {
            int numRetries = 0;
            InetSocketAddress addr;
            while((!shutdown) && (numRetries < 3)){
                try {
                    //实例化ServerSocket
                    ss = new ServerSocket();
                    ss.setReuseAddress(true);
                    if (listenOnAllIPs) {
                        int port = view.get(QuorumCnxManager.this.mySid)
                            .electionAddr.getPort();
                        addr = new InetSocketAddress(port);
                    } else {
                        addr = view.get(QuorumCnxManager.this.mySid)
                            .electionAddr;
                    }
                    LOG.info("My election bind port: " + addr.toString());
                    setName(view.get(QuorumCnxManager.this.mySid)
                            .electionAddr.toString());
                    ss.bind(addr);
                    while (!shutdown) {
                        //阻塞等待其他节点请求连接
                        Socket client = ss.accept();
                        setSockOpts(client);
                        LOG.info("Received connection request "
                                + client.getRemoteSocketAddress());

                        if (quorumSaslAuthEnabled) {
                            receiveConnectionAsync(client);
                        } else {
                            //接受请求核心逻辑
                            receiveConnection(client);
                        }

                        numRetries = 0;
                    }
                } catch (IOException e) {
                    LOG.error("Exception while listening", e);
                    numRetries++;
                    try {
                        ss.close();
                        Thread.sleep(1000);
                    } catch (IOException ie) {
                        LOG.error("Error closing server socket", ie);
                    } catch (InterruptedException ie) {
                        LOG.error("Interrupted while sleeping. " +
                                  "Ignoring exception", ie);
                    }
                }
            }
            LOG.info("Leaving listener");
            if (!shutdown) {
                LOG.error("As I'm leaving the listener thread, "
                        + "I won't be able to participate in leader "
                        + "election any longer: "
                        + view.get(QuorumCnxManager.this.mySid).electionAddr);
            }
        }

该方法主要是使用jdk的阻塞io与其他节点建立连接,不了解的可以去自行补充一下jdk的socket编程基础知识,在第二个while循环中的ss.accept()代码是会一直阻塞等待其他节点请求连接,当其他节点建立连接后,就会返回一个Socket实例,然后将Socket实例传入receiveConnection方法中,然后我们就可以和其他节点进行通信了,具体receiveConnection代码逻辑如下:

    public void receiveConnection(final Socket sock) {
        DataInputStream din = null;
        try {
//将输入流进行多次包装
            din = new DataInputStream(
                    new BufferedInputStream(sock.getInputStream()));

//真正处理连接
            handleConnection(sock, din);
        } catch (IOException e) {
            LOG.error("Exception handling connection, addr: {}, closing server connection",
                     sock.getRemoteSocketAddress());
            closeSocket(sock);
        }
    }

将io输入流包装后,进一步调用了handleConnection进行连接的处理:

    private void handleConnection(Socket sock, DataInputStream din)
            throws IOException {
        Long sid = null;
        try {
            // 阻塞等待另外一个节点发送建立请求的第一个包
            //先读取8个字节,又可能sid（服务id）,也有可能是protocolVersion（协议版本）
            sid = din.readLong();
//读取到的是协议版本
            if (sid < 0) {
//进一步读取8个字节,就是真正的sid
                sid = din.readLong();
//读取4个字节,也就是读取到的是剩余的其他内容的字节数
                int num_remaining_bytes = din.readInt();
//进行字数校验
                if (num_remaining_bytes < 0 || num_remaining_bytes > maxBuffer) {
                    LOG.error("Unreasonable buffer length: {}", num_remaining_bytes);
                    closeSocket(sock);
                    return;
                }
                byte[] b = new byte[num_remaining_bytes];

            //一次性将所有剩下的字节内容读取到b这个字节数组中
                int num_read = din.read(b);
                if (num_read != num_remaining_bytes) {
                    LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid);
                }
            }
            if (sid == QuorumPeer.OBSERVER_ID) {
                sid = observerCounter.getAndDecrement();
                LOG.info("Setting arbitrary identifier to observer: " + sid);
            }
        } catch (IOException e) {
            closeSocket(sock);
            LOG.warn("Exception reading or writing challenge: " + e.toString());
            return;
        }

        LOG.debug("Authenticating learner server.id: {}", sid);
        authServer.authenticate(sock, din);
        //如果读取的sid小于当前节点的sid,则关闭之前建立过的连接
        if (sid < this.mySid) {
            SendWorker sw = senderWorkerMap.get(sid);
            if (sw != null) {
                sw.finish();
            }
            LOG.debug("Create new connection to server: " + sid);
            closeSocket(sock);
            //关闭之前的连接后,由当前节点发起连接请求
            connectOne(sid);

        } else {
            //发送线程
            SendWorker sw = new SendWorker(sock, sid);
            //接受线程
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);
            SendWorker vsw = senderWorkerMap.get(sid);
            if(vsw != null)
                vsw.finish();
            senderWorkerMap.put(sid, sw);
            queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue(SEND_CAPACITY));
            //启动发送线程
            sw.start();
            //启动接受线程
            rw.start();
            return;
        }
    }

从这段代码中可以看出,建立请求只能由sid大的一方发起,由sid小的一方接受,如现在有sid=1,sid=2,sid=3三个节点,那么只能由2这个节点发起连接请求,1这个这个节点处理连接请求.这样就保证了双方只保持着一条连接,因为Socket是全双工模式,支持双方进行通信.Socket可以通过ss.accept获取到,还可以通过当前方法的connectOne这个方法去和sid较小的节点进行连接：

    synchronized public void connectOne(long sid){
//就是判断sendWorkerMap中是否包含了当前sid
        if (!connectedToPeer(sid)){
            InetSocketAddress electionAddr;
            if (view.containsKey(sid)) {
            //拿到之前配置的server.id的选举地址
                electionAddr = view.get(sid).electionAddr;
            } else {
                LOG.warn("Invalid server id: " + sid);
                return;
            }
            try {
                LOG.debug("Opening channel to server " + sid);
//实例化Socket对象
                Socket sock = new Socket();
                setSockOpts(sock);
                //进行连接
                sock.connect(view.get(sid).electionAddr, cnxTO);
                LOG.debug("Connected to server " + sid);
                if (quorumSaslAuthEnabled) {
                    initiateConnectionAsync(sock, sid);
                } else {
                    //同步初始化连接,也就是将当前自身的一些信息发送给其他节点
                    initiateConnection(sock, sid);
                }
            } catch (UnresolvedAddressException e) {
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr, e);
                if (view.containsKey(sid)) {
                    view.get(sid).recreateSocketAddresses();
                }
                throw e;
            } catch (IOException e) {
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr,
                        e);
                if (view.containsKey(sid)) {
                    view.get(sid).recreateSocketAddresses();
                }
            }
        } else {
            LOG.debug("There is a connection already for server " + sid);
        }
    }

    public void initiateConnection(final Socket sock, final Long sid) {
        try {
            startConnection(sock, sid);
        } catch (IOException e) {
            LOG.error("Exception while connecting, id: {}, addr: {}, closing learner connection",
                     new Object[] { sid, sock.getRemoteSocketAddress() }, e);
            closeSocket(sock);
            return;
        }
    }

    private boolean startConnection(Socket sock, Long sid)
            throws IOException {
        DataOutputStream dout = null;
        DataInputStream din = null;
        try {
            dout = new DataOutputStream(sock.getOutputStream());
            //将自身sid发送给其他节点
            dout.writeLong(this.mySid);
            dout.flush();
            din = new DataInputStream(
                    new BufferedInputStream(sock.getInputStream()));
        } catch (IOException e) {
            LOG.warn("Ignoring exception reading or writing challenge: ", e);
            closeSocket(sock);
            return false;
        }
        // authenticate learner
        authLearner.authenticate(sock, view.get(sid).hostname);
        if (sid > this.mySid) {
            LOG.info("Have smaller server identifier, so dropping the " +
                     "connection: (" + sid + ", " + this.mySid + ")");
            closeSocket(sock);
            // Otherwise proceed with the connection
        } else {
            //以下逻辑就和通过ss.accept拿到socket对象之后一样的逻辑
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);
            SendWorker vsw = senderWorkerMap.get(sid);
            if(vsw != null)
                vsw.finish();
            senderWorkerMap.put(sid, sw);
            queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue(SEND_CAPACITY));
            sw.start();
            rw.start();
            return true;    
            
        }
        return false;
    }

从以上几个方法中可以看出,在通过ServerSocket.accpet和socket.connect拿到了Socket对象之后,实例化出来一个SendWorker和一个RecvWorker这个对象,并调用了各自的start方法去启动两个线程,其实就是通过这2个线程去完成和其他节点的请求和响应的数据传输工作,一个节点维护一个SendWorker、一个RecvWorker和通过queueSendMap来存储一个队列来进行通信的。
具体后面这3个对象是如何发挥作用的,会在选举细节中具体讲解.完成这一系列的选举准备工作后,我们回到QuorumPeer#start方法中,接下来QuorumPeer#start方法调用super.start()方法,因为QuorumPeer这个对象继承了ZooKeeperThread,而ZooKeeperThread又继承了jdk的Thread类,所以调用了super.start之后,就会单独开辟一个线程去执行QuorumPeer#run方法,也就是真正进行选举的地方:

    public void run() {
        setName("QuorumPeer" + "[myid=" + getId() + "]" +
                cnxnFactory.getLocalAddress());
        LOG.debug("Starting quorum peer");
        //1.jmx拓展点
        try {
            jmxQuorumBean = new QuorumBean(this);
            MBeanRegistry.getInstance().register(jmxQuorumBean, null);
            for(QuorumServer s: getView().values()){
                ZKMBeanInfo p;
                if (getId() == s.id) {
                    p = jmxLocalPeerBean = new LocalPeerBean(this);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                        jmxLocalPeerBean = null;
                    }
                } else {
                    p = new RemotePeerBean(s);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                    }
                }
            }
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            jmxQuorumBean = null;
        }
        2.//选举逻辑
        try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                //1.Looking状态
                case LOOKING:
                    LOG.info("LOOKING");
                    //开启只读模式
                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");
                        final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                                logFactory, this,
                                new ZooKeeperServer.BasicDataTreeBuilder(),
                                this.zkDb);
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            setBCVote(null);
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception",e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {
                        try {
                            setBCVote(null);
                            //调用ElectionAlg#lookForLeader方法,然后返回选举后的投票信息
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }
                    }
                    break;
                //选举结束,observer角色进如到此处
                case OBSERVING:
                    try {
                        LOG.info("OBSERVING");
                        setObserver(makeObserver(logFactory));
                        observer.observeLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e );                        
                    } finally {
                        observer.shutdown();
                        setObserver(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                //选举结束,Follower角色进入到此
                case FOLLOWING:
                    try {
                        LOG.info("FOLLOWING");
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        follower.shutdown();
                        setFollower(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                //选举结束,Leader角色进入到此
                case LEADING:
                    LOG.info("LEADING");
                    try {
                        setLeader(makeLeader(logFactory));
                        leader.lead();
                        setLeader(null);
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        if (leader != null) {
                            leader.shutdown("Forcing shutdown");
                            setLeader(null);
                        }
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                }
            }
        } finally {
            LOG.warn("QuorumPeer main thread exited");
            try {
                MBeanRegistry.getInstance().unregisterAll();
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            jmxQuorumBean = null;
            jmxLocalPeerBean = null;
        }
    }

我们可以从上诉代码中的MainLoop处开始看,进入while循环后,因为当前节点还是looking状态,苏所以进入到looking分支,在这个分支中可以看到首先判断当前节点是否是只读模式,因为当前不讲解只读模式,所以直接进入到另外一个分支：

                        setBCVote(null);
                        //调用ElectionAlg#lookForLeader方法,然后返回选举后的投票信息
                        setCurrentVote(makeLEStrategy().lookForLeader());

makeLEStrategy方法返回的其实就是我们在QuorumPeer#startLeaderElection方法中实例话出来的FastLeaderElection实例,然后调用FastLeaderElection#lookForLeader方法进行Leader选举:

  public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {
            HashMap recvset = new HashMap();

            HashMap outofelection = new HashMap();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                else if(validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view for a replica in the voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

未完待续.......

无微不至之Zookeeper源码深度讲解(2)-核心流程梳理

一.源码仓库:

二.流程分析:

你可能感兴趣的:(zookeeper源码分析)