[Zookeeper] Zookeeper 请求流程1 （Learner部分）

数据同步与初始化（选举完leader之后）
分角色业务处理分析（leader,follower,observer）

1.数据同步与初始化

选举完leader之后，只有当各个角色与leader保持数据同步，才能对外提供服务。

其中，服务器间数据同步过程

基本分为三种方式：

SNAP方式（snapshot,同步整个文件）
DIFF方式
TRUNC方式

1.1 基本流程

其中第3步：Follower返回Leader的AckEpoch，会包含当前的最大zxid，Leader节点会将该zxid与其minZxid，maxZxid进行比较。
这个 [minZxid,maxZxid] 实际上是在leader端缓存的一个事务队列。

其中第6步：发送的NewLeader，说明当前数据已经同步完（Leader已经将该同步的内容发送给Follower）

三种方式的区别：

如果Follower端的zxid小于minZxid，说明Leader与Follower之间数据差距非常大，直接采取Snap方式，Follower就去接收Leader发送的snapshot文件
如果Follower端的zxid处于minZxid，maxZxid之间，采取Diff方式，即Leader只要发送区间为[zxid,maxZxid]的事务即可，Follower接收到这些事务，进行持久化并更新内存
如果Follower端的zxid大于maxZxid，采取Trunc方式，Follower则将大于maxZxid的事务日志删除

1.2 类说明

Learner类
Learner包括Follower和Observer，其中比较重要的leaderIs,leaderOs,表示是链接到Leader的输入流，输出流
LearnerHandler(继承自ZooKeeperThread)

1.3 详细说明

在QuorumPeer中，进行FastLeaderElection之后，即在QuorumPeer的run方法中，

       try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                case LOOKING:
                    LOG.info("LOOKING");

                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");

                        // Create read-only server but don't start it immediately
                        final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                                logFactory, this,
                                new ZooKeeperServer.BasicDataTreeBuilder(),
                                this.zkDb);
    
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            setBCVote(null);
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception",e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {
                        try {
                            setBCVote(null);
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }
                    }
                    break;
                case OBSERVING:
                    try {
                        LOG.info("OBSERVING");
                        setObserver(makeObserver(logFactory));
                        observer.observeLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e );                        
                    } finally {
                        observer.shutdown();
                        setObserver(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                case FOLLOWING:
                    try {
                        LOG.info("FOLLOWING");
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        follower.shutdown();
                        setFollower(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                case LEADING:
                    LOG.info("LEADING");
                    try {
                        setLeader(makeLeader(logFactory));
                        leader.lead();
                        setLeader(null);
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        if (leader != null) {
                            leader.shutdown("Forcing shutdown");
                            setLeader(null);
                        }
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                }
            }
        } finally {
            LOG.warn("QuorumPeer main thread exited");
            try {
                MBeanRegistry.getInstance().unregisterAll();
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            jmxQuorumBean = null;
            jmxLocalPeerBean = null;
        }
    }

通过getPeerState()方法，获取当前服务器的state，如果是FOLLOWING状态，

followLeader()方法：Follower在提供服务给客户端之间完成注册到Leader的动作。
注册分为以下3个主要步骤：

调用connectToLeader方法连接到Leader。
调用registerWithLeader方法注册到Leader，交换各自的sid、zxid和Epoch等信息，Leader以此决定事务同步的方式。
调用SyncWithLeader跟Leader进行事务数据同步，处理SNAP/DIFF/TRUNC包。

connectToLeader：创建Socket连接到Leader，该方法定义在Follower父类Learner中，加了重试机制，最多可以尝试5次连接。连接成功后，Leader会创建一个LearnerHandler专门处理与该Follower之间的QuorumPacket消息的传递。
registerWithLeader：首先会发送FOLLOWERINFO包给Leader，告诉Leader自己的身份属性（Follower的zxid，sid）。然后等待Leader回复的LEADINFO包，获取Leader的Epoch和zxid值，并更新Follower的Epoch和zxid值，以Leader信息为准。
最后，给Leader发送ACKINFO包，告诉Leader这次Follower已经与Leader的zxid同步了。
SyncWithLeader：与Leader同步数据，即同步Leader的事务到Follower

3.1
首先读取同步数据包，主要代码如下：

QuorumPacket qp = new QuorumPacket();
        long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
        // In the DIFF case we don't need to do a snapshot because the transactions will sync on top of any existing snapshot
        // For SNAP and TRUNC the snapshot is needed to save that history
        boolean snapshotNeeded = true;
        readPacket(qp);
        // 提交的packets
        LinkedList packetsCommitted = new LinkedList();
        // 未提交的packets
        LinkedList packetsNotCommitted = new LinkedList();
        synchronized (zk) {
            // Diff方式下,不需要进行snapshot
            if (qp.getType() == Leader.DIFF) {
                LOG.info("Getting a diff from the leader 0x{}", Long.toHexString(qp.getZxid()));
                snapshotNeeded = false;
            }
            else if (qp.getType() == Leader.SNAP) {
                LOG.info("Getting a snapshot from leader 0x" + Long.toHexString(qp.getZxid()));
                // The leader is going to dump the database
                // clear our own database and read
                // 清空日志,minZxid和maxZxid都为0,,新构建DataTree
                zk.getZKDatabase().clear();
                zk.getZKDatabase().deserializeSnapshot(leaderIs);
                String signature = leaderIs.readString("signature");
                if (!signature.equals("BenWasHere")) {
                    LOG.error("Missing signature. Got " + signature);
                    throw new IOException("Missing signature");                   
                }
                // 设置最近的zxid
                zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
            } else if (qp.getType() == Leader.TRUNC) {
                //we need to truncate the log to the lastzxid of the leader
                LOG.warn("Truncating log to get in sync with the leader 0x"
                        + Long.toHexString(qp.getZxid()));
                boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
                if (!truncated) {
                    // not able to truncate the log
                    LOG.error("Not able to truncate the log "
                            + Long.toHexString(qp.getZxid()));
                    System.exit(13);
                }
                zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
            }

<1> SNAP：快照模式，这种模式下Leader将整个完整数据库传给Follower
<2> TRUNC：截断模式，这种模式表明Follower的数据比Leader还多，为了维持一致性需要将Follower多余的数据删除
<3> DIFF：差异模式，说明Follower比Leader的事务少，需要给Follower补足，这时候Leader会将需要补充的事务生成PROPOSAL包和COMMIT包发给Follower执行。

3.2
处理后续消息（即QuorumPacket类型）
比如Proposal，Commit，NewLeader等，其中Proposal是指在同步期间收到的Leader发送的写请求信息，缓存在packetsNotCommitted中。

            outerLoop:
            while (self.isRunning()) {
                readPacket(qp);
                switch(qp.getType()) {
                case Leader.PROPOSAL:
                    PacketInFlight pif = new PacketInFlight();
                    pif.hdr = new TxnHeader();
                    pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);
                    if (pif.hdr.getZxid() != lastQueued + 1) {
                    LOG.warn("Got zxid 0x"
                            + Long.toHexString(pif.hdr.getZxid())
                            + " expected 0x"
                            + Long.toHexString(lastQueued + 1));
                    }
                    lastQueued = pif.hdr.getZxid();
                    packetsNotCommitted.add(pif);
                    break;
                case Leader.COMMIT:
                    if (!writeToTxnLog) {
                        pif = packetsNotCommitted.peekFirst();
                        if (pif.hdr.getZxid() != qp.getZxid()) {
                            LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
                        } else {
                            zk.processTxn(pif.hdr, pif.rec);
                            packetsNotCommitted.remove();
                        }
                    } else {
                        packetsCommitted.add(qp.getZxid());
                    }
                    break;
                case Leader.INFORM:
                    /*
                     * Only observer get this type of packet. We treat this
                     * as receiving PROPOSAL and COMMMIT.
                     */
                    PacketInFlight packet = new PacketInFlight();
                    packet.hdr = new TxnHeader();
                    packet.rec = SerializeUtils.deserializeTxn(qp.getData(), packet.hdr);
                    // Log warning message if txn comes out-of-order
                    if (packet.hdr.getZxid() != lastQueued + 1) {
                        LOG.warn("Got zxid 0x"
                                + Long.toHexString(packet.hdr.getZxid())
                                + " expected 0x"
                                + Long.toHexString(lastQueued + 1));
                    }
                    lastQueued = packet.hdr.getZxid();
                    if (!writeToTxnLog) {
                        // Apply to db directly if we haven't taken the snapshot
                        zk.processTxn(packet.hdr, packet.rec);
                    } else {
                        packetsNotCommitted.add(packet);
                        packetsCommitted.add(qp.getZxid());
                    }
                    break;
                case Leader.UPTODATE:
                    if (isPreZAB1_0) {
                        zk.takeSnapshot();
                        self.setCurrentEpoch(newEpoch);
                    }
                    self.cnxnFactory.setZooKeeperServer(zk);                
                    break outerLoop;
                case Leader.NEWLEADER: // Getting NEWLEADER here instead of in discovery 
                 
                    File updating = new File(self.getTxnFactory().getSnapDir(),
                                        QuorumPeer.UPDATING_EPOCH_FILENAME);
                    if (!updating.exists() && !updating.createNewFile()) {
                        throw new IOException("Failed to create " +
                                              updating.toString());
                    }
                    if (snapshotNeeded) {
                        zk.takeSnapshot();
                    }
                    self.setCurrentEpoch(newEpoch);
                    if (!updating.delete()) {
                        throw new IOException("Failed to delete " +
                                              updating.toString());
                    }
                    writeToTxnLog = true; 
                    isPreZAB1_0 = false;
                    writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
                    break;
                }
            }
        }

之后，Leader会发送NEWLEAER包，Follower收到NEWLEADER包后回复ACK给Leader，
最后，Leader发送UPTODATE包表示同步完成，Follower这时启动服务端并跳出本次循环，准备结束整个注册过程。

3.3 Follower主流程
Follower是Learner的子类，Follower的启动方法就是followLeader。

// 寻找Leader角色
            QuorumServer leaderServer = findLeader();            
            try {
                // 尝试5次连接Leader
                connectToLeader(leaderServer.addr, leaderServer.hostname);
                // 建立Following连接,
                long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);

                //check to see if the leader zxid is lower than ours
                //this should never happen but is just a safety check
                long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
                if (newEpoch < self.getAcceptedEpoch()) {
                    LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
                            + " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
                    throw new IOException("Error: Epoch of leader is lower");
                }
                // 进行数据同步
               syncWithLeader(newEpochZxid);                
                QuorumPacket qp = new QuorumPacket();
                while (this.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }

启动时，首先与Leader同步数据，然后启动FollowerZooKeeperServer，在FollowerZooKeeperServer运行的同时，额外启动while循环等待QuorumPacket包，调用processPacket方法处理这些包。

processPacket处理QuorumPeer传送的
QuorumPacket，最主要是处理两种QuorumPacket：PROPOSAL和COMMIT。当然还有PING、COMMITANDACTIVATE等包类型。

该方法在收到Leader发送过来的QuorumPacket时被调用，主要是响应PROPOSAL和COMMIT两种类型的消息。PROPOSAL是Leader将要执行的写事务命令；COMMIT是提交命令。Follower只有在收到COMMIT消息后才会让PROPOSAL命令的内容生效。

同一个写事务命令会在Leader和多个Follower上都执行一次，保证集群数据的一致性。

        case Leader.PROPOSAL:            
            TxnHeader hdr = new TxnHeader();
            Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr);
            if (hdr.getZxid() != lastQueued + 1) {
                LOG.warn("Got zxid 0x"
                        + Long.toHexString(hdr.getZxid())
                        + " expected 0x"
                        + Long.toHexString(lastQueued + 1));
            }
            lastQueued = hdr.getZxid();
            fzk.logRequest(hdr, txn);
            break;
        case Leader.COMMIT:
            fzk.commit(qp.getZxid());
            break;

Follower收到PROPOSAL消息后调用FollowerZooKeeperServer的logRequest方法；收到COMMIT消息后调用FollowerZooKeeperServer的commit方法。

PROPOSAL包
Leader发送给集群中所有follower的写请求包。
Leader执行写操作时需要告之集群中的Learner，让大家也执行写操作，保证集群数据的一致性。PROPOSAL是严格按照顺序执行的，这也是ZOOKEEPER的核心设计思想之一
COMMIT包
当Leader认为一个Proposal已被大多数Follower持久化并等待执行后会发送COMMIT包，通知各Follower可以提交执行该Proposal了，最后调用到FinalRequestProcessor执行写操作，通过这种机制保证写操作能被大半数集群机器执行

3.4 Observer主流程
Observer和Follower功能类似，主要的差别就是不参与选举。

Observer的入口方法是observerLeader。当QuorumPeer的状态是OBSERVING时会启动Observer并调用observerLeader方法。

observerLeader同Follower的followLeader方法类似，首先注册到Leader，事务同步后进入QuorumPacket包循环处理过程，调用processPacket方法处理QuorumPacket。

processPacket比Follower要简单许多，最主要是处理INFORM包来执行Leader的写请求命令。

这里处理的是INFORM消息，Leader群发写事务时，给Follower发的是PROPOSAL并要等待Follower确认；而给Observer发的则是INFORM消息并且不需要Obverver回复ACK消息来确认。

[Zookeeper] Zookeeper 请求流程1 （Learner部分）

1.数据同步与初始化

1.1 基本流程

1.2 类说明

1.3 详细说明

你可能感兴趣的:([Zookeeper] Zookeeper 请求流程1 （Learner部分）)