zk源码阅读40:Learner源码阅读

摘要

之前讲了服务器角色分为Leader和Learner(细分为Follower和Observer),这里先对Learner的代码进行展开
Learner是Follower和Observer的父类,定义了两者的一些公共行为
主要讲解内容如下

继承关系: 子类Follower和Observer
内部类 :PacketInFlight 表示尚在PROPOSAL还未COMMIT的消息记录
属性
方法
  与leader交互相关
    findLeader:找到leader,发现地址
    connectToLeader:和leader建立连接
    registerWithLeader:向leader注册,发送learnerInfo
    syncWithLeader:启动时learner先和leader进行数据同步
  读写packet
  session验证相关
    validateSession:集群版client重连时调用,learner验证会话是否有效,并激活,需要发送请求给Leader
    revalidate:接收到了leader返回的REVALIDATE信息,进行验证处理
  其他
    request:所有请求都被LEADER处理,learner接收到请求会转发给Leader
    ping:learner接收leader的ping命令时,返回LearnerSessionTracker的快照
思考
问题

继承关系

zk源码阅读40:Learner源码阅读_第1张图片
Learner的继承关系

体现了Learner细分为两种角色,Follower和Observer

内部类

PacketInFlight类,这个类是记录Leader发出提议,但是还没有通过过半验证时候记录的数据格式
类名代表"还在处理的包"
Follower读取PROPOSAL消息以及OBSERVER读取INFORM消息时,会生成相关记录

    static class PacketInFlight {
        TxnHeader hdr;//事务头
        Record rec;//记录
    }

属性

zk源码阅读40:Learner源码阅读_第2张图片
属性

源码,备注如下

    QuorumPeer self;//当前集群对象
    LearnerZooKeeperServer zk;//当前learner状态的zk服务器
    
    protected BufferedOutputStream bufferedOutput;
    
    protected Socket sock;

    protected InputArchive leaderIs;//输入
    protected OutputArchive leaderOs;  //输出
    /** the protocol version of the leader */
    protected int leaderProtocolVersion = 0x01;//当前协议版本
    
    protected static final Logger LOG = LoggerFactory.getLogger(Learner.class);

    static final private boolean nodelay = System.getProperty("follower.nodelay", "true").equals("true");//连接leader是否允许延迟

    final ConcurrentHashMap pendingRevalidations =
        new ConcurrentHashMap();//client连接到learner时,learner要向leader提出REVALIDATE请求,在收到回复之前,记录在一个map中,表示尚未处理完的验证

方法

leader交互相关

主要是发现leader,连接leader,向leader注册自己,与leader进行数据同步

findLeader

//找到leader是谁,就是通过currentVote的sid遍历所有集群机器,看哪个sid一样,就是那台机器

    protected InetSocketAddress findLeader() {
        InetSocketAddress addr = null;
        // Find the leader by id
        Vote current = self.getCurrentVote();
        for (QuorumServer s : self.getView().values()) {
            if (s.id == current.getId()) {//集群中某个机器的sid和当前投票的sid一样
                // Ensure we have the leader's correct IP address before
                // attempting to connect.
                s.recreateSocketAddresses();
                addr = s.addr;
                break;
            }
        }
        if (addr == null) {
            LOG.warn("Couldn't find the leader with id = "
                    + current.getId());
        }
        return addr;
    }

connectToLeader

和leader建立连接

    protected void connectToLeader(InetSocketAddress addr) 
    throws IOException, ConnectException, InterruptedException {
        sock = new Socket();        
        sock.setSoTimeout(self.tickTime * self.initLimit);
        for (int tries = 0; tries < 5; tries++) {
            try {
                sock.connect(addr, self.tickTime * self.syncLimit);
                sock.setTcpNoDelay(nodelay);
                break;
            } catch (IOException e) {
                if (tries == 4) {
                    LOG.error("Unexpected exception",e);
                    throw e;
                } else {
                    LOG.warn("Unexpected exception, tries="+tries+
                            ", connecting to " + addr,e);
                    sock = new Socket();
                    sock.setSoTimeout(self.tickTime * self.initLimit);
                }
            }
            Thread.sleep(1000);
        }
        leaderIs = BinaryInputArchive.getArchive(new BufferedInputStream(
                sock.getInputStream()));
        bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
        leaderOs = BinaryOutputArchive.getArchive(bufferedOutput);
    }   

registerWithLeader

//连上leader后,进行握手,参数代表一个following或者observing连接的注册
//注册时会发把自己基本信息发送给leader,称为learnerInfo

    protected long registerWithLeader(int pktType) throws IOException{//5.注册当前Follower
        /*
         * Send follower info, including last zxid and sid
         */
        long lastLoggedZxid = self.getLastLoggedZxid();
        QuorumPacket qp = new QuorumPacket();                
        qp.setType(pktType);//设置类型为Leader.FOLLOWERINFO或者Leader.OBSERVERINFO
        qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));
        
        /*
         * Add sid to payload
         */
        LearnerInfo li = new LearnerInfo(self.getId(), 0x10000);
        ByteArrayOutputStream bsid = new ByteArrayOutputStream();
        BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
        boa.writeRecord(li, "LearnerInfo");//把learner当前信息发送给leader
        qp.setData(bsid.toByteArray());
        
        writePacket(qp, true);//发送LearnerInfo包
        readPacket(qp);//接收leader的回复(新版本是一个LEADERINFO的消息,包含leader的状态)
        final long newEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
        if (qp.getType() == Leader.LEADERINFO) {//新版本的leader
            // we are connected to a 1.0 server so accept the new epoch and read the next packet
            leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
            byte epochBytes[] = new byte[4];
            final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);
            if (newEpoch > self.getAcceptedEpoch()) {
                wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
                self.setAcceptedEpoch(newEpoch);
            } else if (newEpoch == self.getAcceptedEpoch()) {
                // since we have already acked an epoch equal to the leaders, we cannot ack
                // again, but we still need to send our lastZxid to the leader so that we can
                // sync with it if it does assume leadership of the epoch.
                // the -1 indicates that this reply should not count as an ack for the new epoch
                wrappedEpochBytes.putInt(-1);
            } else {
                throw new IOException("Leaders epoch, " + newEpoch + " is less than accepted epoch, " + self.getAcceptedEpoch());
            }
            QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null);//8.接受完了leader状态之后,要发送ACK消息
            writePacket(ackNewEpoch, true);
            return ZxidUtils.makeZxid(newEpoch, 0);
        } else {//老版本的leader(用于兼容)
            if (newEpoch > self.getAcceptedEpoch()) {
                self.setAcceptedEpoch(newEpoch);
            }
            if (qp.getType() != Leader.NEWLEADER) {
                LOG.error("First packet should have been NEWLEADER");
                throw new IOException("First packet should have been NEWLEADER");
            }
            return qp.getZxid();
        }
    } 

syncWithLeader

启动时learner先和leader进行数据同步

    protected void syncWithLeader(long newLeaderZxid) throws IOException, InterruptedException{//和server同步
        QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
        QuorumPacket qp = new QuorumPacket();
        long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
        
        readPacket(qp);   
        LinkedList packetsCommitted = new LinkedList();
        LinkedList packetsNotCommitted = new LinkedList();//收到proposal但是还未commit的包
        synchronized (zk) {
            if (qp.getType() == Leader.DIFF) {//接收diff,表示以diff方式与leader的数据同步
                LOG.info("Getting a diff from the leader 0x" + Long.toHexString(qp.getZxid()));                
            }
            else if (qp.getType() == Leader.SNAP) {//表示snap方式与leader同步,从leader复制一份镜像数据到本地内存
                LOG.info("Getting a snapshot from leader");
                // The leader is going to dump the database
                // clear our own database and read
                zk.getZKDatabase().clear();
                zk.getZKDatabase().deserializeSnapshot(leaderIs);//从leader复制一份镜像数据到本地内存
                String signature = leaderIs.readString("signature");
                if (!signature.equals("BenWasHere")) {//验证签名
                    LOG.error("Missing signature. Got " + signature);
                    throw new IOException("Missing signature");                   
                }
            } else if (qp.getType() == Leader.TRUNC) {//触发回滚,回滚到leader的lastzxid
                //we need to truncate the log to the lastzxid of the leader
                LOG.warn("Truncating log to get in sync with the leader 0x"
                        + Long.toHexString(qp.getZxid()));
                boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
                if (!truncated) {
                    // not able to truncate the log
                    LOG.error("Not able to truncate the log "
                            + Long.toHexString(qp.getZxid()));
                    System.exit(13);
                }

            }
            else {
                LOG.error("Got unexpected packet from leader "
                        + qp.getType() + " exiting ... " );
                System.exit(13);

            }
            zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
            zk.createSessionTracker();            
            
            long lastQueued = 0;

            // in V1.0 we take a snapshot when we get the NEWLEADER message, but in pre V1.0
            // we take the snapshot at the UPDATE, since V1.0 also gets the UPDATE (after the NEWLEADER)
            // we need to make sure that we don't take the snapshot twice.
            boolean snapshotTaken = false;//标识是否接受到了 snapshotTaken 请求
            // we are now going to start getting transactions to apply followed by an UPTODATE
            outerLoop:
            while (self.isRunning()) {//启动时数据同步,不断读取leader的数据,直到收到UPTODATE表示同步完成
                readPacket(qp);
                switch(qp.getType()) {
                case Leader.PROPOSAL://接收到提议
                    PacketInFlight pif = new PacketInFlight();
                    pif.hdr = new TxnHeader();
                    pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);
                    if (pif.hdr.getZxid() != lastQueued + 1) {
                    LOG.warn("Got zxid 0x"
                            + Long.toHexString(pif.hdr.getZxid())
                            + " expected 0x"
                            + Long.toHexString(lastQueued + 1));
                    }
                    lastQueued = pif.hdr.getZxid();
                    packetsNotCommitted.add(pif);//因为还没有commit,记录在notCommit的队列里面
                    break;
                case Leader.COMMIT://接收到COMMIT
                    if (!snapshotTaken) { //没拍照的话
                        pif = packetsNotCommitted.peekFirst();
                        if (pif.hdr.getZxid() != qp.getZxid()) {
                            LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
                        } else {
                            zk.processTxn(pif.hdr, pif.rec);//直接处理事务
                            packetsNotCommitted.remove();//从未COMMIT记录中删除对应记录
                        }
                    } else {
                        packetsCommitted.add(qp.getZxid());//
                    }
                    break;
                case Leader.INFORM://observer才会拿到INFORM消息,来同步
                    /*
                     * Only observer get this type of packet. We treat this
                     * as receiving PROPOSAL and COMMMIT.
                     */
                    PacketInFlight packet = new PacketInFlight();
                    packet.hdr = new TxnHeader();
                    packet.rec = SerializeUtils.deserializeTxn(qp.getData(), packet.hdr);
                    // Log warning message if txn comes out-of-order
                    if (packet.hdr.getZxid() != lastQueued + 1) {
                        LOG.warn("Got zxid 0x"
                                + Long.toHexString(packet.hdr.getZxid())
                                + " expected 0x"
                                + Long.toHexString(lastQueued + 1));
                    }
                    lastQueued = packet.hdr.getZxid();
                    if (!snapshotTaken) {//没拍照就直接处理事务
                        // Apply to db directly if we haven't taken the snapshot
                        zk.processTxn(packet.hdr, packet.rec);
                    } else {
                        packetsNotCommitted.add(packet);
                        packetsCommitted.add(qp.getZxid());
                    }
                    break;
                case Leader.UPTODATE://过半机器完成了leader验证,自己也完成了数据同步,可以跳出循环
                    if (!snapshotTaken) { // true for the pre v1.0 case
                        zk.takeSnapshot();
                        self.setCurrentEpoch(newEpoch);
                    }
                    self.cnxnFactory.setZooKeeperServer(zk);                
                    break outerLoop;
                case Leader.NEWLEADER: // it will be NEWLEADER in v1.0
                    // Create updatingEpoch file and remove it after current
                    // epoch is set. QuorumPeer.loadDataBase() uses this file to
                    // detect the case where the server was terminated after
                    // taking a snapshot but before setting the current epoch.
                    File updating = new File(self.getTxnFactory().getSnapDir(),
                                        QuorumPeer.UPDATING_EPOCH_FILENAME);
                    if (!updating.exists() && !updating.createNewFile()) {
                        throw new IOException("Failed to create " +
                                              updating.toString());
                    }
                    zk.takeSnapshot();//设置快照和当前epoch
                    self.setCurrentEpoch(newEpoch);
                    if (!updating.delete()) {//删除updatingEpoch文件
                        throw new IOException("Failed to delete " +
                                              updating.toString());
                    }
                    snapshotTaken = true;//已经快照了
                    writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);//遇到NEWLEADER回复ACK
                    break;
                }
            }
        }
        ack.setZxid(ZxidUtils.makeZxid(newEpoch, 0));
        writePacket(ack, true);//最后再发一个ack
        sock.setSoTimeout(self.tickTime * self.syncLimit);
        zk.startup();//启动服务器
        /*
         * Update the election vote here to ensure that all members of the
         * ensemble report the same vote to new servers that start up and
         * send leader election notifications to the ensemble.
         * 
         * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
         */
        self.updateElectionVote(newEpoch);//更新选举投票,解决某个bug用的

        // We need to log the stuff that came in between the snapshot and the uptodate
        if (zk instanceof FollowerZooKeeperServer) {//如果是follower
            FollowerZooKeeperServer fzk = (FollowerZooKeeperServer)zk;
            for(PacketInFlight p: packetsNotCommitted) {
                fzk.logRequest(p.hdr, p.rec);
            }
            for(Long zxid: packetsCommitted) {
                fzk.commit(zxid);//进行commit
            }
        } else if (zk instanceof ObserverZooKeeperServer) {
            // Similar to follower, we need to log requests between the snapshot
            // and UPTODATE
            ObserverZooKeeperServer ozk = (ObserverZooKeeperServer) zk;
            for (PacketInFlight p : packetsNotCommitted) {
                Long zxid = packetsCommitted.peekFirst();
                if (p.hdr.getZxid() != zxid) {
                    // log warning message if there is no matching commit
                    // old leader send outstanding proposal to observer
                    LOG.warn("Committing " + Long.toHexString(zxid)
                            + ", but next proposal is "
                            + Long.toHexString(p.hdr.getZxid()));
                    continue;
                }
                packetsCommitted.remove();
                Request request = new Request(null, p.hdr.getClientId(),
                        p.hdr.getCxid(), p.hdr.getType(), null, null);
                request.txn = p.rec;
                request.hdr = p.hdr;
                ozk.commitRequest(request);
            }
        } else {
            // New server type need to handle in-flight packets
            throw new UnsupportedOperationException("Unknown server type");
        }
    }

需要结合leader端 LearnerHandler#run代码查看,在learner端主要逻辑是

1.前面registerWithLeader函数learner会回复leader的LEADERINFO,带上了自己的lastLoggedZxid
2.leader根据lastLoggedZxid告诉learner是哪一种同步方式
  DIFF同步,还是SNAP同步,还是先TRUNC回滚到某个zxid
3.确定同步方式之后,leader会接着给learner发送后续的同步packet,分为
PROPOSAL(提议)
COMMIT(提交,针对Follower)
INFORM(通知,针对Observer)
UPTODATE(表示过半机器已完成同步,可以对外工作)
NEWLEADER(leader告诉learner同步的相关请求已经发完了)

不过代码里面也有部分不理解的问题,在下面思考和问题中统一列举

读写packet

直接贴了

    void writePacket(QuorumPacket pp, boolean flush) throws IOException {//发送packet,record的tag是packet,让leader读
        synchronized (leaderOs) {
            if (pp != null) {
                leaderOs.writeRecord(pp, "packet");
            }
            if (flush) {
                bufferedOutput.flush();
            }
        }
    }

    void readPacket(QuorumPacket pp) throws IOException {//从leader读取packet
        synchronized (leaderIs) {
            leaderIs.readRecord(pp, "packet");
        }
        long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
        if (pp.getType() == Leader.PING) {
            traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
        }
        if (LOG.isTraceEnabled()) {
            ZooTrace.logQuorumPacket(LOG, traceMask, 'i', pp);
        }
    }

session验证相关

validateSession

集群版client重连时调用,learner验证会话是否有效,并激活,需要发送请求给Leader

    void validateSession(ServerCnxn cnxn, long clientId, int timeout)
            throws IOException {
        LOG.info("Revalidating client: 0x" + Long.toHexString(clientId));
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);
        dos.writeLong(clientId);//写入clientId
        dos.writeInt(timeout);
        dos.close();
        QuorumPacket qp = new QuorumPacket(Leader.REVALIDATE, -1, baos
                .toByteArray(), null);//发送REVALIDATE命令
        pendingRevalidations.put(clientId, cnxn);//需要验证,还未返回结果,记录在map中
        if (LOG.isTraceEnabled()) {
            ZooTrace.logTraceMessage(LOG,
                                     ZooTrace.SESSION_TRACE_MASK,
                                     "To validate session 0x"
                                     + Long.toHexString(clientId));
        }
        writePacket(qp, true);//发送
    }     

revalidate

//接收到了leader返回的REVALIDATE信息,进行验证处理

    protected void revalidate(QuorumPacket qp) throws IOException {//接收到了leader返回的REVALIDATE信息,进行验证处理
        ByteArrayInputStream bis = new ByteArrayInputStream(qp
                .getData());
        DataInputStream dis = new DataInputStream(bis);
        long sessionId = dis.readLong();
        boolean valid = dis.readBoolean();
        ServerCnxn cnxn = pendingRevalidations
        .remove(sessionId);//验证完成,从map中移除
        if (cnxn == null) {
            LOG.warn("Missing session 0x"
                    + Long.toHexString(sessionId)
                    + " for validation");
        } else {
            zk.finishSessionInit(cnxn, valid);//完成session的初始化
        }
        if (LOG.isTraceEnabled()) {
            ZooTrace.logTraceMessage(LOG,
                    ZooTrace.SESSION_TRACE_MASK,
                    "Session 0x" + Long.toHexString(sessionId)
                    + " is valid: " + valid);
        }
    }

其他

request

所有请求都被LEADER处理,learner接收到请求会转发给Leader

    void request(Request request) throws IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream oa = new DataOutputStream(baos);
        oa.writeLong(request.sessionId);
        oa.writeInt(request.cxid);
        oa.writeInt(request.type);
        if (request.request != null) {
            request.request.rewind();
            int len = request.request.remaining();
            byte b[] = new byte[len];
            request.request.get(b);
            request.request.rewind();
            oa.write(b);
        }
        oa.close();
        QuorumPacket qp = new QuorumPacket(Leader.REQUEST, -1, baos
                .toByteArray(), request.authInfo);
        writePacket(qp, true);
    }

ping

learner接收leader的ping命令时,返回LearnerSessionTracker的快照

    protected void ping(QuorumPacket qp) throws IOException {
        // Send back the ping with our session data
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(bos);
        HashMap touchTable = zk
                .getTouchSnapshot();//
        for (Entry entry : touchTable.entrySet()) {
            dos.writeLong(entry.getKey());
            dos.writeInt(entry.getValue());
        }
        qp.setData(bos.toByteArray());
        writePacket(qp, true);
    }

思考

数据同步的消息类型补充

zk源码阅读40:Learner源码阅读_第3张图片
image.png

这里补充几点

1.DIFF 和 SNAP 两种数据同步的方式的区别是什么
打个比方,LEADER的样子是A,LEARNER的样式是B
DIFF:代表LEADER后续会告诉LEARNER:不断接收PROPOSAL和COMMIT,一步步提交让LEARNER从B变成B1,到B2。。。最后到A的样子.
SNAP:代表LEADER直接告诉LEARNER:我长得样子是A,你copy一下变成我的样子就好了

2.是否发送了DIFF ,SNAP或者TRUNC,learner接受到一下就一下子数据同步完成了
不是
DIFF:后面会跟着一堆INFORM和COMMIT让learner一步步的同步
TRUNC:回滚到某个zxid后,后面也会跟着一堆INFORM和COMMIT
SNAP:后面会发送自己db的序列化内容

数据同步时syncWithLeader的思考

点有很多

1.snapshotTaken意义是什么
代表是否接受到了NEWLEADER请求,此时代表接受完了leader的同步数据,拍摄快照

2.packetsNotCommitted意义是什么
收到proposal但是还未commit的包

单机版和集群版中Learner验证session是否有效并且重连的逻辑的区别

单机版,源码37节讲了client的重连机制

ZooKeeperServer#reopenSession
ZooKeeperServer#revalidateSession

就是看账号密码是否ok,ok的话去sessionTracker验证,看是否超时即可

集群版Learner验证session是否有效

Learner#validateSession
发送REVALIDATE请求给leader,让leader决定会话是否过时

Learner#revalidate接收leader的REVALIDATE回复,完成session的验证

哪些点体现了Leader角色的特性

之前讲到,Leader服务器是Zookeeper集群工作的核心,其主要工作如下

  (1) 事务请求的唯一调度和处理者,保证集群事务处理的顺序性。
  (2) 集群内部各服务器的调度者。

比如上面提到的,验证session是否有效,learner需要发送请求给leader接收回复才行,无法自行判断

问题

数据同步时syncWithLeader的问题

1.packetsCommitted的意义是什么

2.什么时候进入case Leader.COMMIT里面的else条件
按照LearnerHandler#run的写法

leader发送完了NEWLEADER消息之后,等过半learner回复ACK时,自己就发送UPTODATE了,这样的话,learner端走不进else里面的逻辑

吐槽

代码注释,还有各种兼容性问题让代码难看懂
也没找到什么状态机的图

refer

基本没有很match的refer,稍许带上一点
http://www.cnblogs.com/leesf456/p/6139266.html
《paxos到zk》

你可能感兴趣的:(zk源码阅读40:Learner源码阅读)