Zookeeper(七)-服务端集群模式-启动流程-1

概述

前面Zookeeper(五)-服务端单机模式-启动流程分析了服务端启动流程,其中集群模式下Leader选主后流程尚未分析,本节继续分析选主后的数据同步及ZooKeeperServer启动等

数据同步

1. 入口

集群模式选主之后,根据当前服务器状态(LEADING/FOLLOWING/OBSERVING)进行不同逻辑处理;

case FOLLOWING:
    ......
    LOG.info("FOLLOWING");
    setFollower(makeFollower(logFactory));
    follower.followLeader();
    ......
    break;
case LEADING:
    ......
    setLeader(makeLeader(logFactory));
    leader.lead();
    setLeader(null);
    ......
    break;
  • 重点分析FOLLOWING/LEADING,OBSERVING跟FOLLOWING类似相对比较简单;

2. Leader.lead

void lead() throws IOException, InterruptedException {
    ......
    // 加载从磁盘日志文件加载数据
    zk.loadData();
    ......
    // 等待新关注者的连接请求的启动线程
    cnxAcceptor = new LearnerCnxAcceptor();
    cnxAcceptor.start();
    ......
    // LearnerHandler中也会调用getEpochToPropose
    long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
    ......
    // LearnerHandler中也会调用waitForEpochAck
    waitForEpochAck(self.getId(), leaderStateSummary);
    ......
    // LearnerHandler中也会调用waitForNewLeaderAck
    waitForNewLeaderAck(self.getId(), zk.getZxid(), LearnerType.PARTICIPANT);
    ......
    // 启动ZooKeeperServer,zk stat置为running
    startZkServer();
    ......
}
  • zk.loadData()从磁盘日志文件加载DataTree;参考Zookeeper(三)-持久化,这里重点关注下ZkDatabase中注册的回调方法PlayBackListener.onTxnLoaded;
  • cnxAcceptor.start()为每个follower创建一个LearnerHandler线程,用于跟leader进行数据同步;重点
  • getEpochToPropose/waitForEpochAck/waitForNewLeaderAck数据同步过程中follower和leader之间的交互,需要过半服务器通过提案,这里相当于添加leader自己的提案,下面LearnerHandler中还会详细分析;
  • startZkServer()启动ZooKeeperServer,参考Zookeeper(五)-服务端单机模式-启动流程;
2.1 ZKDatabase.addCommittedProposal
protected LinkedList committedLog = new LinkedList();
public void addCommittedProposal(Request request) {
    WriteLock wl = logLock.writeLock();
    try {
        // 写锁
        wl.lock();
        // committedLog最大容量500
        if (committedLog.size() > commitLogCount) {
            committedLog.removeFirst();
            minCommittedLog = committedLog.getFirst().packet.getZxid();
        }
        if (committedLog.size() == 0) {
            minCommittedLog = request.zxid;
            maxCommittedLog = request.zxid;
        }

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos);
        try {
            request.hdr.serialize(boa, "hdr");
            if (request.txn != null) {
                request.txn.serialize(boa, "txn");
            }
            baos.close();
        } catch (IOException e) {
            LOG.error("This really should be impossible", e);
        }
        QuorumPacket pp = new QuorumPacket(Leader.PROPOSAL, request.zxid,
                baos.toByteArray(), null);
        Proposal p = new Proposal();
        p.packet = pp;
        p.request = request;
        // Proposal加入committedLog
        committedLog.add(p);
        maxCommittedLog = p.packet.getZxid();
    } finally {
        wl.unlock();
    }
}
  • 1.该方法在加载txnLog时进行回调,每加载一条txnLog回调一次;
  • 2.committedLog存放反序列化后的包装成的Proposal,最大500条;
  • 3.minCommittedLog:committedLog中最小的zxid;maxCommittedLog:committedLog中最大的zxid;

3. Follower.followLeader

void followLeader() throws InterruptedException {
    ......
    // 查找leader
    QuorumServer leaderServer = findLeader();            
    // 创建到leader的连接
    connectToLeader(leaderServer.addr, leaderServer.hostname);
    // 注册followerInfo
    long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);

    long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
    if (newEpoch < self.getAcceptedEpoch()) {
        LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
                + " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
        throw new IOException("Error: Epoch of leader is lower");
    }
    // 数据同步
    syncWithLeader(newEpochZxid);
    QuorumPacket qp = new QuorumPacket();
    while (this.isRunning()) {
        // 启动后阻塞等待leader发送请求
        readPacket(qp);
        processPacket(qp);
    }
    ......
}
  • findLeader()从servers配置中找到leader节点的QuorumServer;
  • connectToLeader(leaderServer.addr, leaderServer.hostname)创建到leader的连接,并初始化leaderIs和leaderOs;
  • registerWithLeader(Leader.FOLLOWERINFO)发送sid到leader,当前follower注册到leader;
  • newEpoch < self.getAcceptedEpoch()leader的epoch小于当前epoch(注意这里不是zxid),抛出异常;
  • syncWithLeader(newEpochZxid)进行leader数据同步;
  • readPacket(qp)/processPacket(qp)数据同步之后启动完成,阻塞等待leader发送的请求,例如ping等;

下面开始具体分析leader和follower之间的通信过程

数据同步.png

4. Leader启动数据同步监听(LearnerCnxAcceptor.run)

public void run() {
    ......
    // 等待follower连接,2888端口
    Socket s = ss.accept();
    s.setSoTimeout(self.tickTime * self.initLimit);
    s.setTcpNoDelay(nodelay);
    BufferedInputStream is = new BufferedInputStream(s.getInputStream());
    LearnerHandler fh = new LearnerHandler(s, is, Leader.this);
    // 为每个follower创建一个LearnerHandler线程
    fh.start();
    ......
}
  • Socket s = ss.accept()阻塞等待follower连接,默认监听2888端口;
  • new BufferedInputStream(s.getInputStream())创建Socket的InputStream,等待接收该Socket上的数据;
  • new LearnerHandler(s, is, Leader.this)通过Socket、输入流、leader构造LearnerHandler;
  • fh.start()为每个follower启动一个LearnerHandler线程,具体处理对应follower的通信;
4.1. Leader接收输入流(LearnerHandler.run)
public void run() {
    // 等待follower发送请求
    ia = BinaryInputArchive.getArchive(bufferedInput);
    bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
    oa = BinaryOutputArchive.getArchive(bufferedOutput);
    ......
}

5. Follower连接Leader(Learner.connectToLeader)

protected void connectToLeader(InetSocketAddress addr, String hostname)
            throws IOException, ConnectException, InterruptedException {
    sock = new Socket();        
    sock.setSoTimeout(self.tickTime * self.initLimit);
    // 创建连接最多重试5次
    for (int tries = 0; tries < 5; tries++) {
        try {
            sock.connect(addr, self.tickTime * self.syncLimit);
            sock.setTcpNoDelay(nodelay);
            break;
        } 
            ......
        }
        Thread.sleep(1000);
    }
    self.authLearner.authenticate(sock, hostname);
    // 反序列化并读取leader响应数据
    leaderIs = BinaryInputArchive.getArchive(new BufferedInputStream(sock.getInputStream()));
    bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
    // 序列化并发送leader响应数据
    leaderOs = BinaryOutputArchive.getArchive(bufferedOutput);
}   
  • sock.connect跟Leader创建连接,最多重试5次;
  • leaderIs接收输入流并反序列化;
  • leaderOs序列化后发送到Leader的输出流;

6. Follower发送FOLLOWERINFO(Learner.registerWithLeader)

protected long registerWithLeader(int pktType) throws IOException{
    long lastLoggedZxid = self.getLastLoggedZxid();
    QuorumPacket qp = new QuorumPacket();                
    qp.setType(pktType);
    qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));
    LearnerInfo li = new LearnerInfo(self.getId(), 0x10000);
    ByteArrayOutputStream bsid = new ByteArrayOutputStream();
    BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
    boa.writeRecord(li, "LearnerInfo");
    ......
}
  • boa.writeRecord发送QuorumPacket;
  • 发送数据:
    type:FOLLOWERINFO;
    zxid:acceptedEpoch补0;
    data:LearnerInfo;

7. Leader处理FOLLOWERINFO(LearnerHandler.run)

public void run() {
    ......
    QuorumPacket qp = new QuorumPacket();
    // 读取消息,消息1. FOLLOWERINFO
    ia.readRecord(qp, "packet");
    // 不是FOLLOWERINFO或OBSERVERINFO,直接返回
    if(qp.getType() != Leader.FOLLOWERINFO && qp.getType() != Leader.OBSERVERINFO){
        LOG.error("First packet " + qp.toString() + " is not FOLLOWERINFO or OBSERVERINFO!");
        return;
    }
    byte learnerInfoData[] = qp.getData();
    if (learnerInfoData != null) {
        // 老版本,忽略
        if (learnerInfoData.length == 8) {
            ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
            this.sid = bbsid.getLong();
        } else {
            LearnerInfo li = new LearnerInfo();
            // 反序列化LearnerInfo
            ByteBufferInputStream.byteBuffer2Record(ByteBuffer.wrap(learnerInfoData), li);
            this.sid = li.getServerid();
            this.version = li.getProtocolVersion();
        }
    } else {
        this.sid = leader.followerCounter.getAndDecrement();
    }

    LOG.info("Follower sid: " + sid + " : info : " + leader.self.quorumPeers.get(sid));
                
    if (qp.getType() == Leader.OBSERVERINFO) {
          learnerType = LearnerType.OBSERVER;
    }            
    // 从qp的zxid中获取epoch
    long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
    
    long peerLastZxid;
    StateSummary ss = null;
    long zxid = qp.getZxid();
    // 返回当前最新的epoch
    long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);

    // 老版本忽略
    if (this.getVersion() < 0x10000) {
        // we are going to have to extrapolate the epoch information
        long epoch = ZxidUtils.getEpochFromZxid(zxid);
        ss = new StateSummary(epoch, zxid);
        // fake the message
        leader.waitForEpochAck(this.getSid(), ss);
    } else {
        byte ver[] = new byte[4];
        ByteBuffer.wrap(ver).putInt(0x10000);
        QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
        // 达到一半以上follower注册后响应LEADERINFO给follower
        oa.writeRecord(newEpochPacket, "packet");
        bufferedOutput.flush();
        ......
}
  • ia.readRecord(qp, "packet")反序列化QuorumPacket;
  • lastAcceptedEpoch从follower zxid中获取follower的epoch;
  • leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch)使用wait-notify机制等待收到超过半数的FOLLOWERINFO,返回最大的follower epoch+1作为newEpoch;
  • oa.writeRecord(newEpochPacket, "packet")响应follower LEADERINFO;

8. Follower处理LEADERINFO(Learner.registerWithLeader)

protected long registerWithLeader(int pktType) throws IOException{
    ......
    // 阻塞等待leader返回LEADERINFO
    readPacket(qp);        
    final long newEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
    if (qp.getType() == Leader.LEADERINFO) {
        // we are connected to a 1.0 server so accept the new epoch and read the next packet
        leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
        byte epochBytes[] = new byte[4];
        final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);
        if (newEpoch > self.getAcceptedEpoch()) {
            wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
            self.setAcceptedEpoch(newEpoch);
        } else if (newEpoch == self.getAcceptedEpoch()) {
            wrappedEpochBytes.putInt(-1);
        } else {
            throw new IOException("Leaders epoch, " + newEpoch + " is less than accepted epoch, " + self.getAcceptedEpoch());
        }
QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null);
            // 发送ackEpoch
            writePacket(ackNewEpoch, true);
            return ZxidUtils.makeZxid(newEpoch, 0);
    ......
} 
  • readPacket(qp)阻塞等待leader返回LEADERINFO;
  • newEpoch > self.getAcceptedEpoch()leader返回epoch大于当前epoch,响应leader当前epoch,并更新为leader的epoch;
  • newEpoch == self.getAcceptedEpoch()响应leader -1;
  • writePacket(ackNewEpoch, true)发送leader ACKEPOCH;
  • ackNewEpoch
    type:ACKEPOCH
    zxid:当前follower zxid;
    data:currentEpoch 或 -1

9. Leader处理ACKEPOCH(LearnerHandler.run)

public void run() {
        ......
        QuorumPacket ackEpochPacket = new QuorumPacket();
        // 发送后再接收ackEpoch
        ia.readRecord(ackEpochPacket, "packet");
        if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
            LOG.error(ackEpochPacket.toString() + " is not ACKEPOCH");
            return;
        }
        ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
        ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
        // 收到过半ackEpoch
        leader.waitForEpochAck(this.getSid(), ss);
    }
    peerLastZxid = ss.getLastZxid();
    
    /* the default to send to the follower */
    // 默认是SNAP
    int packetToSend = Leader.SNAP;
    long zxidToSend = 0;
    long leaderLastZxid = 0;
    /** the packets that the follower needs to get updates from **/
    long updates = peerLastZxid;
    
    ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
    ReadLock rl = lock.readLock();
    try {
        // 读锁
        rl.lock();
        // maxZxid,leader当前最大的zxid
        final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
        // minzxid,committedLog中队头最小的zxid
        final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
        // committedLog
        LinkedList proposals = leader.zk.getZKDatabase().getCommittedLog();

        // follower跟leader已经完全同步,发送空diff
        if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
            packetToSend = Leader.DIFF;
            zxidToSend = peerLastZxid;
        } else if (proposals.size() != 0) {
            LOG.debug("proposal size is {}", proposals.size());
            // follower zxid大于leader min zxid小于leader max zxid,进行DIFF同步
            if ((maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)) {
                long prevProposalZxid = minCommittedLog;
                boolean firstPacket=true;
                packetToSend = Leader.DIFF;
                zxidToSend = maxCommittedLog;

                for (Proposal propose: proposals) {
                    // skip the proposals the peer already has
                    // 跳过committedLog中小于follower zxid的propose
                    if (propose.packet.getZxid() <= peerLastZxid) {
                        prevProposalZxid = propose.packet.getZxid();
                        continue;
                    } else {
                        if (firstPacket) {
                            firstPacket = false;
                            // Does the peer have some proposals that the leader hasn't seen yet
                            // 第一个数据包时,follower zxid大于minZxid,说明存在follower有但是leader没有的zxid,发送TRUNC
                            if (prevProposalZxid < peerLastZxid) {
                                // send a trunc message before sending the diff
                                packetToSend = Leader.TRUNC;                                        
                                zxidToSend = prevProposalZxid;
                                updates = zxidToSend;
                            }
                        }
                        // PROPOSAL
                        queuePacket(propose.packet);
                        QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),null, null);
                        // COMMIT,一个PROPOSAL一个COMMIT放入队列
                        queuePacket(qcommit);
                    }
                }
            } else if (peerLastZxid > maxCommittedLog) {
                // follower zxid大于leader max zxid,返回TRUNC
                packetToSend = Leader.TRUNC;
                zxidToSend = maxCommittedLog;
                updates = zxidToSend;
        // TODO
        leaderLastZxid = leader.startForwarding(this, updates);

    } finally {
        rl.unlock();
    }

     QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
            ZxidUtils.makeZxid(newEpoch, 0), null, null);
     if (getVersion() < 0x10000) {
        oa.writeRecord(newLeaderQP, "packet");
    } else {
         // NEWLEADER加入到queuedPackets
        queuedPackets.add(newLeaderQP);
    }
    bufferedOutput.flush();
    //Need to set the zxidToSend to the latest zxid
    if (packetToSend == Leader.SNAP) {
        zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
    }
    // 发送SNAP或DIFF或NEWLEADER等
    oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
    bufferedOutput.flush();
    
    /* if we are not truncating or sending a diff just send a snapshot */
    if (packetToSend == Leader.SNAP) {
        LOG.info("Sending snapshot last zxid of peer is 0x"
                + Long.toHexString(peerLastZxid) + " " 
                + " zxid of leader is 0x"
                + Long.toHexString(leaderLastZxid)
                + "sent zxid of db as 0x" 
                + Long.toHexString(zxidToSend));
        // Dump data to peer
        // 序列化snapshot
        leader.zk.getZKDatabase().serializeSnapshot(oa);
        oa.writeString("BenWasHere", "signature");
    }
    bufferedOutput.flush();
    
    // Start sending packets
    new Thread() {
        public void run() {
            Thread.currentThread().setName(
                    "Sender-" + sock.getRemoteSocketAddress());
            try {
                // 创建线程发送packets
                sendPackets();
            } catch (InterruptedException e) {
                LOG.warn("Unexpected interruption",e);
            }
        }
    }.start();
    ......
}
  • ia.readRecord(ackEpochPacket, "packet")接收follower发送的ACKEPOCH;
  • leader.waitForEpochAck(this.getSid(), ss)阻塞等待收到过半的follower发送的ACKEPOCH;
  • rl.lock()使用ZKDatabase读锁,防止并发写入(前面分析过的addCommittedProposal中会使用写锁);
  • leader.zk.getZKDatabase().getCommittedLog()就是上面2.1中维护的committedLog,存放txnLog反序列化包装成的Proposal;
  • peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()表示follower跟leader已经完全同步,发送空的DIFF;
  • (maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)表示follower zxid大于leader min zxid小于leader max zxid,进行DIFF同步;
  • propose.packet.getZxid() <= peerLastZxid跳过committedLog中小于follower zxid的propose;
  • queuePacket(propose.packet)把PROPOSAL入队,packet对应2.1中QuorumPacket pp = new QuorumPacket(Leader.PROPOSAL, request.zxid, baos.toByteArray(), null);
  • queuePacket(qcommit)每个zxid的PROPOSAL跟一个COMMIT;
  • peerLastZxid > maxCommittedLog表示follower zxid大于leader max zxid,follower需要把大于的部分清除,返回TRUNC;
  • leader.startForwarding(this, updates)后续业务处理流程中再分析;
  • queuedPackets.add(newLeaderQP)NEWLEADER加入到queuedPackets;
  • oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet")发送QuorumPacket type:SNAP/TRUNC/DIFF;
  • leader.zk.getZKDatabase().serializeSnapshot(oa)SNAP时序列化并发送snapshot;
  • sendPackets()创建线程发送queuedPackets中的QuorumPacket,queuedPackets为空时该线程阻塞在queuedPackets.take();
  • queuedPackets.add(newLeaderQP)
    -----over 内容太多了,下一节继续-----

你可能感兴趣的:(Zookeeper(七)-服务端集群模式-启动流程-1)