前一篇介绍了Leader选举,这一篇介绍选举成功之后Leader和Follower之间的初始化。
先看Leader端操作
- case LEADING:
- LOG.info("LEADING");
- try {
- //初始化Leader对象
- setLeader(makeLeader(logFactory));
- //lead,线程在这里阻塞
- leader.lead();
- setLeader(null);
- } catch (Exception e) {
- LOG.warn("Unexpected exception",e);
- } finally {
- if (leader != null) {
- leader.shutdown("Forcing shutdown");
- setLeader(null);
- }
- setPeerState(ServerState.LOOKING);
- }
- break;
- }
Leader初始化
- Leader(QuorumPeer self,LeaderZooKeeperServer zk) throws IOException {
- this.self = self;
- try {
- 打开lead端口,这里是2888
- ss = new ServerSocket();
- ss.setReuseAddress(true);
- ss.bind(new InetSocketAddress(self.getQuorumAddress().getPort()));
- } catch (BindException e) {
- LOG.error("Couldn't bind to port "
- + self.getQuorumAddress().getPort(), e);
- throw e;
- }
- this.zk=zk;
- }
具体lead过程
- self.tick = 0;
- //从本地文件恢复数据
- zk.loadData();
- //leader的状态信息
- leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());
- // Start thread that waits for connection requests from
- // new followers.
- //启动lead端口的监听线程,专门用来监听新的follower
- cnxAcceptor = new LearnerCnxAcceptor();
- cnxAcceptor.start();
- readyToStart = true;
- //等待足够多的follower进来,代表自己确实是leader,此处lead线程可能会等待
- long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
- .......
等待follower连接
- public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
- synchronized(connectingFollowers) {
- if (!waitingForNewEpoch) {
- return epoch;
- }
- if (lastAcceptedEpoch >= epoch) {
- epoch = lastAcceptedEpoch+1;
- }
- //将自己加入连接队伍中,方便后续判断lead是否有效
- connectingFollowers.add(sid);
- QuorumVerifier verifier = self.getQuorumVerifier();
- //如果足够多的follower进入,选举有效,则无需等待,并通知其他的等待线程,类似于Barrier
- if (connectingFollowers.contains(self.getId()) &&
- verifier.containsQuorum(connectingFollowers)) {
- waitingForNewEpoch = false;
- self.setAcceptedEpoch(epoch);
- connectingFollowers.notifyAll();
- }
- //如果进入的follower不够,则进入等待,超时即为initLimit时间,
- else {
- long start = System.currentTimeMillis();
- long cur = start;
- long end = start + self.getInitLimit()*self.getTickTime();
- while(waitingForNewEpoch && cur < end) {
- connectingFollowers.wait(end - cur);
- cur = System.currentTimeMillis();
- }
- //超时了,退出lead过程,重新发起选举
- if (waitingForNewEpoch) {
- throw new InterruptedException("Timeout while waiting for epoch from quorum");
- }
- }
- return epoch;
- }
- }
好的,这个时候我们假设其他follower还没连接进来,那Leader就会在此等待。再来看Follower的初始化过程
- case FOLLOWING:
- try {
- LOG.info("FOLLOWING");
- //初始化Follower对象
- setFollower(makeFollower(logFactory));
- //follow动作,线程在此等待
- follower.followLeader();
- } catch (Exception e) {
- LOG.warn("Unexpected exception",e);
- } finally {
- follower.shutdown();
- setFollower(null);
- setPeerState(ServerState.LOOKING);
- }
- break;
具体follow过程
- void followLeader() throws InterruptedException {
- .......
- try {
- //根据sid找到对应leader,拿到lead连接信息
- InetSocketAddress addr = findLeader();
- try {
- //连接leader
- connectToLeader(addr);
- //注册follower,根据Leader和follower协议,主要是同步选举轮数
- long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
- //check to see if the leader zxid is lower than ours
- //this should never happen but is just a safety check
- long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
- if (newEpoch < self.getAcceptedEpoch()) {
- LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
- + " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
- throw new IOException("Error: Epoch of leader is lower");
- }
- //同步数据
- syncWithLeader(newEpochZxid);
- QuorumPacket qp = new QuorumPacket();
- //接受Leader消息,执行并反馈给leader,线程在此自旋
- while (self.isRunning()) {
- readPacket(qp);
- processPacket(qp);
- }
- ......
- }
连接leader
- protected void connectToLeader(InetSocketAddress addr)
- throws IOException, ConnectException, InterruptedException {
- sock = new Socket();
- //设置读超时时间为initLimit对应时间
- sock.setSoTimeout(self.tickTime * self.initLimit);
- //重试5次,失败后退出follower角色,重新选举
- for (int tries = 0; tries < 5; tries++) {
- try {
- //连接超时
- sock.connect(addr, self.tickTime * self.syncLimit);
- sock.setTcpNoDelay(nodelay);
- break;
- } catch (IOException e) {
- if (tries == 4) {
- LOG.error("Unexpected exception",e);
- throw e;
- } else {
- LOG.warn("Unexpected exception, tries="+tries+
- ", connecting to " + addr,e);
- sock = new Socket();
- sock.setSoTimeout(self.tickTime * self.initLimit);
- }
- }
- Thread.sleep(1000);
- }
- leaderIs = BinaryInputArchive.getArchive(new BufferedInputStream(
- sock.getInputStream()));
- bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
- leaderOs = BinaryOutputArchive.getArchive(bufferedOutput);
- }
假设这里follower顺利连上了leader,此时leader端会为每个follower启动单独IO线程,请看LearnerCnxAcceptor代码
- public void run() {
- try {
- while (!stop) {
- try{
- //线程在此等待连接
- Socket s = ss.accept();
- // start with the initLimit, once the ack is processed
- // in LearnerHandler switch to the syncLimit
- //读超时设为initLimit时间
- s.setSoTimeout(self.tickTime * self.initLimit);
- s.setTcpNoDelay(nodelay);
- //为每个follower启动单独线程,处理IO
- LearnerHandler fh = new LearnerHandler(s, Leader.this);
- fh.start();
- } catch (SocketException e) {
- .....
- }
leader端为follower建立IO线程,其处理过程和follower自身的主线程根据协议相互交互,以下将通过数据交换场景式分析这个过程,leader端IO线程LearnerHandler启动
- ia = BinaryInputArchive.getArchive(new BufferedInputStream(sock
- .getInputStream()));
- bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
- oa = BinaryOutputArchive.getArchive(bufferedOutput);
- //IO线程等待follower发送包
- QuorumPacket qp = new QuorumPacket();
- ia.readRecord(qp, "packet");
follower端进入registerWithLeader处理
- long lastLoggedZxid = self.getLastLoggedZxid();
- QuorumPacket qp = new QuorumPacket();
- //type为Leader.FOLLOWERINFO
- qp.setType(pktType);
- qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));
- /*
- * Add sid to payload
- */
- LearnerInfo li = new LearnerInfo(self.getId(), 0x10000);
- ByteArrayOutputStream bsid = new ByteArrayOutputStream();
- BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
- boa.writeRecord(li, "LearnerInfo");
- qp.setData(bsid.toByteArray());
- //发送LearnerInfo包
- writePacket(qp, true);
- //等待leader响应
- readPacket(qp);
leader端收到包处理
- byte learnerInfoData[] = qp.getData();
- if (learnerInfoData != null) {
- if (learnerInfoData.length == 8) {
- ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
- this.sid = bbsid.getLong();
- } else {
- //反序列化LearnerInfo
- LearnerInfo li = new LearnerInfo();
- ByteBufferInputStream.byteBuffer2Record(ByteBuffer.wrap(learnerInfoData), li);
- this.sid = li.getServerid();
- this.version = li.getProtocolVersion();
- }
- } else {
- this.sid = leader.followerCounter.getAndDecrement();
- }
- ......
- //follower的选举轮数
- long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
- long peerLastZxid;
- StateSummary ss = null;
- long zxid = qp.getZxid();
- //将follower加入到connectingFollowers中,因为满足半数机器的条件,此时在此等待的leader主线程会退出等待,继续往下处理
- long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
- ......
- //发一个Leader.LEADERINFO包,带上新的epoch id
- byte ver[] = new byte[4];
- ByteBuffer.wrap(ver).putInt(0x10000);
- QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
- oa.writeRecord(newEpochPacket, "packet");
- bufferedOutput.flush();
- QuorumPacket ackEpochPacket = new QuorumPacket();
- //等待follower响应
- ia.readRecord(ackEpochPacket, "packet");
- if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
- LOG.error(ackEpochPacket.toString()
- + " is not ACKEPOCH");
- return;
- }
- ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
- ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
- leader.waitForEpochAck(this.getSid(), ss);
- }
此时follower收到LEADERINFO包处理:
- final long newEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
- if (qp.getType() == Leader.LEADERINFO) {
- // we are connected to a 1.0 server so accept the new epoch and read the next packet
- leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
- byte epochBytes[] = new byte[4];
- final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);
- //将自己的epoch发给leader
- if (newEpoch > self.getAcceptedEpoch()) {
- wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
- self.setAcceptedEpoch(newEpoch);
- }
- ......
- //发送一个Leader.ACKEPOCH包,带上自己的最大zxid
- QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null);
- writePacket(ackNewEpoch, true);
- return ZxidUtils.makeZxid(newEpoch, 0);
- }
leader收到Leader.ACKEPOCH后进入waitForEpochAck处理
- public void waitForEpochAck(long id, StateSummary ss) throws IOException, InterruptedException {
- synchronized(electingFollowers) {
- if (electionFinished) {
- return;
- }
- if (ss.getCurrentEpoch() != -1) {
- ......
- //将follower添加到等待集合
- electingFollowers.add(id);
- }
- QuorumVerifier verifier = self.getQuorumVerifier();
- //判断是否满足选举条件,如果不满足进入等待,满足则通知其他等待线程,类似于Barrier
- if (electingFollowers.contains(self.getId()) && verifier.containsQuorum(electingFollowers)) {
- electionFinished = true;
- electingFollowers.notifyAll();
- }
- //follower还不够,等等吧
- else {
- long start = System.currentTimeMillis();
- long cur = start;
- long end = start + self.getInitLimit()*self.getTickTime();
- while(!electionFinished && cur < end) {
- electingFollowers.wait(end - cur);
- cur = System.currentTimeMillis();
- }
- if (!electionFinished) {
- throw new InterruptedException("Timeout while waiting for epoch to be acked by quorum");
- }
- }
- }
- }
假设IO线程在此等待,此时leader主线程在getEpochToPropose恢复后继续执行
- long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
- zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
- synchronized(this){
- lastProposed = zk.getZxid();
- }
- //发起一个NEWLEADER投票
- newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
- null, null);
- ......
- //投票箱
- outstandingProposals.put(newLeaderProposal.packet.getZxid(), newLeaderProposal);
- //自己默认同意
- newLeaderProposal.ackSet.add(self.getId());
- //等待follower进来
- waitForEpochAck(self.getId(), leaderStateSummary);
由于之前已经有follower进来,满足选举条件,则IO线程和leader主线程都继续往下执行,先看leader主线程
- //当前选票轮数
- self.setCurrentEpoch(epoch);
- // We have to get at least a majority of servers in sync with
- // us. We do this by waiting for the NEWLEADER packet to get
- // acknowledged
- //等待确认NEWLEADER包的follower足够多,那自己真的是leader了
- while (!self.getQuorumVerifier().containsQuorum(newLeaderProposal.ackSet)){
- //while (newLeaderProposal.ackCount <= self.quorumPeers.size() / 2) {
- //如果超过初始化时间initlimit,则退出lead过程,重新选举,有可能是follower同步数据比较慢
- if (self.tick > self.initLimit) {
- // Followers aren't syncing fast enough,
- // renounce leadership!
- StringBuilder ackToString = new StringBuilder();
- for(Long id : newLeaderProposal.ackSet)
- ackToString.append(id + ": ");
- shutdown("Waiting for a quorum of followers, only synced with: " + ackToString);
- HashSet<Long> followerSet = new HashSet<Long>();
- for(LearnerHandler f : getLearners()) {
- followerSet.add(f.getSid());
- }
- if (self.getQuorumVerifier().containsQuorum(followerSet)) {
- //if (followers.size() >= self.quorumPeers.size() / 2) {
- LOG.warn("Enough followers present. "+
- "Perhaps the initTicks need to be increased.");
- }
- return;
- }
- Thread.sleep(self.tickTime);
- self.tick++;
- }
这个时候IO线程继续执行
- /* the default to send to the follower */
- //默认发送一个SNAP包,要求follower同步数据
- int packetToSend = Leader.SNAP;
- long zxidToSend = 0;
- long leaderLastZxid = 0;
- /** the packets that the follower needs to get updates from **/
- long updates = peerLastZxid;
- /* we are sending the diff check if we have proposals in memory to be able to
- * send a diff to the
- */
- ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
- ReadLock rl = lock.readLock();
- try {
- rl.lock();
- final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
- final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
- LOG.info("Synchronizing with Follower sid: " + sid
- +" maxCommittedLog=0x"+Long.toHexString(maxCommittedLog)
- +" minCommittedLog=0x"+Long.toHexString(minCommittedLog)
- +" peerLastZxid=0x"+Long.toHexString(peerLastZxid));
- //看看是否还有需要投的票
- LinkedList<Proposal> proposals = leader.zk.getZKDatabase().getCommittedLog();
- //如果有,则处理这些投票
- if (proposals.size() != 0) {
- //如果follower还没处理这个分布式事务,有可能down掉了又恢复,则继续处理这个事务
- if ((maxCommittedLog >= peerLastZxid)
- && (minCommittedLog <= peerLastZxid)) {
- .......
- // If we are here, we can use committedLog to sync with
- // follower. Then we only need to decide whether to
- // send trunc or not
- packetToSend = Leader.DIFF;
- zxidToSend = maxCommittedLog;
- for (Proposal propose: proposals) {
- // skip the proposals the peer already has
- //这个已经被处理过了,无视
- if (propose.packet.getZxid() <= peerLastZxid) {
- prevProposalZxid = propose.packet.getZxid();
- continue;
- } else {
- // If we are sending the first packet, figure out whether to trunc
- // in case the follower has some proposals that the leader doesn't
- //发第一个事务之前先确认folloer是否比leader超前
- if (firstPacket) {
- firstPacket = false;
- // Does the peer have some proposals that the leader hasn't seen yet
- //follower处理事务比leader多,则发送TRUNC包,让follower回滚到和leader一致
- if (prevProposalZxid < peerLastZxid) {
- // send a trunc message before sending the diff
- packetToSend = Leader.TRUNC;
- zxidToSend = prevProposalZxid;
- updates = zxidToSend;
- }
- }
- //将事务发送到队列
- queuePacket(propose.packet);
- //立马接一个COMMIT包
- QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
- null, null);
- queuePacket(qcommit);
- }
- }
- }
- //如果follower超前了,则发送TRUNC包,让其和leader同步
- else if (peerLastZxid > maxCommittedLog) {
- LOG.debug("Sending TRUNC to follower zxidToSend=0x{} updates=0x{}",
- Long.toHexString(maxCommittedLog),
- Long.toHexString(updates));
- packetToSend = Leader.TRUNC;
- zxidToSend = maxCommittedLog;
- updates = zxidToSend;
- } else {
- LOG.warn("Unhandled proposal scenario");
- }
- }
- //如果follower和leader同步,则发送DIFF包,而不需要follower拉数据
- else if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
- .....
- packetToSend = Leader.DIFF;
- zxidToSend = peerLastZxid;
- .......
- //NEWLEADER包添加到发送队列
- QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
- ZxidUtils.makeZxid(newEpoch, 0), null, null);
- if (getVersion() < 0x10000) {
- oa.writeRecord(newLeaderQP, "packet");
- } else {
- queuedPackets.add(newLeaderQP);
- }
- bufferedOutput.flush();
- //Need to set the zxidToSend to the latest zxid
- if (packetToSend == Leader.SNAP) {
- zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
- }
- //发送一个DIFF或SNAP包
- oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
- bufferedOutput.flush();
- ......
- // Start sending packets
- //启动一个异步发送线程
- new Thread() {
- public void run() {
- Thread.currentThread().setName(
- "Sender-" + sock.getRemoteSocketAddress());
- try {
- sendPackets();
- } catch (InterruptedException e) {
- LOG.warn("Unexpected interruption",e);
- }
- }
- }.start();
- /*
- * Have to wait for the first ACK, wait until
- * the leader is ready, and only then we can
- * start processing messages.
- */
- //等待follower确认
- qp = new QuorumPacket();
- ia.readRecord(qp, "packet");
在我们这个集群里。由于是刚启动的,所以leader会直接发送DIFF包,然后再发送一个NEWLEADER包
接着follower收到包处理,在syncWithLeader中
- QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
- QuorumPacket qp = new QuorumPacket();
- long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
- readPacket(qp);
- LinkedList<Long> packetsCommitted = new LinkedList<Long>();
- LinkedList<PacketInFlight> packetsNotCommitted = new LinkedList<PacketInFlight>();
- synchronized (zk) {
- //DIFF包
- if (qp.getType() == Leader.DIFF) {
- LOG.info("Getting a diff from the leader 0x" + Long.toHexString(qp.getZxid()));
- }
- //如果是SNAP包,则从leader复制一份镜像数据到本地内存
- else if (qp.getType() == Leader.SNAP) {
- LOG.info("Getting a snapshot from leader");
- // The leader is going to dump the database
- // clear our own database and read
- zk.getZKDatabase().clear();
- zk.getZKDatabase().deserializeSnapshot(leaderIs);
- String signature = leaderIs.readString("signature");
- if (!signature.equals("BenWasHere")) {
- LOG.error("Missing signature. Got " + signature);
- throw new IOException("Missing signature");
- }
- }
- //TRUNC包,回滚到对应事务
- else if (qp.getType() == Leader.TRUNC) {
- //we need to truncate the log to the lastzxid of the leader
- LOG.warn("Truncating log to get in sync with the leader 0x"
- + Long.toHexString(qp.getZxid()));
- boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
- ......
- //最新的事务id
- zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
- //启动过期session检查
- zk.createSessionTracker();
- long lastQueued = 0;
- // in V1.0 we take a snapshot when we get the NEWLEADER message, but in pre V1.0
- // we take the snapshot at the UPDATE, since V1.0 also gets the UPDATE (after the NEWLEADER)
- // we need to make sure that we don't take the snapshot twice.
- boolean snapshotTaken = false;
- // we are now going to start getting transactions to apply followed by an UPTODATE
- outerLoop:
- //同步完数据后,准备执行投票
- while (self.isRunning()) {
- readPacket(qp);
- switch(qp.getType()) {
- //将投票添加到待处理列表
- case Leader.PROPOSAL:
- PacketInFlight pif = new PacketInFlight();
- pif.hdr = new TxnHeader();
- pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);
- if (pif.hdr.getZxid() != lastQueued + 1) {
- LOG.warn("Got zxid 0x"
- + Long.toHexString(pif.hdr.getZxid())
- + " expected 0x"
- + Long.toHexString(lastQueued + 1));
- }
- lastQueued = pif.hdr.getZxid();
- packetsNotCommitted.add(pif);
- break;
- //COMMIT则将事务交给Server处理掉
- case Leader.COMMIT:
- if (!snapshotTaken) {
- pif = packetsNotCommitted.peekFirst();
- if (pif.hdr.getZxid() != qp.getZxid()) {
- LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
- } else {
- zk.processTxn(pif.hdr, pif.rec);
- packetsNotCommitted.remove();
- }
- } else {
- packetsCommitted.add(qp.getZxid());
- }
- break;
- case Leader.INFORM:
- TxnHeader hdr = new TxnHeader();
- Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr);
- zk.processTxn(hdr, txn);
- break;
- //UPTODATE包,说明同步成功,退出循环
- case Leader.UPTODATE:
- if (!snapshotTaken) { // true for the pre v1.0 case
- zk.takeSnapshot();
- self.setCurrentEpoch(newEpoch);
- }
- self.cnxnFactory.setZooKeeperServer(zk);
- break outerLoop;
- //NEWLEADER包,说明之前残留的投票已经处理完了,则将内存中数据写文件,并发送ACK包
- case Leader.NEWLEADER: // it will be NEWLEADER in v1.0
- zk.takeSnapshot();
- self.setCurrentEpoch(newEpoch);
- snapshotTaken = true;
- writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
- break;
- }
- }
- }
follower在这里同步leader数据,在拿到NEWLEADER包之后序列化到文件,发送ACK包,leaderIO线程处理
- qp = new QuorumPacket();
- ia.readRecord(qp, "packet");
- if(qp.getType() != Leader.ACK){
- LOG.error("Next packet was supposed to be an ACK");
- return;
- }
- //ACK包处理,如果follower数据同步成功,则将它添加到NEWLEADER这个投票的结果中,这样leader主线程就会恢复执行
- leader.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress());
- // now that the ack has been processed expect the syncLimit
- sock.setSoTimeout(leader.self.tickTime * leader.self.syncLimit);
- /*
- * Wait until leader starts up
- */
- //等待leader的server启动
- synchronized(leader.zk){
- while(!leader.zk.isRunning() && !this.isInterrupted()){
- leader.zk.wait(20);
- }
- }
- // Mutation packets will be queued during the serialize,
- // so we need to mark when the peer can actually start
- // using the data
- //
- //leader server启动后,发送一个UPTODATE包
- queuedPackets.add(new QuorumPacket(Leader.UPTODATE, -1, null, null));
具体的ACK包处理
- synchronized public void processAck(long sid, long zxid, SocketAddress followerAddr) {
- ......
- Proposal p = outstandingProposals.get(zxid);
- ......
- //将follower添加到结果列表
- p.ackSet.add(sid);
- ......
- //票数够了,则启动leader的server
- if (self.getQuorumVerifier().containsQuorum(p.ackSet)){
- .......
- } else {
- lastCommitted = zxid;
- LOG.info("Have quorum of supporters; starting up and setting last processed zxid: 0x{}",
- Long.toHexString(zk.getZxid()));
- //启动leader的zookeeper server
- zk.startup();
- zk.getZKDatabase().setlastProcessedZxid(zk.getZxid());
- }
- }
- }
由于follower进来已经满足投票条件,则leader 的server启动,如下
- public void startup() {
- if (sessionTracker == null) {
- createSessionTracker();
- }
- //session检查
- startSessionTracker();
- //处理链
- setupRequestProcessors();
- registerJMX();
- synchronized (this) {
- running = true;
- notifyAll();
- }
- }
- protected void setupRequestProcessors() {
- 后final处理器
- RequestProcessor finalProcessor = new FinalRequestProcessor(this);
- RequestProcessor toBeAppliedProcessor = new Leader.ToBeAppliedRequestProcessor(
- finalProcessor, getLeader().toBeApplied);
- 票结果确认
- commitProcessor = new CommitProcessor(toBeAppliedProcessor,
- Long.toString(getServerId()), false);
- commitProcessor.start();
- 票发起
- ProposalRequestProcessor proposalProcessor = new ProposalRequestProcessor(this,
- commitProcessor);
- proposalProcessor.initialize();
- 务预处理
- firstProcessor = new PrepRequestProcessor(this, proposalProcessor);
- ((PrepRequestProcessor)firstProcessor).start();
- }
leader启动后,发送一个UPTODATE包,follower处理
- //退出同步数据循环
- case Leader.UPTODATE:
- if (!snapshotTaken) { // true for the pre v1.0 case
- zk.takeSnapshot();
- self.setCurrentEpoch(newEpoch);
- }
- self.cnxnFactory.setZooKeeperServer(zk);
- break outerLoop;
- ......
- //再发ACK包
- ack.setZxid(ZxidUtils.makeZxid(newEpoch, 0));
- writePacket(ack, true);
leader的IO线程LearnerHandler进入主循环,收到ACK包处理
- while (true) {
- qp = new QuorumPacket();
- ia.readRecord(qp, "packet");
- ......
- tickOfLastAck = leader.self.tick;
- ByteBuffer bb;
- long sessionId;
- int cxid;
- int type;
- switch (qp.getType()) {
- //ACK包,看看之前的投票是否结束
- case Leader.ACK:
- ......
- leader.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress());
- break;
- //PING包更新下session的超时时间,往前推
- case Leader.PING:
- // Process the touches
- ByteArrayInputStream bis = new ByteArrayInputStream(qp
- .getData());
- DataInputStream dis = new DataInputStream(bis);
- while (dis.available() > 0) {
- long sess = dis.readLong();
- int to = dis.readInt();
- leader.zk.touch(sess, to);
- }
- break;
- //REVALIDATE包,检查session是否还有效
- case Leader.REVALIDATE:
- bis = new ByteArrayInputStream(qp.getData());
- dis = new DataInputStream(bis);
- long id = dis.readLong();
- int to = dis.readInt();
- ByteArrayOutputStream bos = new ByteArrayOutputStream();
- DataOutputStream dos = new DataOutputStream(bos);
- dos.writeLong(id);
- boolean valid = leader.zk.touch(id, to);
- if (valid) {
- try {
- //set the session owner
- // as the follower that
- // owns the session
- leader.zk.setOwner(id, this);
- } catch (SessionExpiredException e) {
- LOG.error("Somehow session " + Long.toHexString(id) + " expired right after being renewed! (impossible)", e);
- }
- }
- if (LOG.isTraceEnabled()) {
- ZooTrace.logTraceMessage(LOG,
- ZooTrace.SESSION_TRACE_MASK,
- "Session 0x" + Long.toHexString(id)
- + " is valid: "+ valid);
- }
- dos.writeBoolean(valid);
- qp.setData(bos.toByteArray());
- queuedPackets.add(qp);
- break;
- //REQUEST包,事务请求,follower会将事务请求转发给leader处理
- case Leader.REQUEST:
- bb = ByteBuffer.wrap(qp.getData());
- sessionId = bb.getLong();
- cxid = bb.getInt();
- type = bb.getInt();
- bb = bb.slice();
- Request si;
- if(type == OpCode.sync){
- si = new LearnerSyncRequest(this, sessionId, cxid, type, bb, qp.getAuthinfo());
- } else {
- si = new Request(null, sessionId, cxid, type, bb, qp.getAuthinfo());
- }
- si.setOwner(this);
- leader.zk.submitRequest(si);
- break;
- default:
- }
- }
这个时候LearnerHandler线程已经启动完成,follower发完ACK包后
- writePacket(ack, true);
- //读超时为syncLimit时间
- sock.setSoTimeout(self.tickTime * self.syncLimit);
- //启动follower的zookeeper server
- zk.startup();
- // We need to log the stuff that came in between the snapshot and the uptodate
- if (zk instanceof FollowerZooKeeperServer) {
- FollowerZooKeeperServer fzk = (FollowerZooKeeperServer)zk;
- for(PacketInFlight p: packetsNotCommitted) {
- fzk.logRequest(p.hdr, p.rec);
- }
- for(Long zxid: packetsCommitted) {
- fzk.commit(zxid);
- }
- }
Follower的zookeeper server启动
- @Override
- protected void setupRequestProcessors() {
- RequestProcessor finalProcessor = new FinalRequestProcessor(this);
- commitProcessor = new CommitProcessor(finalProcessor,
- Long.toString(getServerId()), true);
- commitProcessor.start();
- firstProcessor = new FollowerRequestProcessor(this, commitProcessor);
- ((FollowerRequestProcessor) firstProcessor).start();
- syncProcessor = new SyncRequestProcessor(this,
- new SendAckRequestProcessor((Learner)getFollower()));
- syncProcessor.start();
- }
Follower进入主处理
- QuorumPacket qp = new QuorumPacket();
- while (self.isRunning()) {
- readPacket(qp);
- processPacket(qp);
- }
- protected void processPacket(QuorumPacket qp) throws IOException{
- switch (qp.getType()) {
- //PING包,写回session数据
- case Leader.PING:
- ping(qp);
- break;
- //PROPOSAL包,投票处理
- case Leader.PROPOSAL:
- TxnHeader hdr = new TxnHeader();
- Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr);
- if (hdr.getZxid() != lastQueued + 1) {
- LOG.warn("Got zxid 0x"
- + Long.toHexString(hdr.getZxid())
- + " expected 0x"
- + Long.toHexString(lastQueued + 1));
- }
- lastQueued = hdr.getZxid();
- fzk.logRequest(hdr, txn);
- break;
- //COMMIT包,提交事务
- case Leader.COMMIT:
- fzk.commit(qp.getZxid());
- break;
- case Leader.UPTODATE:
- LOG.error("Received an UPTODATE message after Follower started");
- break;
- case Leader.REVALIDATE:
- revalidate(qp);
- break;
- case Leader.SYNC:
- fzk.sync();
- break;
- }
- }
这个时候Follower也初始化完成,再看leader主线程,Leader主线程之前在等待follower同步结束,结束之后,leader主线程进入主循环,检查follower是否down掉
- while (true) {
- Thread.sleep(self.tickTime / 2);
- if (!tickSkip) {
- self.tick++;
- }
- HashSet<Long> syncedSet = new HashSet<Long>();
- // lock on the followers when we use it.
- syncedSet.add(self.getId());
- //检查每个follower是否还活着
- for (LearnerHandler f : getLearners()) {
- // Synced set is used to check we have a supporting quorum, so only
- // PARTICIPANT, not OBSERVER, learners should be used
- if (f.synced() && f.getLearnerType() == LearnerType.PARTICIPANT) {
- syncedSet.add(f.getSid());
- }
- f.ping();
- }
- //如果有follower挂掉导致投票不通过,则退出lead流程,重新选举
- if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) {
- //if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) {
- // Lost quorum, shutdown
- // TODO: message is wrong unless majority quorums used
- shutdown("Only " + syncedSet.size() + " followers, need "
- + (self.getVotingView().size() / 2));
- // make sure the order is the same!
- // the leader goes to looking
- return;
- }
- tickSkip = !tickSkip;
- }