目录:
- 数据同步与初始化(选举完leader之后)
- 分角色业务处理分析(leader,follower,observer)
1.数据同步与初始化
选举完leader之后,只有当各个角色与leader保持数据同步,才能对外提供服务。
其中,服务器间数据同步过程
基本分为三种方式:
- SNAP方式(snapshot,同步整个文件)
- DIFF方式
- TRUNC方式
1.1 基本流程
其中第3步:Follower返回Leader的AckEpoch,会包含当前的最大zxid,Leader节点会将该zxid与其minZxid,maxZxid进行比较。
这个 [minZxid,maxZxid] 实际上是在leader端缓存的一个事务队列。
其中第6步:发送的NewLeader,说明当前数据已经同步完(Leader已经将该同步的内容发送给Follower)
三种方式的区别:
如果Follower端的zxid小于minZxid,说明Leader与Follower之间数据差距非常大,直接采取Snap方式,Follower就去接收Leader发送的snapshot文件
如果Follower端的zxid处于minZxid,maxZxid之间,采取Diff方式,即Leader只要发送区间为[zxid,maxZxid]的事务即可,Follower接收到这些事务,进行持久化并更新内存
如果Follower端的zxid大于maxZxid,采取Trunc方式,Follower则将大于maxZxid的事务日志删除
1.2 类说明
Learner类
Learner包括Follower和Observer,其中比较重要的leaderIs
,leaderOs
,表示是链接到Leader的输入流,输出流LearnerHandler(继承自ZooKeeperThread)
1.3 详细说明
在QuorumPeer中,进行FastLeaderElection之后,即在QuorumPeer的run方法中,
try {
/*
* Main loop
*/
while (running) {
switch (getPeerState()) {
case LOOKING:
LOG.info("LOOKING");
if (Boolean.getBoolean("readonlymode.enabled")) {
LOG.info("Attempting to start ReadOnlyZooKeeperServer");
// Create read-only server but don't start it immediately
final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
logFactory, this,
new ZooKeeperServer.BasicDataTreeBuilder(),
this.zkDb);
Thread roZkMgr = new Thread() {
public void run() {
try {
// lower-bound grace period to 2 secs
sleep(Math.max(2000, tickTime));
if (ServerState.LOOKING.equals(getPeerState())) {
roZk.startup();
}
} catch (InterruptedException e) {
LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
} catch (Exception e) {
LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
}
}
};
try {
roZkMgr.start();
setBCVote(null);
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
setPeerState(ServerState.LOOKING);
} finally {
// If the thread is in the the grace period, interrupt
// to come out of waiting.
roZkMgr.interrupt();
roZk.shutdown();
}
} else {
try {
setBCVote(null);
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
}
break;
case OBSERVING:
try {
LOG.info("OBSERVING");
setObserver(makeObserver(logFactory));
observer.observeLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e );
} finally {
observer.shutdown();
setObserver(null);
setPeerState(ServerState.LOOKING);
}
break;
case FOLLOWING:
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
follower.shutdown();
setFollower(null);
setPeerState(ServerState.LOOKING);
}
break;
case LEADING:
LOG.info("LEADING");
try {
setLeader(makeLeader(logFactory));
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
if (leader != null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
setPeerState(ServerState.LOOKING);
}
break;
}
}
} finally {
LOG.warn("QuorumPeer main thread exited");
try {
MBeanRegistry.getInstance().unregisterAll();
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
jmxQuorumBean = null;
jmxLocalPeerBean = null;
}
}
通过getPeerState()
方法,获取当前服务器的state,如果是FOLLOWING状态,
followLeader()方法:Follower在提供服务给客户端之间完成注册到Leader的动作。
注册分为以下3个主要步骤:
- 调用connectToLeader方法连接到Leader。
- 调用registerWithLeader方法注册到Leader,交换各自的sid、zxid和Epoch等信息,Leader以此决定事务同步的方式。
- 调用SyncWithLeader跟Leader进行事务数据同步,处理SNAP/DIFF/TRUNC包。
connectToLeader:创建Socket连接到Leader,该方法定义在Follower父类Learner中,加了重试机制,最多可以尝试5次连接。连接成功后,Leader会创建一个LearnerHandler专门处理与该Follower之间的QuorumPacket消息的传递。
registerWithLeader:首先会发送FOLLOWERINFO包给Leader,告诉Leader自己的身份属性(Follower的zxid,sid)。然后等待Leader回复的LEADINFO包,获取Leader的Epoch和zxid值,并更新Follower的Epoch和zxid值,以Leader信息为准。
最后,给Leader发送ACKINFO包,告诉Leader这次Follower已经与Leader的zxid同步了。SyncWithLeader:与Leader同步数据,即同步Leader的事务到Follower
3.1
首先读取同步数据包,主要代码如下:
QuorumPacket qp = new QuorumPacket();
long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
// In the DIFF case we don't need to do a snapshot because the transactions will sync on top of any existing snapshot
// For SNAP and TRUNC the snapshot is needed to save that history
boolean snapshotNeeded = true;
readPacket(qp);
// 提交的packets
LinkedList packetsCommitted = new LinkedList();
// 未提交的packets
LinkedList packetsNotCommitted = new LinkedList();
synchronized (zk) {
// Diff方式下,不需要进行snapshot
if (qp.getType() == Leader.DIFF) {
LOG.info("Getting a diff from the leader 0x{}", Long.toHexString(qp.getZxid()));
snapshotNeeded = false;
}
else if (qp.getType() == Leader.SNAP) {
LOG.info("Getting a snapshot from leader 0x" + Long.toHexString(qp.getZxid()));
// The leader is going to dump the database
// clear our own database and read
// 清空日志,minZxid和maxZxid都为0,,新构建DataTree
zk.getZKDatabase().clear();
zk.getZKDatabase().deserializeSnapshot(leaderIs);
String signature = leaderIs.readString("signature");
if (!signature.equals("BenWasHere")) {
LOG.error("Missing signature. Got " + signature);
throw new IOException("Missing signature");
}
// 设置最近的zxid
zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
} else if (qp.getType() == Leader.TRUNC) {
//we need to truncate the log to the lastzxid of the leader
LOG.warn("Truncating log to get in sync with the leader 0x"
+ Long.toHexString(qp.getZxid()));
boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
if (!truncated) {
// not able to truncate the log
LOG.error("Not able to truncate the log "
+ Long.toHexString(qp.getZxid()));
System.exit(13);
}
zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
}
<1> SNAP:快照模式,这种模式下Leader将整个完整数据库传给Follower
<2> TRUNC:截断模式,这种模式表明Follower的数据比Leader还多,为了维持一致性需要将Follower多余的数据删除
<3> DIFF:差异模式,说明Follower比Leader的事务少,需要给Follower补足,这时候Leader会将需要补充的事务生成PROPOSAL包和COMMIT包发给Follower执行。
3.2
处理后续消息(即QuorumPacket类型)
比如Proposal,Commit,NewLeader等,其中Proposal是指在同步期间收到的Leader发送的写请求信息,缓存在packetsNotCommitted
中。
outerLoop:
while (self.isRunning()) {
readPacket(qp);
switch(qp.getType()) {
case Leader.PROPOSAL:
PacketInFlight pif = new PacketInFlight();
pif.hdr = new TxnHeader();
pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);
if (pif.hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(pif.hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
lastQueued = pif.hdr.getZxid();
packetsNotCommitted.add(pif);
break;
case Leader.COMMIT:
if (!writeToTxnLog) {
pif = packetsNotCommitted.peekFirst();
if (pif.hdr.getZxid() != qp.getZxid()) {
LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
} else {
zk.processTxn(pif.hdr, pif.rec);
packetsNotCommitted.remove();
}
} else {
packetsCommitted.add(qp.getZxid());
}
break;
case Leader.INFORM:
/*
* Only observer get this type of packet. We treat this
* as receiving PROPOSAL and COMMMIT.
*/
PacketInFlight packet = new PacketInFlight();
packet.hdr = new TxnHeader();
packet.rec = SerializeUtils.deserializeTxn(qp.getData(), packet.hdr);
// Log warning message if txn comes out-of-order
if (packet.hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(packet.hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
lastQueued = packet.hdr.getZxid();
if (!writeToTxnLog) {
// Apply to db directly if we haven't taken the snapshot
zk.processTxn(packet.hdr, packet.rec);
} else {
packetsNotCommitted.add(packet);
packetsCommitted.add(qp.getZxid());
}
break;
case Leader.UPTODATE:
if (isPreZAB1_0) {
zk.takeSnapshot();
self.setCurrentEpoch(newEpoch);
}
self.cnxnFactory.setZooKeeperServer(zk);
break outerLoop;
case Leader.NEWLEADER: // Getting NEWLEADER here instead of in discovery
File updating = new File(self.getTxnFactory().getSnapDir(),
QuorumPeer.UPDATING_EPOCH_FILENAME);
if (!updating.exists() && !updating.createNewFile()) {
throw new IOException("Failed to create " +
updating.toString());
}
if (snapshotNeeded) {
zk.takeSnapshot();
}
self.setCurrentEpoch(newEpoch);
if (!updating.delete()) {
throw new IOException("Failed to delete " +
updating.toString());
}
writeToTxnLog = true;
isPreZAB1_0 = false;
writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
break;
}
}
}
之后,Leader会发送NEWLEAER包,Follower收到NEWLEADER包后回复ACK给Leader,
最后,Leader发送UPTODATE包表示同步完成,Follower这时启动服务端并跳出本次循环,准备结束整个注册过程。
3.3 Follower主流程
Follower是Learner的子类,Follower的启动方法就是followLeader。
// 寻找Leader角色
QuorumServer leaderServer = findLeader();
try {
// 尝试5次连接Leader
connectToLeader(leaderServer.addr, leaderServer.hostname);
// 建立Following连接,
long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
//check to see if the leader zxid is lower than ours
//this should never happen but is just a safety check
long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
if (newEpoch < self.getAcceptedEpoch()) {
LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
+ " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
throw new IOException("Error: Epoch of leader is lower");
}
// 进行数据同步
syncWithLeader(newEpochZxid);
QuorumPacket qp = new QuorumPacket();
while (this.isRunning()) {
readPacket(qp);
processPacket(qp);
}
启动时,首先与Leader同步数据,然后启动FollowerZooKeeperServer,在FollowerZooKeeperServer运行的同时,额外启动while循环等待QuorumPacket包,调用processPacket方法处理这些包。
processPacket处理QuorumPeer传送的
QuorumPacket,最主要是处理两种QuorumPacket:PROPOSAL和COMMIT。当然还有PING、COMMITANDACTIVATE等包类型。
该方法在收到Leader发送过来的QuorumPacket时被调用,主要是响应PROPOSAL和COMMIT两种类型的消息。PROPOSAL是Leader将要执行的写事务命令;COMMIT是提交命令。Follower只有在收到COMMIT消息后才会让PROPOSAL命令的内容生效。
同一个写事务命令会在Leader和多个Follower上都执行一次,保证集群数据的一致性。
case Leader.PROPOSAL:
TxnHeader hdr = new TxnHeader();
Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr);
if (hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
lastQueued = hdr.getZxid();
fzk.logRequest(hdr, txn);
break;
case Leader.COMMIT:
fzk.commit(qp.getZxid());
break;
Follower收到PROPOSAL消息后调用FollowerZooKeeperServer的logRequest方法;收到COMMIT消息后调用FollowerZooKeeperServer的commit方法。
PROPOSAL包
Leader发送给集群中所有follower的写请求包。
Leader执行写操作时需要告之集群中的Learner,让大家也执行写操作,保证集群数据的一致性。PROPOSAL是严格按照顺序执行的,这也是ZOOKEEPER的核心设计思想之一COMMIT包
当Leader认为一个Proposal已被大多数Follower持久化并等待执行后会发送COMMIT包,通知各Follower可以提交执行该Proposal了,最后调用到FinalRequestProcessor执行写操作,通过这种机制保证写操作能被大半数集群机器执行
3.4 Observer主流程
Observer和Follower功能类似,主要的差别就是不参与选举。
Observer的入口方法是observerLeader。当QuorumPeer的状态是OBSERVING时会启动Observer并调用observerLeader方法。
observerLeader同Follower的followLeader方法类似,首先注册到Leader,事务同步后进入QuorumPacket包循环处理过程,调用processPacket方法处理QuorumPacket。
processPacket比Follower要简单许多,最主要是处理INFORM包来执行Leader的写请求命令。
这里处理的是INFORM消息,Leader群发写事务时,给Follower发的是PROPOSAL并要等待Follower确认;而给Observer发的则是INFORM消息并且不需要Obverver回复ACK消息来确认。