概述
前面Zookeeper(五)-服务端单机模式-启动流程分析了服务端启动流程,其中集群模式下Leader选主后流程尚未分析,本节继续分析选主后的数据同步及ZooKeeperServer启动等
数据同步
1. 入口
集群模式选主之后,根据当前服务器状态(LEADING/FOLLOWING/OBSERVING)进行不同逻辑处理;
case FOLLOWING:
......
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
......
break;
case LEADING:
......
setLeader(makeLeader(logFactory));
leader.lead();
setLeader(null);
......
break;
- 重点分析FOLLOWING/LEADING,OBSERVING跟FOLLOWING类似相对比较简单;
2. Leader.lead
void lead() throws IOException, InterruptedException {
......
// 加载从磁盘日志文件加载数据
zk.loadData();
......
// 等待新关注者的连接请求的启动线程
cnxAcceptor = new LearnerCnxAcceptor();
cnxAcceptor.start();
......
// LearnerHandler中也会调用getEpochToPropose
long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
......
// LearnerHandler中也会调用waitForEpochAck
waitForEpochAck(self.getId(), leaderStateSummary);
......
// LearnerHandler中也会调用waitForNewLeaderAck
waitForNewLeaderAck(self.getId(), zk.getZxid(), LearnerType.PARTICIPANT);
......
// 启动ZooKeeperServer,zk stat置为running
startZkServer();
......
}
-
zk.loadData()
从磁盘日志文件加载DataTree;参考Zookeeper(三)-持久化,这里重点关注下ZkDatabase中注册的回调方法PlayBackListener.onTxnLoaded
; -
cnxAcceptor.start()
为每个follower创建一个LearnerHandler线程,用于跟leader进行数据同步;重点 -
getEpochToPropose/waitForEpochAck/waitForNewLeaderAck
数据同步过程中follower和leader之间的交互,需要过半服务器通过提案,这里相当于添加leader自己的提案,下面LearnerHandler中还会详细分析; -
startZkServer()
启动ZooKeeperServer,参考Zookeeper(五)-服务端单机模式-启动流程;
2.1 ZKDatabase.addCommittedProposal
protected LinkedList committedLog = new LinkedList();
public void addCommittedProposal(Request request) {
WriteLock wl = logLock.writeLock();
try {
// 写锁
wl.lock();
// committedLog最大容量500
if (committedLog.size() > commitLogCount) {
committedLog.removeFirst();
minCommittedLog = committedLog.getFirst().packet.getZxid();
}
if (committedLog.size() == 0) {
minCommittedLog = request.zxid;
maxCommittedLog = request.zxid;
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos);
try {
request.hdr.serialize(boa, "hdr");
if (request.txn != null) {
request.txn.serialize(boa, "txn");
}
baos.close();
} catch (IOException e) {
LOG.error("This really should be impossible", e);
}
QuorumPacket pp = new QuorumPacket(Leader.PROPOSAL, request.zxid,
baos.toByteArray(), null);
Proposal p = new Proposal();
p.packet = pp;
p.request = request;
// Proposal加入committedLog
committedLog.add(p);
maxCommittedLog = p.packet.getZxid();
} finally {
wl.unlock();
}
}
-
1.
该方法在加载txnLog时进行回调,每加载一条txnLog回调一次; -
2.
committedLog存放反序列化后的包装成的Proposal,最大500条; -
3.
minCommittedLog:committedLog中最小的zxid;maxCommittedLog:committedLog中最大的zxid;
3. Follower.followLeader
void followLeader() throws InterruptedException {
......
// 查找leader
QuorumServer leaderServer = findLeader();
// 创建到leader的连接
connectToLeader(leaderServer.addr, leaderServer.hostname);
// 注册followerInfo
long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
if (newEpoch < self.getAcceptedEpoch()) {
LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
+ " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
throw new IOException("Error: Epoch of leader is lower");
}
// 数据同步
syncWithLeader(newEpochZxid);
QuorumPacket qp = new QuorumPacket();
while (this.isRunning()) {
// 启动后阻塞等待leader发送请求
readPacket(qp);
processPacket(qp);
}
......
}
-
findLeader()
从servers配置中找到leader节点的QuorumServer; -
connectToLeader(leaderServer.addr, leaderServer.hostname)
创建到leader的连接,并初始化leaderIs和leaderOs; -
registerWithLeader(Leader.FOLLOWERINFO)
发送sid到leader,当前follower注册到leader; -
newEpoch < self.getAcceptedEpoch()
leader的epoch小于当前epoch(注意这里不是zxid),抛出异常; -
syncWithLeader(newEpochZxid)
进行leader数据同步; -
readPacket(qp)/processPacket(qp)
数据同步之后启动完成,阻塞等待leader发送的请求,例如ping等;
下面开始具体分析leader和follower之间的通信过程
4. Leader启动数据同步监听(LearnerCnxAcceptor.run)
public void run() {
......
// 等待follower连接,2888端口
Socket s = ss.accept();
s.setSoTimeout(self.tickTime * self.initLimit);
s.setTcpNoDelay(nodelay);
BufferedInputStream is = new BufferedInputStream(s.getInputStream());
LearnerHandler fh = new LearnerHandler(s, is, Leader.this);
// 为每个follower创建一个LearnerHandler线程
fh.start();
......
}
-
Socket s = ss.accept()
阻塞等待follower连接,默认监听2888端口; -
new BufferedInputStream(s.getInputStream())
创建Socket的InputStream,等待接收该Socket上的数据; -
new LearnerHandler(s, is, Leader.this)
通过Socket、输入流、leader构造LearnerHandler; -
fh.start()
为每个follower启动一个LearnerHandler线程,具体处理对应follower的通信;
4.1. Leader接收输入流(LearnerHandler.run)
public void run() {
// 等待follower发送请求
ia = BinaryInputArchive.getArchive(bufferedInput);
bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
oa = BinaryOutputArchive.getArchive(bufferedOutput);
......
}
5. Follower连接Leader(Learner.connectToLeader)
protected void connectToLeader(InetSocketAddress addr, String hostname)
throws IOException, ConnectException, InterruptedException {
sock = new Socket();
sock.setSoTimeout(self.tickTime * self.initLimit);
// 创建连接最多重试5次
for (int tries = 0; tries < 5; tries++) {
try {
sock.connect(addr, self.tickTime * self.syncLimit);
sock.setTcpNoDelay(nodelay);
break;
}
......
}
Thread.sleep(1000);
}
self.authLearner.authenticate(sock, hostname);
// 反序列化并读取leader响应数据
leaderIs = BinaryInputArchive.getArchive(new BufferedInputStream(sock.getInputStream()));
bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
// 序列化并发送leader响应数据
leaderOs = BinaryOutputArchive.getArchive(bufferedOutput);
}
-
sock.connect
跟Leader创建连接,最多重试5次; -
leaderIs
接收输入流并反序列化; -
leaderOs
序列化后发送到Leader的输出流;
6. Follower发送FOLLOWERINFO(Learner.registerWithLeader)
protected long registerWithLeader(int pktType) throws IOException{
long lastLoggedZxid = self.getLastLoggedZxid();
QuorumPacket qp = new QuorumPacket();
qp.setType(pktType);
qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));
LearnerInfo li = new LearnerInfo(self.getId(), 0x10000);
ByteArrayOutputStream bsid = new ByteArrayOutputStream();
BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
boa.writeRecord(li, "LearnerInfo");
......
}
-
boa.writeRecord
发送QuorumPacket; -
发送数据:
type:FOLLOWERINFO;
zxid:acceptedEpoch补0;
data:LearnerInfo;
7. Leader处理FOLLOWERINFO(LearnerHandler.run)
public void run() {
......
QuorumPacket qp = new QuorumPacket();
// 读取消息,消息1. FOLLOWERINFO
ia.readRecord(qp, "packet");
// 不是FOLLOWERINFO或OBSERVERINFO,直接返回
if(qp.getType() != Leader.FOLLOWERINFO && qp.getType() != Leader.OBSERVERINFO){
LOG.error("First packet " + qp.toString() + " is not FOLLOWERINFO or OBSERVERINFO!");
return;
}
byte learnerInfoData[] = qp.getData();
if (learnerInfoData != null) {
// 老版本,忽略
if (learnerInfoData.length == 8) {
ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
this.sid = bbsid.getLong();
} else {
LearnerInfo li = new LearnerInfo();
// 反序列化LearnerInfo
ByteBufferInputStream.byteBuffer2Record(ByteBuffer.wrap(learnerInfoData), li);
this.sid = li.getServerid();
this.version = li.getProtocolVersion();
}
} else {
this.sid = leader.followerCounter.getAndDecrement();
}
LOG.info("Follower sid: " + sid + " : info : " + leader.self.quorumPeers.get(sid));
if (qp.getType() == Leader.OBSERVERINFO) {
learnerType = LearnerType.OBSERVER;
}
// 从qp的zxid中获取epoch
long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
long peerLastZxid;
StateSummary ss = null;
long zxid = qp.getZxid();
// 返回当前最新的epoch
long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
// 老版本忽略
if (this.getVersion() < 0x10000) {
// we are going to have to extrapolate the epoch information
long epoch = ZxidUtils.getEpochFromZxid(zxid);
ss = new StateSummary(epoch, zxid);
// fake the message
leader.waitForEpochAck(this.getSid(), ss);
} else {
byte ver[] = new byte[4];
ByteBuffer.wrap(ver).putInt(0x10000);
QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
// 达到一半以上follower注册后响应LEADERINFO给follower
oa.writeRecord(newEpochPacket, "packet");
bufferedOutput.flush();
......
}
-
ia.readRecord(qp, "packet")
反序列化QuorumPacket; -
lastAcceptedEpoch
从follower zxid中获取follower的epoch; -
leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch)
使用wait-notify机制等待收到超过半数的FOLLOWERINFO,返回最大的follower epoch+1作为newEpoch; -
oa.writeRecord(newEpochPacket, "packet")
响应follower LEADERINFO;
8. Follower处理LEADERINFO(Learner.registerWithLeader)
protected long registerWithLeader(int pktType) throws IOException{
......
// 阻塞等待leader返回LEADERINFO
readPacket(qp);
final long newEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
if (qp.getType() == Leader.LEADERINFO) {
// we are connected to a 1.0 server so accept the new epoch and read the next packet
leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
byte epochBytes[] = new byte[4];
final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);
if (newEpoch > self.getAcceptedEpoch()) {
wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
self.setAcceptedEpoch(newEpoch);
} else if (newEpoch == self.getAcceptedEpoch()) {
wrappedEpochBytes.putInt(-1);
} else {
throw new IOException("Leaders epoch, " + newEpoch + " is less than accepted epoch, " + self.getAcceptedEpoch());
}
QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null);
// 发送ackEpoch
writePacket(ackNewEpoch, true);
return ZxidUtils.makeZxid(newEpoch, 0);
......
}
-
readPacket(qp)
阻塞等待leader返回LEADERINFO; -
newEpoch > self.getAcceptedEpoch()
leader返回epoch大于当前epoch,响应leader当前epoch,并更新为leader的epoch; -
newEpoch == self.getAcceptedEpoch()
响应leader -1; -
writePacket(ackNewEpoch, true)
发送leader ACKEPOCH; -
ackNewEpoch
type:ACKEPOCH
zxid:当前follower zxid;
data:currentEpoch 或 -1
9. Leader处理ACKEPOCH(LearnerHandler.run)
public void run() {
......
QuorumPacket ackEpochPacket = new QuorumPacket();
// 发送后再接收ackEpoch
ia.readRecord(ackEpochPacket, "packet");
if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
LOG.error(ackEpochPacket.toString() + " is not ACKEPOCH");
return;
}
ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
// 收到过半ackEpoch
leader.waitForEpochAck(this.getSid(), ss);
}
peerLastZxid = ss.getLastZxid();
/* the default to send to the follower */
// 默认是SNAP
int packetToSend = Leader.SNAP;
long zxidToSend = 0;
long leaderLastZxid = 0;
/** the packets that the follower needs to get updates from **/
long updates = peerLastZxid;
ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
ReadLock rl = lock.readLock();
try {
// 读锁
rl.lock();
// maxZxid,leader当前最大的zxid
final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
// minzxid,committedLog中队头最小的zxid
final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
// committedLog
LinkedList proposals = leader.zk.getZKDatabase().getCommittedLog();
// follower跟leader已经完全同步,发送空diff
if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
packetToSend = Leader.DIFF;
zxidToSend = peerLastZxid;
} else if (proposals.size() != 0) {
LOG.debug("proposal size is {}", proposals.size());
// follower zxid大于leader min zxid小于leader max zxid,进行DIFF同步
if ((maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)) {
long prevProposalZxid = minCommittedLog;
boolean firstPacket=true;
packetToSend = Leader.DIFF;
zxidToSend = maxCommittedLog;
for (Proposal propose: proposals) {
// skip the proposals the peer already has
// 跳过committedLog中小于follower zxid的propose
if (propose.packet.getZxid() <= peerLastZxid) {
prevProposalZxid = propose.packet.getZxid();
continue;
} else {
if (firstPacket) {
firstPacket = false;
// Does the peer have some proposals that the leader hasn't seen yet
// 第一个数据包时,follower zxid大于minZxid,说明存在follower有但是leader没有的zxid,发送TRUNC
if (prevProposalZxid < peerLastZxid) {
// send a trunc message before sending the diff
packetToSend = Leader.TRUNC;
zxidToSend = prevProposalZxid;
updates = zxidToSend;
}
}
// PROPOSAL
queuePacket(propose.packet);
QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),null, null);
// COMMIT,一个PROPOSAL一个COMMIT放入队列
queuePacket(qcommit);
}
}
} else if (peerLastZxid > maxCommittedLog) {
// follower zxid大于leader max zxid,返回TRUNC
packetToSend = Leader.TRUNC;
zxidToSend = maxCommittedLog;
updates = zxidToSend;
// TODO
leaderLastZxid = leader.startForwarding(this, updates);
} finally {
rl.unlock();
}
QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
ZxidUtils.makeZxid(newEpoch, 0), null, null);
if (getVersion() < 0x10000) {
oa.writeRecord(newLeaderQP, "packet");
} else {
// NEWLEADER加入到queuedPackets
queuedPackets.add(newLeaderQP);
}
bufferedOutput.flush();
//Need to set the zxidToSend to the latest zxid
if (packetToSend == Leader.SNAP) {
zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
}
// 发送SNAP或DIFF或NEWLEADER等
oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
bufferedOutput.flush();
/* if we are not truncating or sending a diff just send a snapshot */
if (packetToSend == Leader.SNAP) {
LOG.info("Sending snapshot last zxid of peer is 0x"
+ Long.toHexString(peerLastZxid) + " "
+ " zxid of leader is 0x"
+ Long.toHexString(leaderLastZxid)
+ "sent zxid of db as 0x"
+ Long.toHexString(zxidToSend));
// Dump data to peer
// 序列化snapshot
leader.zk.getZKDatabase().serializeSnapshot(oa);
oa.writeString("BenWasHere", "signature");
}
bufferedOutput.flush();
// Start sending packets
new Thread() {
public void run() {
Thread.currentThread().setName(
"Sender-" + sock.getRemoteSocketAddress());
try {
// 创建线程发送packets
sendPackets();
} catch (InterruptedException e) {
LOG.warn("Unexpected interruption",e);
}
}
}.start();
......
}
-
ia.readRecord(ackEpochPacket, "packet")
接收follower发送的ACKEPOCH; -
leader.waitForEpochAck(this.getSid(), ss)
阻塞等待收到过半的follower发送的ACKEPOCH; -
rl.lock()
使用ZKDatabase读锁,防止并发写入(前面分析过的addCommittedProposal中会使用写锁); -
leader.zk.getZKDatabase().getCommittedLog()
就是上面2.1中维护的committedLog,存放txnLog反序列化包装成的Proposal; -
peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()
表示follower跟leader已经完全同步,发送空的DIFF; -
(maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)
表示follower zxid大于leader min zxid小于leader max zxid,进行DIFF同步; -
propose.packet.getZxid() <= peerLastZxid
跳过committedLog中小于follower zxid的propose; -
queuePacket(propose.packet)
把PROPOSAL入队,packet对应2.1中QuorumPacket pp = new QuorumPacket(Leader.PROPOSAL, request.zxid, baos.toByteArray(), null); -
queuePacket(qcommit)
每个zxid的PROPOSAL跟一个COMMIT; -
peerLastZxid > maxCommittedLog
表示follower zxid大于leader max zxid,follower需要把大于的部分清除,返回TRUNC; -
leader.startForwarding(this, updates)
后续业务处理流程中再分析; -
queuedPackets.add(newLeaderQP)
NEWLEADER加入到queuedPackets; -
oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet")
发送QuorumPacket type:SNAP/TRUNC/DIFF; -
leader.zk.getZKDatabase().serializeSnapshot(oa)
SNAP时序列化并发送snapshot; -
sendPackets()
创建线程发送queuedPackets中的QuorumPacket,queuedPackets为空时该线程阻塞在queuedPackets.take(); -
queuedPackets.add(newLeaderQP)
-----over 内容太多了,下一节继续-----