摘要
之前讲了服务器角色分为Leader和Learner(细分为Follower和Observer),这里先对Learner的代码进行展开
Learner是Follower和Observer的父类,定义了两者的一些公共行为
主要讲解内容如下
继承关系: 子类Follower和Observer
内部类 :PacketInFlight 表示尚在PROPOSAL还未COMMIT的消息记录
属性
方法
与leader交互相关
findLeader:找到leader,发现地址
connectToLeader:和leader建立连接
registerWithLeader:向leader注册,发送learnerInfo
syncWithLeader:启动时learner先和leader进行数据同步
读写packet
session验证相关
validateSession:集群版client重连时调用,learner验证会话是否有效,并激活,需要发送请求给Leader
revalidate:接收到了leader返回的REVALIDATE信息,进行验证处理
其他
request:所有请求都被LEADER处理,learner接收到请求会转发给Leader
ping:learner接收leader的ping命令时,返回LearnerSessionTracker的快照
思考
问题
继承关系
体现了Learner细分为两种角色,Follower和Observer
内部类
PacketInFlight类,这个类是记录Leader发出提议,但是还没有通过过半验证时候记录的数据格式
类名代表"还在处理的包"
Follower读取PROPOSAL消息以及OBSERVER读取INFORM消息时,会生成相关记录
static class PacketInFlight {
TxnHeader hdr;//事务头
Record rec;//记录
}
属性
源码,备注如下
QuorumPeer self;//当前集群对象
LearnerZooKeeperServer zk;//当前learner状态的zk服务器
protected BufferedOutputStream bufferedOutput;
protected Socket sock;
protected InputArchive leaderIs;//输入
protected OutputArchive leaderOs; //输出
/** the protocol version of the leader */
protected int leaderProtocolVersion = 0x01;//当前协议版本
protected static final Logger LOG = LoggerFactory.getLogger(Learner.class);
static final private boolean nodelay = System.getProperty("follower.nodelay", "true").equals("true");//连接leader是否允许延迟
final ConcurrentHashMap pendingRevalidations =
new ConcurrentHashMap();//client连接到learner时,learner要向leader提出REVALIDATE请求,在收到回复之前,记录在一个map中,表示尚未处理完的验证
方法
leader交互相关
主要是发现leader,连接leader,向leader注册自己,与leader进行数据同步
findLeader
//找到leader是谁,就是通过currentVote的sid遍历所有集群机器,看哪个sid一样,就是那台机器
protected InetSocketAddress findLeader() {
InetSocketAddress addr = null;
// Find the leader by id
Vote current = self.getCurrentVote();
for (QuorumServer s : self.getView().values()) {
if (s.id == current.getId()) {//集群中某个机器的sid和当前投票的sid一样
// Ensure we have the leader's correct IP address before
// attempting to connect.
s.recreateSocketAddresses();
addr = s.addr;
break;
}
}
if (addr == null) {
LOG.warn("Couldn't find the leader with id = "
+ current.getId());
}
return addr;
}
connectToLeader
和leader建立连接
protected void connectToLeader(InetSocketAddress addr)
throws IOException, ConnectException, InterruptedException {
sock = new Socket();
sock.setSoTimeout(self.tickTime * self.initLimit);
for (int tries = 0; tries < 5; tries++) {
try {
sock.connect(addr, self.tickTime * self.syncLimit);
sock.setTcpNoDelay(nodelay);
break;
} catch (IOException e) {
if (tries == 4) {
LOG.error("Unexpected exception",e);
throw e;
} else {
LOG.warn("Unexpected exception, tries="+tries+
", connecting to " + addr,e);
sock = new Socket();
sock.setSoTimeout(self.tickTime * self.initLimit);
}
}
Thread.sleep(1000);
}
leaderIs = BinaryInputArchive.getArchive(new BufferedInputStream(
sock.getInputStream()));
bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
leaderOs = BinaryOutputArchive.getArchive(bufferedOutput);
}
registerWithLeader
//连上leader后,进行握手,参数代表一个following或者observing连接的注册
//注册时会发把自己基本信息发送给leader,称为learnerInfo
protected long registerWithLeader(int pktType) throws IOException{//5.注册当前Follower
/*
* Send follower info, including last zxid and sid
*/
long lastLoggedZxid = self.getLastLoggedZxid();
QuorumPacket qp = new QuorumPacket();
qp.setType(pktType);//设置类型为Leader.FOLLOWERINFO或者Leader.OBSERVERINFO
qp.setZxid(ZxidUtils.makeZxid(self.getAcceptedEpoch(), 0));
/*
* Add sid to payload
*/
LearnerInfo li = new LearnerInfo(self.getId(), 0x10000);
ByteArrayOutputStream bsid = new ByteArrayOutputStream();
BinaryOutputArchive boa = BinaryOutputArchive.getArchive(bsid);
boa.writeRecord(li, "LearnerInfo");//把learner当前信息发送给leader
qp.setData(bsid.toByteArray());
writePacket(qp, true);//发送LearnerInfo包
readPacket(qp);//接收leader的回复(新版本是一个LEADERINFO的消息,包含leader的状态)
final long newEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
if (qp.getType() == Leader.LEADERINFO) {//新版本的leader
// we are connected to a 1.0 server so accept the new epoch and read the next packet
leaderProtocolVersion = ByteBuffer.wrap(qp.getData()).getInt();
byte epochBytes[] = new byte[4];
final ByteBuffer wrappedEpochBytes = ByteBuffer.wrap(epochBytes);
if (newEpoch > self.getAcceptedEpoch()) {
wrappedEpochBytes.putInt((int)self.getCurrentEpoch());
self.setAcceptedEpoch(newEpoch);
} else if (newEpoch == self.getAcceptedEpoch()) {
// since we have already acked an epoch equal to the leaders, we cannot ack
// again, but we still need to send our lastZxid to the leader so that we can
// sync with it if it does assume leadership of the epoch.
// the -1 indicates that this reply should not count as an ack for the new epoch
wrappedEpochBytes.putInt(-1);
} else {
throw new IOException("Leaders epoch, " + newEpoch + " is less than accepted epoch, " + self.getAcceptedEpoch());
}
QuorumPacket ackNewEpoch = new QuorumPacket(Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null);//8.接受完了leader状态之后,要发送ACK消息
writePacket(ackNewEpoch, true);
return ZxidUtils.makeZxid(newEpoch, 0);
} else {//老版本的leader(用于兼容)
if (newEpoch > self.getAcceptedEpoch()) {
self.setAcceptedEpoch(newEpoch);
}
if (qp.getType() != Leader.NEWLEADER) {
LOG.error("First packet should have been NEWLEADER");
throw new IOException("First packet should have been NEWLEADER");
}
return qp.getZxid();
}
}
syncWithLeader
启动时learner先和leader进行数据同步
protected void syncWithLeader(long newLeaderZxid) throws IOException, InterruptedException{//和server同步
QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
QuorumPacket qp = new QuorumPacket();
long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
readPacket(qp);
LinkedList packetsCommitted = new LinkedList();
LinkedList packetsNotCommitted = new LinkedList();//收到proposal但是还未commit的包
synchronized (zk) {
if (qp.getType() == Leader.DIFF) {//接收diff,表示以diff方式与leader的数据同步
LOG.info("Getting a diff from the leader 0x" + Long.toHexString(qp.getZxid()));
}
else if (qp.getType() == Leader.SNAP) {//表示snap方式与leader同步,从leader复制一份镜像数据到本地内存
LOG.info("Getting a snapshot from leader");
// The leader is going to dump the database
// clear our own database and read
zk.getZKDatabase().clear();
zk.getZKDatabase().deserializeSnapshot(leaderIs);//从leader复制一份镜像数据到本地内存
String signature = leaderIs.readString("signature");
if (!signature.equals("BenWasHere")) {//验证签名
LOG.error("Missing signature. Got " + signature);
throw new IOException("Missing signature");
}
} else if (qp.getType() == Leader.TRUNC) {//触发回滚,回滚到leader的lastzxid
//we need to truncate the log to the lastzxid of the leader
LOG.warn("Truncating log to get in sync with the leader 0x"
+ Long.toHexString(qp.getZxid()));
boolean truncated=zk.getZKDatabase().truncateLog(qp.getZxid());
if (!truncated) {
// not able to truncate the log
LOG.error("Not able to truncate the log "
+ Long.toHexString(qp.getZxid()));
System.exit(13);
}
}
else {
LOG.error("Got unexpected packet from leader "
+ qp.getType() + " exiting ... " );
System.exit(13);
}
zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());
zk.createSessionTracker();
long lastQueued = 0;
// in V1.0 we take a snapshot when we get the NEWLEADER message, but in pre V1.0
// we take the snapshot at the UPDATE, since V1.0 also gets the UPDATE (after the NEWLEADER)
// we need to make sure that we don't take the snapshot twice.
boolean snapshotTaken = false;//标识是否接受到了 snapshotTaken 请求
// we are now going to start getting transactions to apply followed by an UPTODATE
outerLoop:
while (self.isRunning()) {//启动时数据同步,不断读取leader的数据,直到收到UPTODATE表示同步完成
readPacket(qp);
switch(qp.getType()) {
case Leader.PROPOSAL://接收到提议
PacketInFlight pif = new PacketInFlight();
pif.hdr = new TxnHeader();
pif.rec = SerializeUtils.deserializeTxn(qp.getData(), pif.hdr);
if (pif.hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(pif.hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
lastQueued = pif.hdr.getZxid();
packetsNotCommitted.add(pif);//因为还没有commit,记录在notCommit的队列里面
break;
case Leader.COMMIT://接收到COMMIT
if (!snapshotTaken) { //没拍照的话
pif = packetsNotCommitted.peekFirst();
if (pif.hdr.getZxid() != qp.getZxid()) {
LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
} else {
zk.processTxn(pif.hdr, pif.rec);//直接处理事务
packetsNotCommitted.remove();//从未COMMIT记录中删除对应记录
}
} else {
packetsCommitted.add(qp.getZxid());//
}
break;
case Leader.INFORM://observer才会拿到INFORM消息,来同步
/*
* Only observer get this type of packet. We treat this
* as receiving PROPOSAL and COMMMIT.
*/
PacketInFlight packet = new PacketInFlight();
packet.hdr = new TxnHeader();
packet.rec = SerializeUtils.deserializeTxn(qp.getData(), packet.hdr);
// Log warning message if txn comes out-of-order
if (packet.hdr.getZxid() != lastQueued + 1) {
LOG.warn("Got zxid 0x"
+ Long.toHexString(packet.hdr.getZxid())
+ " expected 0x"
+ Long.toHexString(lastQueued + 1));
}
lastQueued = packet.hdr.getZxid();
if (!snapshotTaken) {//没拍照就直接处理事务
// Apply to db directly if we haven't taken the snapshot
zk.processTxn(packet.hdr, packet.rec);
} else {
packetsNotCommitted.add(packet);
packetsCommitted.add(qp.getZxid());
}
break;
case Leader.UPTODATE://过半机器完成了leader验证,自己也完成了数据同步,可以跳出循环
if (!snapshotTaken) { // true for the pre v1.0 case
zk.takeSnapshot();
self.setCurrentEpoch(newEpoch);
}
self.cnxnFactory.setZooKeeperServer(zk);
break outerLoop;
case Leader.NEWLEADER: // it will be NEWLEADER in v1.0
// Create updatingEpoch file and remove it after current
// epoch is set. QuorumPeer.loadDataBase() uses this file to
// detect the case where the server was terminated after
// taking a snapshot but before setting the current epoch.
File updating = new File(self.getTxnFactory().getSnapDir(),
QuorumPeer.UPDATING_EPOCH_FILENAME);
if (!updating.exists() && !updating.createNewFile()) {
throw new IOException("Failed to create " +
updating.toString());
}
zk.takeSnapshot();//设置快照和当前epoch
self.setCurrentEpoch(newEpoch);
if (!updating.delete()) {//删除updatingEpoch文件
throw new IOException("Failed to delete " +
updating.toString());
}
snapshotTaken = true;//已经快照了
writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);//遇到NEWLEADER回复ACK
break;
}
}
}
ack.setZxid(ZxidUtils.makeZxid(newEpoch, 0));
writePacket(ack, true);//最后再发一个ack
sock.setSoTimeout(self.tickTime * self.syncLimit);
zk.startup();//启动服务器
/*
* Update the election vote here to ensure that all members of the
* ensemble report the same vote to new servers that start up and
* send leader election notifications to the ensemble.
*
* @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
*/
self.updateElectionVote(newEpoch);//更新选举投票,解决某个bug用的
// We need to log the stuff that came in between the snapshot and the uptodate
if (zk instanceof FollowerZooKeeperServer) {//如果是follower
FollowerZooKeeperServer fzk = (FollowerZooKeeperServer)zk;
for(PacketInFlight p: packetsNotCommitted) {
fzk.logRequest(p.hdr, p.rec);
}
for(Long zxid: packetsCommitted) {
fzk.commit(zxid);//进行commit
}
} else if (zk instanceof ObserverZooKeeperServer) {
// Similar to follower, we need to log requests between the snapshot
// and UPTODATE
ObserverZooKeeperServer ozk = (ObserverZooKeeperServer) zk;
for (PacketInFlight p : packetsNotCommitted) {
Long zxid = packetsCommitted.peekFirst();
if (p.hdr.getZxid() != zxid) {
// log warning message if there is no matching commit
// old leader send outstanding proposal to observer
LOG.warn("Committing " + Long.toHexString(zxid)
+ ", but next proposal is "
+ Long.toHexString(p.hdr.getZxid()));
continue;
}
packetsCommitted.remove();
Request request = new Request(null, p.hdr.getClientId(),
p.hdr.getCxid(), p.hdr.getType(), null, null);
request.txn = p.rec;
request.hdr = p.hdr;
ozk.commitRequest(request);
}
} else {
// New server type need to handle in-flight packets
throw new UnsupportedOperationException("Unknown server type");
}
}
需要结合leader端 LearnerHandler#run代码查看,在learner端主要逻辑是
1.前面registerWithLeader函数learner会回复leader的LEADERINFO,带上了自己的lastLoggedZxid
2.leader根据lastLoggedZxid告诉learner是哪一种同步方式
DIFF同步,还是SNAP同步,还是先TRUNC回滚到某个zxid
3.确定同步方式之后,leader会接着给learner发送后续的同步packet,分为
PROPOSAL(提议)
COMMIT(提交,针对Follower)
INFORM(通知,针对Observer)
UPTODATE(表示过半机器已完成同步,可以对外工作)
NEWLEADER(leader告诉learner同步的相关请求已经发完了)
不过代码里面也有部分不理解的问题,在下面思考和问题中统一列举
读写packet
直接贴了
void writePacket(QuorumPacket pp, boolean flush) throws IOException {//发送packet,record的tag是packet,让leader读
synchronized (leaderOs) {
if (pp != null) {
leaderOs.writeRecord(pp, "packet");
}
if (flush) {
bufferedOutput.flush();
}
}
}
void readPacket(QuorumPacket pp) throws IOException {//从leader读取packet
synchronized (leaderIs) {
leaderIs.readRecord(pp, "packet");
}
long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
if (pp.getType() == Leader.PING) {
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
if (LOG.isTraceEnabled()) {
ZooTrace.logQuorumPacket(LOG, traceMask, 'i', pp);
}
}
session验证相关
validateSession
集群版client重连时调用,learner验证会话是否有效,并激活,需要发送请求给Leader
void validateSession(ServerCnxn cnxn, long clientId, int timeout)
throws IOException {
LOG.info("Revalidating client: 0x" + Long.toHexString(clientId));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
dos.writeLong(clientId);//写入clientId
dos.writeInt(timeout);
dos.close();
QuorumPacket qp = new QuorumPacket(Leader.REVALIDATE, -1, baos
.toByteArray(), null);//发送REVALIDATE命令
pendingRevalidations.put(clientId, cnxn);//需要验证,还未返回结果,记录在map中
if (LOG.isTraceEnabled()) {
ZooTrace.logTraceMessage(LOG,
ZooTrace.SESSION_TRACE_MASK,
"To validate session 0x"
+ Long.toHexString(clientId));
}
writePacket(qp, true);//发送
}
revalidate
//接收到了leader返回的REVALIDATE信息,进行验证处理
protected void revalidate(QuorumPacket qp) throws IOException {//接收到了leader返回的REVALIDATE信息,进行验证处理
ByteArrayInputStream bis = new ByteArrayInputStream(qp
.getData());
DataInputStream dis = new DataInputStream(bis);
long sessionId = dis.readLong();
boolean valid = dis.readBoolean();
ServerCnxn cnxn = pendingRevalidations
.remove(sessionId);//验证完成,从map中移除
if (cnxn == null) {
LOG.warn("Missing session 0x"
+ Long.toHexString(sessionId)
+ " for validation");
} else {
zk.finishSessionInit(cnxn, valid);//完成session的初始化
}
if (LOG.isTraceEnabled()) {
ZooTrace.logTraceMessage(LOG,
ZooTrace.SESSION_TRACE_MASK,
"Session 0x" + Long.toHexString(sessionId)
+ " is valid: " + valid);
}
}
其他
request
所有请求都被LEADER处理,learner接收到请求会转发给Leader
void request(Request request) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream oa = new DataOutputStream(baos);
oa.writeLong(request.sessionId);
oa.writeInt(request.cxid);
oa.writeInt(request.type);
if (request.request != null) {
request.request.rewind();
int len = request.request.remaining();
byte b[] = new byte[len];
request.request.get(b);
request.request.rewind();
oa.write(b);
}
oa.close();
QuorumPacket qp = new QuorumPacket(Leader.REQUEST, -1, baos
.toByteArray(), request.authInfo);
writePacket(qp, true);
}
ping
learner接收leader的ping命令时,返回LearnerSessionTracker的快照
protected void ping(QuorumPacket qp) throws IOException {
// Send back the ping with our session data
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
HashMap touchTable = zk
.getTouchSnapshot();//
for (Entry entry : touchTable.entrySet()) {
dos.writeLong(entry.getKey());
dos.writeInt(entry.getValue());
}
qp.setData(bos.toByteArray());
writePacket(qp, true);
}
思考
数据同步的消息类型补充
这里补充几点
1.DIFF 和 SNAP 两种数据同步的方式的区别是什么
打个比方,LEADER的样子是A,LEARNER的样式是B
DIFF:代表LEADER后续会告诉LEARNER:不断接收PROPOSAL和COMMIT,一步步提交让LEARNER从B变成B1,到B2。。。最后到A的样子.
SNAP:代表LEADER直接告诉LEARNER:我长得样子是A,你copy一下变成我的样子就好了
2.是否发送了DIFF ,SNAP或者TRUNC,learner接受到一下就一下子数据同步完成了
不是
DIFF:后面会跟着一堆INFORM和COMMIT让learner一步步的同步
TRUNC:回滚到某个zxid后,后面也会跟着一堆INFORM和COMMIT
SNAP:后面会发送自己db的序列化内容
数据同步时syncWithLeader的思考
点有很多
1.snapshotTaken意义是什么
代表是否接受到了NEWLEADER请求,此时代表接受完了leader的同步数据,拍摄快照
2.packetsNotCommitted意义是什么
收到proposal但是还未commit的包
单机版和集群版中Learner验证session是否有效并且重连的逻辑的区别
单机版,源码37节讲了client的重连机制
ZooKeeperServer#reopenSession
ZooKeeperServer#revalidateSession
就是看账号密码是否ok,ok的话去sessionTracker验证,看是否超时即可
集群版Learner验证session是否有效
Learner#validateSession
发送REVALIDATE请求给leader,让leader决定会话是否过时
Learner#revalidate接收leader的REVALIDATE回复,完成session的验证
哪些点体现了Leader角色的特性
之前讲到,Leader服务器是Zookeeper集群工作的核心,其主要工作如下
(1) 事务请求的唯一调度和处理者,保证集群事务处理的顺序性。
(2) 集群内部各服务器的调度者。
比如上面提到的,验证session是否有效,learner需要发送请求给leader接收回复才行,无法自行判断
问题
数据同步时syncWithLeader的问题
1.packetsCommitted的意义是什么
2.什么时候进入case Leader.COMMIT里面的else条件
按照LearnerHandler#run的写法
leader发送完了NEWLEADER消息之后,等过半learner回复ACK时,自己就发送UPTODATE了,这样的话,learner端走不进else里面的逻辑
吐槽
代码注释,还有各种兼容性问题让代码难看懂
也没找到什么状态机的图
refer
基本没有很match的refer,稍许带上一点
http://www.cnblogs.com/leesf456/p/6139266.html
《paxos到zk》