0. 背景
HugeGraph项目用到了jraft作为控制副本一致性的底层组件,昨天review代码的时候,发现transferLeaderShipTo方法和addPeer、removePeer方法的签名不太一致,后两个方法需要传入一个Closure对象,然后在Closure对象的run方法中根据status的值判断操作是否成功。但是transferLeaderShip方法不用传Closure对象,而是直接返回了一个status。于是带着好奇心想看看既然都是操作raft-group,为什么单这个方法是同步的,它是怎么实现的呢?
1. 跟踪transferLeaderShipTo方法
// 省略一堆检查和提前返回的情况
final long lastLogIndex = this.logManager.getLastLogIndex();
if (!this.replicatorGroup.transferLeadershipTo(peerId, lastLogIndex)) {
LOG.warn("No such peer {}.", peer);
return new Status(RaftError.EINVAL, "No such peer %s", peer);
}
this.state = State.STATE_TRANSFERRING;
final Status status = new Status(RaftError.ETRANSFERLEADERSHIP,
"Raft leader is transferring leadership to %s", peerId);
onLeaderStop(status);
LOG.info("Node {} starts to transfer leadership to peer {}.", getNodeId(), peer);
final StopTransferArg stopArg = new StopTransferArg(this, this.currTerm, peerId);
this.stopTransferArg = stopArg;
this.transferTimer = this.timerManager.schedule(() -> onTransferTimeout(stopArg),
this.options.getElectionTimeoutMs(), TimeUnit.MILLISECONDS);
这里的几个大步骤包括:
- 给准leader的replicator执行transferLeadershipTo方法,传入lastLogIndex,这里先不管到底干了啥,但是扫描了一下下面的代码行,基本都跟准leader的PeerId没关系了,说明这个transferLeadershipTo方法就能触发准leader的选举;
- 把自己的状态设置为State.STATE_TRANSFERRING,表示正在转移;
- 执行onLeaderStop,这个方法会构造一个类型为LEADER_STOP的task,然后塞入任务队列,自己定义的状态机也会响应onLeaderStop事件;
- 启动一个定时任务,该任务在一个选举超时时间后会执行onTransferTimeout(stopArg)方法,这个方法我们先不管。
从这里可以大致看出主逻辑是:先让准leader触发选举逻辑,如果一个选举周期内都没能选出leader,会执行onTransferTimeout方法。
既然replicator.transferLeadershipTo(peerId, lastLogIndex)这么关键,我们进去看看:
public boolean transferLeadershipTo(final PeerId peer, final long logIndex) {
final ThreadId rid = this.replicatorMap.get(peer);
// 调了下面的方法
return rid != null && Replicator.transferLeadership(rid, logIndex);
}
public static boolean transferLeadership(final ThreadId id, final long logIndex) {
final Replicator r = (Replicator) id.lock();
if (r == null) {
return false;
}
// dummy is unlock in _transfer_leadership
// 调了下面的方法
return r.transferLeadership(logIndex);
}
private boolean transferLeadership(final long logIndex) {
if (this.hasSucceeded && this.nextIndex > logIndex) {
// _id is unlock in _send_timeout_now
sendTimeoutNow(true, false);
return true;
}
// Register log_index so that _on_rpc_return trigger
// _send_timeout_now if _next_index reaches log_index
this.timeoutNowIndex = logIndex;
this.id.unlock();
return true;
}
关键方法是sendTimeoutNow,再一路跟,没什么特别之处,直接把代码贴出来了。
private void sendTimeoutNow(final boolean unlockId, final boolean stopAfterFinish) {
sendTimeoutNow(unlockId, stopAfterFinish, -1);
}
private void sendTimeoutNow(final boolean unlockId, final boolean stopAfterFinish, final int timeoutMs) {
final TimeoutNowRequest.Builder rb = TimeoutNowRequest.newBuilder();
rb.setTerm(this.options.getTerm());
rb.setGroupId(this.options.getGroupId());
rb.setServerId(this.options.getServerId().toString());
rb.setPeerId(this.options.getPeerId().toString());
try {
if (!stopAfterFinish) {
// This RPC is issued by transfer_leadership, save this call_id so that
// the RPC can be cancelled by stop.
// 这个方法返回的Future被接住后,并没有什么特殊的等待操作
this.timeoutNowInFly = timeoutNow(rb, false, timeoutMs);
this.timeoutNowIndex = 0;
} else {
timeoutNow(rb, true, timeoutMs);
}
} finally {
if (unlockId) {
this.id.unlock();
}
}
}
private Future timeoutNow(final TimeoutNowRequest.Builder rb, final boolean stopAfterFinish,
final int timeoutMs) {
final TimeoutNowRequest request = rb.build();
return this.rpcService.timeoutNow(this.options.getPeerId().getEndpoint(), request, timeoutMs,
new RpcResponseClosureAdapter() {
@Override
public void run(final Status status) {
if (Replicator.this.id != null) {
onTimeoutNowReturned(Replicator.this.id, status, request, getResponse(), stopAfterFinish);
}
}
});
}
这两段的逻辑是,先找到准leader对应的replicator对象,然后通过它给准leader发一个TimeoutNow的请求,这个Timeout应该是选举超时的意思,所以TimeoutNow就是指“立马处理选举超时”。但是这里让人费解的是:this.timeoutNowInFly = timeoutNow(rb, false, timeoutMs);
只是把返回的Future接住了,并没有做什么等待操作,所以其实transferLeaderShipTo方法本质上还是一个异步方法。
经验:一到这种发消息的地方,代码肯定是不能直接一步一步跟了的。还好处理消息/请求的方法命名都很类似,一般都是handleXXXRequest这种格式的。
全局搜索handleTimeoutNowRequest方法,果然在NodeImpl中能找到它的实现:
// 省略一堆检查和提前返回的情况
final long savedTerm = this.currTerm;
final TimeoutNowResponse resp = TimeoutNowResponse.newBuilder() //
.setTerm(this.currTerm + 1) //
.setSuccess(true) //
.build();
// Parallelize response and election
done.sendResponse(resp);
doUnlock = false;
electSelf();
LOG.info("Node {} received TimeoutNowRequest from {}, term={}.", getNodeId(), request.getServerId(),
savedTerm);
这里先构造了响应,设置success为true,term为原term+1,然后再执行electSelf实现自选举。难道jraft的作者这么自信,electSelf方法一定不会出错吗?极端情况下:发送完响应后,准leader的机器突然宕机,会导致选举失败,那提前返回的响应是代表什么,代表收到了这个请求吗?我觉得这里怪怪的。
既然怀疑transferLeaderShipTo方法是一个异步(非阻塞)方法,怎么验证呢?很简单,在异步处理的某个步骤加上一个睡眠,如果主方法还能瞬间返回,那就是异步无疑了。于是我随手添加了如下代码:
final TimeoutNowResponse resp = TimeoutNowResponse.newBuilder() //
.setTerm(this.currTerm + 1) //
.setSuccess(true) //
.build();
try {
LOG.info("==> 睡眠一会,用于调试");
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
// Parallelize response and election
done.sendResponse(resp);
doUnlock = false;
electSelf();
LOG.info("Node {} received TimeoutNowRequest from {}, term={}.", getNodeId(), request.getServerId(),
savedTerm);
再运行调试,果然主方法还能瞬间返回,确认是异步了,既然如此,我觉得还是应该像addPeer、removePeer一样传一个Closure进来,让用户能处理Closure。现在主方法返回的status是不太严谨的。
本来以为问题得到了验证,就此打住的,但是突然发现原leader和准leader都打出了很多日志,一直在报错,似乎是我们程序的一个BUG。
2. 调试+定位BUG
下面是最初报错时的日志:
原leader节点
2020-10-26 16:23:50 105014 [grizzly-http-server-1] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node starts to transfer leadership to peer 127.0.0.1:8282.
2020-10-26 16:23:50 105017 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreStateMachine [] - The node 127.0.0.1:8281 abdicated from leader
2020-10-26 16:23:50 105017 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.alipay.sofa.jraft.core.StateMachineAdapter [] - onLeaderStop: status=Status[ETRANSFERLEADERSHIP<10013>: Raft leader is transferring leadership to 127.0.0.1:8282].
2020-10-26 16:23:52 106747 [Bolt-conn-event-executor-9-thread-1] [INFO ] com.alipay.sofa.jraft.rpc.impl.core.ClientServiceConnectionEventProcessor [] - Peer 127.0.0.1:8281 is connected
2020-10-26 16:24:00 115020 [JRaft-Node-ScheduleThreadPool5] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node failed to transfer leadership to peer 127.0.0.1:8282, reached timeout.
2020-10-26 16:24:00 115021 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreStateMachine [] - The node 127.0.0.1:8281 become to leader
2020-10-26 16:24:00 115021 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.alipay.sofa.jraft.core.StateMachineAdapter [] - onLeaderStart: term=1.
2020-10-26 16:24:01 116087 [JRaft-StepDownTimer-0] [WARN ] com.alipay.sofa.jraft.core.NodeImpl [] - Node steps down when alive nodes don't satisfy quorum, term=1, deadNodes=127.0.0.1:8282, conf=127.0.0.1:8281,127.0.0.1:8282.
2020-10-26 16:24:01 116088 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreStateMachine [] - The node 127.0.0.1:8281 abdicated from leader
2020-10-26 16:24:01 116088 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.alipay.sofa.jraft.core.StateMachineAdapter [] - onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 1/2].
2020-10-26 16:24:01 116088 [JRaft-StepDownTimer-0] [INFO ] com.alipay.sofa.jraft.core.Replicator [] - Replicator Replicator [state=Replicate, statInfo=, peerId=127.0.0.1:8282, type=Follower] is going to quit
2020-10-26 16:24:01 116089 [JRaft-Rpc-Closure-Executor-3] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Replicator 127.0.0.1:8282 prepare to offline
2020-10-26 16:24:12 126476 [JRaft-ElectionTimer-0] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node term 1 start preVote.
2020-10-26 16:24:22 137260 [JRaft-ElectionTimer-0] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node term 1 start preVote.
2020-10-26 16:24:32 147278 [JRaft-ElectionTimer-0] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node term 1 start preVote.
2020-10-26 16:24:43 158161 [JRaft-ElectionTimer-0] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node term 1 start preVote.
2020-10-26 16:24:49 163820 [Bolt-conn-event-executor-4-thread-1] [INFO ] com.alipay.sofa.jraft.rpc.RpcRequestProcessor [] - Connection disconnected: 127.0.0.1:57648
原follower节点
2020-10-26 16:23:50 92315 [Bolt-default-executor-5-thread-8] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - ==> 睡眠一会,用于调试
2020-10-26 16:23:53 95318 [Bolt-default-executor-5-thread-8] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node start vote and grant vote self, term=1.
2020-10-26 16:23:53 95320 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreStateMachine [] - The node 127.0.0.1:8282 abdicated from follower
2020-10-26 16:23:53 95320 [JRaft-FSMCaller-Disruptor-0] [INFO ] com.alipay.sofa.jraft.core.StateMachineAdapter [] - onStopFollowing: LeaderChangeContext [leaderId=127.0.0.1:8281, term=1, status=Status[ERAFTTIMEDOUT<10001>: A follower's leader_id is reset to NULL as it begins to request_vote.]].
2020-10-26 16:23:53 95321 [default-group/127.0.0.1:8282-AppendEntriesThread0] [WARN ] com.alipay.sofa.jraft.core.NodeImpl [] - Node ignore stale AppendEntriesRequest from 127.0.0.1:8281, term=1, currTerm=2.
2020-10-26 16:23:53 95328 [Bolt-default-executor-5-thread-8] [INFO ] com.alipay.sofa.jraft.storage.impl.LocalRaftMetaStorage [] - Save raft meta, path=/Users/liningrui/IdeaProjects/baidu/xbu-data/hugegraph/node2/raft-log/meta, term=2, votedFor=127.0.0.1:8282, cost time=1 ms
2020-10-26 16:23:53 95329 [Bolt-default-executor-5-thread-8] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node received TimeoutNowRequest from 127.0.0.1:8281, term=1.
2020-10-26 16:23:56 98323 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 3.002s
2020-10-26 16:23:59 101324 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 6.003s
2020-10-26 16:24:02 104330 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 9.009s
2020-10-26 16:24:03 105339 [JRaft-RPC-Processor-7] [WARN ] com.alipay.sofa.jraft.core.NodeImpl [] - Node RequestVote to 127.0.0.1:8281 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 127.0.0.1:8281].
2020-10-26 16:24:04 105802 [JRaft-VoteTimer-0] [WARN ] com.alipay.sofa.jraft.core.NodeImpl [] - Candidate node term 2 steps down when election reaching vote timeout: fail to get quorum vote-granted.
2020-10-26 16:24:04 105803 [JRaft-VoteTimer-0] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node term 2 start preVote.
2020-10-26 16:24:05 107332 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 12.011s
2020-10-26 16:24:08 110337 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 15.015s
2020-10-26 16:24:11 113342 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 18.021s
2020-10-26 16:24:12 113778 [Bolt-default-executor-5-thread-12] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node received PreVoteRequest from 127.0.0.1:8281, term=2, currTerm=2, granted=true, requestLastLogId=LogId [index=46, term=1], lastLogId=LogId [index=46, term=1].
2020-10-26 16:24:14 115816 [JRaft-RPC-Processor-8] [WARN ] com.alipay.sofa.jraft.core.NodeImpl [] - Node PreVote to 127.0.0.1:8281 error: Status[EINTERNAL<1004>: RPC exception:Invoke timeout when invoke with callback.The address is 127.0.0.1:8281].
2020-10-26 16:24:14 116226 [JRaft-ElectionTimer-0] [INFO ] com.alipay.sofa.jraft.core.NodeImpl [] - Node term 2 start preVote.
2020-10-26 16:24:14 116345 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 21.024s
2020-10-26 16:24:17 119349 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 24.028s
2020-10-26 16:24:20 122350 [server-info-db-worker-1] [WARN ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - Waiting for raft group 'default-group' election cost 27.029s
从日志可以看出两个节点的表现:
- 原leader节点先从leader退位,然后10秒后(设置的选举超时时间),触发了onTransferTimeout方法(failed to transfer leadership),再变回leader,但是立马又从leader退位,原因是存活节点不过半(steps down when alive nodes don't satisfy quorum);
- 准leader节点睡眠了3秒后,开始自选举,但是在超时时间内收不到投票请求,然后开启新一轮的预投票,各种失败警告,另外一直有等待选举耗时多少秒的警告(这个是HugeGraph种添加的)
先不管这些错误信息,我们想想正确的流程应该是怎样的。准leader睡眠3秒后,执行自选举,这时应该能收到原leader(已退位)的投票,然后顺利成为新的leader。那为什么它没有成为新的leader呢?一定是投票环节出了问题。
我在electSelf方法中添加日志打印
// 省略
LOG.info("HG ==> 向 {} 发起投票请求", peer.getEndpoint());
this.rpcService.requestVote(peer.getEndpoint(), done.request, done);
// 省略
这里能打印出来,说明确实发送了请求投票的请求。
继续在接收请求处也添加日志打印,具体是在RequestVoteRequestProcessor.processRequest0方法中
if (request.getPreVote()) {
LOG.info("HG ==> 准备处理 PreVoteRequest");
return service.handlePreVoteRequest(request);
} else {
LOG.info("HG ==> 准备处理 RequestVoteRequest");
return service.handleRequestVoteRequest(request);
}
这里就不正常了,if和else分支都不打印日志,说明原leader根本就没有收到准leader的投票请求。
那为什么收不到呢?进程都是在本地,不可能是网络的问题,况且先前都已经选出过leader了,难道是所有的rpc线程都在使用中,导致请求无法被rpc线程处理?但是又为啥rpc线程被用完呢?虽然说是在本地,好歹笔记本配置还可以,也没开什么大的任务,就这就让线程这么长时间都无法响应了?
突然我瞟到了代码中的forwardToLeader方法,这个方法的作用是把follower节点上的写请求转发到leader节点上,这里是会显式地占用rpc线程的。这里很可以,添加了日志打印,另外在接收请求的地方(StoreCommandRequestProcessor.processRequest方法)也添加日志。
一运行,居然发现了消息的递归调用
2020-10-26 22:01:57 42474 [Bolt-default-executor-5-thread-18] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreCommandRequestProcessor [] - 线程 Bolt-default-executor-5-thread-18 提交并等待命令执行
2020-10-26 22:01:57 42475 [Bolt-default-executor-5-thread-18] [INFO ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - The node forward request to leader 127.0.0.1:8281
2020-10-26 22:01:57 42477 [Bolt-default-executor-5-thread-20] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreCommandRequestProcessor [] - 线程 Bolt-default-executor-5-thread-20 提交并等待命令执行
2020-10-26 22:01:57 42478 [Bolt-default-executor-5-thread-20] [INFO ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - The node forward request to leader 127.0.0.1:8281
2020-10-26 22:01:57 42480 [Bolt-default-executor-5-thread-1] [INFO ] com.baidu.hugegraph.backend.store.raft.StoreCommandRequestProcessor [] - 线程 Bolt-default-executor-5-thread-1 提交并等待命令执行
2020-10-26 22:01:57 42480 [Bolt-default-executor-5-thread-1] [INFO ] com.baidu.hugegraph.backend.store.raft.RaftNode [] - The node forward request to leader 127.0.0.1:8281
...
这就奇怪了,走到forwardToLeader方法说明当前节点不是leader,需要转发,但是能收到请求又说明自己是leader。这种矛盾的状态应该跟判断leader角色的代码有关。经过定位,找到了这个代码片段:
this.waitLeaderElected(RaftSharedContext.NO_TIMEOUT);
if (!this.isRaftLeader()) {
this.forwardToLeader(command, closure);
return;
}
其中waitLeaderElected()会用Node.getLeaderId()判断leader存不存在,isRaftLeader()会用Node.isLeader()判断自己是不是leader,现在是情况是:
- Node.getLeaderId()判断leader存在,而且leaderId指向的就是自己;
- Node.isLeader()判断自己不是leader。
所以上面那个消息递归的原因就是:
准leader给原leader转发了一个写请求,原leader的rpc线程接收到了这个请求,然后调用submitAndWait执行该请求,然后waitLeaderElected方法能走过去,但this.isRaftLeader()返回了false,于是进入到forwardToLeader方法,该方法拿到当前的leaderId,其实就是自己,然后又给自己发送请求,于是这个过程不断重复,直到把rpc线程用完。
所以当准leader向原leader发送了投票请求时,原leader已经没有rpc线程能接受请求了,所以导致投票超时。于是出现:原leader因为term低不能被选为新leader,而准leader收不到原leader的投票也不能成为新leader,导致raft group再也选不出来leader的情形。
而jraft的在onLeaderStop后出现两个方法判断leader状态不一致的问题,已经向社区提问了:https://github.com/sofastack/sofa-jraft/issues/531