在Raft协议中,节点有三种角色:
选举分为两个阶段:
所有节点启动的时候,都是follower状态。 如果在一段时间内如果没有收到leader的心跳(可能是没有leader,也可能是leader挂了),那么follower会变成Candidate。然后发起选举,选举之前,会增加term,这个term和zookeeper中的epoch的道理是一样的。
follower会投自己一票,并且给其他节点发送票据vote,等到其他节点回复。
在这个过程中,可能出现几种情况
约束条件:
Nacos Server在启动的时候会调用RaftCore.init()方法进行集群选举操作和节点之间的心跳机制
/**
* @author nacos
*/
@Component
public class RaftCore {
@PostConstruct
public void init() throws Exception {
Loggers.RAFT.info("initializing Raft sub-system");
executor.submit(notifier);
long start = System.currentTimeMillis();
raftStore.loadDatums(notifier, datums);
setTerm(NumberUtils.toLong(raftStore.loadMeta().getProperty("term"), 0L));
Loggers.RAFT.info("cache loaded, datum count: {}, current term: {}", datums.size(), peers.getTerm());
while (true) {
if (notifier.tasks.size() <= 0) {
break;
}
Thread.sleep(1000L);
}
initialized = true;
Loggers.RAFT.info("finish to load data from disk, cost: {} ms.", (System.currentTimeMillis() - start));
//节点选举
GlobalExecutor.registerMasterElection(new MasterElection());
//集群节点的心跳机制
GlobalExecutor.registerHeartbeat(new HeartBeat());
Loggers.RAFT.info("timer started: leader timeout ms: {}, heart-beat timeout ms: {}",
GlobalExecutor.LEADER_TIMEOUT_MS, GlobalExecutor.HEARTBEAT_INTERVAL_MS);
}
}
在init()方法中,使用GlobalExecutor.registerMasterElection(new MasterElection());方法来进行选举操作;
registerMasterElection()方法中,启动了一个定时任务去执行MasterElection里面的操作;接下来看MasterElection里面的逻辑:
public class MasterElection implements Runnable {
@Override
public void run() {
try {
if (!peers.isReady()) {
return;
}
//获取本机RaftPeer信息
RaftPeer local = peers.local();
local.leaderDueMs -= GlobalExecutor.TICK_PERIOD_MS;
if (local.leaderDueMs > 0) {
return;
}
// reset timeout
//重置选举超时时间和发送心跳时间
local.resetLeaderDue();
local.resetHeartbeatDue();
//发送选票信息到其他nacos节点
sendVote();
} catch (Exception e) {
Loggers.RAFT.warn("[RAFT] error while master election {}", e);
}
}
在new MasterElection()线程中,首先会获取本机nacos节点的RaftPeer信息;RaftPeer包括了一下信息:
获取到了本机RaftPeer信息之后,首先重置选举超时时间和发送心跳时间;然后调用sendVote()方法进行选举操作
public void sendVote() {
//1、获取本机nacos节点的RaftPeer信息
RaftPeer local = peers.get(NetUtils.localServer());
Loggers.RAFT.info("leader timeout, start voting,leader: {}, term: {}",
JSON.toJSONString(getLeader()), local.term);
//2、重置leader节点==null,同时重置其他各个节点的选票信息==null
peers.reset();
//3、本机节点设置term+1
local.term.incrementAndGet();
//4、本机节点设置选票信息为自己
local.voteFor = local.ip;
//5、同时修改本机节点信息为CANDIDATE昨天
local.state = RaftPeer.State.CANDIDATE;
Map params = new HashMap<>(1);
//6、将本机节点的RaftPeer信息进行组装
params.put("vote", JSON.toJSONString(local));
//7、通过httpClient给nacos集群的其他节点发送选票信息
for (final String server : peers.allServersWithoutMySelf()) {
final String url = buildURL(server, API_VOTE);
try {
HttpClient.asyncHttpPost(url, null, params, new AsyncCompletionHandler() {
@Override
public Integer onCompleted(Response response) throws Exception {
if (response.getStatusCode() != HttpURLConnection.HTTP_OK) {
Loggers.RAFT.error("NACOS-RAFT vote failed: {}, url: {}", response.getResponseBody(), url);
return 1;
}
//8、接收其他节点对于前面发送的选票信息的返回结果
RaftPeer peer = JSON.parseObject(response.getResponseBody(), RaftPeer.class);
Loggers.RAFT.info("received approve from peer: {}", JSON.toJSONString(peer));
//9、决定哪一个是Leader节点操作
peers.decideLeader(peer);
return 0;
}
});
} catch (Exception e) {
Loggers.RAFT.warn("error while sending vote to server: {}", server);
}
}
}
}
在sendVote()方法中,主要的步骤是:
其中peers.reset()方法中的逻辑代码为:
public void reset() {
leader = null;
for (RaftPeer peer : peers.values()) {
peer.voteFor = null;
}
}
通过httpClient将本机选票信息发送给其他节点,并返回其他节点的选票结果逻辑主要是将请求到RaftController.vote()方法中:
@NeedAuth
@PostMapping("/vote")
public JSONObject vote(HttpServletRequest request, HttpServletResponse response) throws Exception {
RaftPeer peer = raftCore.receivedVote(
JSON.parseObject(WebUtils.required(request, "vote"), RaftPeer.class));
return JSON.parseObject(JSON.toJSONString(peer));
}
在vote()方法中,主要是调用RaftCore.receivedVote()方法;
该方法就是nacos节点接受其他节点的选票信息并返回自己的选票信息结果
public synchronized RaftPeer receivedVote(RaftPeer remote) {
if (!peers.contains(remote)) {
throw new IllegalStateException("can not find peer: " + remote.ip);
}
RaftPeer local = peers.get(NetUtils.localServer());
if (remote.term.get() <= local.term.get()) {
String msg = "received illegitimate vote" +
", voter-term:" + remote.term + ", votee-term:" + local.term;
Loggers.RAFT.info(msg);
if (StringUtils.isEmpty(local.voteFor)) {
local.voteFor = local.ip;
}
return local;
}
local.resetLeaderDue();
local.state = RaftPeer.State.FOLLOWER;
local.voteFor = remote.ip;
local.term.set(remote.term.get());
Loggers.RAFT.info("vote {} as leader, term: {}", remote.ip, remote.term);
return local;
}
该方法中的逻辑比较简单明了:
在2.3的RaftCore.sendVote()方法中,每个本机nacos节点都会将自己的选票信息发送给nacos集群中的其他节点,请求到其他节点的RaftController.vote()方法中,vote()方法通过调用2.4中的RaftCore.receivedVote()方法来处理其他节点的选票信息并进行判断之后返回自身的选票信息给原来的nacos节点;
RaftCore.sendVote()方法中获取到了其他节点的选票结果之后,会调用decideLeader()方法来选出Leade节点
public RaftPeer decideLeader(RaftPeer candidate) {
peers.put(candidate.ip, candidate);
SortedBag ips = new TreeBag();
int maxApproveCount = 0;
String maxApprovePeer = null;
for (RaftPeer peer : peers.values()) {
if (StringUtils.isEmpty(peer.voteFor)) {
continue;
}
ips.add(peer.voteFor);
if (ips.getCount(peer.voteFor) > maxApproveCount) {
maxApproveCount = ips.getCount(peer.voteFor);
maxApprovePeer = peer.voteFor;
}
}
if (maxApproveCount >= majorityCount()) {
RaftPeer peer = peers.get(maxApprovePeer);
peer.state = RaftPeer.State.LEADER;
if (!Objects.equals(leader, peer)) {
leader = peer;
applicationContext.publishEvent(new LeaderElectFinishedEvent(this, leader));
Loggers.RAFT.info("{} has become the LEADER", leader.ip);
}
}
return leader;
}
该方法中首先会找出得票最多的节点的信息以及该节点的得票数;然后判断得票数是否超过了一半的nacos集群节点数量;如果没有超过,直接返回leader(null);如果超过了则将该节点的信息赋值给Leader节点并返回。
RaftCore.init()方法除了上面的选举操作之外,紧跟着进行了集群心跳机制的逻辑;同样调用了一个定时任务,每个5s执行一个发送心跳的操作---new HeartBeat():
public class HeartBeat implements Runnable {
@Override
public void run() {
try {
if (!peers.isReady()) {
return;
}
RaftPeer local = peers.local();
local.heartbeatDueMs -= GlobalExecutor.TICK_PERIOD_MS;
if (local.heartbeatDueMs > 0) {
return;
}
local.resetHeartbeatDue();
sendBeat();
} catch (Exception e) {
Loggers.RAFT.warn("[RAFT] error while sending beat {}", e);
}
}
该方法中,首先会获取本机节点的RaftPeer信息,并重置心跳信息;同时调用sendBeat()方法发送心跳:
public void sendBeat() throws IOException, InterruptedException {
RaftPeer local = peers.local();
if (local.state != RaftPeer.State.LEADER && !STANDALONE_MODE) {
return;
}
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("[RAFT] send beat with {} keys.", datums.size());
}
local.resetLeaderDue();
// build data
JSONObject packet = new JSONObject();
packet.put("peer", local);
JSONArray array = new JSONArray();
if (switchDomain.isSendBeatOnly()) {
Loggers.RAFT.info("[SEND-BEAT-ONLY] {}", String.valueOf(switchDomain.isSendBeatOnly()));
}
if (!switchDomain.isSendBeatOnly()) {
for (Datum datum : datums.values()) {
JSONObject element = new JSONObject();
if (KeyBuilder.matchServiceMetaKey(datum.key)) {
element.put("key", KeyBuilder.briefServiceMetaKey(datum.key));
} else if (KeyBuilder.matchInstanceListKey(datum.key)) {
element.put("key", KeyBuilder.briefInstanceListkey(datum.key));
}
element.put("timestamp", datum.timestamp);
array.add(element);
}
}
packet.put("datums", array);
// broadcast
Map params = new HashMap(1);
params.put("beat", JSON.toJSONString(packet));
String content = JSON.toJSONString(params);
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(content.getBytes(StandardCharsets.UTF_8));
gzip.close();
byte[] compressedBytes = out.toByteArray();
String compressedContent = new String(compressedBytes, StandardCharsets.UTF_8);
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("raw beat data size: {}, size of compressed data: {}",
content.length(), compressedContent.length());
}
for (final String server : peers.allServersWithoutMySelf()) {
try {
final String url = buildURL(server, API_BEAT);
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("send beat to server " + server);
}
HttpClient.asyncHttpPostLarge(url, null, compressedBytes, new AsyncCompletionHandler() {
@Override
public Integer onCompleted(Response response) throws Exception {
if (response.getStatusCode() != HttpURLConnection.HTTP_OK) {
Loggers.RAFT.error("NACOS-RAFT beat failed: {}, peer: {}",
response.getResponseBody(), server);
MetricsMonitor.getLeaderSendBeatFailedException().increment();
return 1;
}
peers.update(JSON.parseObject(response.getResponseBody(), RaftPeer.class));
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("receive beat response from: {}", url);
}
return 0;
}
@Override
public void onThrowable(Throwable t) {
Loggers.RAFT.error("NACOS-RAFT error while sending heart-beat to peer: {} {}", server, t);
MetricsMonitor.getLeaderSendBeatFailedException().increment();
}
});
} catch (Exception e) {
Loggers.RAFT.error("error while sending heart-beat to peer: {} {}", server, e);
MetricsMonitor.getLeaderSendBeatFailedException().increment();
}
}
}
}
该方法中大致的过程是:
首先判断本机节点是否是Leader节点,如果不是则直接返回,如果是Leader节点,则将RaftPeer和时间戳等信息封装并通过httpClient远程发送到其他nacos集群follower节点中;请求会发送到RaftController.beat()方法;beat方法中调用了RaftCore.receivedBeat()方法;并将远程nacos节点RaftPeer返回到本机节点中;然后更新RaftPeerSet集合信息,保持nacos集群数据节点的一致性。
核心代码:
if (local.state != RaftPeer.State.FOLLOWER) {
Loggers.RAFT.info("[RAFT] make remote as leader, remote peer: {}", JSON.toJSONString(remote));
// mk follower
local.state = RaftPeer.State.FOLLOWER;
local.voteFor = remote.ip;
}
final JSONArray beatDatums = beat.getJSONArray("datums");
local.resetLeaderDue();
local.resetHeartbeatDue();
peers.makeLeader(remote);
在该方法中会判断该远程节点是否为follower,如果不是则修改为follower状态,同时调用makeLeader()方法将其他非follower的节点状态改为follower;最后返回远程节点RaftPeer。