一、Raft算法
Raft通过当选的领导者达成共识。筏集群中的服务器是领导者或追随者,并且在选举的精确情况下可以是候选者(领导者不可用)。领导者负责将日志复制到关注者。它通过发送心跳消息定期通知追随者它的存在。每个跟随者都有一个超时(通常在150到300毫秒之间),它期望领导者的心跳。接收心跳时重置超时。如果没有收到心跳,则关注者将其状态更改为候选人并开始领导选举。
注意:在各种分布式中间件里面,离不开ralft共识算法,nacos,kafka,rocketmq,flink,pulsa等等
1.每台Nacos机器注册上去都会给对应的服务器地址发送注册请求。send registRequest
2.这时候每台机器会把对应的请求转发到主节点,
3.主节点收到后会把该请求放到本地内存Map
4.主节点会定时从localAllInfoMap注入到机器的发送心跳
5.如果没有收到心跳那么就重试几次,如果再没获取到那么就移除这台机器从Map里面
6.每台机器也会发送beatRequest请求到Nacos服务器中,后面网络恢复后,那么也会注册到Nacos里面
7.集群模式下,各个服务器互相发送请求,通过raflt算法获取到对应的Leader节点
8.leader节点会定时向从节点发送beatLeaderRequest,这样也是一个提升性能点,只用master给其他服务发送请求,
8.如果有一个服务器没有给回复,那么这时候就会触发选举
9.当然子节点也会定时一段时间去发送请求给Leader,但是频率没有Leader发送心跳频率高,这些都是为了提供性能、
10.raft算法细节
RaftCommands.beat()方法处理/v1/ns/raft/beat请求
接收心跳包的 http 接口:
@RestController
@RequestMapping(UtilsAndCommons.NACOS_NAMING_CONTEXT + “/raft”)
public class RaftController {
......
@NeedAuth
@RequestMapping(value = "/beat", method = RequestMethod.POST)
public JSONObject beat(HttpServletRequest request, HttpServletResponse response) throws Exception {
String entity = new String(IoUtils.tryDecompress(request.getInputStream()), "UTF-8");
String value = URLDecoder.decode(entity, "UTF-8");
value = URLDecoder.decode(value, "UTF-8");
// 解析心跳包
JSONObject json = JSON.parseObject(value);
JSONObject beat = JSON.parseObject(json.getString("beat"));
// 处理心跳包并将本节点的信息作为 response 返回
RaftPeer peer = raftCore.receivedBeat(beat);
return JSON.parseObject(JSON.toJSONString(peer));
}
......
}
HeartBeat.receivedBeat()处理心跳包
1.如果收到心跳的节点不是Follower角色,则设置为Follower角色,并把它的voteFor设置为Leader节点的ip;
2.重置本地节点的heart timeout、election timeout;
3.调用PeerSet.makeLeader()通知这个节点更新Leader;(也就是说Leader节点会通过心跳通知其它节点更新Leader)
4.检查Datum:
遍历请求参数中的datums,如果Follwoer不存在这个datumKey或者时间戳比较旧,则收集这个datumKey;
每收集到50个datumKey,则向Leader节点的/v1/ns/raft/get路径发送请求,请求参数为这50个datumKey,获取对应的50个最新的Datum对象;
遍历这些Daum对象,接下来做的是和RaftCore.onPublish()方法中做的事类似:
1.调用RaftStore#write将Datum序列化为json写到cacheFile中
2.将Datum存放到RaftCore的datums集合中,key为上面的datum的key值
3.更新本地节点的election timeout
4.更新本地节点的任期term
5.本地节点的任期term持久化到properties文件中
6.调用notifier.addTask(datum, Notifier.ApplyAction.CHANGE);
通知对应的RaftListener
RaftCore.deleteDatum(String key)用来删除旧的Datum
datums集合中删除key对应的Datum;
RaftStore.delete(),在磁盘上删除这个Datum对应的文件;
notifier.addTask(deleted, Notifier.ApplyAction.DELETE),通知对应的RaftListener有DELETE事件。
本地节点的RaftPeer作为http响应返回。
@Component
public class RaftCore {
......
public RaftPeer receivedBeat(JSONObject beat) throws Exception {
final RaftPeer local = peers.local();
// 解析发送心跳包的节点信息
final RaftPeer remote = new RaftPeer();
remote.ip = beat.getJSONObject("peer").getString("ip");
remote.state = RaftPeer.State.valueOf(beat.getJSONObject("peer").getString("state"));
remote.term.set(beat.getJSONObject("peer").getLongValue("term"));
remote.heartbeatDueMs = beat.getJSONObject("peer").getLongValue("heartbeatDueMs");
remote.leaderDueMs = beat.getJSONObject("peer").getLongValue("leaderDueMs");
remote.voteFor = beat.getJSONObject("peer").getString("voteFor");
// 若收到的心跳包不是 leader 节点发送的,则抛异常
if (remote.state != RaftPeer.State.LEADER) {
Loggers.RAFT.info("[RAFT] invalid state from master, state: {}, remote peer: {}",
remote.state, JSON.toJSONString(remote));
throw new IllegalArgumentException("invalid state from master, state: " + remote.state);
}
// 本地 term 大于心跳包的 term,则心跳包不进行处理
if (local.term.get() > remote.term.get()) {
Loggers.RAFT.info("[RAFT] out of date beat, beat-from-term: {}, beat-to-term: {}, remote peer: {}, and leaderDueMs: {}"
, remote.term.get(), local.term.get(), JSON.toJSONString(remote), local.leaderDueMs);
throw new IllegalArgumentException("out of date beat, beat-from-term: " + remote.term.get()
+ ", beat-to-term: " + local.term.get());
}
// 若当前节点不是 follower 节点,则将其更新为 follower 节点
if (local.state != RaftPeer.State.FOLLOWER) {
Loggers.RAFT.info("[RAFT] make remote as leader, remote peer: {}", JSON.toJSONString(remote));
// mk follower
local.state = RaftPeer.State.FOLLOWER;
local.voteFor = remote.ip;
}
final JSONArray beatDatums = beat.getJSONArray("datums");
// 更新心跳包发送间隔和收不到心跳包的选举间隔
local.resetLeaderDue();
local.resetHeartbeatDue();
// 更新 leader 信息,将 remote 设置为新 leader,更新原有 leader 的节点信息
peers.makeLeader(remote);
// 将当前节点的 key 存放到一个 map 中,value 都为 0
Map<String, Integer> receivedKeysMap = new HashMap<String, Integer>(datums.size());
for (Map.Entry<String, Datum> entry : datums.entrySet()) {
receivedKeysMap.put(entry.getKey(), 0);
}
// 检查接收到的 datum 列表
List<String> batch = new ArrayList<String>();
if (!switchDomain.isSendBeatOnly()) {
int processedCount = 0;
Loggers.RAFT.info("[RAFT] received beat with {} keys, RaftCore.datums' size is {}, remote server: {}, term: {}, local term: {}",
beatDatums.size(), datums.size(), remote.ip, remote.term, local.term);
for (Object object : beatDatums) {
processedCount = processedCount + 1;
JSONObject entry = (JSONObject) object;
String key = entry.getString("key");
final String datumKey;
// 构建 datumKey(加上前缀,发送的时候 key 是去掉了前缀的)
if (KeyBuilder.matchServiceMetaKey(key)) {
datumKey = KeyBuilder.detailServiceMetaKey(key);
} else if (KeyBuilder.matchInstanceListKey(key)) {
datumKey = KeyBuilder.detailInstanceListkey(key);
} else {
// ignore corrupted key:
continue;
}
// 获取收到的 key 对应的版本
long timestamp = entry.getLong("timestamp");
// 将收到的 key 在本地 key 的 map 中标记为 1
receivedKeysMap.put(datumKey, 1);
try {
// 收到的 key 在本地存在 并且 本地的版本大于收到的版本 并且 还有数据未处理,则直接 continue
if (datums.containsKey(datumKey) && datums.get(datumKey).timestamp.get() >= timestamp && processedCount < beatDatums.size()) {
continue;
}
// 若收到的 key 在本地没有,或者本地的版本小于收到的版本,放入 batch,准备下一步获取数据
if (!(datums.containsKey(datumKey) && datums.get(datumKey).timestamp.get() >= timestamp)) {
batch.add(datumKey);
}
// 只有 batch 的数量超过 50 或已经处理完了,才进行获取数据操作
if (batch.size() < 50 && processedCount < beatDatums.size()) {
continue;
}
String keys = StringUtils.join(batch, ",");
if (batch.size() <= 0) {
continue;
}
Loggers.RAFT.info("get datums from leader: {}, batch size is {}, processedCount is {}, datums' size is {}, RaftCore.datums' size is {}"
, getLeader().ip, batch.size(), processedCount, beatDatums.size(), datums.size());
// 获取对应 key 的数据
// update datum entry
String url = buildURL(remote.ip, API_GET) + "?keys=" + URLEncoder.encode(keys, "UTF-8");
HttpClient.asyncHttpGet(url, null, null, new AsyncCompletionHandler<Integer>() {
@Override
public Integer onCompleted(Response response) throws Exception {
if (response.getStatusCode() != HttpURLConnection.HTTP_OK) {
return 1;
}
List<Datum> datumList = JSON.parseObject(response.getResponseBody(), new TypeReference<List<Datum>>() {
});
// 更新本地数据
for (Datum datum : datumList) {
OPERATE_LOCK.lock();
try {
Datum oldDatum = getDatum(datum.key);
if (oldDatum != null && datum.timestamp.get() <= oldDatum.timestamp.get()) {
Loggers.RAFT.info("[NACOS-RAFT] timestamp is smaller than that of mine, key: {}, remote: {}, local: {}",
datum.key, datum.timestamp, oldDatum.timestamp);
continue;
}
raftStore.write(datum);
if (KeyBuilder.matchServiceMetaKey(datum.key)) {
Datum<Service> serviceDatum = new Datum<>();
serviceDatum.key = datum.key;
serviceDatum.timestamp.set(datum.timestamp.get());
serviceDatum.value = JSON.parseObject(JSON.toJSONString(datum.value), Service.class);
datum = serviceDatum;
}
if (KeyBuilder.matchInstanceListKey(datum.key)) {
Datum<Instances> instancesDatum = new Datum<>();
instancesDatum.key = datum.key;
instancesDatum.timestamp.set(datum.timestamp.get());
instancesDatum.value = JSON.parseObject(JSON.toJSONString(datum.value), Instances.class);
datum = instancesDatum;
}
datums.put(datum.key, datum);
notifier.addTask(datum.key, ApplyAction.CHANGE);
local.resetLeaderDue();
if (local.term.get() + 100 > remote.term.get()) {
getLeader().term.set(remote.term.get());
local.term.set(getLeader().term.get());
} else {
local.term.addAndGet(100);
}
raftStore.updateTerm(local.term.get());
Loggers.RAFT.info("data updated, key: {}, timestamp: {}, from {}, local term: {}",
datum.key, datum.timestamp, JSON.toJSONString(remote), local.term);
} catch (Throwable e) {
Loggers.RAFT.error("[RAFT-BEAT] failed to sync datum from leader, key: {} {}", datum.key, e);
} finally {
OPERATE_LOCK.unlock();
}
}
TimeUnit.MILLISECONDS.sleep(200);
return 0;
}
});
batch.clear();
} catch (Exception e) {
Loggers.RAFT.error("[NACOS-RAFT] failed to handle beat entry, key: {}", datumKey);
}
}
// 若某个 key 在本地存在但收到的 key 列表中没有,则证明 leader 已经删除,那么本地也要删除
List<String> deadKeys = new ArrayList<String>();
for (Map.Entry<String, Integer> entry : receivedKeysMap.entrySet()) {
if (entry.getValue() == 0) {
deadKeys.add(entry.getKey());
}
}
for (String deadKey : deadKeys) {
try {
deleteDatum(deadKey);
} catch (Exception e) {
Loggers.RAFT.error("[NACOS-RAFT] failed to remove entry, key={} {}", deadKey, e);
}
}
}
return local;
}
}
总结
Nacos 制定自己raft时做了一些变更;
变更一:
leader 任期没有超时现象,在发起心跳的时候都会在重置任期时间,导致不超时,除非宕机;避免了node之间频繁通讯;同时通过心跳机制重置其它节点为follower,避免长时间双leader 现象
变更二:
选举未采用双阶段选举模式,简化了模式;通过数据变更term+100 的方式来解决短时间分区问题;
特征:
一、term 的变更发生在两个地方:1.leader 选举,加1;2.数据更新,加100;
二、心跳只能leader 发送;
三、数据同步term必须大于等于本地term才是更新的前提;
四、选举是发起方的term必须大于本地term
针对双leader 项目后续1.4版本会避开,已咨询过