ETCD Raft

分布式系统必须解决的问题，数据一致性问题

Raft 一致性算法

经典的 Leader follower 模式，只有 leader 处理请求，若 follower 接收到请求，会将请求转发至 leader 由 leader 处理

当然算法要证明一些完备性

Raft requires only two types of RPC

RequestVote RPCs initiated by candidates during elections
AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat

And add a third RPC for transferring snapshots between servers

Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority

If a follower receives no communication over a period of time called the election timeout

术语解释

Term 为 logic clock in Raft

Node Status

Leader
Candidate
Follower

RPC Messages

Request Vote message
Append Entries message

Leader Election

一开始所有 nodes 都是 followers，在 election time out （随机时间范围）内没有收到 heartbeat 消息后，开始成为 candidate，vote 自己，并向其他 nodes 发送 RequestVote Message，当收到多数 Vote 时，成为 Leader

假如同时有两个 nodes 成为 candiate，可能会导致 split vote，然后此轮 election 超时后，重新开始 election，由于 election time out 是随机时间，因此 leader election 可以快速成功

Log replication

一旦 node 成为 leader 后，所以写处理均有 leader 完成，followers 接受到的请求，会转发至 leader 上处理

leader 使用空的 AppendEntries Message 保持与 followers 的 heartbeat

leader 在接收到 client 的写请求后，使用 AppendEntries Message 告知 followers，当 major 返回时，告知 client 写成功，否则保持 message uncommited 状态

每次改变都为 leader 写入一个 entry 到 node 的 log 中，commited 的 log entry 才会改变 node 的值。leader 会 replica entry 发送到每个 follower，然后其会等待 major followers 写入 entry；当 leader 收到 major followers 写入成功时，会 commited log entry，并通知 followers

首先会 append 到 leader log，在下一次 heartbeat 时，会将 Message 发送给 followers，then the change is sent to the followers on the next heartbeat

Leader 在起来之后，会给每个 follower 设置其 nextIndex 为其自己的 log 的 currentIndex - 1，然后随 AppendEntry 发送至 follower，follower 在接收到之后，检查是否与自己的 index 一致，若不一致则拒绝，Leader 收到后，将 nextIndex - 1 继续发送，直到找到 follower 与其相同的 log 部分，follower 删除这些 old log，重新与 leader 同步，即要求 follower 的 log 与 leader 的要一致

If a follower’s log is
inconsistent with the leader’s, the AppendEntries consistency
check will fail in the next AppendEntries RPC. After
a rejection, the leader decrements nextIndex and retries
the AppendEntries RPC. Eventually nextIndex will reach
a point where the leader and follower logs match.

In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten
with entries from the leader’s log.

A leader never overwrites or
deletes entries in its own log (the Leader Append-Only
Property in Figure 3).

Split network

split brain 问题，cluster 中出现网络分区时，election timeout 后，不同网络分区会出现不同 leader，然而只要保证 cluster 原先的成员数为奇数，则出现一个网络分区时，一定有一个分区为少数，client 向这个分区的 leader 发起写，将不会成功；client 向多数分区写时，能成功；在网络分区恢复时，由于写成功的分区的 term 更高，因此少数分区中的 node 接收新 leader 的 Message，丢弃 uncommited 的 Message 达到 cluster log 一致性

Two key timeout

The election timeout is the amount of time a follower waits until becoming a candidate.
After one node became leader in cluster, then it will send empty AppendEntry Message to followers, and these messages are sent in intervals specified by the heartbeat timeout. Followers receive AEM and response it. This election term will continue until a follower stops receiving heartbeats and becomes a candidate.

If followers crash or run slowly,
or if network packets are lost, the leader retries AppendEntries
RPCs indefinitely (even after it has responded to
the client) until all followers eventually store all log entries

A log entry is committed once the leader
that created the entry has replicated it on a majority of
the servers (e.g., entry 7 in Figure 6)

Once a follower learns
that a log entry is committed, it applies the entry to its
local state machine (in log order)

同样会将自己的状态切换为 candidate 并发起选举。每成功选举一次，新 leader 的步进数都会比之前 leader 的步进数大1。

Log 一致性保证
The first property follows from the fact that a leader creates at most one entry with a given log index in a given
term, and log entries never change their position in the
log. The second property is guaranteed by a simple consistency
check performed by AppendEntries. When sending
an AppendEntries RPC, the leader includes the index
and term of the entry in its log that immediately precedes
the new entries. If the follower does not find an entry in
its log with the same index and term, then it refuses the
new entries. The consistency check acts as an induction
step: the initial empty state of the logs satisfies the Log
Matching Property, and the consistency check preserves
the Log Matching Property whenever logs are extended.
As a result, whenever AppendEntries returns successfully,
the leader knows that the follower’s log is identical to its
own log up through the new entries

This means that log entries only flow in one direction,
from leaders to followers, and leaders never overwrite
existing entries in their logs.

在 Vote 阶段，需比较 candidate 的 log 与其他 followers 的 log，candidate 的 log 必须至少和其他 followers 的 log 一样新，才能获得 vote，否则会被 deny vote，无法成为 leader

which means that every committed
entry must be present in at least one of those servers. If the
candidate’s log is at least as up-to-date as any other log
in that majority (where “up-to-date” is defined precisely
below), then it will hold all the committed entries. The
RequestVote RPC implements this restriction: the RPC
includes information about the candidate’s log, and the
voter denies its vote if its own log is more up-to-date than
that of the candidate.

log 的新旧判断

Raft determines which of two logs is more up-to-date
by comparing the index and term of the last entries in the
logs. If the logs have last entries with different terms, then
the log with the later term is more up-to-date. If the logs
end with the same term, then whichever log is longer is
more up-to-date.

这段不是特别理解，解释 Raft 如何解决 how to commit log in previous term

Raft never commits log entries from previous terms by counting
replicas. Only log entries from the leader’s current
term are committed by counting replicas; once an entry
from the current term has been committed in this way,
then all prior entries are committed indirectly because
of the Log Matching Property

Follower crash 的情况

Raft handles these failures by retrying indefinitely;
if the crashed server restarts, then the RPC will complete
successfully.

The leader uses a new RPC called InstallSnapshot to
send snapshots to followers that are too far behind; see
Figure 13.

To be cont.

http://thesecretlivesofdata.com/raft/
https://raft.github.io/