我们继续看从节点收到type为MsgApp message 的处理,所以MsgApp 其实是MsgAppend 的缩写?
func stepFollower(r *raft, m pb.Message) error {
switch m.Type {
case pb.MsgProp:
if r.lead == None {
r.logger.Infof("%x no leader at term %d; dropping proposal", r.id, r.Term)
return ErrProposalDropped
} else if r.disableProposalForwarding {
r.logger.Infof("%x not forwarding to leader %x at term %d; dropping proposal", r.id, r.lead, r.Term)
return ErrProposalDropped
}
m.To = r.lead
r.send(m)
case pb.MsgApp:
r.electionElapsed = 0
r.lead = m.From
r.handleAppendEntries(m)
case pb.MsgHeartbeat:
r.electionElapsed = 0
r.lead = m.From
r.handleHeartbeat(m)
case pb.MsgSnap:
r.electionElapsed = 0
r.lead = m.From
r.handleSnapshot(m)
case pb.MsgTransferLeader:
if r.lead == None {
r.logger.Infof("%x no leader at term %d; dropping leader transfer msg", r.id, r.Term)
return nil
}
m.To = r.lead
r.send(m)
case pb.MsgTimeoutNow:
r.logger.Infof("%x [term %d] received MsgTimeoutNow from %x and starts an election to get leadership.", r.id, r.Term, m.From)
// Leadership transfers never use pre-vote even if r.preVote is true; we
// know we are not recovering from a partition so there is no need for the
// extra round trip.
r.hup(campaignTransfer)
case pb.MsgReadIndex:
if r.lead == None {
r.logger.Infof("%x no leader at term %d; dropping index reading msg", r.id, r.Term)
return nil
}
m.To = r.lead
r.send(m)
case pb.MsgReadIndexResp:
if len(m.Entries) != 1 {
r.logger.Errorf("%x invalid format of MsgReadIndexResp from %x, entries count: %d", r.id, m.From, len(m.Entries))
return nil
}
r.readStates = append(r.readStates, ReadState{Index: m.Index, RequestCtx: m.Entries[0].Data})
}
return nil
}
根据前文描述,raftNode 会运行step函数,而follower节点执行的自然是func stepFollower(r *raft, m pb.Message)
函数。该函数首先将electionElapsed 赋值为0,这就表示follower收到了来自leader的心跳,不会在超时后重新发生选举。接下来就是调用r.handleAppendEntries(m)
函数。该函数正常情况下会调用ratfLog.maybeAppend
函数,
func (l *raftLog) maybeAppend(index, logTerm, committed uint64, ents ...pb.Entry) (lastnewi uint64, ok bool) {
if l.matchTerm(index, logTerm) {
lastnewi = index + uint64(len(ents))
ci := l.findConflict(ents)
switch {
case ci == 0:
case ci <= l.committed:
l.logger.Panicf("entry %d conflict with committed entry [committed(%d)]", ci, l.committed)
default:
offset := index + 1
if ci-offset > uint64(len(ents)) {
l.logger.Panicf("index, %d, is out of range [%d]", ci-offset, len(ents))
}
l.append(ents[ci-offset:]...)
}
l.commitTo(min(committed, lastnewi))
return lastnewi, true
}
return 0, false
}
follower节点首先通过matchTerm函数检查要添加的entries的term是否跟当前follower的term相同,并要求新entries index至少等于当前raftLog的最新值。即raft不支持空洞。
func (l *raftLog) term(i uint64) (uint64, error) {
// the valid term range is [index of dummy entry, last index]
dummyIndex := l.firstIndex() - 1
if i < dummyIndex || i > l.lastIndex() {
// TODO: return an error instead?
return 0, nil
}
...
}
接下来调用findConflicts 对比follower raftLog 里面尚未commited的entries 和 leader节点送过来的entries是否有不一致的地方,具体来说就是相同的index,但是term不同,这可能就是在leader节点挂了,发生了重新选举,老的leader节点只同步了entries到部分follower 上,理论上这部分冲突的entries是需要被截断的。截断的逻辑发生在default case 部分,即用来自leader节点的entries覆盖这部分冲突的entries。最后取来自leader的commited和本次follower log被追加到的最新index 两者之间的最小值,用这个最小值更新当前follower的commited值。
maybeAppend函数有两个返回值,第一个返回值表示当前follower节点再被append entries 后,最新的index 值,第二个返回值表示append操作是否成功。我们考虑返回true的情况,接下来在handleAppendEntries 里面,follower节点会组装一条type为MsgAppResp,index为当前follower节点最新index的消息发回给leader节点。 另一方面,返回false 一般是因为follower节点与leader节点的日志index 和term出现了偏离,这可能是因为网络分区等因素导致的。follower节点此时会根据leader发的term和index,尽量返回一个跟leader节点初始偏离点比较近的index和term,这样可以减少leader节点的探测。比如leader节点下次探测的term已经小于follower节点,那么follower节点就返回比leader节点发来的term更小的term和index,这样才能尽快找到偏离的初始点。具体可以查看leader节点收到type为MsgAppResp,具体值为Reject的注释。
MsgAppResp和MsgApp消息都是调用r.send函数,该函数有注释如下:
// send schedules persisting state to a stable storage and AFTER that
// sending the message (as part of next Ready message processing).
func (r *raft) send(m pb.Message) {
if m.From == None {
m.From = r.id
}
if m.Type == pb.MsgVote || m.Type == pb.MsgVoteResp || m.Type == pb.MsgPreVote || m.Type == pb.MsgPreVoteResp {
if m.Term == 0 {
panic(fmt.Sprintf("term should be set when sending %s", m.Type))
}
} else {
if m.Term != 0 {
panic(fmt.Sprintf("term should not be set when sending %s (was %d)", m.Type, m.Term))
}
if m.Type != pb.MsgProp && m.Type != pb.MsgReadIndex {
m.Term = r.Term
}
}
r.msgs = append(r.msgs, m)
}
可以看到往r.msgs 里面追加内容应该会触发别的地方持久化。于是全局搜索r.msgs,可以看到RawNode的HasReady函数有调用:
// HasReady called when RawNode user need to check if any Ready pending.
// Checking logic in this method should be consistent with Ready.containsUpdates().
func (rn *RawNode) HasReady() bool {
r := rn.raft
if !r.softState().equal(rn.prevSoftSt) {
return true
}
if hardSt := r.hardState(); !IsEmptyHardState(hardSt) && !isHardStateEqual(hardSt, rn.prevHardSt) {
return true
}
if r.raftLog.hasPendingSnapshot() {
return true
}
if len(r.msgs) > 0 || len(r.raftLog.unstableEntries()) > 0 || r.raftLog.hasNextEnts() {
return true
}
if len(r.readStates) != 0 {
return true
}
return false
}
进一步往上追溯,HasReady函数被协程node.run函数调用,并且在返回值为true的情况下,组装出一个Ready结构的rd变量。该变量在接下来的select语句里,会被填充到n.readyc这个channel 内。
另一方面,raftNode的start函数内r.Ready函数一直在监视着上面的readyc channel:
func (n *node) Ready() <-chan Ready { return n.readyc }
而在raftNode的start函数内(server/etcdserver/raft.go),如果从readyc channel上收到消息,便会开始处理rd(ready),这其中就包括判断是否要将ready里面的entries先保存到wal中。
return lastCommittedEntry.Term > firstUnstableEntry.Term ||
(lastCommittedEntry.Term == firstUnstableEntry.Term && lastCommittedEntry.Index >= firstUnstableEntry.Index)
即如果最近commited entries 先于unstable entries 就 先写往wal。否则,其如果作为leader,就将entries 先replicate给其他follower.
// the leader can write to its disk in parallel with replicating to the followers and them
// writing to their disks.
// For more details, check raft thesis 10.2.1
if islead {
// gofail: var raftBeforeLeaderSend struct{}
r.transport.Send(r.processMessages(rd.Messages))
}
transport 结构进一步调用了Peer的send函数,这个函数从注释可以看出send操作是非阻塞的:
type Peer interface {
// send sends the message to the remote peer. The function is non-blocking
// and has no promise that the message will be received by the remote.
// When it fails to send message out, it will report the status to underlying
// raft.
send(m raftpb.Message)
接下来可以将本次ready内的entries 保存到wal 中:
if !waitWALSync {
// gofail: var raftBeforeSave struct{}
if err := r.storage.Save(rd.HardState, rd.Entries); err != nil {
r.lg.Fatal("failed to save Raft hard state and entries", zap.Error(err))
}
}
save内有个是否同步等待硬盘写入操作完成的函数判断:
// MustSync returns true if the hard state and count of Raft entries indicate
// that a synchronous write to persistent storage is required.
func MustSync(st, prevst pb.HardState, entsnum int) bool {
// Persistent state on all servers:
// (Updated on stable storage before responding to RPCs)
// currentTerm
// votedFor
// log entries[]
return entsnum != 0 || st.Vote != prevst.Vote || st.Term != prevst.Term
}
可见,只要entries 的数目不为0,就一定会等待硬盘的写入操作完成。
func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
w.mu.Lock()
defer w.mu.Unlock()
// short cut, do not call sync
if raft.IsEmptyHardState(st) && len(ents) == 0 {
return nil
}
mustSync := raft.MustSync(st, w.state, len(ents))
// TODO(xiangli): no more reference operator
for i := range ents {
if err := w.saveEntry(&ents[i]); err != nil {
return err
}
}
if err := w.saveState(&st); err != nil {
return err
}
curOff, err := w.tail().Seek(0, io.SeekCurrent)
if err != nil {
return err
}
if curOff < SegmentSizeBytes {
if mustSync {
return w.sync()
}
return nil
}
return w.cut()
}
save函数首先将entries写入到wal文件中,接下来将hardState保存至wal中,注意hardState中包含
最后再看看要不要调用系统的datasync函数,等待I/O操作完成。一般情况下,如果本次追加的条目数不为0,都需要调用fdatasync函数。
if took > warnSyncDuration {
w.lg.Warn(
"slow fdatasync",
zap.Duration("took", took),
zap.Duration("expected-duration", warnSyncDuration),
)
}
如果本次sync 同步时间超过1s(warnSyncDurationc常量),会打印一个报错日志,这极有可能导致本次请求处理时间超过1s。
顺便再来看下storage 的SaveSnap函数。
// SaveSnap saves the snapshot file to disk and writes the WAL snapshot entry.
func (st *storage) SaveSnap(snap raftpb.Snapshot) error {
st.mux.RLock()
defer st.mux.RUnlock()
walsnap := walpb.Snapshot{
Index: snap.Metadata.Index,
Term: snap.Metadata.Term,
ConfState: &snap.Metadata.ConfState,
}
// save the snapshot file before writing the snapshot to the wal.
// This makes it possible for the snapshot file to become orphaned, but prevents
// a WAL snapshot entry from having no corresponding snapshot file.
err := st.s.SaveSnap(snap)
if err != nil {
return err
}
// gofail: var raftBeforeWALSaveSnaphot struct{}
return st.w.SaveSnapshot(walsnap)
}
可以看到storage先调用s.SaveSnap函数将snap内容单独作为一个文件写入硬盘,然后再将对snap文件的描述写到wal中,这样在宕机恢复的时候,就可以先从wal中读取到当前机器上有snap文件,再根据描述读取snap文件进行恢复。
留个坑
从节点在调用完maybeAppend函数后,会拿到当前日志条目的最新index,由此组装成message消息,回复给leader节点:
if mlastIndex, ok := r.raftLog.maybeAppend(m.Index, m.LogTerm, m.Commit, m.Entries...); ok {
r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})
我们已经看到send函数调用后,上层的raftNode启动的协程(start)内会对日志条目进行持久化,在持久化后,发给leader节点。现在回到stepLeader函数:
oldPaused := pr.IsPaused()
if pr.MaybeUpdate(m.Index) {
switch {
case pr.State == tracker.StateProbe:
pr.BecomeReplicate()
case pr.State == tracker.StateSnapshot && pr.Match >= pr.PendingSnapshot:
// TODO(tbg): we should also enter this branch if a snapshot is
// received that is below pr.PendingSnapshot but which makes it
// possible to use the log again.
r.logger.Debugf("%x recovered from needing snapshot, resumed sending replication messages to %x [%s]", r.id, m.From, pr)
// Transition back to replicating state via probing state
// (which takes the snapshot into account). If we didn't
// move to replicating state, that would only happen with
// the next round of appends (but there may not be a next
// round for a while, exposing an inconsistent RaftStatus).
pr.BecomeProbe()
pr.BecomeReplicate()
case pr.State == tracker.StateReplicate:
pr.Inflights.FreeLE(m.Index)
}
if r.maybeCommit() {
// committed index has progressed for the term, so it is safe
// to respond to pending read index requests
releasePendingReadIndexMessages(r)
r.bcastAppend()
} else if oldPaused {
// If we were paused before, this node may be missing the
// latest commit index, so send it.
r.sendAppend(m.From)
}
// We've updated flow control information above, which may
// allow us to send multiple (size-limited) in-flight messages
// at once (such as when transitioning from probe to
// replicate, or when freeTo() covers multiple messages). If
// we have more entries to send, send as many messages as we
// can (without sending empty messages for the commit index)
for r.maybeSendAppend(m.From, false) {
}
// Transfer leadership is in progress.
if m.From == r.leadTransferee && pr.Match == r.raftLog.lastIndex() {
r.logger.Infof("%x sent MsgTimeoutNow to %x after received MsgAppResp", r.id, m.From)
r.sendTimeoutNow(m.From)
}
}
首先调用MaybeUpdate函数更新下一次给该follower节点发的起始index(Next),如果Next值确实发生了更新,则进入If block,暂且把该follower的state当作StateReplicate,接下来释放inflight window中小于该index的消息记录。然后调用raft的maybeCommit函数:
// maybeCommit attempts to advance the commit index. Returns true if
// the commit index changed (in which case the caller should call
// r.bcastAppend).
func (r *raft) maybeCommit() bool {
mci := r.prs.Committed()
return r.raftLog.maybeCommit(mci, r.Term)
}
Commited函数会查看各节点已经提交的index:
// Committed returns the largest log index known to be committed based on what
// the voting members of the group have acknowledged.
func (p *ProgressTracker) Committed() uint64 {
return uint64(p.Voters.CommittedIndex(matchAckIndexer(p.Progress)))
}
上面的函数会把p.Progress转换成matchAckIndexer类型,该类型实现了AckedIndex函数:
// AckedIndex implements IndexLookuper.
func (l matchAckIndexer) AckedIndex(id uint64) (quorum.Index, bool) {
pr, ok := l[id]
if !ok {
return 0, false
}
return quorum.Index(pr.Match), true
}
即返回progress的Match值,这个值在MaybeUpdate函数内被更新成follower节点的最新index,
// Fill the slice with the indexes observed. Any unused slots will be
// left as zero; these correspond to voters that may report in, but
// haven't yet. We fill from the right (since the zeroes will end up on
// the left after sorting below anyway).
i := n - 1
for id := range c {
if idx, ok := l.AckedIndex(id); ok {
srt[i] = uint64(idx)
i--
}
}
}
// Sort by index. Use a bespoke algorithm (copied from the stdlib's sort
// package) to keep srt on the stack.
insertionSort(srt)
// The smallest index into the array for which the value is acked by a
// quorum. In other words, from the end of the slice, move n/2+1 to the
// left (accounting for zero-indexing).
pos := n - (n/2 + 1)
return Index(srt[pos])
而commited index会对各节点的最新index进行插入排序,排好序后取数组中间一半偏左的位置,即为大多数节点已经的提交index,取该index为commited index。取好之后,交由raftLog的maybeCommit函数:
func (l *raftLog) maybeCommit(maxIndex, term uint64) bool {
if maxIndex > l.committed && l.zeroTermOnErrCompacted(l.term(maxIndex)) == term {
l.commitTo(maxIndex)
return true
}
return false
}
如果maxIndex已经大于log的commite值,且term没有发生变化,就更新commited index值,返回true,否则返回false。这样raft的maybeCommit也会有返回值,假设前者返回true,raft的maybeCommit也返回true,接下来:
// committed index has progressed for the term, so it is safe
// to respond to pending read index requests
releasePendingReadIndexMessages(r)
r.bcastAppend()
就可以释放一些对 read请求的响应,并广播已经commited的最新值。
做完这些,本次对stepLeader函数的调用算是结束了,于是raft协议本身又回到协程中检查HasReady的部分,这次该函数返回true,因为commited entries已经发生了变化,即r.raftLog.hasNextEnts() 为true。所以这次组装出来的ready结构中commited entries不为空。接下来上次raftNode 就会从readyc channel中收到本次组装出来的ready结构,并将该ready进一步通过r.applyc 通道发送给etcdserver对象:
select {
case r.applyc <- ap:
case <-r.stopped:
return
}
etcdserver收到ap结构后,便会组装一个调度任务,对本次commited entries运行applyAll函数:
select {
case ap := <-s.r.apply():
f := schedule.NewJob("server_applyAll", func(context.Context) { s.applyAll(&ep, &ap) })
sched.Schedule(f)
我们沿着applyAll函数,最终会跟踪到:
// applyEntryNormal applies an EntryNormal type raftpb request to the EtcdServer
func (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) {
...
needResult := s.w.IsRegistered(id)
if needResult || !noSideEffect(&raftReq) {
if !needResult && raftReq.Txn != nil {
removeNeedlessRangeReqs(raftReq.Txn)
}
ar = s.uberApply.Apply(&raftReq, shouldApplyV3)
}
// do not re-toApply applied entries.
if !shouldApplyV3 {
return
}
if ar == nil {
return
}
if ar.Err != errors.ErrNoSpace || len(s.alarmStore.Get(pb.AlarmType_NOSPACE)) > 0 {
s.w.Trigger(id, ar)
return
}
...
}
其在apply 后,会调用s.w.Trigger函数,Trigger函数会往listElement ch上塞入apply的结果:
func (w *list) Trigger(id uint64, x interface{}) {
idx := id % defaultListElementLength
w.e[idx].l.Lock()
ch := w.e[idx].m[id]
delete(w.e[idx].m, id)
w.e[idx].l.Unlock()
if ch != nil {
ch <- x
close(ch)
}
}
而这也是我们在第一篇中看到的processInternalRaftRequestOnce 在ch上等待结果的地方:
select {
case x := <-ch:
return x.(*apply2.Result), nil
至此,etcdserver 才会把这个结果返回给client。
可以看到,一个put 请求在发给leader节点时,leader节点会将请求组装成msg分别发送到follower节点上,follower节点再回复收到消息的结果前,会将消息保存到wal中,且调用fdatasync函数,而leader节点在异步发送msg时,也会将msg写到wal中,并调用fdatasync。leader节点在收到一半follower节点的回复后,会更新commited index 并将commited index也通知给follower,与此同时,将commited index apply 至boltdb中,并将apply 的结果返回给client。所以整个写入过程,必然涉及到leader节点和follower节点的通信,也涉及到硬盘的数据持久化,这其实是一个很耗时的过程,但是保证了数据的强一致性。