etcd 的put 请求(二)

从(follower)节点收到append消息

我们继续看从节点收到type为MsgApp message 的处理,所以MsgApp 其实是MsgAppend 的缩写?

func stepFollower(r *raft, m pb.Message) error {
 switch m.Type {
 case pb.MsgProp:
 	if r.lead == None {
 		r.logger.Infof("%x no leader at term %d; dropping proposal", r.id, r.Term)
 		return ErrProposalDropped
 	} else if r.disableProposalForwarding {
 		r.logger.Infof("%x not forwarding to leader %x at term %d; dropping proposal", r.id, r.lead, r.Term)
 		return ErrProposalDropped
 	}
 	m.To = r.lead
 	r.send(m)
 case pb.MsgApp:
 	r.electionElapsed = 0
 	r.lead = m.From
 	r.handleAppendEntries(m)
 case pb.MsgHeartbeat:
 	r.electionElapsed = 0
 	r.lead = m.From
 	r.handleHeartbeat(m)
 case pb.MsgSnap:
 	r.electionElapsed = 0
 	r.lead = m.From
 	r.handleSnapshot(m)
 case pb.MsgTransferLeader:
 	if r.lead == None {
 		r.logger.Infof("%x no leader at term %d; dropping leader transfer msg", r.id, r.Term)
 		return nil
 	}
 	m.To = r.lead
 	r.send(m)
 case pb.MsgTimeoutNow:
 	r.logger.Infof("%x [term %d] received MsgTimeoutNow from %x and starts an election to get leadership.", r.id, r.Term, m.From)
 	// Leadership transfers never use pre-vote even if r.preVote is true; we
 	// know we are not recovering from a partition so there is no need for the
 	// extra round trip.
 	r.hup(campaignTransfer)
 case pb.MsgReadIndex:
 	if r.lead == None {
 		r.logger.Infof("%x no leader at term %d; dropping index reading msg", r.id, r.Term)
 		return nil
 	}
 	m.To = r.lead
 	r.send(m)
 case pb.MsgReadIndexResp:
 	if len(m.Entries) != 1 {
 		r.logger.Errorf("%x invalid format of MsgReadIndexResp from %x, entries count: %d", r.id, m.From, len(m.Entries))
 		return nil
 	}
 	r.readStates = append(r.readStates, ReadState{Index: m.Index, RequestCtx: m.Entries[0].Data})
 }
 return nil
}

根据前文描述,raftNode 会运行step函数,而follower节点执行的自然是func stepFollower(r *raft, m pb.Message)函数。该函数首先将electionElapsed 赋值为0,这就表示follower收到了来自leader的心跳,不会在超时后重新发生选举。接下来就是调用r.handleAppendEntries(m)函数。该函数正常情况下会调用ratfLog.maybeAppend函数,

func (l *raftLog) maybeAppend(index, logTerm, committed uint64, ents ...pb.Entry) (lastnewi uint64, ok bool) {
	if l.matchTerm(index, logTerm) {
		lastnewi = index + uint64(len(ents))
		ci := l.findConflict(ents)
		switch {
		case ci == 0:
		case ci <= l.committed:
			l.logger.Panicf("entry %d conflict with committed entry [committed(%d)]", ci, l.committed)
		default:
			offset := index + 1
			if ci-offset > uint64(len(ents)) {
				l.logger.Panicf("index, %d, is out of range [%d]", ci-offset, len(ents))
			}
			l.append(ents[ci-offset:]...)
		}
		l.commitTo(min(committed, lastnewi))
		return lastnewi, true
	}
	return 0, false
}

follower节点首先通过matchTerm函数检查要添加的entries的term是否跟当前follower的term相同,并要求新entries index至少等于当前raftLog的最新值。即raft不支持空洞。

func (l *raftLog) term(i uint64) (uint64, error) {
	// the valid term range is [index of dummy entry, last index]
	dummyIndex := l.firstIndex() - 1
	if i < dummyIndex || i > l.lastIndex() {
		// TODO: return an error instead?
		return 0, nil
	}
...
}

接下来调用findConflicts 对比follower raftLog 里面尚未commited的entries 和 leader节点送过来的entries是否有不一致的地方,具体来说就是相同的index,但是term不同,这可能就是在leader节点挂了,发生了重新选举,老的leader节点只同步了entries到部分follower 上,理论上这部分冲突的entries是需要被截断的。截断的逻辑发生在default case 部分,即用来自leader节点的entries覆盖这部分冲突的entries。最后取来自leader的commited和本次follower log被追加到的最新index 两者之间的最小值,用这个最小值更新当前follower的commited值。

maybeAppend函数有两个返回值,第一个返回值表示当前follower节点再被append entries 后,最新的index 值,第二个返回值表示append操作是否成功。我们考虑返回true的情况,接下来在handleAppendEntries 里面,follower节点会组装一条type为MsgAppResp,index为当前follower节点最新index的消息发回给leader节点。 另一方面,返回false 一般是因为follower节点与leader节点的日志index 和term出现了偏离,这可能是因为网络分区等因素导致的。follower节点此时会根据leader发的term和index,尽量返回一个跟leader节点初始偏离点比较近的index和term,这样可以减少leader节点的探测。比如leader节点下次探测的term已经小于follower节点,那么follower节点就返回比leader节点发来的term更小的term和index,这样才能尽快找到偏离的初始点。具体可以查看leader节点收到type为MsgAppResp,具体值为Reject的注释。

r.send函数

MsgAppResp和MsgApp消息都是调用r.send函数,该函数有注释如下:

// send schedules persisting state to a stable storage and AFTER that
// sending the message (as part of next Ready message processing).
func (r *raft) send(m pb.Message) {
	if m.From == None {
		m.From = r.id
	}
	if m.Type == pb.MsgVote || m.Type == pb.MsgVoteResp || m.Type == pb.MsgPreVote || m.Type == pb.MsgPreVoteResp {
		if m.Term == 0 {	
			panic(fmt.Sprintf("term should be set when sending %s", m.Type))
		}
	} else {
		if m.Term != 0 {
			panic(fmt.Sprintf("term should not be set when sending %s (was %d)", m.Type, m.Term))
		}		
		if m.Type != pb.MsgProp && m.Type != pb.MsgReadIndex {
			m.Term = r.Term
		}
	}
	r.msgs = append(r.msgs, m)
}

可以看到往r.msgs 里面追加内容应该会触发别的地方持久化。于是全局搜索r.msgs,可以看到RawNode的HasReady函数有调用:

// HasReady called when RawNode user need to check if any Ready pending.
// Checking logic in this method should be consistent with Ready.containsUpdates().
func (rn *RawNode) HasReady() bool {
   r := rn.raft
   if !r.softState().equal(rn.prevSoftSt) {
   	return true
   }
   if hardSt := r.hardState(); !IsEmptyHardState(hardSt) && !isHardStateEqual(hardSt, rn.prevHardSt) {
   	return true
   }
   if r.raftLog.hasPendingSnapshot() {
   	return true
   }
   if len(r.msgs) > 0 || len(r.raftLog.unstableEntries()) > 0 || r.raftLog.hasNextEnts() {
   	return true
   }
   if len(r.readStates) != 0 {
   	return true
   }
   return false
}

进一步往上追溯,HasReady函数被协程node.run函数调用,并且在返回值为true的情况下,组装出一个Ready结构的rd变量。该变量在接下来的select语句里,会被填充到n.readyc这个channel 内。
另一方面,raftNode的start函数内r.Ready函数一直在监视着上面的readyc channel:

func (n *node) Ready() <-chan Ready { return n.readyc }

而在raftNode的start函数内(server/etcdserver/raft.go),如果从readyc channel上收到消息,便会开始处理rd(ready),这其中就包括判断是否要将ready里面的entries先保存到wal中。

return lastCommittedEntry.Term > firstUnstableEntry.Term ||
		(lastCommittedEntry.Term == firstUnstableEntry.Term && lastCommittedEntry.Index >= firstUnstableEntry.Index)

即如果最近commited entries 先于unstable entries 就 先写往wal。否则,其如果作为leader,就将entries 先replicate给其他follower.

// the leader can write to its disk in parallel with replicating to the followers and them
				// writing to their disks.
				// For more details, check raft thesis 10.2.1
				if islead {
					// gofail: var raftBeforeLeaderSend struct{}
					r.transport.Send(r.processMessages(rd.Messages))
				}

transport 结构进一步调用了Peer的send函数,这个函数从注释可以看出send操作是非阻塞的:

type Peer interface {
	// send sends the message to the remote peer. The function is non-blocking
	// and has no promise that the message will be received by the remote.
	// When it fails to send message out, it will report the status to underlying
	// raft.
	send(m raftpb.Message)

接下来可以将本次ready内的entries 保存到wal 中:

	if !waitWALSync {
					// gofail: var raftBeforeSave struct{}
					if err := r.storage.Save(rd.HardState, rd.Entries); err != nil {
						r.lg.Fatal("failed to save Raft hard state and entries", zap.Error(err))
					}
				}

save内有个是否同步等待硬盘写入操作完成的函数判断:

// MustSync returns true if the hard state and count of Raft entries indicate
// that a synchronous write to persistent storage is required.
func MustSync(st, prevst pb.HardState, entsnum int) bool {
	// Persistent state on all servers:
	// (Updated on stable storage before responding to RPCs)
	// currentTerm
	// votedFor
	// log entries[]
	return entsnum != 0 || st.Vote != prevst.Vote || st.Term != prevst.Term
}

可见,只要entries 的数目不为0,就一定会等待硬盘的写入操作完成。

storage

save函数

func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
	w.mu.Lock()
	defer w.mu.Unlock()

	// short cut, do not call sync
	if raft.IsEmptyHardState(st) && len(ents) == 0 {
		return nil
	}

	mustSync := raft.MustSync(st, w.state, len(ents))

	// TODO(xiangli): no more reference operator
	for i := range ents {
		if err := w.saveEntry(&ents[i]); err != nil {
			return err
		}
	}
	if err := w.saveState(&st); err != nil {
		return err
	}

	curOff, err := w.tail().Seek(0, io.SeekCurrent)
	if err != nil {
		return err
	}
	if curOff < SegmentSizeBytes {
		if mustSync {
			return w.sync()
		}
		return nil
	}

	return w.cut()
}

save函数首先将entries写入到wal文件中,接下来将hardState保存至wal中,注意hardState中包含

  • 当前所处的Term
  • 当前节点投票给谁了,
  • 当前节点已经提交的index

最后再看看要不要调用系统的datasync函数,等待I/O操作完成。一般情况下,如果本次追加的条目数不为0,都需要调用fdatasync函数。

if took > warnSyncDuration {
		w.lg.Warn(
			"slow fdatasync",
			zap.Duration("took", took),
			zap.Duration("expected-duration", warnSyncDuration),
		)
	}

如果本次sync 同步时间超过1s(warnSyncDurationc常量),会打印一个报错日志,这极有可能导致本次请求处理时间超过1s。
顺便再来看下storage 的SaveSnap函数。

SaveSnap函数

// SaveSnap saves the snapshot file to disk and writes the WAL snapshot entry.
func (st *storage) SaveSnap(snap raftpb.Snapshot) error {
	st.mux.RLock()
	defer st.mux.RUnlock()
	walsnap := walpb.Snapshot{
		Index:     snap.Metadata.Index,
		Term:      snap.Metadata.Term,
		ConfState: &snap.Metadata.ConfState,
	}
	// save the snapshot file before writing the snapshot to the wal.
	// This makes it possible for the snapshot file to become orphaned, but prevents
	// a WAL snapshot entry from having no corresponding snapshot file.
	err := st.s.SaveSnap(snap)
	if err != nil {
		return err
	}
	// gofail: var raftBeforeWALSaveSnaphot struct{}

	return st.w.SaveSnapshot(walsnap)
}

可以看到storage先调用s.SaveSnap函数将snap内容单独作为一个文件写入硬盘,然后再将对snap文件的描述写到wal中,这样在宕机恢复的时候,就可以先从wal中读取到当前机器上有snap文件,再根据描述读取snap文件进行恢复。

raftStoarge函数

留个坑

leader节点收到Append的消息的回复MsgAppResp

从节点在调用完maybeAppend函数后,会拿到当前日志条目的最新index,由此组装成message消息,回复给leader节点:

	if mlastIndex, ok := r.raftLog.maybeAppend(m.Index, m.LogTerm, m.Commit, m.Entries...); ok {
		r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})

我们已经看到send函数调用后,上层的raftNode启动的协程(start)内会对日志条目进行持久化,在持久化后,发给leader节点。现在回到stepLeader函数:

	oldPaused := pr.IsPaused()
			if pr.MaybeUpdate(m.Index) {
				switch {
				case pr.State == tracker.StateProbe:
					pr.BecomeReplicate()
				case pr.State == tracker.StateSnapshot && pr.Match >= pr.PendingSnapshot:
					// TODO(tbg): we should also enter this branch if a snapshot is
					// received that is below pr.PendingSnapshot but which makes it
					// possible to use the log again.
					r.logger.Debugf("%x recovered from needing snapshot, resumed sending replication messages to %x [%s]", r.id, m.From, pr)
					// Transition back to replicating state via probing state
					// (which takes the snapshot into account). If we didn't
					// move to replicating state, that would only happen with
					// the next round of appends (but there may not be a next
					// round for a while, exposing an inconsistent RaftStatus).
					pr.BecomeProbe()
					pr.BecomeReplicate()
				case pr.State == tracker.StateReplicate:
					pr.Inflights.FreeLE(m.Index)
				}

				if r.maybeCommit() {
					// committed index has progressed for the term, so it is safe
					// to respond to pending read index requests
					releasePendingReadIndexMessages(r)
					r.bcastAppend()
				} else if oldPaused {
					// If we were paused before, this node may be missing the
					// latest commit index, so send it.
					r.sendAppend(m.From)
				}
				// We've updated flow control information above, which may
				// allow us to send multiple (size-limited) in-flight messages
				// at once (such as when transitioning from probe to
				// replicate, or when freeTo() covers multiple messages). If
				// we have more entries to send, send as many messages as we
				// can (without sending empty messages for the commit index)
				for r.maybeSendAppend(m.From, false) {
				}
				// Transfer leadership is in progress.
				if m.From == r.leadTransferee && pr.Match == r.raftLog.lastIndex() {
					r.logger.Infof("%x sent MsgTimeoutNow to %x after received MsgAppResp", r.id, m.From)
					r.sendTimeoutNow(m.From)
				}
			}

首先调用MaybeUpdate函数更新下一次给该follower节点发的起始index(Next),如果Next值确实发生了更新,则进入If block,暂且把该follower的state当作StateReplicate,接下来释放inflight window中小于该index的消息记录。然后调用raft的maybeCommit函数:

// maybeCommit attempts to advance the commit index. Returns true if
// the commit index changed (in which case the caller should call
// r.bcastAppend).
func (r *raft) maybeCommit() bool {
	mci := r.prs.Committed()
	return r.raftLog.maybeCommit(mci, r.Term)
}

Commited函数会查看各节点已经提交的index:

// Committed returns the largest log index known to be committed based on what
// the voting members of the group have acknowledged.
func (p *ProgressTracker) Committed() uint64 {
	return uint64(p.Voters.CommittedIndex(matchAckIndexer(p.Progress)))
}

上面的函数会把p.Progress转换成matchAckIndexer类型,该类型实现了AckedIndex函数:

// AckedIndex implements IndexLookuper.
func (l matchAckIndexer) AckedIndex(id uint64) (quorum.Index, bool) {
	pr, ok := l[id]
	if !ok {
		return 0, false
	}
	return quorum.Index(pr.Match), true
}

即返回progress的Match值,这个值在MaybeUpdate函数内被更新成follower节点的最新index,

		// Fill the slice with the indexes observed. Any unused slots will be
		// left as zero; these correspond to voters that may report in, but
		// haven't yet. We fill from the right (since the zeroes will end up on
		// the left after sorting below anyway).
		i := n - 1
		for id := range c {
			if idx, ok := l.AckedIndex(id); ok {
				srt[i] = uint64(idx)
				i--
			}
		}
	}

	// Sort by index. Use a bespoke algorithm (copied from the stdlib's sort
	// package) to keep srt on the stack.
	insertionSort(srt)

	// The smallest index into the array for which the value is acked by a
	// quorum. In other words, from the end of the slice, move n/2+1 to the
	// left (accounting for zero-indexing).
	pos := n - (n/2 + 1)
	return Index(srt[pos])

而commited index会对各节点的最新index进行插入排序,排好序后取数组中间一半偏左的位置,即为大多数节点已经的提交index,取该index为commited index。取好之后,交由raftLog的maybeCommit函数:

func (l *raftLog) maybeCommit(maxIndex, term uint64) bool {
	if maxIndex > l.committed && l.zeroTermOnErrCompacted(l.term(maxIndex)) == term {
		l.commitTo(maxIndex)
		return true
	}
	return false
}

如果maxIndex已经大于log的commite值,且term没有发生变化,就更新commited index值,返回true,否则返回false。这样raft的maybeCommit也会有返回值,假设前者返回true,raft的maybeCommit也返回true,接下来:

	// committed index has progressed for the term, so it is safe
					// to respond to pending read index requests
					releasePendingReadIndexMessages(r)
					r.bcastAppend()

就可以释放一些对 read请求的响应,并广播已经commited的最新值。

做完这些,本次对stepLeader函数的调用算是结束了,于是raft协议本身又回到协程中检查HasReady的部分,这次该函数返回true,因为commited entries已经发生了变化,即r.raftLog.hasNextEnts() 为true。所以这次组装出来的ready结构中commited entries不为空。接下来上次raftNode 就会从readyc channel中收到本次组装出来的ready结构,并将该ready进一步通过r.applyc 通道发送给etcdserver对象:

	select {
				case r.applyc <- ap:
				case <-r.stopped:
					return
				}

etcdserver收到ap结构后,便会组装一个调度任务,对本次commited entries运行applyAll函数:

		select {
		case ap := <-s.r.apply():
			f := schedule.NewJob("server_applyAll", func(context.Context) { s.applyAll(&ep, &ap) })
			sched.Schedule(f)

我们沿着applyAll函数,最终会跟踪到:

// applyEntryNormal applies an EntryNormal type raftpb request to the EtcdServer
func (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) {
	...
	needResult := s.w.IsRegistered(id)
	if needResult || !noSideEffect(&raftReq) {
		if !needResult && raftReq.Txn != nil {
			removeNeedlessRangeReqs(raftReq.Txn)
		}
		ar = s.uberApply.Apply(&raftReq, shouldApplyV3)
	}

	// do not re-toApply applied entries.
	if !shouldApplyV3 {
		return
	}

	if ar == nil {
		return
	}

	if ar.Err != errors.ErrNoSpace || len(s.alarmStore.Get(pb.AlarmType_NOSPACE)) > 0 {
		s.w.Trigger(id, ar)
		return
	}
	...
}

其在apply 后,会调用s.w.Trigger函数,Trigger函数会往listElement ch上塞入apply的结果:

func (w *list) Trigger(id uint64, x interface{}) {
	idx := id % defaultListElementLength
	w.e[idx].l.Lock()
	ch := w.e[idx].m[id]
	delete(w.e[idx].m, id)
	w.e[idx].l.Unlock()
	if ch != nil {
		ch <- x
		close(ch)
	}
}

而这也是我们在第一篇中看到的processInternalRaftRequestOnce 在ch上等待结果的地方:

 select {
    case x := <-ch:
        return x.(*apply2.Result), nil

至此,etcdserver 才会把这个结果返回给client。

总结

可以看到,一个put 请求在发给leader节点时,leader节点会将请求组装成msg分别发送到follower节点上,follower节点再回复收到消息的结果前,会将消息保存到wal中,且调用fdatasync函数,而leader节点在异步发送msg时,也会将msg写到wal中,并调用fdatasync。leader节点在收到一半follower节点的回复后,会更新commited index 并将commited index也通知给follower,与此同时,将commited index apply 至boltdb中,并将apply 的结果返回给client。所以整个写入过程,必然涉及到leader节点和follower节点的通信,也涉及到硬盘的数据持久化,这其实是一个很耗时的过程,但是保证了数据的强一致性。

你可能感兴趣的:(etcd,etcd,网络)