etcd server 学习
etcd.go 内的startEtcd 函数会根据配置,启动一个etcd server。
if e.Server, err = etcdserver.NewServer(srvcfg); err != nil {
return e, err
}
在NewServer函数内,首先调用了bootstrap函数,返回bootstrappedServer结构,故名思义,bootstrap函数根据配置创建了接下来server启动要用的一系列对象
b, err := bootstrap(cfg)
func bootstrap(cfg config.ServerConfig) (b *bootstrappedServer, err error) {
...
haveWAL := wal.Exist(cfg.WALDir())
ss := bootstrapSnapshot(cfg)
backend, err := bootstrapBackend(cfg, haveWAL, st, ss)
cluster, err := bootstrapCluster(cfg, bwal, prt)
s, err := bootstrapStorage(cfg, st, backend, bwal, cluster)
raft := bootstrapRaft(cfg, cluster, s.wal)
return &bootstrappedServer{
prt: prt,
ss: ss,
storage: s,
cluster: cluster,
raft: raft,
}, nil
...
}
根据函数名推断,这里面应该是根据配置,进行了一系列启动动作:
func bootstrapRaftFromCluster(cfg config.ServerConfig, cl *membership.RaftCluster, ids []types.ID, bwal *bootstrappedWAL) *bootstrappedRaft {
...
s := bwal.MemoryStorage()
return &bootstrappedRaft{
lg: cfg.Logger,
heartbeat: time.Duration(cfg.TickMs) * time.Millisecond,
config: raftConfig(cfg, uint64(member.ID), s), //config.s
peers: peers,
storage: s,
}
...
}
可以看到这里raftConfig 里面的storage是MemoryStorage,MemoryStorage结构比较简单,主要是一个内存数组:
// MemoryStorage implements the Storage interface backed by an
// in-memory array.
type MemoryStorage struct {
// Protects access to all fields. Most methods of MemoryStorage are
// run on the raft goroutine, but Append() is run on an application
// goroutine.
sync.Mutex
hardState pb.HardState
snapshot pb.Snapshot
// ents[i] has raft log position i+snapshot.Metadata.Index
ents []pb.Entry
}
e.Server.Start()
Start函数内主要包含一些成员变量和channel的初始化,如
s.w = wait.New()
s.applyWait = wait.NewTimeList()
s.done = make(chan struct{})
s.stop = make(chan struct{})
s.stopping = make(chan struct{}, 1)
s.ctx, s.cancel = context.WithCancel(context.Background())
s.readwaitc = make(chan struct{}, 1)
s.readNotifier = newNotifier()
s.leaderChanged = notify.NewNotifier()
一个正常的读写请求会首先在w上注册,该请求被etcd 集群处理完成后,会通过w来触发响应的通知。w本身是一个map list,即请求会根据请求id注册到list对应下标的map上。
接下来Start函数会启动一个协程。
// TODO: if this is an empty log, writes all peer infos
// into the first entry
go s.run()
该协程创建了一个scheduler,用于调度执行一些任务,比如apply commited entries
// asynchronously accept toApply packets, dispatch progress in-order
sched := schedule.NewFIFOScheduler(lg)
接下来创建了一个raftReadyHandler,用于将raft的逻辑和server状态机的逻辑解耦开来,并启动了raftNode。
rh := &raftReadyHandler{
getLead: func() (lead uint64) { return s.getLead() },
updateLead: func(lead uint64) { s.setLead(lead) },
updateLeadership: func(newLeader bool) {...}
s.r.start(rh)
这里r.start 也为raftNode 启动了一个协程,这个协程是一个大的for 循环:
func (r *raftNode) start(rh *raftReadyHandler) {
go func() {
defer r.onStop()
islead := false
for {
select {
case <-r.ticker.C:
r.tick()
case rd := <-r.Ready():
...
if islead {
// gofail: var raftBeforeLeaderSend struct{}
r.transport.Send(r.processMessages(rd.Messages))
}
...
r.Advance()
case <-r.stopped:
return
}
}
}()
}
这里是raftNode在调用底层的raft库实现raft协议跟其他节点的交互,比如leader拿到Ready结构,Ready结构里面包含一些新来的请求数据,通过transport.Send发给follower,并通过Advance函数更新处理进度,当然follower节点也会拿到自己的Ready结构,将自己收到的消息的处理情况回复给leader。
最后就陷入到select的大循环当中,直到server退出:
for {
select {
case ap := <-s.r.apply():
f := schedule.NewJob("server_applyAll", func(context.Context) { s.applyAll(&ep, &ap) })
sched.Schedule(f)
case leases := <-expiredLeaseC:
s.revokeExpiredLeases(leases)
case err := <-s.errorc:
lg.Warn("server error", zap.Error(err))
lg.Warn("data-dir used by this member must be removed")
return
case <-getSyncC():
if s.v2store.HasTTLKeys() {
s.sync(s.Cfg.ReqTimeout())
}
case <-s.stop:
return
}
}
可以看到,一旦raft协议传递过来一些可以apply到状态机的entries,就会形成一个调度job,由调度器调用s.applyAll函数以应用已经commited的entries。而过期的lease在这个select 内被处理。这里充分运用了go的管道特性,其他协程完成了某件任务,就通知给对应的管道,这里的select负责在各个管道上接受最新的通知。
另外,在server start后,会给server 绑定client 方法的handlers。其中,sctx代表server监听client的listener,每配置一个client listen url,就会有一个sctx.
e.Server 包含server的内部处理逻辑,如raft, storage, watcher 等处理逻辑。在该server start之后,就会调用:
该函数首先会启动一个go 内建的http multiplexer(MUX),是一个将http 各种pattern 路径和handler绑定起来的多路复用对象。
接下来针对每一个sctx, 会调用其serve函数,启动grpc server与http server。
grpc server 的启动由v3rpc 目录下的grpc.go 文件内的Server函数触发。该函数内初始化了普通单向拦截器和流拦截器:
chainUnaryInterceptors := []grpc.UnaryServerInterceptor{
newLogUnaryInterceptor(s),
newUnaryInterceptor(s),
grpc_prometheus.UnaryServerInterceptor,
}
if interceptor != nil {
chainUnaryInterceptors = append(chainUnaryInterceptors, interceptor)
}
chainStreamInterceptors := []grpc.StreamServerInterceptor{
newStreamInterceptor(s),
grpc_prometheus.StreamServerInterceptor,
}
随后,这些拦截器会被加入goopts,根据goopts创建grpcServer:
grpcServer := grpc.NewServer(append(opts, gopts...)...)
pb.RegisterKVServer(grpcServer, NewQuotaKVServer(s))
pb.RegisterWatchServer(grpcServer, NewWatchServer(s))
pb.RegisterLeaseServer(grpcServer, NewQuotaLeaseServer(s))
pb.RegisterClusterServer(grpcServer, NewClusterServer(s))
pb.RegisterAuthServer(grpcServer, NewAuthServer(s))
pb.RegisterMaintenanceServer(grpcServer, NewMaintenanceServer(s))
创建grpc server后,会往该grpc server上注册相应的handlers, 如NewQuotaKVServer。RegisterKVServer将KV 请求的相关方法和KVServer的对应Handlers结合起来共同处理client发来的特定methods:
var _KV_serviceDesc = grpc.ServiceDesc{
ServiceName: "etcdserverpb.KV",
HandlerType: (*KVServer)(nil),
Methods: []grpc.MethodDesc{
{
MethodName: "Range",
Handler: _KV_Range_Handler,
},
{
MethodName: "Put",
Handler: _KV_Put_Handler,
},
{
MethodName: "DeleteRange",
Handler: _KV_DeleteRange_Handler,
},
{
MethodName: "Txn",
Handler: _KV_Txn_Handler,
},
{
MethodName: "Compact",
Handler: _KV_Compact_Handler,
},
},
Streams: []grpc.StreamDesc{},
Metadata: "rpc.proto",
}
比如_KV_Range_Handler 函数,就将KVServer的Range handler包装成一个函数,作为参数传递给interceptor
interceptor执行完interceptor链之后,会调用Handler函数内创建的handler函数,比如_KV_Put_Handler,就调用了函数:
handler := func(ctx context.Context, req interface{}) (interface{}, error) {
return srv.(KVServer).Put(ctx, req.(*PutRequest))
}
这里srv是server的缩写,是前面的NewQuotaKVServer(s) 创建出来的对象,其中的Put 方法实现如下:
func (s *quotaKVServer) Put(ctx context.Context, r *pb.PutRequest) (*pb.PutResponse, error) {
if err := s.qa.check(ctx, r); err != nil {
return nil, err
}
return s.KVServer.Put(ctx, r)
}
即检查了是否满足quota 后,调用KVServer的Put方法。 KVServer即为etcdServer.EtcdServer,这个server由前面的StartEtcd函数创建。
其Put函数实现如下(全局搜素“s *EtcdServer) Put”):
func (s *EtcdServer) Put(ctx context.Context, r *pb.PutRequest) (*pb.PutResponse, error) {
ctx = context.WithValue(ctx, traceutil.StartTimeKey, time.Now())
resp, err := s.raftRequest(ctx, pb.InternalRaftRequest{Put: r})
if err != nil {
return nil, err
}
return resp.(*pb.PutResponse), nil
}
即调用raftReques函数:
func (s *EtcdServer) raftRequest(ctx context.Context, r pb.InternalRaftRequest) (proto.Message, error) {
return s.raftRequestOnce(ctx, r)
}
最终调用processInternalRaftRequestOnce函数
func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {
ai := s.getAppliedIndex()
ci := s.getCommittedIndex()
if ci > ai+maxGapBetweenApplyAndCommitIndex { //避免apply到状态机的跟已经commited的差距太大。
return nil, errors.ErrTooManyRequests
}
r.Header = &pb.RequestHeader{
ID: s.reqIDGen.Next(), //产生一个唯一的ID
}
// check authinfo if it is not InternalAuthenticateRequest
if r.Authenticate == nil {
authInfo, err := s.AuthInfoFromCtx(ctx)
if err != nil {
return nil, err
}
if authInfo != nil {
r.Header.Username = authInfo.Username
r.Header.AuthRevision = authInfo.Revision
}
}
data, err := r.Marshal()
if err != nil {
return nil, err
}
if len(data) > int(s.Cfg.MaxRequestBytes) {
return nil, errors.ErrRequestTooLarge
}
id := r.ID
if id == 0 {
id = r.Header.ID
}
ch := s.w.Register(id) //使用唯一id关联一个channel,在该channel上进行等待。
cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
defer cancel()
start := time.Now()
err = s.r.Propose(cctx, data) //向raft模块提出一个提议
if err != nil {
proposalsFailed.Inc()
s.w.Trigger(id, nil) // GC wait
return nil, err
}
proposalsPending.Inc()
defer proposalsPending.Dec()
select {
case x := <-ch:
return x.(*apply2.Result), nil
case <-cctx.Done():
proposalsFailed.Inc()
s.w.Trigger(id, nil) // GC wait
return nil, s.parseProposeCtxErr(cctx.Err(), start)
case <-s.done:
return nil, errors.ErrStopped
}
}
注意看s.r.Propose(cctx, data) 函数:
r成员变量来自:
r: *b.raft.newRaftNode(b.ss, b.storage.wal.w, b.cluster.cl),
可以看到我们给newRaftNode传入了三个参数,第一个是Snapshotter,第二个是wal,第三个是记录集群成员关系的memebership.RaftCluster
newRaftNode 函数由bootstrappedRaft类型的对象调用,返回一个raftNode对象的地址
func (b *bootstrappedRaft) newRaftNode(ss *snap.Snapshotter, wal *wal.WAL, cl *membership.RaftCluster) *raftNode {
var n raft.Node
if len(b.peers) == 0 {
n = raft.RestartNode(b.config)
} else {
n = raft.StartNode(b.config, b.peers)
}
raftStatusMu.Lock()
raftStatus = n.Status
raftStatusMu.Unlock()
return newRaftNode(
raftNodeConfig{
lg: b.lg,
isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) },
Node: n,
heartbeat: b.heartbeat,
raftStorage: b.storage,
storage: serverstorage.NewStorage(b.lg, wal, ss),
},
)
}
该函数首先创建了一个raft.Node 对象n, 这个对象包含了raft协议的底层实现。无论是StartNode还是RestartNode都会生成一个新的协程,用来跟raft协议进行交互:
n := newNode(rn)
go n.run()
接下来创建raftNodeConfig临时变量,其中包含Node n, hearbeat,raftStorage (memoryStorage), storage 这几个主要的成员变量,在创建storage的时候,传入了wal,说明该storage应该是个持久化的storage。所以raftNode实际上包含两类storage:
在newRaftNode函数内部,我们真正看到了raftNode对象的创建:
r := &raftNode{
lg: cfg.lg,
tickMu: new(sync.Mutex),
raftNodeConfig: cfg,
// set up contention detectors for raft heartbeat message.
// expect to send a heartbeat within 2 heartbeat intervals.
td: contention.NewTimeoutDetector(2 * cfg.heartbeat),
readStateC: make(chan raft.ReadState, 1),
msgSnapC: make(chan raftpb.Message, maxInFlightMsgSnap),
applyc: make(chan toApply),
stopped: make(chan struct{}),
done: make(chan struct{}),
}
if r.heartbeat == 0 {
r.ticker = &time.Ticker{}
} else {
r.ticker = time.NewTicker(r.heartbeat)
}
这其中还根据心跳时间的配置初始化了一个定时器,r.ticker。应该是为了定时发心跳,并触发其他操作用的。
可是raftNode本身没有实现Propose函数,我们raftNode结构体有一个只定义了成员类型,没有定义该类型的变量名的成员,即:
type raftNode struct {
lg *zap.Logger
tickMu *sync.Mutex
raftNodeConfig
//...
}
我们猜测raftNodeConfig类型实现了Propose方法,可惜也没有,再看raftNodeConfig包含的成员,也有一个没有成员变量名的成员raft.Node:
type raftNodeConfig struct {
lg *zap.Logger
// to check if msg receiver is removed from cluster
isIDRemoved func(id uint64) bool
raft.Node
//...
}
raft.Node是包含Propose方法的interface,因此我们需要找到哪个变量对他进行了实现,即前面newRaftNode中的n变量,该变量由
n = raft.StartNode(b.config, b.peers)
创建,该函数实现如下:
func StartNode(c *Config, peers []Peer) Node {
if len(peers) == 0 {
panic("no peers given; use RestartNode instead")
}
rn, err := NewRawNode(c)
if err != nil {
panic(err)
}
err = rn.Bootstrap(peers)
if err != nil {
c.Logger.Warningf("error occurred during starting a new node: %v", err)
}
n := newNode(rn)
go n.run()
return &n
}
其中NewRawNode创建的rn变量(rawNode类型),再由newNode函数创建n变量(其包含rn变量)返回。所以Propose的实现在文件raft/node.go中。Propose函数实现如下:
func (n *node) Propose(ctx context.Context, data []byte) error {
return n.stepWait(ctx, pb.Message{Type: pb.MsgProp, Entries: []pb.Entry{{Data: data}}})
}
stepWait函数解下来调用:
return n.stepWithWaitOption(ctx, m, true)
其中第三个参数bool wait的值为true,表示同步等待。
// Step advances the state machine using msgs. The ctx.Err() will be returned,
// if any.
func (n *node) stepWithWaitOption(ctx context.Context, m pb.Message, wait bool) error {
if m.Type != pb.MsgProp {
select {
case n.recvc <- m:
return nil
case <-ctx.Done():
return ctx.Err()
case <-n.done:
return ErrStopped
}
}
ch := n.propc
pm := msgWithResult{m: m}
if wait {
pm.result = make(chan error, 1)
}
select {
case ch <- pm:
if !wait {
return nil
}
case <-ctx.Done():
return ctx.Err()
case <-n.done:
return ErrStopped
}
select {
case err := <-pm.result:
if err != nil {
return err
}
case <-ctx.Done():
return ctx.Err()
case <-n.done:
return ErrStopped
}
return nil
}
前面在初始化pb.Message的时候,其类型被初始化pb.MsgProp。
所以,直接跳过第一个Select看后面的逻辑。
该函数用参数创建了一个msgWithResult结构,传给了n.propc,如果参数wait 为false,就直接返回了,即为异步参数,但是当前参数为true,为同步操作,所以进入下一个select。select 会等待任意一个case满足条件,也即:
所以,我们主要查看pm.result channel 什么时候会被其他线程塞入数据,而在上一个select 中,n.propc被塞入了pm,所以肯定是在n.propc channel获取数据的地方,有对pm的处理。
func (n *node) run() {
var propc chan msgWithResult
var readyc chan Ready
var advancec chan struct{}
var rd Ready
r := n.rn.raft
lead := None
for {
if advancec != nil {
readyc = nil
} else if n.rn.HasReady() {
// Populate a Ready. Note that this Ready is not guaranteed to
// actually be handled. We will arm readyc, but there's no guarantee
// that we will actually send on it. It's possible that we will
// service another channel instead, loop around, and then populate
// the Ready again. We could instead force the previous Ready to be
// handled first, but it's generally good to emit larger Readys plus
// it simplifies testing (by emitting less frequently and more
// predictably).
rd = n.rn.readyWithoutAccept()
readyc = n.readyc
}
if lead != r.lead {
if r.hasLeader() {
if lead == None {
r.logger.Infof("raft.node: %x elected leader %x at term %d", r.id, r.lead, r.Term)
} else {
r.logger.Infof("raft.node: %x changed leader from %x to %x at term %d", r.id, lead, r.lead, r.Term)
}
propc = n.propc
} else {
r.logger.Infof("raft.node: %x lost leader %x at term %d", r.id, lead, r.Term)
propc = nil
}
lead = r.lead
}
select {
// TODO: maybe buffer the config propose if there exists one (the way
// described in raft dissertation)
// Currently it is dropped in Step silently.
case pm := <-propc:
m := pm.m
m.From = r.id
err := r.Step(m)
if pm.result != nil {
pm.result <- err
close(pm.result)
}
case m := <-n.recvc:
// filter out response message from unknown From.
if pr := r.prs.Progress[m.From]; pr != nil || !IsResponseMsg(m.Type) {
r.Step(m)
}
case cc := <-n.confc:
_, okBefore := r.prs.Progress[r.id]
cs := r.applyConfChange(cc)
// If the node was removed, block incoming proposals. Note that we
// only do this if the node was in the config before. Nodes may be
// a member of the group without knowing this (when they're catching
// up on the log and don't have the latest config) and we don't want
// to block the proposal channel in that case.
//
// NB: propc is reset when the leader changes, which, if we learn
// about it, sort of implies that we got readded, maybe? This isn't
// very sound and likely has bugs.
if _, okAfter := r.prs.Progress[r.id]; okBefore && !okAfter {
var found bool
for _, sl := range [][]uint64{cs.Voters, cs.VotersOutgoing} {
for _, id := range sl {
if id == r.id {
found = true
break
}
}
if found {
break
}
}
if !found {
propc = nil
}
}
select {
case n.confstatec <- cs:
case <-n.done:
}
case <-n.tickc:
n.rn.Tick()
case readyc <- rd:
n.rn.acceptReady(rd)
advancec = n.advancec
case <-advancec:
n.rn.Advance(rd)
rd = Ready{}
advancec = nil
case c := <-n.status:
c <- getStatus(r)
case <-n.stop:
close(n.done)
return
}
}
}
全局搜索propc,可以看到该channel 在 node 的run函数里面被处理,run函数则在StartNode的时候,触发了一个go协程来专门处理。由此可见run函数基本是node节点的main 函数
该函数由一个大的for 循环帮助,应该是为了不停的处理各种消息。
首先系统启动时,变量advancec为空,然后看看n.rn.HasReady 条件是否满足:
func (rn *RawNode) HasReady() bool {
r := rn.raft
if !r.softState().equal(rn.prevSoftSt) {
return true
}
if hardSt := r.hardState(); !IsEmptyHardState(hardSt) && !isHardStateEqual(hardSt, rn.prevHardSt) {
return true
}
if r.raftLog.hasPendingSnapshot() {
return true
}
if len(r.msgs) > 0 || len(r.raftLog.unstableEntries()) > 0 || r.raftLog.hasNextEnts() {
return true
}
if len(r.readStates) != 0 {
return true
}
return false
}
该函数首先检查了节点的当前的softState是否跟之前的softState相等,softState即为不需要持久化保存的状态,里面记录了集群当前的leader和当前节点所处的状态(leader, follower, candiate)。接下来检查hardState, hardState即为需要持久化保存的状态,其中包含当前节点所处的Term, 这一轮投票投给了谁(Vote),和已经commited的索引(commited)。后面看是否有
以上任意条件满足,都会返回true。接下来便会填充出一个ready结构,供后面使用。这里这些条件其实都不满足,因为我们之前的逻辑只是往propc塞了一个东西,并没有涉及到msgs和entries,这里猜测msgs是需要通过网络发送的,entries则是需要本地保存的,所以我们先跳过这里。
随后我们假设集群当前处于稳定状态,即跳过下面对leader变换的处理逻辑。现在就来到了run 函数的大select结构。
第一个case就是从propc channel中接收数据,因此我们的Put请求中的数据作为参数,传给了r.Step函数
这个r.Step函数是node的成员的raft的成员函数,里面应该实现了raft 协议的具体逻辑。
func (r *raft) Step(m pb.Message) error {
// Handle the message term, which may result in our stepping down to a follower.
switch {
case m.Term == 0:
// local message
case m.Term > r.Term:
if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote {
force := bytes.Equal(m.Context, []byte(campaignTransfer))
inLease := r.checkQuorum && r.lead != None && r.electionElapsed < r.electionTimeout
if !force && inLease {
// If a server receives a RequestVote request within the minimum election timeout
// of hearing from a current leader, it does not update its term or grant its vote
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] ignored %s from %x [logterm: %d, index: %d] at term %d: lease is not expired (remaining ticks: %d)",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term, r.electionTimeout-r.electionElapsed)
return nil
}
}
switch {
case m.Type == pb.MsgPreVote:
// Never change our term in response to a PreVote
case m.Type == pb.MsgPreVoteResp && !m.Reject:
// We send pre-vote requests with a term in our future. If the
// pre-vote is granted, we will increment our term when we get a
// quorum. If it is not, the term comes from the node that
// rejected our vote so we should become a follower at the new
// term.
default:
r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
if m.Type == pb.MsgApp || m.Type == pb.MsgHeartbeat || m.Type == pb.MsgSnap {
r.becomeFollower(m.Term, m.From)
} else {
r.becomeFollower(m.Term, None)
}
}
case m.Term < r.Term:
if (r.checkQuorum || r.preVote) && (m.Type == pb.MsgHeartbeat || m.Type == pb.MsgApp) {
// We have received messages from a leader at a lower term. It is possible
// that these messages were simply delayed in the network, but this could
// also mean that this node has advanced its term number during a network
// partition, and it is now unable to either win an election or to rejoin
// the majority on the old term. If checkQuorum is false, this will be
// handled by incrementing term numbers in response to MsgVote with a
// higher term, but if checkQuorum is true we may not advance the term on
// MsgVote and must generate other messages to advance the term. The net
// result of these two features is to minimize the disruption caused by
// nodes that have been removed from the cluster's configuration: a
// removed node will send MsgVotes (or MsgPreVotes) which will be ignored,
// but it will not receive MsgApp or MsgHeartbeat, so it will not create
// disruptive term increases, by notifying leader of this node's activeness.
// The above comments also true for Pre-Vote
//
// When follower gets isolated, it soon starts an election ending
// up with a higher term than leader, although it won't receive enough
// votes to win the election. When it regains connectivity, this response
// with "pb.MsgAppResp" of higher term would force leader to step down.
// However, this disruption is inevitable to free this stuck node with
// fresh election. This can be prevented with Pre-Vote phase.
r.send(pb.Message{To: m.From, Type: pb.MsgAppResp})
} else if m.Type == pb.MsgPreVote {
// Before Pre-Vote enable, there may have candidate with higher term,
// but less log. After update to Pre-Vote, the cluster may deadlock if
// we drop messages with a lower term.
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
r.send(pb.Message{To: m.From, Term: r.Term, Type: pb.MsgPreVoteResp, Reject: true})
} else {
// ignore other cases
r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
}
return nil
}
switch m.Type {
case pb.MsgHup:
if r.preVote {
r.hup(campaignPreElection)
} else {
r.hup(campaignElection)
}
case pb.MsgVote, pb.MsgPreVote:
// We can vote if this is a repeat of a vote we've already cast...
canVote := r.Vote == m.From ||
// ...we haven't voted and we don't think there's a leader yet in this term...
(r.Vote == None && r.lead == None) ||
// ...or this is a PreVote for a future term...
(m.Type == pb.MsgPreVote && m.Term > r.Term)
// ...and we believe the candidate is up to date.
if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
// Note: it turns out that that learners must be allowed to cast votes.
// This seems counter- intuitive but is necessary in the situation in which
// a learner has been promoted (i.e. is now a voter) but has not learned
// about this yet.
// For example, consider a group in which id=1 is a learner and id=2 and
// id=3 are voters. A configuration change promoting 1 can be committed on
// the quorum `{2,3}` without the config change being appended to the
// learner's log. If the leader (say 2) fails, there are de facto two
// voters remaining. Only 3 can win an election (due to its log containing
// all committed entries), but to do so it will need 1 to vote. But 1
// considers itself a learner and will continue to do so until 3 has
// stepped up as leader, replicates the conf change to 1, and 1 applies it.
// Ultimately, by receiving a request to vote, the learner realizes that
// the candidate believes it to be a voter, and that it should act
// accordingly. The candidate's config may be stale, too; but in that case
// it won't win the election, at least in the absence of the bug discussed
// in:
// https://github.com/etcd-io/etcd/issues/7625#issuecomment-488798263.
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] cast %s for %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
// When responding to Msg{Pre,}Vote messages we include the term
// from the message, not the local term. To see why, consider the
// case where a single node was previously partitioned away and
// it's local term is now out of date. If we include the local term
// (recall that for pre-votes we don't update the local term), the
// (pre-)campaigning node on the other end will proceed to ignore
// the message (it ignores all out of date messages).
// The term in the original message and current local term are the
// same in the case of regular votes, but different for pre-votes.
r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
if m.Type == pb.MsgVote {
// Only record real votes.
r.electionElapsed = 0
r.Vote = m.From
}
} else {
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
}
default:
err := r.step(r, m)
if err != nil {
return err
}
}
return nil
}
前面在构造msg的时候,我们没有对其成员变量Term赋值,于是m.Term == 0, raft的Step函数认为处理的是本地消息,再加上msg的Type既不是MsgHup,也不是vote和preVote,所以进入default case逻辑:r.step(r, m)
step是r 的成员变量,其类型是函数类型,
func (r *raft) becomeLeader() {
// TODO(xiangli) remove the panic when the raft implementation is stable
if r.state == StateFollower {
panic("invalid transition [follower -> leader]")
}
r.step = stepLeader
r.reset(r.Term)
r.tick = r.tickHeartbeat
r.lead = r.id
r.state = StateLeader
// Followers enter replicate mode when they've been successfully probed
// (perhaps after having received a snapshot as a result). The leader is
// trivially in this state. Note that r.reset() has initialized this
// progress with the last index already.
r.prs.Progress[r.id].BecomeReplicate()
// Conservatively set the pendingConfIndex to the last index in the
// log. There may or may not be a pending config change, but it's
// safe to delay any future proposals until we commit all our
// pending log entries, and scanning the entire tail of the log
// could be expensive.
r.pendingConfIndex = r.raftLog.lastIndex()
emptyEnt := pb.Entry{Data: nil}
if !r.appendEntry(emptyEnt) {
// This won't happen because we just called reset() above.
r.logger.Panic("empty entry was dropped")
}
// As a special case, don't count the initial empty entry towards the
// uncommitted log quota. This is because we want to preserve the
// behavior of allowing one entry larger than quota if the current
// usage is zero.
r.reduceUncommittedSize([]pb.Entry{emptyEnt})
r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}
经过全局搜索可以发现,由几处地方都对该函数变量进行了初始化,我们简单一点,就假设当前节点是leader。因此该节点被初始化成为了stepLeader函数。
func stepLeader(r *raft, m pb.Message) error {
// These message types do not require any progress for m.From.
switch m.Type {
case pb.MsgBeat:
r.bcastHeartbeat()
return nil
case pb.MsgCheckQuorum:
// The leader should always see itself as active. As a precaution, handle
// the case in which the leader isn't in the configuration any more (for
// example if it just removed itself).
//
// TODO(tbg): I added a TODO in removeNode, it doesn't seem that the
// leader steps down when removing itself. I might be missing something.
if pr := r.prs.Progress[r.id]; pr != nil {
pr.RecentActive = true
}
if !r.prs.QuorumActive() {
r.logger.Warningf("%x stepped down to follower since quorum is not active", r.id)
r.becomeFollower(r.Term, None)
}
// Mark everyone (but ourselves) as inactive in preparation for the next
// CheckQuorum.
r.prs.Visit(func(id uint64, pr *tracker.Progress) {
if id != r.id {
pr.RecentActive = false
}
})
return nil
case pb.MsgProp:
if len(m.Entries) == 0 {
r.logger.Panicf("%x stepped empty MsgProp", r.id)
}
if r.prs.Progress[r.id] == nil {
// If we are not currently a member of the range (i.e. this node
// was removed from the configuration while serving as leader),
// drop any new proposals.
return ErrProposalDropped
}
if r.leadTransferee != None {
r.logger.Debugf("%x [term %d] transfer leadership to %x is in progress; dropping proposal", r.id, r.Term, r.leadTransferee)
return ErrProposalDropped
}
for i := range m.Entries {
e := &m.Entries[i]
var cc pb.ConfChangeI
if e.Type == pb.EntryConfChange {
var ccc pb.ConfChange
if err := ccc.Unmarshal(e.Data); err != nil {
panic(err)
}
cc = ccc
} else if e.Type == pb.EntryConfChangeV2 {
var ccc pb.ConfChangeV2
if err := ccc.Unmarshal(e.Data); err != nil {
panic(err)
}
cc = ccc
}
if cc != nil {
alreadyPending := r.pendingConfIndex > r.raftLog.applied
alreadyJoint := len(r.prs.Config.Voters[1]) > 0
wantsLeaveJoint := len(cc.AsV2().Changes) == 0
var refused string
if alreadyPending {
refused = fmt.Sprintf("possible unapplied conf change at index %d (applied to %d)", r.pendingConfIndex, r.raftLog.applied)
} else if alreadyJoint && !wantsLeaveJoint {
refused = "must transition out of joint config first"
} else if !alreadyJoint && wantsLeaveJoint {
refused = "not in joint state; refusing empty conf change"
}
if refused != "" {
r.logger.Infof("%x ignoring conf change %v at config %s: %s", r.id, cc, r.prs.Config, refused)
m.Entries[i] = pb.Entry{Type: pb.EntryNormal}
} else {
r.pendingConfIndex = r.raftLog.lastIndex() + uint64(i) + 1
}
}
}
if !r.appendEntry(m.Entries...) {
return ErrProposalDropped
}
r.bcastAppend()
return nil
case pb.MsgReadIndex:
// only one voting member (the leader) in the cluster
if r.prs.IsSingleton() {
if resp := r.responseToReadIndexReq(m, r.raftLog.committed); resp.To != None {
r.send(resp)
}
return nil
}
// Postpone read only request when this leader has not committed
// any log entry at its term.
if !r.committedEntryInCurrentTerm() {
r.pendingReadIndexMessages = append(r.pendingReadIndexMessages, m)
return nil
}
sendMsgReadIndexResponse(r, m)
return nil
}
stepLeader函数内包含了对各种消息类型的处理,我们主要看MsgProp这一类消息。处理这类消息,首先是一些异常case判断,然后遍历消息里的每条Entry(这里可以看出消息包含若干条Entry),查看他是否是confChange类型,我们的Put请求肯定不属于这一类型。于是直接进入r.appendEntry逻辑:
func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
li := r.raftLog.lastIndex()
for i := range es {
es[i].Term = r.Term
es[i].Index = li + 1 + uint64(i)
}
// Track the size of this uncommitted proposal.
if !r.increaseUncommittedSize(es) {
r.logger.Debugf(
"%x appending new entries to log would exceed uncommitted entry size limit; dropping proposal",
r.id,
)
// Drop the proposal.
return false
}
// use latest "last" index after truncate/append
li = r.raftLog.append(es...)
r.prs.Progress[r.id].MaybeUpdate(li)
// Regardless of maybeCommit's return, our caller will call bcastAppend.
r.maybeCommit()
return true
}
该函数首先将参数里的每一条Entry 的Term赋值为当前leader的 Term,index在lastIndex的基础上,每赋值一次,自增一次,这其实是raft追加日志的设计实现。接下来增加UncommitedSize,随后往raftLog里面append es(entries)。注意看这里的raftLog其实是基于内存实现的.
该raftLog的初始化逻辑如下:
NewRawNode -> newRaft -> raftlog := newLogWithSize(c.Storage, c.Logger, c.MaxCommittedSizePerReady)
所以我们主要看c.Storage是什么存储。
而NewRawNode 函数由 函数func (b *bootstrappedRaft) newRaftNode(ss *snap.Snapshotter, wal *wal.WAL, cl *membership.RaftCluster) *raftNode
触发。其中的b.config即为NewRawNode的config参数c。newRaftNode函数内的b.config实为变量raft 类型为bootstrappedRaft的config。该raft由函数 raft := bootstrapRaft(cfg, cluster, s.wal)
创建,因为我们的etcd的启动方式是新建了一个raft集群,所以会调用函数:
func bootstrapRaftFromCluster(cfg config.ServerConfig, cl *membership.RaftCluster, ids []types.ID, bwal *bootstrappedWAL) *bootstrappedRaft {
member := cl.MemberByName(cfg.Name)
peers := make([]raft.Peer, len(ids))
for i, id := range ids {
var ctx []byte
ctx, err := json.Marshal((*cl).Member(id))
if err != nil {
cfg.Logger.Panic("failed to marshal member", zap.Error(err))
}
peers[i] = raft.Peer{ID: uint64(id), Context: ctx}
}
cfg.Logger.Info(
"starting local member",
zap.String("local-member-id", member.ID.String()),
zap.String("cluster-id", cl.ID().String()),
)
s := bwal.MemoryStorage()
return &bootstrappedRaft{
lg: cfg.Logger,
heartbeat: time.Duration(cfg.TickMs) * time.Millisecond,
config: raftConfig(cfg, uint64(member.ID), s), //config.s
peers: peers,
storage: s,
}
}
从该函数内 s := bwal.MemoryStorage()
可以看出,bootstrappedRaft里面的config.s实为MemoryStorage。而MemoryStorage函数则先从wal的快照中恢复状态机,又
继续将尚未写到快照中的wal(write ahead log)内的entries 追加到了MemoryStorage的ents数组中,这里我理解为将硬盘中wal的数据恢复到内存的raftLog中,注意:这部分数据是已经持久化的。
func (wal *bootstrappedWAL) MemoryStorage() *raft.MemoryStorage {
s := raft.NewMemoryStorage()
if wal.snapshot != nil {
s.ApplySnapshot(*wal.snapshot)
}
if wal.st != nil {
s.SetHardState(*wal.st)
}
if len(wal.ents) != 0 {
s.Append(wal.ents)
}
return s
}
我们现在回过头来看raftLog结构体:
type raftLog struct {
// storage contains all stable entries since the last snapshot.
storage Storage
// unstable contains all unstable entries and snapshot.
// they will be saved into storage.
unstable unstable
// committed is the highest log position that is known to be in
// stable storage on a quorum of nodes.
committed uint64
// applied is the highest log position that the application has
// been instructed to apply to its state machine.
// Invariant: applied <= committed
applied uint64
logger Logger
// maxNextEntsSize is the maximum number aggregate byte size of the messages
// returned from calls to nextEnts.
maxNextEntsSize uint64
}
该raftLog由newLogWithSize函数创建:
// newLogWithSize returns a log using the given storage and max
// message size.
func newLogWithSize(storage Storage, logger Logger, maxNextEntsSize uint64) *raftLog {
if storage == nil {
log.Panic("storage must not be nil")
}
log := &raftLog{
storage: storage,
logger: logger,
maxNextEntsSize: maxNextEntsSize,
}
firstIndex, err := storage.FirstIndex()
if err != nil {
panic(err) // TODO(bdarnell)
}
lastIndex, err := storage.LastIndex()
if err != nil {
panic(err) // TODO(bdarnell)
}
log.unstable.offset = lastIndex + 1
log.unstable.logger = logger
// Initialize our committed and applied pointers to the time of the last compaction.
log.committed = firstIndex - 1
log.applied = firstIndex - 1
return log
}
该函数对自己的成员unstable进行了初始化,unstable将会记录尚未持久化的raftLog的entries,是一个slice,其offset值代表了raft 协议中的index值,因此该值被初始化lastIndex, LastIndex的计算方法如下:
func (ms *MemoryStorage) lastIndex() uint64 {
return ms.ents[0].Index + uint64(len(ms.ents)) - 1
}
而FirstIndex的计算方法如下:
func (ms *MemoryStorage) firstIndex() uint64 {
return ms.ents[0].Index + 1
}
而log.commited被初始化为firstIndex -1, log.applied 也被初始化为firstIndex -1。由此,我们可以推断出,快照(snapshot)中保存的是当前集群公认的已经commited的状态,wal中保存的是已经持久化,但是不一定commited的条目。
之前的r.raftLog.append函数,实则调用的是raftLog成员unstable的truncateAndAppend(ents)函数:
func (u *unstable) truncateAndAppend(ents []pb.Entry) {
after := ents[0].Index
switch {
case after == u.offset+uint64(len(u.entries)):
// after is the next index in the u.entries
// directly append
u.entries = append(u.entries, ents...)
case after <= u.offset:
u.logger.Infof("replace the unstable entries from index %d", after)
// The log is being truncated to before our current offset
// portion, so set the offset and replace the entries
u.offset = after
u.entries = ents
default:
// truncate to after and copy to u.entries
// then append
u.logger.Infof("truncate the unstable entries before index %d", after)
u.entries = append([]pb.Entry{}, u.slice(u.offset, after)...)
u.entries = append(u.entries, ents...)
}
}
可以看出unstable里面的entries实为slice结构,其从下标0开始,保存了当前尚未持久化的entry。而u.offset + etnries slice的下标才表示raft log 协议中的index字段。 这里三个case 分别表示:
raftLog的append 函数最后返回的是当前最新的index: (unstable内最新的index, 如果unstable 内的entries 为空, 则返回storage里面的最新index),其实我理解storage里的条目都已经被持久化,这在MemoryStorage函数里可以看出,初始ents都是从wal内读到的。
func (l *raftLog) append(ents ...pb.Entry) uint64 {
if len(ents) == 0 {
return l.lastIndex()
}
if after := ents[0].Index - 1; after < l.committed {
l.logger.Panicf("after(%d) is out of range [committed(%d)]", after, l.committed)
}
l.unstable.truncateAndAppend(ents)
return l.lastIndex()
}
append执行完成后到了 r.prs.Progress[r.id].MaybeUpdate(li)
。先来看看prs是怎么被初始化的, prs是一个ProgressTracker类型的变量。
// MakeProgressTracker initializes a ProgressTracker.
func MakeProgressTracker(maxInflight int) ProgressTracker {
p := ProgressTracker{
MaxInflight: maxInflight,
Config: Config{
Voters: quorum.JointConfig{
quorum.MajorityConfig{},
nil, // only populated when used
},
Learners: nil, // only populated when used
LearnersNext: nil, // only populated when used
},
Votes: map[uint64]bool{},
Progress: map[uint64]*Progress{},
}
return p
}
可以看到其包含两个主要的成员变量, Votes map和 Progress map。而Progress按照注释描述,是leader用来记录每个follower当前raftLog的状态的。map的key即为raft的节点id。
// Progress represents a follower’s progress in the view of the leader. Leader
// maintains progresses of all followers, and sends entries to the follower
// based on its progress.
//
// NB(tbg): Progress is basically a state machine whose transitions are mostly
// strewn around `*raft.raft`. Additionally, some fields are only used when in a
// certain State. All of this isn't ideal.
type Progress struct {
Match, Next uint64
//...
可以看到Progress 里面包含变脸Match Index,表示当前raftLog 最新的index, 即MatchIndex, 调用MaybeUpdate
函数会记录当前节点的MatchIndex,可以用于后面leader 统计各节点的index 情况。
接下来就是调用r.maybeCommit()
函数了。
// maybeCommit attempts to advance the commit index. Returns true if
// the commit index changed (in which case the caller should call
// r.bcastAppend).
func (r *raft) maybeCommit() bool {
mci := r.prs.Committed()
return r.raftLog.maybeCommit(mci, r.Term)
}
prs.Committed会收集各voters的MatchIndex,这其中通过插入排序 将各voters 的index 进行排序,取大家都满足的最新index 返回。接下来raftLog.maybeCommit(mci, r.Term) 会把返回值拿到,判断是否需要更新raftLog的commited值。
不过r.maybeCommit 这个函数的返回值不重要,接下来按照注释都会调用bcastAppend函数。作为leader,会给每个follower 发去他们所缺的entries:
// maybeSendAppend sends an append RPC with new entries to the given peer,
// if necessary. Returns true if a message was sent. The sendIfEmpty
// argument controls whether messages with no entries will be sent
// ("empty" messages are useful to convey updated Commit indexes, but
// are undesirable when we're sending multiple messages in a batch).
func (r *raft) maybeSendAppend(to uint64, sendIfEmpty bool) bool {
pr := r.prs.Progress[to]
if pr.IsPaused() {
return false
}
m := pb.Message{}
m.To = to
term, errt := r.raftLog.term(pr.Next - 1)
ents, erre := r.raftLog.entries(pr.Next, r.maxMsgSize)
if len(ents) == 0 && !sendIfEmpty {
return false
}
if errt != nil || erre != nil { // send snapshot if we failed to get term or entries
if !pr.RecentActive {
r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)
return false
}
m.Type = pb.MsgSnap
snapshot, err := r.raftLog.snapshot()
if err != nil {
if err == ErrSnapshotTemporarilyUnavailable {
r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)
return false
}
panic(err) // TODO(bdarnell)
}
if IsEmptySnap(snapshot) {
panic("need non-empty snapshot")
}
m.Snapshot = snapshot
sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term
r.logger.Debugf("%x [firstindex: %d, commit: %d] sent snapshot[index: %d, term: %d] to %x [%s]",
r.id, r.raftLog.firstIndex(), r.raftLog.committed, sindex, sterm, to, pr)
pr.BecomeSnapshot(sindex)
r.logger.Debugf("%x paused sending replication messages to %x [%s]", r.id, to, pr)
} else {
m.Type = pb.MsgApp
m.Index = pr.Next - 1
m.LogTerm = term
m.Entries = ents
m.Commit = r.raftLog.committed
if n := len(m.Entries); n != 0 {
switch pr.State {
// optimistically increase the next when in StateReplicate
case tracker.StateReplicate:
last := m.Entries[n-1].Index
pr.OptimisticUpdate(last)
pr.Inflights.Add(last)
case tracker.StateProbe:
pr.ProbeSent = true
default:
r.logger.Panicf("%x is sending append in unhandled state %s", r.id, pr.State)
}
}
}
r.send(m)
return true
}
上面的函数:
下面我们直接看从节点收到消息后的处理,跳过发送逻辑。