从etcd server的启动开始说起(一)

etcd server 学习

startEtcd函数

etcd.go 内的startEtcd 函数会根据配置,启动一个etcd server。

创建etcdserver (etcdserver.NewServer)

if e.Server, err = etcdserver.NewServer(srvcfg); err != nil {
		return e, err
}

在NewServer函数内,首先调用了bootstrap函数,返回bootstrappedServer结构,故名思义,bootstrap函数根据配置创建了接下来server启动要用的一系列对象

b, err := bootstrap(cfg)

bootstrap

func bootstrap(cfg config.ServerConfig) (b *bootstrappedServer, err error) {
...
	haveWAL := wal.Exist(cfg.WALDir())
	ss := bootstrapSnapshot(cfg)
	backend, err := bootstrapBackend(cfg, haveWAL, st, ss)
	cluster, err := bootstrapCluster(cfg, bwal, prt)
	s, err := bootstrapStorage(cfg, st, backend, bwal, cluster)
	raft := bootstrapRaft(cfg, cluster, s.wal)
	return &bootstrappedServer{
		prt:     prt,
		ss:      ss,
		storage: s,
		cluster: cluster,
		raft:    raft,
	}, nil
...
}

根据函数名推断,这里面应该是根据配置,进行了一系列启动动作:

  • 初始化快照创建者(比如快照存放目录)
  • 初始化DB,获取操作db的句柄,创建bolt db里面的元数据bucket
  • 读取cluster的初始配置(比如集群内有哪些成员),获取计算自己的成员ID 等等
  • 如果wal文件之前不存在的话,创建wal文件,并获取文件句柄
  • 初始化raft 协议所需要的配置
bootstrapRaft函数
func bootstrapRaftFromCluster(cfg config.ServerConfig, cl *membership.RaftCluster, ids []types.ID, bwal *bootstrappedWAL) *bootstrappedRaft {
...
	s := bwal.MemoryStorage()
	return &bootstrappedRaft{
		lg:        cfg.Logger,
		heartbeat: time.Duration(cfg.TickMs) * time.Millisecond,
		config:    raftConfig(cfg, uint64(member.ID), s), //config.s
		peers:     peers,
		storage:   s,
	}
...
}

可以看到这里raftConfig 里面的storage是MemoryStorage,MemoryStorage结构比较简单,主要是一个内存数组:

// MemoryStorage implements the Storage interface backed by an
// in-memory array.
type MemoryStorage struct {
	// Protects access to all fields. Most methods of MemoryStorage are
	// run on the raft goroutine, but Append() is run on an application
	// goroutine.
	sync.Mutex

	hardState pb.HardState
	snapshot  pb.Snapshot
	// ents[i] has raft log position i+snapshot.Metadata.Index
	ents []pb.Entry
}

启动etcdserver

	e.Server.Start()

Start函数内主要包含一些成员变量和channel的初始化,如

s.w = wait.New()
s.applyWait = wait.NewTimeList()
s.done = make(chan struct{})
s.stop = make(chan struct{})
s.stopping = make(chan struct{}, 1)
s.ctx, s.cancel = context.WithCancel(context.Background())
s.readwaitc = make(chan struct{}, 1)
s.readNotifier = newNotifier()
s.leaderChanged = notify.NewNotifier()

一个正常的读写请求会首先在w上注册,该请求被etcd 集群处理完成后,会通过w来触发响应的通知。w本身是一个map list,即请求会根据请求id注册到list对应下标的map上。

接下来Start函数会启动一个协程。

	// TODO: if this is an empty log, writes all peer infos
	// into the first entry
	go s.run()

该协程创建了一个scheduler,用于调度执行一些任务,比如apply commited entries

// asynchronously accept toApply packets, dispatch progress in-order
	sched := schedule.NewFIFOScheduler(lg)

接下来创建了一个raftReadyHandler,用于将raft的逻辑和server状态机的逻辑解耦开来,并启动了raftNode。

rh := &raftReadyHandler{
		getLead:    func() (lead uint64) { return s.getLead() },
		updateLead: func(lead uint64) { s.setLead(lead) },
		updateLeadership: func(newLeader bool) {...}
s.r.start(rh)

raftNode 的协程for 循环

这里r.start 也为raftNode 启动了一个协程,这个协程是一个大的for 循环:

func (r *raftNode) start(rh *raftReadyHandler) {
	go func() {
		defer r.onStop()
		islead := false

		for {
			select {
			case <-r.ticker.C:
				r.tick()
			case rd := <-r.Ready():
				...
				if islead {
					// gofail: var raftBeforeLeaderSend struct{}
					r.transport.Send(r.processMessages(rd.Messages))
				}
				...
				r.Advance()
			case <-r.stopped:
				return
			}
		}
	}()
}

这里是raftNode在调用底层的raft库实现raft协议跟其他节点的交互,比如leader拿到Ready结构,Ready结构里面包含一些新来的请求数据,通过transport.Send发给follower,并通过Advance函数更新处理进度,当然follower节点也会拿到自己的Ready结构,将自己收到的消息的处理情况回复给leader。

etcdserver的select循环

最后就陷入到select的大循环当中,直到server退出:

for {
		select {
		case ap := <-s.r.apply():
			f := schedule.NewJob("server_applyAll", func(context.Context) { s.applyAll(&ep, &ap) })
			sched.Schedule(f)
		case leases := <-expiredLeaseC:
			s.revokeExpiredLeases(leases)
		case err := <-s.errorc:
			lg.Warn("server error", zap.Error(err))
			lg.Warn("data-dir used by this member must be removed")
			return
		case <-getSyncC():
			if s.v2store.HasTTLKeys() {
				s.sync(s.Cfg.ReqTimeout())
			}
		case <-s.stop:
			return
		}
	}

可以看到,一旦raft协议传递过来一些可以apply到状态机的entries,就会形成一个调度job,由调度器调用s.applyAll函数以应用已经commited的entries。而过期的lease在这个select 内被处理。这里充分运用了go的管道特性,其他协程完成了某件任务,就通知给对应的管道,这里的select负责在各个管道上接受最新的通知。

另外,在server start后,会给server 绑定client 方法的handlers。其中,sctx代表server监听client的listener,每配置一个client listen url,就会有一个sctx.
e.Server 包含server的内部处理逻辑,如raft, storage, watcher 等处理逻辑。在该server start之后,就会调用:

  • e.servePeers() 处理来自集群内部其他成员的消息
  • e.serveClients() 处理来自client 的消息
  • e.serveMetrics() 处理metric 请求的相应

e.serveClients函数

该函数首先会启动一个go 内建的http multiplexer(MUX),是一个将http 各种pattern 路径和handler绑定起来的多路复用对象。
接下来针对每一个sctx, 会调用其serve函数,启动grpc server与http server。

启动grpc server

grpc server 的启动由v3rpc 目录下的grpc.go 文件内的Server函数触发。该函数内初始化了普通单向拦截器和流拦截器:

chainUnaryInterceptors := []grpc.UnaryServerInterceptor{
        newLogUnaryInterceptor(s),
        newUnaryInterceptor(s),
        grpc_prometheus.UnaryServerInterceptor,
    }
    if interceptor != nil {
        chainUnaryInterceptors = append(chainUnaryInterceptors, interceptor)
    }

chainStreamInterceptors := []grpc.StreamServerInterceptor{
        newStreamInterceptor(s),
        grpc_prometheus.StreamServerInterceptor,
}

随后,这些拦截器会被加入goopts,根据goopts创建grpcServer:

grpcServer := grpc.NewServer(append(opts, gopts...)...)

pb.RegisterKVServer(grpcServer, NewQuotaKVServer(s))
pb.RegisterWatchServer(grpcServer, NewWatchServer(s))
pb.RegisterLeaseServer(grpcServer, NewQuotaLeaseServer(s))
pb.RegisterClusterServer(grpcServer, NewClusterServer(s))
pb.RegisterAuthServer(grpcServer, NewAuthServer(s))
pb.RegisterMaintenanceServer(grpcServer, NewMaintenanceServer(s))

创建grpc server后,会往该grpc server上注册相应的handlers, 如NewQuotaKVServer。RegisterKVServer将KV 请求的相关方法和KVServer的对应Handlers结合起来共同处理client发来的特定methods:

var _KV_serviceDesc = grpc.ServiceDesc{
    ServiceName: "etcdserverpb.KV",
    HandlerType: (*KVServer)(nil),
    Methods: []grpc.MethodDesc{
        {
            MethodName: "Range",
            Handler:    _KV_Range_Handler,
        },
        {
            MethodName: "Put",
            Handler:    _KV_Put_Handler,
        },
        {
            MethodName: "DeleteRange",
            Handler:    _KV_DeleteRange_Handler,
        },
        {
            MethodName: "Txn",
            Handler:    _KV_Txn_Handler,
        },
        {
            MethodName: "Compact",
            Handler:    _KV_Compact_Handler,
        },
    },
    Streams:  []grpc.StreamDesc{},
    Metadata: "rpc.proto",
}

比如_KV_Range_Handler 函数,就将KVServer的Range handler包装成一个函数,作为参数传递给interceptor
interceptor执行完interceptor链之后,会调用Handler函数内创建的handler函数,比如_KV_Put_Handler,就调用了函数:

handler := func(ctx context.Context, req interface{}) (interface{}, error) {
        return srv.(KVServer).Put(ctx, req.(*PutRequest))
}

这里srv是server的缩写,是前面的NewQuotaKVServer(s) 创建出来的对象,其中的Put 方法实现如下:

func (s *quotaKVServer) Put(ctx context.Context, r *pb.PutRequest) (*pb.PutResponse, error) {
    if err := s.qa.check(ctx, r); err != nil {
        return nil, err
    }
    return s.KVServer.Put(ctx, r)
}

即检查了是否满足quota 后,调用KVServer的Put方法。 KVServer即为etcdServer.EtcdServer,这个server由前面的StartEtcd函数创建。

Put 函数的实现

其Put函数实现如下(全局搜素“s *EtcdServer) Put”):

func (s *EtcdServer) Put(ctx context.Context, r *pb.PutRequest) (*pb.PutResponse, error) {
    ctx = context.WithValue(ctx, traceutil.StartTimeKey, time.Now())
    resp, err := s.raftRequest(ctx, pb.InternalRaftRequest{Put: r})
    if err != nil {
        return nil, err
    }
    return resp.(*pb.PutResponse), nil
}

即调用raftReques函数:

func (s *EtcdServer) raftRequest(ctx context.Context, r pb.InternalRaftRequest) (proto.Message, error) {
    return s.raftRequestOnce(ctx, r)
}

最终调用processInternalRaftRequestOnce函数

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {
    ai := s.getAppliedIndex()
    ci := s.getCommittedIndex()
    if ci > ai+maxGapBetweenApplyAndCommitIndex {  //避免apply到状态机的跟已经commited的差距太大。
        return nil, errors.ErrTooManyRequests
    }

    r.Header = &pb.RequestHeader{   
        ID: s.reqIDGen.Next(),  //产生一个唯一的ID
    }

    // check authinfo if it is not InternalAuthenticateRequest
    if r.Authenticate == nil {
        authInfo, err := s.AuthInfoFromCtx(ctx)
        if err != nil {
            return nil, err
        }
        if authInfo != nil {
            r.Header.Username = authInfo.Username
            r.Header.AuthRevision = authInfo.Revision
        }
    }

    data, err := r.Marshal()
    if err != nil {
        return nil, err
    }

    if len(data) > int(s.Cfg.MaxRequestBytes) {
        return nil, errors.ErrRequestTooLarge
    }

    id := r.ID
    if id == 0 {
        id = r.Header.ID
    }
    ch := s.w.Register(id)  //使用唯一id关联一个channel,在该channel上进行等待。

    cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
    defer cancel()

    start := time.Now()
    err = s.r.Propose(cctx, data) //向raft模块提出一个提议
    if err != nil {
        proposalsFailed.Inc()
        s.w.Trigger(id, nil) // GC wait
        return nil, err
    }
    proposalsPending.Inc()
    defer proposalsPending.Dec()

    select {
    case x := <-ch:
        return x.(*apply2.Result), nil
    case <-cctx.Done():
        proposalsFailed.Inc()
        s.w.Trigger(id, nil) // GC wait
        return nil, s.parseProposeCtxErr(cctx.Err(), start)
    case <-s.done:
        return nil, errors.ErrStopped
    }
}

注意看s.r.Propose(cctx, data) 函数:
r成员变量来自:

        r:                     *b.raft.newRaftNode(b.ss, b.storage.wal.w, b.cluster.cl),

可以看到我们给newRaftNode传入了三个参数,第一个是Snapshotter,第二个是wal,第三个是记录集群成员关系的memebership.RaftCluster
newRaftNode 函数由bootstrappedRaft类型的对象调用,返回一个raftNode对象的地址

func (b *bootstrappedRaft) newRaftNode(ss *snap.Snapshotter, wal *wal.WAL, cl *membership.RaftCluster) *raftNode {
	var n raft.Node
	if len(b.peers) == 0 {
		n = raft.RestartNode(b.config)
	} else {
		n = raft.StartNode(b.config, b.peers)
	}
	raftStatusMu.Lock()
	raftStatus = n.Status
	raftStatusMu.Unlock()
	return newRaftNode(
		raftNodeConfig{
			lg:          b.lg,
			isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) },
			Node:        n,
			heartbeat:   b.heartbeat,
			raftStorage: b.storage,
			storage:     serverstorage.NewStorage(b.lg, wal, ss),
		},
	)
}

该函数首先创建了一个raft.Node 对象n, 这个对象包含了raft协议的底层实现。无论是StartNode还是RestartNode都会生成一个新的协程,用来跟raft协议进行交互:

	n := newNode(rn)
	go n.run()

接下来创建raftNodeConfig临时变量,其中包含Node n, hearbeat,raftStorage (memoryStorage), storage 这几个主要的成员变量,在创建storage的时候,传入了wal,说明该storage应该是个持久化的storage。所以raftNode实际上包含两类storage:

  1. raftStorage 基于内存
  2. storage 基于硬盘 (wal)

在newRaftNode函数内部,我们真正看到了raftNode对象的创建:

    r := &raftNode{
        lg:             cfg.lg,
        tickMu:         new(sync.Mutex),
        raftNodeConfig: cfg,
        // set up contention detectors for raft heartbeat message.
        // expect to send a heartbeat within 2 heartbeat intervals.
        td:         contention.NewTimeoutDetector(2 * cfg.heartbeat),
        readStateC: make(chan raft.ReadState, 1),
        msgSnapC:   make(chan raftpb.Message, maxInFlightMsgSnap),
        applyc:     make(chan toApply),
        stopped:    make(chan struct{}),
        done:       make(chan struct{}),
    }
    if r.heartbeat == 0 {
        r.ticker = &time.Ticker{}
    } else {
        r.ticker = time.NewTicker(r.heartbeat)
    }

这其中还根据心跳时间的配置初始化了一个定时器,r.ticker。应该是为了定时发心跳,并触发其他操作用的。
可是raftNode本身没有实现Propose函数,我们raftNode结构体有一个只定义了成员类型,没有定义该类型的变量名的成员,即:

type raftNode struct {
    lg *zap.Logger

    tickMu *sync.Mutex
    raftNodeConfig
    //...
}

我们猜测raftNodeConfig类型实现了Propose方法,可惜也没有,再看raftNodeConfig包含的成员,也有一个没有成员变量名的成员raft.Node:

type raftNodeConfig struct {
    lg *zap.Logger

    // to check if msg receiver is removed from cluster
    isIDRemoved func(id uint64) bool
    raft.Node
    //...
}

raft.Node是包含Propose方法的interface,因此我们需要找到哪个变量对他进行了实现,即前面newRaftNode中的n变量,该变量由

n = raft.StartNode(b.config, b.peers)

创建,该函数实现如下:

func StartNode(c *Config, peers []Peer) Node {
    if len(peers) == 0 {
        panic("no peers given; use RestartNode instead")
    }
    rn, err := NewRawNode(c)
    if err != nil {
        panic(err)
    }
    err = rn.Bootstrap(peers)
    if err != nil {
        c.Logger.Warningf("error occurred during starting a new node: %v", err)
    }

    n := newNode(rn)

    go n.run()
    return &n
}

其中NewRawNode创建的rn变量(rawNode类型),再由newNode函数创建n变量(其包含rn变量)返回。所以Propose的实现在文件raft/node.go中。Propose函数实现如下:

func (n *node) Propose(ctx context.Context, data []byte) error {
    return n.stepWait(ctx, pb.Message{Type: pb.MsgProp, Entries: []pb.Entry{{Data: data}}})
}

stepWait函数解下来调用:

	return n.stepWithWaitOption(ctx, m, true)

其中第三个参数bool wait的值为true,表示同步等待。

// Step advances the state machine using msgs. The ctx.Err() will be returned,
// if any.
func (n *node) stepWithWaitOption(ctx context.Context, m pb.Message, wait bool) error {
	if m.Type != pb.MsgProp {
		select {
		case n.recvc <- m:
			return nil
		case <-ctx.Done():
			return ctx.Err()
		case <-n.done:
			return ErrStopped
		}
	}
	ch := n.propc
	pm := msgWithResult{m: m}
	if wait {
		pm.result = make(chan error, 1)
	}
	select {
	case ch <- pm:
		if !wait {
			return nil
		}
	case <-ctx.Done():
		return ctx.Err()
	case <-n.done:
		return ErrStopped
	}
	select {
	case err := <-pm.result:
		if err != nil {
			return err
		}
	case <-ctx.Done():
		return ctx.Err()
	case <-n.done:
		return ErrStopped
	}
	return nil
}

前面在初始化pb.Message的时候,其类型被初始化pb.MsgProp。
所以,直接跳过第一个Select看后面的逻辑。
该函数用参数创建了一个msgWithResult结构,传给了n.propc,如果参数wait 为false,就直接返回了,即为异步参数,但是当前参数为true,为同步操作,所以进入下一个select。select 会等待任意一个case满足条件,也即:

  • pm.result channel被其他线程塞入了1条数据
  • context 进入done 状态 (可能是超过context deadline了)
  • 节点进入done 状态,应该是节点被关闭了。

所以,我们主要查看pm.result channel 什么时候会被其他线程塞入数据,而在上一个select 中,n.propc被塞入了pm,所以肯定是在n.propc channel获取数据的地方,有对pm的处理。

func (n *node) run() {
	var propc chan msgWithResult
	var readyc chan Ready
	var advancec chan struct{}
	var rd Ready

	r := n.rn.raft

	lead := None

	for {
		if advancec != nil {
			readyc = nil
		} else if n.rn.HasReady() {
			// Populate a Ready. Note that this Ready is not guaranteed to
			// actually be handled. We will arm readyc, but there's no guarantee
			// that we will actually send on it. It's possible that we will
			// service another channel instead, loop around, and then populate
			// the Ready again. We could instead force the previous Ready to be
			// handled first, but it's generally good to emit larger Readys plus
			// it simplifies testing (by emitting less frequently and more
			// predictably).
			rd = n.rn.readyWithoutAccept()
			readyc = n.readyc
		}

		if lead != r.lead {
			if r.hasLeader() {
				if lead == None {
					r.logger.Infof("raft.node: %x elected leader %x at term %d", r.id, r.lead, r.Term)
				} else {
					r.logger.Infof("raft.node: %x changed leader from %x to %x at term %d", r.id, lead, r.lead, r.Term)
				}
				propc = n.propc
			} else {
				r.logger.Infof("raft.node: %x lost leader %x at term %d", r.id, lead, r.Term)
				propc = nil
			}
			lead = r.lead
		}

		select {
		// TODO: maybe buffer the config propose if there exists one (the way
		// described in raft dissertation)
		// Currently it is dropped in Step silently.
		case pm := <-propc:
			m := pm.m
			m.From = r.id
			err := r.Step(m)
			if pm.result != nil {
				pm.result <- err
				close(pm.result)
			}
		case m := <-n.recvc:
			// filter out response message from unknown From.
			if pr := r.prs.Progress[m.From]; pr != nil || !IsResponseMsg(m.Type) {
				r.Step(m)
			}
		case cc := <-n.confc:
			_, okBefore := r.prs.Progress[r.id]
			cs := r.applyConfChange(cc)
			// If the node was removed, block incoming proposals. Note that we
			// only do this if the node was in the config before. Nodes may be
			// a member of the group without knowing this (when they're catching
			// up on the log and don't have the latest config) and we don't want
			// to block the proposal channel in that case.
			//
			// NB: propc is reset when the leader changes, which, if we learn
			// about it, sort of implies that we got readded, maybe? This isn't
			// very sound and likely has bugs.
			if _, okAfter := r.prs.Progress[r.id]; okBefore && !okAfter {
				var found bool
				for _, sl := range [][]uint64{cs.Voters, cs.VotersOutgoing} {
					for _, id := range sl {
						if id == r.id {
							found = true
							break
						}
					}
					if found {
						break
					}
				}
				if !found {
					propc = nil
				}
			}
			select {
			case n.confstatec <- cs:
			case <-n.done:
			}
		case <-n.tickc:
			n.rn.Tick()
		case readyc <- rd:
			n.rn.acceptReady(rd)
			advancec = n.advancec
		case <-advancec:
			n.rn.Advance(rd)
			rd = Ready{}
			advancec = nil
		case c := <-n.status:
			c <- getStatus(r)
		case <-n.stop:
			close(n.done)
			return
		}
	}
}

全局搜索propc,可以看到该channel 在 node 的run函数里面被处理,run函数则在StartNode的时候,触发了一个go协程来专门处理。由此可见run函数基本是node节点的main 函数

node 的run函数

该函数由一个大的for 循环帮助,应该是为了不停的处理各种消息。
首先系统启动时,变量advancec为空,然后看看n.rn.HasReady 条件是否满足:

func (rn *RawNode) HasReady() bool {
	r := rn.raft
	if !r.softState().equal(rn.prevSoftSt) {
		return true
	}
	if hardSt := r.hardState(); !IsEmptyHardState(hardSt) && !isHardStateEqual(hardSt, rn.prevHardSt) {
		return true
	}
	if r.raftLog.hasPendingSnapshot() {
		return true
	}
	if len(r.msgs) > 0 || len(r.raftLog.unstableEntries()) > 0 || r.raftLog.hasNextEnts() {
		return true
	}
	if len(r.readStates) != 0 {
		return true
	}
	return false
}

该函数首先检查了节点的当前的softState是否跟之前的softState相等,softState即为不需要持久化保存的状态,里面记录了集群当前的leader和当前节点所处的状态(leader, follower, candiate)。接下来检查hardState, hardState即为需要持久化保存的状态,其中包含当前节点所处的Term, 这一轮投票投给了谁(Vote),和已经commited的索引(commited)。后面看是否有

  • 处于Pending状态的snapshot,
  • 尚未持久化的日志条目
  • 尚未应用到状态机的条目(hasNextEnts)
  • 供线性读请求使用的readStates.

以上任意条件满足,都会返回true。接下来便会填充出一个ready结构,供后面使用。这里这些条件其实都不满足,因为我们之前的逻辑只是往propc塞了一个东西,并没有涉及到msgs和entries,这里猜测msgs是需要通过网络发送的,entries则是需要本地保存的,所以我们先跳过这里。
随后我们假设集群当前处于稳定状态,即跳过下面对leader变换的处理逻辑。现在就来到了run 函数的大select结构。
第一个case就是从propc channel中接收数据,因此我们的Put请求中的数据作为参数,传给了r.Step函数

r.Step函数

这个r.Step函数是node的成员的raft的成员函数,里面应该实现了raft 协议的具体逻辑。

func (r *raft) Step(m pb.Message) error {
	// Handle the message term, which may result in our stepping down to a follower.
	switch {
	case m.Term == 0:
		// local message
	case m.Term > r.Term:
		if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote {
			force := bytes.Equal(m.Context, []byte(campaignTransfer))
			inLease := r.checkQuorum && r.lead != None && r.electionElapsed < r.electionTimeout
			if !force && inLease {
				// If a server receives a RequestVote request within the minimum election timeout
				// of hearing from a current leader, it does not update its term or grant its vote
				r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] ignored %s from %x [logterm: %d, index: %d] at term %d: lease is not expired (remaining ticks: %d)",
					r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term, r.electionTimeout-r.electionElapsed)
				return nil
			}
		}
		switch {
		case m.Type == pb.MsgPreVote:
			// Never change our term in response to a PreVote
		case m.Type == pb.MsgPreVoteResp && !m.Reject:
			// We send pre-vote requests with a term in our future. If the
			// pre-vote is granted, we will increment our term when we get a
			// quorum. If it is not, the term comes from the node that
			// rejected our vote so we should become a follower at the new
			// term.
		default:
			r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
				r.id, r.Term, m.Type, m.From, m.Term)
			if m.Type == pb.MsgApp || m.Type == pb.MsgHeartbeat || m.Type == pb.MsgSnap {
				r.becomeFollower(m.Term, m.From)
			} else {
				r.becomeFollower(m.Term, None)
			}
		}

	case m.Term < r.Term:
		if (r.checkQuorum || r.preVote) && (m.Type == pb.MsgHeartbeat || m.Type == pb.MsgApp) {
			// We have received messages from a leader at a lower term. It is possible
			// that these messages were simply delayed in the network, but this could
			// also mean that this node has advanced its term number during a network
			// partition, and it is now unable to either win an election or to rejoin
			// the majority on the old term. If checkQuorum is false, this will be
			// handled by incrementing term numbers in response to MsgVote with a
			// higher term, but if checkQuorum is true we may not advance the term on
			// MsgVote and must generate other messages to advance the term. The net
			// result of these two features is to minimize the disruption caused by
			// nodes that have been removed from the cluster's configuration: a
			// removed node will send MsgVotes (or MsgPreVotes) which will be ignored,
			// but it will not receive MsgApp or MsgHeartbeat, so it will not create
			// disruptive term increases, by notifying leader of this node's activeness.
			// The above comments also true for Pre-Vote
			//
			// When follower gets isolated, it soon starts an election ending
			// up with a higher term than leader, although it won't receive enough
			// votes to win the election. When it regains connectivity, this response
			// with "pb.MsgAppResp" of higher term would force leader to step down.
			// However, this disruption is inevitable to free this stuck node with
			// fresh election. This can be prevented with Pre-Vote phase.
			r.send(pb.Message{To: m.From, Type: pb.MsgAppResp})
		} else if m.Type == pb.MsgPreVote {
			// Before Pre-Vote enable, there may have candidate with higher term,
			// but less log. After update to Pre-Vote, the cluster may deadlock if
			// we drop messages with a lower term.
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			r.send(pb.Message{To: m.From, Term: r.Term, Type: pb.MsgPreVoteResp, Reject: true})
		} else {
			// ignore other cases
			r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
				r.id, r.Term, m.Type, m.From, m.Term)
		}
		return nil
	}

	switch m.Type {
	case pb.MsgHup:
		if r.preVote {
			r.hup(campaignPreElection)
		} else {
			r.hup(campaignElection)
		}

	case pb.MsgVote, pb.MsgPreVote:
		// We can vote if this is a repeat of a vote we've already cast...
		canVote := r.Vote == m.From ||
			// ...we haven't voted and we don't think there's a leader yet in this term...
			(r.Vote == None && r.lead == None) ||
			// ...or this is a PreVote for a future term...
			(m.Type == pb.MsgPreVote && m.Term > r.Term)
		// ...and we believe the candidate is up to date.
		if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
			// Note: it turns out that that learners must be allowed to cast votes.
			// This seems counter- intuitive but is necessary in the situation in which
			// a learner has been promoted (i.e. is now a voter) but has not learned
			// about this yet.
			// For example, consider a group in which id=1 is a learner and id=2 and
			// id=3 are voters. A configuration change promoting 1 can be committed on
			// the quorum `{2,3}` without the config change being appended to the
			// learner's log. If the leader (say 2) fails, there are de facto two
			// voters remaining. Only 3 can win an election (due to its log containing
			// all committed entries), but to do so it will need 1 to vote. But 1
			// considers itself a learner and will continue to do so until 3 has
			// stepped up as leader, replicates the conf change to 1, and 1 applies it.
			// Ultimately, by receiving a request to vote, the learner realizes that
			// the candidate believes it to be a voter, and that it should act
			// accordingly. The candidate's config may be stale, too; but in that case
			// it won't win the election, at least in the absence of the bug discussed
			// in:
			// https://github.com/etcd-io/etcd/issues/7625#issuecomment-488798263.
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] cast %s for %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			// When responding to Msg{Pre,}Vote messages we include the term
			// from the message, not the local term. To see why, consider the
			// case where a single node was previously partitioned away and
			// it's local term is now out of date. If we include the local term
			// (recall that for pre-votes we don't update the local term), the
			// (pre-)campaigning node on the other end will proceed to ignore
			// the message (it ignores all out of date messages).
			// The term in the original message and current local term are the
			// same in the case of regular votes, but different for pre-votes.
			r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
			if m.Type == pb.MsgVote {
				// Only record real votes.
				r.electionElapsed = 0
				r.Vote = m.From
			}
		} else {
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
		}

	default:
		err := r.step(r, m)
		if err != nil {
			return err
		}
	}
	return nil
}

前面在构造msg的时候,我们没有对其成员变量Term赋值,于是m.Term == 0, raft的Step函数认为处理的是本地消息,再加上msg的Type既不是MsgHup,也不是vote和preVote,所以进入default case逻辑:r.step(r, m)
step是r 的成员变量,其类型是函数类型,

func (r *raft) becomeLeader() {
	// TODO(xiangli) remove the panic when the raft implementation is stable
	if r.state == StateFollower {
		panic("invalid transition [follower -> leader]")
	}
	r.step = stepLeader
	r.reset(r.Term)
	r.tick = r.tickHeartbeat
	r.lead = r.id
	r.state = StateLeader
	// Followers enter replicate mode when they've been successfully probed
	// (perhaps after having received a snapshot as a result). The leader is
	// trivially in this state. Note that r.reset() has initialized this
	// progress with the last index already.
	r.prs.Progress[r.id].BecomeReplicate()

	// Conservatively set the pendingConfIndex to the last index in the
	// log. There may or may not be a pending config change, but it's
	// safe to delay any future proposals until we commit all our
	// pending log entries, and scanning the entire tail of the log
	// could be expensive.
	r.pendingConfIndex = r.raftLog.lastIndex()

	emptyEnt := pb.Entry{Data: nil}
	if !r.appendEntry(emptyEnt) {
		// This won't happen because we just called reset() above.
		r.logger.Panic("empty entry was dropped")
	}
	// As a special case, don't count the initial empty entry towards the
	// uncommitted log quota. This is because we want to preserve the
	// behavior of allowing one entry larger than quota if the current
	// usage is zero.
	r.reduceUncommittedSize([]pb.Entry{emptyEnt})
	r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}

经过全局搜索可以发现,由几处地方都对该函数变量进行了初始化,我们简单一点,就假设当前节点是leader。因此该节点被初始化成为了stepLeader函数。

func stepLeader(r *raft, m pb.Message) error {
	// These message types do not require any progress for m.From.
	switch m.Type {
	case pb.MsgBeat:
		r.bcastHeartbeat()
		return nil
	case pb.MsgCheckQuorum:
		// The leader should always see itself as active. As a precaution, handle
		// the case in which the leader isn't in the configuration any more (for
		// example if it just removed itself).
		//
		// TODO(tbg): I added a TODO in removeNode, it doesn't seem that the
		// leader steps down when removing itself. I might be missing something.
		if pr := r.prs.Progress[r.id]; pr != nil {
			pr.RecentActive = true
		}
		if !r.prs.QuorumActive() {
			r.logger.Warningf("%x stepped down to follower since quorum is not active", r.id)
			r.becomeFollower(r.Term, None)
		}
		// Mark everyone (but ourselves) as inactive in preparation for the next
		// CheckQuorum.
		r.prs.Visit(func(id uint64, pr *tracker.Progress) {
			if id != r.id {
				pr.RecentActive = false
			}
		})
		return nil
	case pb.MsgProp:
		if len(m.Entries) == 0 {
			r.logger.Panicf("%x stepped empty MsgProp", r.id)
		}
		if r.prs.Progress[r.id] == nil {
			// If we are not currently a member of the range (i.e. this node
			// was removed from the configuration while serving as leader),
			// drop any new proposals.
			return ErrProposalDropped
		}
		if r.leadTransferee != None {
			r.logger.Debugf("%x [term %d] transfer leadership to %x is in progress; dropping proposal", r.id, r.Term, r.leadTransferee)
			return ErrProposalDropped
		}

		for i := range m.Entries {
			e := &m.Entries[i]
			var cc pb.ConfChangeI
			if e.Type == pb.EntryConfChange {
				var ccc pb.ConfChange
				if err := ccc.Unmarshal(e.Data); err != nil {
					panic(err)
				}
				cc = ccc
			} else if e.Type == pb.EntryConfChangeV2 {
				var ccc pb.ConfChangeV2
				if err := ccc.Unmarshal(e.Data); err != nil {
					panic(err)
				}
				cc = ccc
			}
			if cc != nil {
				alreadyPending := r.pendingConfIndex > r.raftLog.applied
				alreadyJoint := len(r.prs.Config.Voters[1]) > 0
				wantsLeaveJoint := len(cc.AsV2().Changes) == 0

				var refused string
				if alreadyPending {
					refused = fmt.Sprintf("possible unapplied conf change at index %d (applied to %d)", r.pendingConfIndex, r.raftLog.applied)
				} else if alreadyJoint && !wantsLeaveJoint {
					refused = "must transition out of joint config first"
				} else if !alreadyJoint && wantsLeaveJoint {
					refused = "not in joint state; refusing empty conf change"
				}

				if refused != "" {
					r.logger.Infof("%x ignoring conf change %v at config %s: %s", r.id, cc, r.prs.Config, refused)
					m.Entries[i] = pb.Entry{Type: pb.EntryNormal}
				} else {
					r.pendingConfIndex = r.raftLog.lastIndex() + uint64(i) + 1
				}
			}
		}

		if !r.appendEntry(m.Entries...) {
			return ErrProposalDropped
		}
		r.bcastAppend()
		return nil
	case pb.MsgReadIndex:
		// only one voting member (the leader) in the cluster
		if r.prs.IsSingleton() {
			if resp := r.responseToReadIndexReq(m, r.raftLog.committed); resp.To != None {
				r.send(resp)
			}
			return nil
		}

		// Postpone read only request when this leader has not committed
		// any log entry at its term.
		if !r.committedEntryInCurrentTerm() {
			r.pendingReadIndexMessages = append(r.pendingReadIndexMessages, m)
			return nil
		}

		sendMsgReadIndexResponse(r, m)

		return nil
	}

stepLeader函数内包含了对各种消息类型的处理,我们主要看MsgProp这一类消息。处理这类消息,首先是一些异常case判断,然后遍历消息里的每条Entry(这里可以看出消息包含若干条Entry),查看他是否是confChange类型,我们的Put请求肯定不属于这一类型。于是直接进入r.appendEntry逻辑:

func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
	li := r.raftLog.lastIndex()
	for i := range es {
		es[i].Term = r.Term
		es[i].Index = li + 1 + uint64(i)
	}
	// Track the size of this uncommitted proposal.
	if !r.increaseUncommittedSize(es) {
		r.logger.Debugf(
			"%x appending new entries to log would exceed uncommitted entry size limit; dropping proposal",
			r.id,
		)
		// Drop the proposal.
		return false
	}
	// use latest "last" index after truncate/append
	li = r.raftLog.append(es...)
	r.prs.Progress[r.id].MaybeUpdate(li)
	// Regardless of maybeCommit's return, our caller will call bcastAppend.
	r.maybeCommit()
	return true
}

该函数首先将参数里的每一条Entry 的Term赋值为当前leader的 Term,index在lastIndex的基础上,每赋值一次,自增一次,这其实是raft追加日志的设计实现。接下来增加UncommitedSize,随后往raftLog里面append es(entries)。注意看这里的raftLog其实是基于内存实现的.
该raftLog的初始化逻辑如下:
NewRawNode -> newRaft -> raftlog := newLogWithSize(c.Storage, c.Logger, c.MaxCommittedSizePerReady)
所以我们主要看c.Storage是什么存储。
而NewRawNode 函数由 函数func (b *bootstrappedRaft) newRaftNode(ss *snap.Snapshotter, wal *wal.WAL, cl *membership.RaftCluster) *raftNode 触发。其中的b.config即为NewRawNode的config参数c。newRaftNode函数内的b.config实为变量raft 类型为bootstrappedRaft的config。该raft由函数 raft := bootstrapRaft(cfg, cluster, s.wal) 创建,因为我们的etcd的启动方式是新建了一个raft集群,所以会调用函数:

func bootstrapRaftFromCluster(cfg config.ServerConfig, cl *membership.RaftCluster, ids []types.ID, bwal *bootstrappedWAL) *bootstrappedRaft {
	member := cl.MemberByName(cfg.Name)
	peers := make([]raft.Peer, len(ids))
	for i, id := range ids {
		var ctx []byte
		ctx, err := json.Marshal((*cl).Member(id))
		if err != nil {
			cfg.Logger.Panic("failed to marshal member", zap.Error(err))
		}
		peers[i] = raft.Peer{ID: uint64(id), Context: ctx}
	}
	cfg.Logger.Info(
		"starting local member",
		zap.String("local-member-id", member.ID.String()),
		zap.String("cluster-id", cl.ID().String()),
	)
	s := bwal.MemoryStorage()
	return &bootstrappedRaft{
		lg:        cfg.Logger,
		heartbeat: time.Duration(cfg.TickMs) * time.Millisecond,
		config:    raftConfig(cfg, uint64(member.ID), s), //config.s
		peers:     peers,
		storage:   s,
	}
}

从该函数内 s := bwal.MemoryStorage() 可以看出,bootstrappedRaft里面的config.s实为MemoryStorage。而MemoryStorage函数则先从wal的快照中恢复状态机,又
继续将尚未写到快照中的wal(write ahead log)内的entries 追加到了MemoryStorage的ents数组中,这里我理解为将硬盘中wal的数据恢复到内存的raftLog中,注意:这部分数据是已经持久化的。

func (wal *bootstrappedWAL) MemoryStorage() *raft.MemoryStorage {
	s := raft.NewMemoryStorage()
	if wal.snapshot != nil {
		s.ApplySnapshot(*wal.snapshot)
	}
	if wal.st != nil {
		s.SetHardState(*wal.st)
	}
	if len(wal.ents) != 0 {
		s.Append(wal.ents)
	}
	return s
}

我们现在回过头来看raftLog结构体:

type raftLog struct {
	// storage contains all stable entries since the last snapshot.
	storage Storage

	// unstable contains all unstable entries and snapshot.
	// they will be saved into storage.
	unstable unstable

	// committed is the highest log position that is known to be in
	// stable storage on a quorum of nodes.
	committed uint64
	// applied is the highest log position that the application has
	// been instructed to apply to its state machine.
	// Invariant: applied <= committed
	applied uint64

	logger Logger

	// maxNextEntsSize is the maximum number aggregate byte size of the messages
	// returned from calls to nextEnts.
	maxNextEntsSize uint64
}

该raftLog由newLogWithSize函数创建:

// newLogWithSize returns a log using the given storage and max
// message size.
func newLogWithSize(storage Storage, logger Logger, maxNextEntsSize uint64) *raftLog {
	if storage == nil {
		log.Panic("storage must not be nil")
	}
	log := &raftLog{
		storage:         storage,
		logger:          logger,
		maxNextEntsSize: maxNextEntsSize,
	}
	firstIndex, err := storage.FirstIndex()
	if err != nil {
		panic(err) // TODO(bdarnell)
	}
	lastIndex, err := storage.LastIndex()
	if err != nil {
		panic(err) // TODO(bdarnell)
	}
	log.unstable.offset = lastIndex + 1
	log.unstable.logger = logger
	// Initialize our committed and applied pointers to the time of the last compaction.
	log.committed = firstIndex - 1
	log.applied = firstIndex - 1

	return log
}

该函数对自己的成员unstable进行了初始化,unstable将会记录尚未持久化的raftLog的entries,是一个slice,其offset值代表了raft 协议中的index值,因此该值被初始化lastIndex, LastIndex的计算方法如下:

func (ms *MemoryStorage) lastIndex() uint64 {
	return ms.ents[0].Index + uint64(len(ms.ents)) - 1
}

而FirstIndex的计算方法如下:

func (ms *MemoryStorage) firstIndex() uint64 {
	return ms.ents[0].Index + 1
}

而log.commited被初始化为firstIndex -1, log.applied 也被初始化为firstIndex -1。由此,我们可以推断出,快照(snapshot)中保存的是当前集群公认的已经commited的状态,wal中保存的是已经持久化,但是不一定commited的条目。

之前的r.raftLog.append函数,实则调用的是raftLog成员unstable的truncateAndAppend(ents)函数:

func (u *unstable) truncateAndAppend(ents []pb.Entry) {
	after := ents[0].Index
	switch {
	case after == u.offset+uint64(len(u.entries)):
		// after is the next index in the u.entries
		// directly append
		u.entries = append(u.entries, ents...)
	case after <= u.offset:
		u.logger.Infof("replace the unstable entries from index %d", after)
		// The log is being truncated to before our current offset
		// portion, so set the offset and replace the entries
		u.offset = after
		u.entries = ents
	default:
		// truncate to after and copy to u.entries
		// then append
		u.logger.Infof("truncate the unstable entries before index %d", after)
		u.entries = append([]pb.Entry{}, u.slice(u.offset, after)...)
		u.entries = append(u.entries, ents...)
	}
}

可以看出unstable里面的entries实为slice结构,其从下标0开始,保存了当前尚未持久化的entry。而u.offset + etnries slice的下标才表示raft log 协议中的index字段。 这里三个case 分别表示:

  • unstable entries最后一个条目的索引与下一个即将添加的entry的索引恰好匹配(after = lastIndex + 1)。于是直接追加到entries后面
  • 即将添加的entry的index 小于unstable entries 最老的index(u.offset + 0),这表明当前日志需要被截断,这表示当前节点的raftLog与集群内大多数节点的index不一致,需要回退,在这里表现为截断。直接用新的ents替换当前的entries
  • 即将添加的entry的索引大于当前raftLog的最老索引,这表明存在日志重叠,我们需要替换after index后的entries

raftLog的append 函数最后返回的是当前最新的index: (unstable内最新的index, 如果unstable 内的entries 为空, 则返回storage里面的最新index),其实我理解storage里的条目都已经被持久化,这在MemoryStorage函数里可以看出,初始ents都是从wal内读到的。

func (l *raftLog) append(ents ...pb.Entry) uint64 {
	if len(ents) == 0 {
		return l.lastIndex()
	}
	if after := ents[0].Index - 1; after < l.committed {
		l.logger.Panicf("after(%d) is out of range [committed(%d)]", after, l.committed)
	}
	l.unstable.truncateAndAppend(ents)
	return l.lastIndex()
}

append执行完成后到了 r.prs.Progress[r.id].MaybeUpdate(li)。先来看看prs是怎么被初始化的, prs是一个ProgressTracker类型的变量。

// MakeProgressTracker initializes a ProgressTracker.
func MakeProgressTracker(maxInflight int) ProgressTracker {
	p := ProgressTracker{
		MaxInflight: maxInflight,
		Config: Config{
			Voters: quorum.JointConfig{
				quorum.MajorityConfig{},
				nil, // only populated when used
			},
			Learners:     nil, // only populated when used
			LearnersNext: nil, // only populated when used
		},
		Votes:    map[uint64]bool{},
		Progress: map[uint64]*Progress{},
	}
	return p
}

可以看到其包含两个主要的成员变量, Votes map和 Progress map。而Progress按照注释描述,是leader用来记录每个follower当前raftLog的状态的。map的key即为raft的节点id。

// Progress represents a follower’s progress in the view of the leader. Leader
// maintains progresses of all followers, and sends entries to the follower
// based on its progress.
//
// NB(tbg): Progress is basically a state machine whose transitions are mostly
// strewn around `*raft.raft`. Additionally, some fields are only used when in a
// certain State. All of this isn't ideal.
type Progress struct {
	Match, Next uint64
//...
	

可以看到Progress 里面包含变脸Match Index,表示当前raftLog 最新的index, 即MatchIndex, 调用MaybeUpdate函数会记录当前节点的MatchIndex,可以用于后面leader 统计各节点的index 情况。
接下来就是调用r.maybeCommit()函数了。

// maybeCommit attempts to advance the commit index. Returns true if
// the commit index changed (in which case the caller should call
// r.bcastAppend).
func (r *raft) maybeCommit() bool {
	mci := r.prs.Committed()
	return r.raftLog.maybeCommit(mci, r.Term)
}

prs.Committed会收集各voters的MatchIndex,这其中通过插入排序 将各voters 的index 进行排序,取大家都满足的最新index 返回。接下来raftLog.maybeCommit(mci, r.Term) 会把返回值拿到,判断是否需要更新raftLog的commited值。
不过r.maybeCommit 这个函数的返回值不重要,接下来按照注释都会调用bcastAppend函数。作为leader,会给每个follower 发去他们所缺的entries:

// maybeSendAppend sends an append RPC with new entries to the given peer,
// if necessary. Returns true if a message was sent. The sendIfEmpty
// argument controls whether messages with no entries will be sent
// ("empty" messages are useful to convey updated Commit indexes, but
// are undesirable when we're sending multiple messages in a batch).
func (r *raft) maybeSendAppend(to uint64, sendIfEmpty bool) bool {
	pr := r.prs.Progress[to]
	if pr.IsPaused() {
		return false
	}
	m := pb.Message{}
	m.To = to

	term, errt := r.raftLog.term(pr.Next - 1)
	ents, erre := r.raftLog.entries(pr.Next, r.maxMsgSize)
	if len(ents) == 0 && !sendIfEmpty {
		return false
	}

	if errt != nil || erre != nil { // send snapshot if we failed to get term or entries
		if !pr.RecentActive {
			r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)
			return false
		}

		m.Type = pb.MsgSnap
		snapshot, err := r.raftLog.snapshot()
		if err != nil {
			if err == ErrSnapshotTemporarilyUnavailable {
				r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)
				return false
			}
			panic(err) // TODO(bdarnell)
		}
		if IsEmptySnap(snapshot) {
			panic("need non-empty snapshot")
		}
		m.Snapshot = snapshot
		sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term
		r.logger.Debugf("%x [firstindex: %d, commit: %d] sent snapshot[index: %d, term: %d] to %x [%s]",
			r.id, r.raftLog.firstIndex(), r.raftLog.committed, sindex, sterm, to, pr)
		pr.BecomeSnapshot(sindex)
		r.logger.Debugf("%x paused sending replication messages to %x [%s]", r.id, to, pr)
	} else {
		m.Type = pb.MsgApp
		m.Index = pr.Next - 1
		m.LogTerm = term
		m.Entries = ents
		m.Commit = r.raftLog.committed
		if n := len(m.Entries); n != 0 {
			switch pr.State {
			// optimistically increase the next when in StateReplicate
			case tracker.StateReplicate:
				last := m.Entries[n-1].Index
				pr.OptimisticUpdate(last)
				pr.Inflights.Add(last)
			case tracker.StateProbe:
				pr.ProbeSent = true
			default:
				r.logger.Panicf("%x is sending append in unhandled state %s", r.id, pr.State)
			}
		}
	}
	r.send(m)
	return true
}

上面的函数:

  1. 根据当前followerd节点的进度(pr.Next),取到该进度对应的term,并从当前进度开始,取到leader节点的最新entry,条数最多不超过r.maxMsgSize
  2. 构建 type 为MsgApp的Message,其中装上entries 和 leader节点已经commited的index,term等字段。
  3. 更新pr.Next 和 pr.Inflights
  4. 将Message发送给从节点

下面我们直接看从节点收到消息后的处理,跳过发送逻辑。

你可能感兴趣的:(go,etcd,etcd)