转载独立博客 https://simplexity.cn/2019/03/28/consensus-pbft-paper/#more
Our algorithm can be used to implement any deterministic replicated service with a state and some operations.
f
个恶意节点,则至少需要3f+1
个节点才能提供safety and liveness(Asynchronous Consensus and Broadcast Protocols这篇论文中证明)。3f+1
个节点,可以保证client持续重传,并且到达destination的传输时间没有无限的增长情况下,client最终会收到回复。The algorithm does not address the problem of fault- tolerant privacy: a faulty replica may leak information to an attacker.
p = v mod |R|
(p为节点的id,v是当前view number,|R| = 3f+1)
- A client sends a request to invoke a service operation to the primary
- The primary multicasts the request to the backups
- Replicas execute the request and send a reply to the client
- The client waits for f + 1 replies from different replicas with the same result; this is the result of the operation.
client c
点对点发送
到primary,然后由primary广播到所有backup。o
是操作,t
是时间(如本地机器时间),用来保证exactly-once语义,<>$c
表示对消息签名(下同)。replica i
相应的回复
。v
是当前view number(存在view-change情况),r
是执行结果,然后各自对消息签名。client可以根据回复的view number跟踪当前view,相应的得知当前primary。client接收到f+1个相同结果的有效回复(校验签名),则结果有效,完成。
如果client在有效时间内没法收到f+1个有效结果,则广播request到所有的replica。如果该request已经被处理过,replica就简单的重新发送reply到client。如果没有,backup会将改request重定向到primary。如果primary没有再次将该request广播,则会被认为是异常节点,最终触犯view change。
replica记录的状态
The state of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer denoting the replica’s current view.
整个流程分为三阶段,pre-prepare, prepare, commit (pre-prepare, prepare用于保持同一个view内request的有序性,prepare, commit用于保证跨view的reqeust的有序性)。
pre-prepare 阶段
primary广播<
到所有的backup。n
是primary分配给该消息的序号,全局递增,m
是client的请求消息,d
是消息的digest摘要。然后签名。
如果满足以下条件,则backup会接受该pre-prepare请求,并且进入prepare阶段(The last condition prevents a faulty primary from exhausting the space of sequence numbers by selecting a very large one.):
* the signatures in the request and the pre-prepare message are correct and `d` is the digest for `m`;
* it is in view `v`;
* it has not accepted a pre-prepare message for view `v` and sequence number `n` containing a different digest;
* the sequence number in the pre-prepare message is between a low water mark, `h`, and a high water mark, `H`.
prepare 阶段
backup广播
消息到所有replica(包含primary和backup),i
为该节点id。并且将
和PREPARE
消息保存到本地log。replica接收到prepare消息后,校验签名正确,v
是当前view,n
在水位h
和H
之间,摘要d
和本地保存的PRE-PREPARE
消息里的d
相等。当replica i收到2f
个这样的PREPARE
消息后,就达到了prepared(m,v,n,i)
状态,进入commit阶段。
commit 阶段
replica i广播
消息到所有replica。replica收到后校验通过后,写入log。如果replica i达到了prepared(m,v,n,i)
状态,并且收到2f+1
个(包括自身的commit消息)校验通过的commit(v
,n
,d
相同且与log里记录的prepare吻合),则达到committed-local(m,v,n,i)
状态,并且其当前状态是消息序号小于n
的之前所有消息按序执行后的最新最全结果,则该replica执行m
里指定的操作。操作的结果返回到client。
replica周期性对其本地状态生成快照,称为checkpoint。当replica i生成checkpoint同时,广播
消息到其他所有replica,其中,n
是生成该状态执行的最后的请求消息的序号,d
是该状态的摘要digest。replica接收checkpoint消息,并且记录到log里,当收到2f+1
个相同的checkpoint消息(n
,d
)相同,则该checkpoint变成stable checkpoint。此时,小于等于n
的消息(pre-prepare, prepare, commit)都会被丢弃,同时,之前的stable checkpoint也被丢弃。
实际上,一个replica会同时持有这样几个状态副本:一个stable checkpoint,若干个未达到稳定状态的checkpoint(生成后暂未收集到2f+1个该状态的checkpoint消息),以及当前状态。stable checkpoint的最后的消息序号n
就是低水位h
,高水位H = h + k
(where is big enough so that replicas do not stall waiting for a checkpoint to become stable).
client在有效时间内没法收到f+1个有效结果,则广播request到所有的replica。backup收到后,若该请求从未执行,则转发给primary,并且启动timer,在timer到期前若没有收到primary发出的pre-prepare,则判断该primary异常,发出view-change消息。backup i,
,v+1
是新的view number,n
是该backup的stable checkpoint s
的最后消息序号,C
是证明s
的2f+1个checkpoint消息,P
是Pm
的集合,Pm
是达到prepared(m,v,n,i)
状态的消息(包含pre-prepare消息和2f个来自其他replica的matching消息,其中消息m
的序号大于n
)。
v+1
的primary(通过p = (v+1)mod |R|
计算)收到2f个有效view-change消息后,广播
,V
是收到的和primary自身的view-change消息的集合,O
是pre-prepare消息的集合。计算如下,比较复杂
1. The primary determines the sequence number `min-s` of the latest stable checkpoint in `V` and the highest sequence number `max-s` in a prepare message in `V`.
2. The primary creates a new pre-prepare message for view `v+1` for each sequence number `n` between `min-s` and `max-s`.
There are two cases: (1) there is at least one set in the component of some view-change message in with sequence number `n`, or (2) there is no such set.
In the first case, the primary creates a new message `$p` where `d` is the request digest in the pre-prepare message for sequence number `n` with the highest view number in `V`.
In the second case, it creates a new pre-prepare message `$p` where `d(null)` is the digest of a special null request; a null request goes through the protocol like other requests, but its execution is a no-op. (Paxos [18] used a similar technique to fill in gaps.)
也就是说,primary从latest stable checkpoint重放之后所有达到prepared(m,v,n,i)
状态的消息(min-s
<n
<=max-s
),重发pre-prepare。replica会对这些消息重走流程,prepare和commit阶段,但是不再重新执行请求,因为在上一个view内,对这些达到prepared(m,v,n,i)
状态的消息已经执行过。
同时,replica记录收到的所有replica的view-change里最新的(n
最大)stable checkpoint记录到log里,并且对缺失的消息m和最新的stable checkpoint(自身的stable checkpoint不是全局最新的stable checkpoint情况下)从其他的replica同步。
if prepared(m,v,n,i) is true then prepared(m’,v,n,j) is false for any non-faulty replica j(including i=j) and any m’ such that D(m’)!=D(m)
这是因为,到达prepare(m)意味着2f+1个replica选择发送prepare(m),假设异常节点个数为k(k<=f), 则至少有2f+1-k个正常replica发送prepare(m),同时也意味着至少有2f+1-k个replica发送prepare(m’)。由于(2f+1-k)*2>3f+1-k,说明这两部分正常节点有重叠,则至少有一个正常节点发送两个不同消息,矛盾。
The view-change protocol ensures that non-faulty replicas also agree on the sequence number of requests that commit locally in different views at different replicas. committed(m,v,n) is true if and only if prepared(m,v,n,i) is true for all i in some set of f+1 non-faulty replicas. if committed-local(m,v,n,i) is true for some non-faulty i then committed(m,v,n) is true.
这是因为,到达committed-local(m,v,n,i)
状态,则意味着至少收到f+1个正常replica的commit消息,也就是至少有f+1个正常节点到达prepared(m,v,n,i)状态。而正常节点只有在接收到new-view消息才会从view v进入view v+1,而new-view消息里包含2f+1个view-change消息。与上面推导方式相似,则这两个集合必定有一个交集包含正常节点k。若然m已经包含在k的stable checkpont内,则最终被同步到所有正常节点;若然没有,则包含k的view-change消息内,然后在view v+1中被重新执行三阶段流程,最终也达到提交一致。
A client request designates a replica to send the result; all other replicas send replies containing just the digest of the result.(指定单个replca回复完整result,其他回复该result的摘要用以校验)。 If the client does not receive a correct result from the designated replica, it retransmits the request as usual, requesting all replicas to send full replies.
prepared(m,v,n,i)
状态,将要进入commit阶段。这时,如果小于该消息序号n
的之前的消息都已经执行生成了当前状态,则replica执行该请求,并且直接返回到client,同时继续后续的commit阶段。
- The client waits for 2f + 1 matching tentative replies. If it receives this many, the request is guaranteed to commit eventually. Otherwise, the client retransmits the request and waits for f + 1 non-tentative replies.
- A request that has executed tentatively may abort if there is a view change and it is replaced by a null request. In this case the replica reverts its state to the last stable checkpoint in the new-view message or to its last checkpointed state (depending on which one has the higher sequence number).
They(replica) send the reply only after all requests reflected in the tentative state have committed; this is necessary to prevent the client from observing uncommitted state.
所有的replica间需要全联接并且多次交互通信,系统中这些消息随着节点数指数增长。单次请求的最少消息数为1 + 3f + 3f(3f-f) + (3f-f+1)(3f+1) + 3f-1
.
pBFT— Understanding the Consensus Algorithm
T>=lower((R-f)/2)+1+f
,这样就保证了correctness正确性。接下来考虑liveness可用性。因为异常节点有可能不参与投票(P为正常节点),需要保证正常节点总数应该大于阈值,即R-f>=T
。References
M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999