1. Learner
listening to acceptors decision. 怎么知道接受的决议呢?是这样的,The acceptor should periodically repeat the state of the highest instance for which some value
was accepted. By doing so, it helps learners to stay up-to date when message loss occurs, since they can realize if they have holes and act accordingly
2. Acceptor
it sits waiting for messages from proposers or learners and answers to them
the acceptor keeps a state record, consisting of < iid, B, V, VB >, acceptor的状态
where B is an integer, the highest ballot number that was accepted or promised for this instance, 最高提议编号
V is a client value 客户端提议的值
V B is the ballot number corresponding to the accepted value. 最高提议编号对应的值
The three fields are initially empty
the acceptor adds a periodic event which reminds to repeat its most recent (highest instance number) accept, this is useful to keep learners up-to-date in low-traffic situations.
the acceptors need send one further type of message, a so called reject This allows the proposer to skip a few ballots ahead when generating the next reques
3. proposer
The proposer is responsible for pushing values submitted by the clients until those are delivered. Proposers relies on an external leader election service, which should nominate a coordinator (or leader) among them.
这意思似乎是消息要先传给leader,在由leader进行提交的样子,继续看看。
4. Leader Proposer
The leader proposer sends client values through the broadcast. For each client value submitted, it chooses the next unused instance and tries to bind the the
value to it. This process is executed in two phases, the second phase can be started only on successful completion of the first one
第二阶段要在第一阶段成功的基础上执行,当然对于察觉到第一阶段已经是成功的时候,直接可以进行第二阶段的accept!
对于来自客户端的每个请求,它都存放在一个pending list中,每次执行一个实例,也可以并发执行多个实例
if instance i successfully completed phase 1 with a null value, the leader can pop the next client value from the pending list, assign it to i and execute phase 2.
5. 第一阶段
The Leader 发送第一阶段的消息后,has to wait until either (i) a majority (i.e., bn=2c+1 where n is the number of acceptors) of promises are received from distinct acceptors, or (ii) the timeout expires. In the first case the instance is declared ready, and phase 2 can begin. In the second case the proposer will increment its ballot and retry executing phase 1; it cannot execute phase 2.
要么收到大多数的promises,要么超时。同时,libpaxos对第一种情形(收到大多数)用了ready来表示,这样完成ready后,才能开始第二阶段;第二种情形,则不能增大编号,继续执行第一阶段。
6. 第二阶段
Specifically, if all the promises received contained a null value V, the instance is empty. In this case the leader can send a value submitted a client. If instead some value was found in the promises, the instance is reserved. The coordinator is forced to select the value V with the highest associated ballot V B and execute phase 2 with it.
就是说,如果收到的acceptor的消息promises中包含了value,则leader强制选择该value(对应的highest-number)来提交,否则,就使用client给的值
Once a value is picked with the above rule, the proposer sends an accept and again sets a timeout for this request. The accept consists of < i, b, v >, where i is the instance number, b is the ballot used in phase 1 and v is a value. Acceptors that did not acknowledge a higher ballot will accept the request and answer with a learn message, consisting of < i, b, v >. If a majority of acceptors accepted the value, it is safe to assume that the instance is closed with v permanently associated to it. Nothing can happen in the system that will changes this fact, therefore learners can deliver the value to their clients. In case of request timeout, the Leader has to start over by incrementing the ballot and executing phase 1 again. We previously said that learn messages are sent by acceptors to learners. The Leader should receive those messages too or it can internally use a learner and wait for it to deliver.
意思是第二阶段请求的时候,leader还是要设置一个超时的,如果acceptors接受的话,value就永久与当前的实例关联,learners把值传给clients,当然leader也需要知道该该值被接受的消息,否则,超时后,leader将重新选择一个更大的编号,从第一阶段开始。
3. 请求prepare消息
A prepare is a tuple < i, b >,
where i is the instance number b a ballot. Unless the ballot B in the acceptor record for instance i contains a number higher than b, the acceptor grants the request. It sets B = b and answers with a promise message 除非acceptor中已接受的编号值更大,否则就设置B=b
4. promise message
consisting of < i, b, V, V B >
i和b和上面一样,V and V B are null if no value was accepted yet.就是说acceptor之前没有接受过其他的值,那么V和VB设置为null
5. accept 消息
An accept is a tuple < i, b, v >,
where i is the instance number, b is a ballot and v is a value. Unless the ballot B in the acceptor record for instance i contains a number higher than b, the acceptor grants the request. Acceptor(改变它自己的状态) sets B = b, V = v, V B = b and answers with a learn message, consisting of < i, b, v >.
6. 有个问题,我能丢弃之前的已提议的值吗?(这个值已经是quorum同意了)
不能
7. libpaxos说的原子广播是什么意思?和ZAB有什么区别?
ZAB要求a)Reliable delivery b)Total order c)Causal order d)Pre fix property [2]
而libpaxos 说 it requires FIFO order with respect to each proposer, it is quite easy to enforce those requirements on top of the the Atomic Broadcast layer. 个人觉得应该也是具备这些特性的
8. 持久存储影响性能?
Without doubts, this stable storage requirement is the factor with the most impact on performance. However, as we shall see later, there are ways to get around this issue. 而libpaxos解决这个存储的问题是使用了use Berkeley DB [OBS99] as a stable storage layer. The code is made so that we can easily change the "durability mode": from no durability at all, to strict log-based transactional storage with synchronous writes to disk, passing from other intermediate setting。
9. Snapshots做了什么事情?
When receiving this command, each acceptor knows that it can safely drop informations about instances up to i, truncating their database. The semantic of this operation and its recovery procedure, in case of failure, depends on the application built on top.
10. libpaxos解决活锁的问题?
我在libpaxos的论文上没有看到关于解决活锁的问题,于是我问了论文作者Marco关于这个活锁的问题,他在邮件中回复我是
Some versions of libpaxos have no code to cover this case since they don't even handle the case in which the leader fails (or there are multiple leaders). In practice this problem can be easily solved in a number of ways. For example a weak failure detection is enough to know that some other leader is active and avoid disturbing it. A random back-off will work too and trades simplicity for some latency.
即一种方法是可以back-off来解决这个问题,另外,我在看了他的另一篇关于性能优化的文章[4]中说到一种性能优化的方法,即先预提交第一阶段(在这个过程中,不提供value),然后再在第二阶段中从中取value,再执行第二阶段
10. 性能优化
As a performance optimization, a number of phase 1 instances are pre-executed by the leader, i.e. before the clients start to submit values we can pre-execute phase 1 for instances from 1 to 100. When a client value is submitted, it can be delivered in only two message delays (an accept and a learn). In principle, also phase 2 could be executed in parallel to achieve better throughput. Since our application has some strict FIFO requirements, we opt-in for a simpler version that does not implement this optimization.
这里的一个问题是如果不用FIFO是不是会有什么问题?
11. ZK是基于TCP的,那libpaxos是UDP 还是TCP?这样做的好处以及坏处是什么?
基于UDP的多播,由于TCP是面向连接的协议,它意味着分别运行于两主机(由IP地址确定)内的两进程(由端口号确定)间存在一条连接,无多进行多播,libpaoxs上说了关于TCP问题,In a connection-oriented network this is more complicated since processes may leave, and come back or move to a different hosts。而UDP的多播则是Multicast also pushes the cost of sending multiple copies of a message down to the network switch, since the sender calls send only once. Most of the design issues addressed in this document are specific to LAN-based implementations using UDP
Multicast.
坏处: There are two major drawbacks of using UDP: the message size is limited by the MTU of the host OS and network switch, therefore there is a bound on the size of values that clients can submit. The second constraint is the performance of the network equipment when delivering high throughput multicast traffic
11. libpaxos的某一实例的有限状态机
状态机描述了某一实例的状态和事件的顺序
每个实例用一个多元组表示, where i is the instance number, S is a symbol representing a state (as they appear in the figure), b is a ballot, pset is a set of promises received, v1 is the value found after a successful phase 1, vb is the corresponding ballot, v2 is the value to use for phase 2, c v is a flag indicating wether v2 is a value received from a client. Each instance is initialized as <0, empty, 0, {空集}, null,0, null, f alse>.
i 表示实例number
S表示某的符号
b是一个ballot号
pset是acceptors发送给leader的promises的集合
v1是在成功完成第一阶段后获取的值
vb是v1对应的ballot号
v2是第二阶段要使用的决议
cv是表示v2是否是从客户端获取的value
S:产生了一个ballot,加入是proposer ID 为 2, 则第一个ballot可以设置为102,完成发送给acceptors后,实例状态为
T01:第一阶段在没有到达大多数前就超时了,增大投票号,清除可能收到的promises,重发,实例状态更改为,【为什么后边是v2,cv呢?我的理解是直接从pending list中拿了一个value,当然,我觉得也可以不用拿】
P: leader收到某个acceptor的promise pa,pa的ballot的号必须与之前发出去的proposal的ballot号相同,否则丢弃,实例状态更新为, 更新的规则如下:如果pa中的包含了决议,并且对应该决议的ballot号比当前的vb要大,则直接替换,否则不予替换。
R0:leader达到大多数的promises后,但没有一个promises包含一个决议,将实例状态更新为,如果v2是非空,那么使用的就是之前T01中说的pending list的客户端的提议,否则,从pending list中拿一个值出来。
NV:完成第一阶段后,v2是空,而且此时pending list也是空,那么就等待有客户端的值过来,再使用这个实例
A:如果是提前分配了v2的,则状态为,跳到E
R1:leader收到大多数的promises后,并且至少有一个promise含有value(决议),v1,那么有以下几种情况处理:
--v2是空,更新实例状态为,什么意思呢?就是使用传过来的promise的值
--v1 = v2, 更新实例状态为 ,(如果第二阶段超时,则可能会发生这种情况,比如说完成第一阶段后,所有的promises中都不含有value,那么第二阶段从pending list中取出client的值发给acceptors,结果超时了,那么这个acceptors可能保存着client的值)
--v1!=v2&&cv=true 把v2放入到head of pending list中,更新实例状态为
--v1!=v2&&cv=false 因为由于cv=false,表明v2不是客户端的值,直接抛弃掉,【这里似乎有点奇怪,v2有可能出现的原因是在第一阶段就直接把客户端的值v2发送给acceptors,难道这个还有可能取的值不是从客户端过来的?】
E: 用b和v2执行阶段2,更新实例状态为
C: 如果leader知道实例结束,则会触发C的操作,状态更新为 后边,还有一句话,有点诡,This instance can be ignored until it is
delivered (D5). 我觉得这话的意思是如果leaners也知道实例结束的话,可以忽视C的操作
T02: 2阶段超时,更为状态为
D0~D5:因为leader可能收到accepted 消息延迟,或者可能是有另一个提议(acceptors此时还没有决议)也提交给了acceptors,这些情况下,learner就要发送收到的v‘给leader进行做比较:(这里似乎有两个问题,learner怎么知道learder消息有延迟,learner如何察觉两个前后不一的提议,不过后边那个应该可以解决)
--v'=v2&&cv=true, 通知客户端操作成功
--v1!=v2&&cv=true, 把v2加入到head of pending list中
--其他情况不做操作
12. libpaxos的一个例子
Let us observe an example execution for one instance of the protocol. The network is composed of a single client C, two proposers P1 and P2, three acceptors A1,A2,A3. L1, L2 and L3, are three learners started by the client and by the proposers. P1 is initially the leader
1个client C
2个proposers P1和P2,并初始P1为leader
3个acceptors A1 A2 A3
3个learners L1 L2 L3
Initialization: C sends the value v to the current leader P1. C发个值v给P1
Phase 1a: P1 sends a prepare message consisting of an instance number (1 in
this case) and a ballot number (i.e., 101).
P1 does not receive any promise from the acceptors because of message loss. It increments the ballot (to 201) and retries
P1用实例为<1,101>发给acceptors,因为消息丢失,所以用<1, 201>重发
Phase 1b: The three acceptors receive the prepare request, since none of them acknowledged any ballot higher than 201 for instance 1, they update their state
in stable storage and send the corresponding promise message. P1 receives two promises from A2 and A3, it can declare this instance ready since the value in
both promises is null.
3个acceptors收到<1,201>,更新自己的状态< iid, B, V, VB >为<1, 201,null, null>(持久化),并发给P1 (因为acceptors之前(这个实例1之前)没有接受过其他的value,发个null回去)
Phase 2a: Instance 1 is ready with no value in it. The proposer sends an accept message containing the previously used ballot 201 and the value v received from
the client.
P1收到后promise消息后,发送<1, 201, v>给acceptors
这里的v是来自客户端值
Phase 2b: Again the acceptors did not acknowledge a ballot higher than 201, therefore they accept the request. After updating their state permanently, they
send a learn message to all learners, announcing their decision to accept v for instance 1 with ballot 201. As soon as the learners realize that a majority of
acceptors granted the same request (same ballot), they can deliver v. In this way, P1, P2 and C are notified that the value was accepted. In case of a timeout
in phase 2, P1 must restart from phase 1, using ballot 301.
成功就发给learners,否则超时就让leader从第一阶段开始 用编号301
As another example, let us go through the worst possible scenario we can think of. The network is composed of two proposers P1 and P2, three acceptors
A1,A2,A3 and a client C. C, P1 and P2 internally start three learners, respectively Lc , L1 and L2.
1个client C
2个P1和P2
3个Acceptors A1 A2 A3
3个learners 都是由C,P1,P2内部启动的
P1 is the current leader and manages to deliver instances up to i 1 (included). P1 successfully completes phase 1 for instance i, it sends an accept containing the value vi submitted by C. Acceptors A1,A2 receive the valid request, accept it, and update their state in stable storage. At this point the value vi is unambiguously chosen for instance i, since a majority of acceptors accepted it. However, assume that both acceptors crash while trying to inform the learners about their decision. To make things worse, P1 crashes in the same moment. Due to a temporary network failure, A3 does not receive P1’s message, it stays alive but knows nothing about what happened.
P2 takes over the leadership: since it uses a learner, it knows that instance up to i 1 are already closed. It executes phase 1 for instance i. However, since not enough acceptors are responding to its requests, phase 1 keeps expiring; P2 can do nothing but increment the ballot and keep trying. After some time, acceptor A1 recovers. A majority of acceptors is now online, progress is (eventually) granted. The next phase 1 executed by P2 receives two acknowledgements and the one from A1 contains value vi . P2 is forced to execute phase 2 using vi rather than any other value. A1 and A3 grant the request, they accept, log to stable storage and inform the learners, which can deliver value vi for instance i. No matter how tragic the scenario is, the safety of the protocol reduces to a simple fact: if a majority of acceptors accepted the same request, an instance is closed. Any proposer trying to do something for that instance will realize during phase 1 that there is a value already. It is then forced by the protocol to help with the current situation rather than try to push a different client value.
P1启动一个paxos实例i(i-1之前都成功执行实例了),P1在成功执行第一阶段后,然后在第二阶段P1用vi提交给acceptors,A1和A2接受了它,但来不及通知learners就挂了,更郁闷的是P1也挂了,而且因为网络的原因,A3没有收到P1的消息;P2成为了leader,由于它自己有个learner,所以它知道了i-1之前的实例都成功执行过了,然后开始执行实例i第一阶段,由于2个acceptors都挂了,形成不了大多数,所以只能增大编号,接着,A1活过来了,接受了P2的第一阶段的请求,P2收到了vi,就只能强制的使用决议为vi进行提交,A1和A3接受该accept请求。所以不管情景如果变态,只要大多数的acceptors接受了同样的请求,实例就能执行成功,对于已经存在的acceptors的value,则必须强制进行执行。
8. 什么叫一个basic paxos实例?
Each instance of the Basic Paxos protocol decides on a single output value
图示说明清楚点,如下
转摘请说明出处:http://blog.csdn.net/techq/article/details/7337210 by Netease CHQ
weibo:http://weibo.com/1772403527
欢迎指正与讨论。
[1] libpaxos http://www.inf.usi.ch/faculty/pedone/MScThesis/marco.pdf
[2] A simple totally ordered broadcastprotocol http://research.yahoo.com/files/ladis08.pdf
[3] 各种版本的libpaxos http://libpaxos.sourceforge.net/paxos_projects.php
[4]LibPaxos Performance Analysis http://libpaxos.sourceforge.net/files/Primim-SPLab08.pdf
[4] 另一个网友的libpaxos代码注释 http://blog.csdn.net/luojian_scu/article/details/6885555