证明consensus的不存在性 - A Brief Tour of FLP Impossibility

One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch, and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.

This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus - that is, getting a distributed network of processors to agree on a common value - was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.

This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.

In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I’ll start by introducing consensus, and then after describing some notation and assumptions I’ll work through the main two lemmas in the paper.

If you want to follow along at home (highly, highly recommended) a copy of the paper is available here.

Consensus

The problem of consensus is genuinely fundamental to distributed systems research. Getting distributed processors to agree on a value has many, many applications. For example, the problem of deciding whether to commit a transaction to a database could be decided by a consensus algorithm between a majority of replicas. You can think of consensus as ‘voting’ for a value.

Several papers in the literature set the problem in the context of generals at different camps outside the enemy castle deciding whether or not to attack. A consensus algorithm that didn’t work would perhaps cause one general to attack while all the others stayed back - leaving the first general in trouble.

There are several variations on the consensus problem that differ in ‘strength’ - a solution to a stronger problem typically will solve weaker problems at the same time. A strong form of consensus is as follows:

Given a set of processors, each with an initial value:

  • All non-faulty processes eventually decide on a value
  • All processes that decide do so on the same value
  • The value that has been decided must have proposed by some process

These three properties are referred to as termination, agreement and validity. Any algorithm that has these three properties can be said to solve the consensus problem.

Termination and agreement are fairly self explanatory - note that we explicitly do not require any particular behavior from faulty nodes, which is just as well: they’re faulty! Validity is slightly more subtle, although it seems reasonable. The idea behind validity is that we want to exclude trivial solutions that just decide ‘No’ whatever the initial set of values is. Such an algorithm would satisfy termination and agreement but would be completely vacuous, and no use to use at all.

The FLP proof actually concerns a slightly weaker form of consensus - for termination, it is enough only that somenon-faulty process decides. Note that it’s not enough to have one distinguished process that always decides its own value, as that process might fail, and another will have to take its place for even weak termination to be satisfied.

Clearly, a solution to strong consensus will be a perfectly good solution for weak consensus as well, so by ruling out the possibility of the latter the FLP result similarly precludes a solution to the former.

For simplicity, the authors consider that the only values possible are 0 and 1. Every argument in the paper generalizes to more values.

System Model

You can’t reasonably talk about a distributed algorithm without talking about the context it is designed to execute in and the assumptions it makes about the underlying distributed system. This set of assumptions is known as the system model and its choice has a profound effect on what algorithms are appropriate. Indeed, this paper is concerned entirely with an impossibility result on one particular class of system models.

There is a huge variety in system models, ranging from those based solidly on real-world experience to more theoretical models. Most share some basic properties in common: a network of distributed processors which are vertices in a connected graph where the edges are communication links along which messages may be sent. The variation comes in how each processor views the passage of time, whether links are reliable and how - if at all - processors are allowed to fail.

The FLP result is based on the asynchronous model, which is actually a class of models which exhibit certain properties of timing. The main characteristic of asynchronous models is that there is no upper bound on the amount of time processors may take to receive, process and respond to an incoming message. Therefore it is impossible to tell if a processor has failed, or is simply taking a long time to do its processing. The asynchronous model is a weak one, but not completely physically unrealistic. We have all encountered web servers that seem to take an arbitrarily long time to serve us a page. Now that mobile ad-hoc networks are becoming more and more pervasive, we see that devices in those networks may power down during processing to save battery, only to reappear later and continue as though nothing had happened. This introduces an arbitrary delay which fits the asynchronous model.

Other models are arguably more appropriate for networks like the Internet, but the asynchronous model is very general without being completely unrealistic.

The communication links between processors are assumed to be reliable. It was well known that given arbitrarily unreliable links no solution for consensus could be found even in a synchronous model; therefore by assuming reliable links and yet still proving consensus impossible in an asynchronous system the FLP result is made stronger. TCP gives reliable enough message delivery that this is a realistic assumption.

Processors are allowed to fail according to the fail-stop model - this simply means that processors that fail do so by ceasing to work correctly. There are more general failure models, such as Byzantine failures where processors fail by deviating arbitrarily from the algorithm they are executing. Again, by strengthening the model - by making strong assumptions about what is possible - the authors make their result much more general. The argument is as follows: if a given problem is proven unsolvable in a very restricted environment where only a few things can go wrong, it will certainly be unsolvable in an environment where yet more things can go wrong.

Formal Model

In order to formally model these properties, some notation needs to be introduced. In section 2 of the paper the system model is carefully defined. There are 2 processors which communicate by sending messages. A message is a pair where p is the processor the message is intended for, and m is the contents of the message. Messages are stored in an abstract data structure called the message buffer which is a multiset - simply a set where more than one of any element is allowed - which supports two operations. The first, send(p,m) simply places the message (p,m) in the message buffer. The second, receive(p) either returns a message for processor p (and removes it from the message buffer) or the special value θ, which does nothing.

The semantics of receive aren’t completely intuitive (but make perfect sense!). If there are no messages in the message system addressed to p then receive(p) will return θ. However if there are messages for p then receive(p) will return one of them at random or θ. This captures the idea that messages may be delivered non-deterministically and that they may be received in any order, as well as being arbitrarily delayed as per the asynchronous model. The only requirement is that calling receive(p) infinitely many times will ensure that all messages in the message buffer are eventually delivered to p. This means that messages may be arbitrarily delayed, but not completely lost.

A configuration is defined as the internal state of all of the processors - the current step in the algorithm that they are executing and the contents of their memory - together with the contents of the message buffer. The system moves from one configuration to the next by a step which consists of a processor p performing receive(p) and moving to another internal state. Therefore, the only way that the system state may evolve is by some processor receiving a message, or null, from the message buffer. Each step is therefore uniquely defined by the message that is received (possibly θ) and the process that received it. That pair is called an event in the paper, so configurations move from one to another through events.

Since the receive operation is non-deterministic, there are many different possible executions for a given initial state (which for consensus is simply differentiated by the set of starting values that each processor has). In order to show that an algorithm solves consensus, we have to show that it satisfies validity, agreement and termination for every possible execution. The set of executions includes the possibility that processors are delayed arbitrarily, which we model by having them receive θ for an arbitrary number of operations until they get their message. A particular execution, defined by a possibly infinite sequence of events from a starting configuration C is called a schedule and the sequence of steps taken to realise the schedule is a run.

Non-faulty processes take infinitely many steps in a run (presumably eventually just receiving θ once the algorithm has finished its work) - otherwise a process is considered faulty. An admissible run is one where at most one process is faulty (capturing the failure requirements of the system model) and every message is eventually delivered (this means that every processor eventually gets chosen to receive infinitely many times). We say that a run is decidingprovided that some process eventually decides according to the properties of consensus, and that a consensus protocol is totally correct if every admissible run is a deciding run.

With all those heavy definitions out of the way, the authors present their main theorem which is that no totally correct consensus algorithm exists. The idea behind it is to show that there is some admissible run - i.e. one with only one processor failure and eventual delivery of every message - that is not a deciding run - i.e. in which no processor eventually decides. This is enough to prove the general impossibility of consensus - it’s enough to show that there’s just one initial configuration in which a given protocol will not work because starting in that configuration can never be ruled out, and the result is a protocol which runs for ever (because no processor decides).

The Proof

As the paper says, ‘the basic idea is to show circumstances under which the protocol remains forever indecisive’. There are two steps to this, presented as two separate lemmas. The first demonstrates the existence of an initial condition which the second exploits. Things are tied together thereafter.

The First Lemma

(This is lemma 2 in the paper)

The point of the first lemma is to show that there is some initial configuration in which the decision is not predetermined, but in fact arrived as a result of the sequence of steps taken and the occurrence of any failure. For example, say that this initial configuration was two processors whose initial value was ‘0’ and one whose value was ‘1’. The authors show that what is decided from this configuration depends on the order in which messages are received and whether any of the processors fail, not just what the initial values of the processors are. This is due to the inherent non-determinism of the asynchronous system model.

The way this is done is quite neat. Suppose that the opposite was true - that all initial configurations have predetermined executions. There must be some initial configuration which results in a ‘0’ being decided and some for which ‘1’ is decided, by validity (which tells us that all results must be possible). Each configuration is uniquely determined by the set of initial values in the processors. We can order these values in a chain where two configurations are next to each other if they differ by only one value, so the only difference between two adjacent configurations is the starting value of one processor.

At some point along this chain, a ‘0’-deciding initial configuration must be next to a ‘1’-deciding one, and therefore they must differ by only one initial value. Call the ‘0’-deciding configuration C0 and the ‘1’-deciding configuration C1. Call the processor whose initial value is different in both p. Now, from p fails initially - because we allow one processor to fail. Therefore p neither sends nor receives any messages, so its initial value cannot be observed by the rest of the processors, one of whom must eventually decide ‘0’.

The crucial observation is that this run can also be made from C​1​​. Because the configurations of C​1​​ are exactly the same, except for the value at p which can’t contribute anything to the run because p has failed, any sequence of steps that C​0​​ may take can also be taken by C​1​​. But if they take exactly the same sequence of steps, then the eventual decision taken must be the same for both configurations. This contradicts our assumption that the result of the consensus algorithm is predetermined only by the initial configuration in C​0​​ and C​1​​, and that both configurations decide differently. So either one (or possibly both) of these configurations can potentially decide ‘0’ or ‘1’ and the eventual result depends on the pattern of failures and message deliveries.

We call the configurations that may lead to either decision value bivalent, and configurations that will only result in one value 0-valent or 1-valent. I’ll use these terms for the rest of this article.

Figure 2.1

Figure 2.1: Proof that D must contain both 0- and 1-valent configurations if it contains no bivalent configuration

That confusing definition is quite a simple idea: we are fixing an F_0F​0​​ in \mathbb{D}D that either comes before or after E_0E​0​​depending on whether ee was received in the run to E_0E​0​​.

Now we can see that F_0F​0​​ must be 0-valent. Either it comes before E_0E​0​​, in which case it too must be 0-valent, because it cannot be bivalent (by assumption) and if it were 1-valent the protocol would ‘change its mind’ between F_0F​0​​ and E_0E​0​​which contradicts the definition of ii-valent.

If F_0F​0​​ comes after E_0E​0​​, by the same argument it must be 0-valent. Therefore \mathbb{D}D contains a 0-valent state.

We can define F_1F​1​​ relative to E_1E​1​​ similarly, to prove that \mathbb{D}D contains a 1-valent configuration. Therefore we’ve shown that \mathbb{D}D must contain both 0-valent and 1-valent configuration if it contains no bivalent configurations.

We use this fact to set up the rest of the lemma’s proof. By a similar argument to the ‘chain’ one used in the proof of the previous lemma, we can see that there must be two configurations C_0C​0​​ and C_1C​1​​ in \mathbb{C}C where C_1C​1​​ is the result of receiving some message e\primee′ in C_0C​0​​, such that applying ee to C_iC​i​​ gives an ii-valent configuration. Call these configurations D_0D​0​​ and D_1D​1​​.

C_0C​0​​ and C_1C​1​​ are called ‘neighbours’ because they are separated by a single message delivery. Now, consider the message that separates them, e\prime=(p\prime,m\prime)e′=(p′,m′). Consider the two alternatives for the destination of e\primee′, that is p\primep′.

证明consensus的不存在性 - A Brief Tour of FLP Impossibility_第1张图片
Proof that D0 must be bivalent if the two states preceding D1 are separated by messages intended for different processes

Figure2.2

Proof that D0 must be bivalent if the two states preceding D1 are separated by messages intended for different processes

  1. If p\prime \neq pp′≠p - i.e. the message that separates them is for a different processor than ee - then we can get to D_1D​1​​ by receiving e\primee′ at D_0D​0​​. If the destination of a set of messages is different then we can receive them in any order to get to the same configuration, because all the processors see the same order of events, just at different times. (This is called the commutativity of schedules in the paper and is proved by its own lemma. I have left the proof out of this - already proof heavy - article). But D_0D​0​​ is 0-valent, and D_1D​1​​ is 1-valent. So we should not be able to reach D_1D​1​​ from D_0D​0​​, as again this would imply the protocol had changed its mind. Therefore this is a contradiction.

  2. Instead consider p\prime = pp′=p. Now consider a finite deciding run from C_0C​0​​ (the configuration such that receiving eeresults in a 0-valent configuration), in which pp takes no steps (such a run must exist to cope with the possibility of pp failing). Let AA be the configuration that results at the end of that deciding run. We can also apply that run to either D_0D​0​​ or D_1D​1​​ to get two new configurations E_0E​0​​ and E_1E​1​​, and by the commutativity of schedules we can also receive ee at AA to get E_0E​0​​ (same set of messages, different delivery order), and e\primee′ followed by ee at AA to get E_1E​1​​. (Remember that ee and e\primee′ are both available for receipt because pp takes no steps in the deciding run). But this implies that AA is bivalent, which is a contradiction since the run to AA is deciding.

证明consensus的不存在性 - A Brief Tour of FLP Impossibility_第2张图片
Proof that no deciding run from C0 exists if D contains only univalent configurations, and both messages preceding D1 are delivered to the same host

Figure 2.3

Proof that no deciding run from C0 exists if D contains only univalent configurations, and both messages preceding D1 are delivered to the same host

So, by contradiction, we have shown that \mathbb{D}D must contain a bivalent configuration.

Bringing it all together

The final step is to show that any deciding run also allows the construction of an infinite non-deciding run. This is done by piecing together sections of runs in such a way as to avoid ever making a decision, that is to enter a univalent configuration.

Start from a bivalent initial configuration C_0C​0​​ (which exists as a result of the first lemma). To make the run admissible, place the processors in the system in a queue and have them receive messages in queue order, being placed at the back of the queue when they are done. This ensures that every message is eventually delivered.

Let e=(p,m)e=(p,m) be the earliest message to the first processor in the queue; possibly a null message. Then by the second lemma we can reach a bivalent configuration C_1C​1​​ reachable from C_0C​0​​ where ee is the last message received. Similarly, we can reach another bivalent configuration C_2C​2​​ from C_1C​1​​ by the same argument. And this may continue for ever. This run is admissible, but never reaches a decision, and it follows that the protocol is not totally correct.

Conclusion

Phew! That was a lot of technical proof to wade through. The meat of the proof is in the two lemmas. The first showed that there must be some initial configuration of the system in which the outcome of consensus is not already determined - you can’t just have a protocol that queries the state of the network and then indexes a massive look-up table, as failures or message delays are guaranteed to scupper that plan. This is the lemma where the possibility of failure is key: if there were no failures then the network state would be enough to determine consensus (because if there are no failures then message delay can be recognised for what it is).

The second lemma showed that, if you start in a bivalent configuration then it is always possible to reach another bivalent configuration by delaying a pending message. This can continue forever - even though all messages are eventually delivered - and so no protocol can guarantee that a univalent state might ever be reached.

What does this mean practically? The chance of entering that infinite undeciding loop can be made arbitrarily small - it’s one thing for nondecision to be possible, another thing entirely for it to be guaranteed. Still, there is much to be said for keeping this possibility in mind when your system livelocks during consensus and you don’t know why. Also, in practice, real systems aren’t always truly asynchronous.

This paper spawned a huge amount of subsequent work in the theoretical distributed systems community. Failure detectors were introduced by Toureg et. al. to characterise just how much the asynchronous model needed to be strengthened to allow for correct consensus. Randomised algorithms exist that practically make the probability of nondecision 0 if run for long enough. A lot of proofs by reduction were published that showed the equivalence of some problem to consensus, and therefore its impossibility. Different system models were postulated to circumvent the FLP problem, including partial synchrony which has had a lot of academic success.

The paper won the 2001 Dijkstra prize for the most influential paper in distributed computing. All three authors, Michael Fischer, Nancy Lynch and Mike Paterson are still very active and extremely eminent researchers in the field. Nancy Lynch, in particular, seems to have been involved in almost every important piece of work in distributed algorithms.

你可能感兴趣的:(证明consensus的不存在性 - A Brief Tour of FLP Impossibility)