http://www.bailis.org/blog/linearizability-versus-serializability
In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant(可用性?)
distributed systems. 本章讲什么
The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.
Linearizability: 提供 stronger consistency
, make replicated data appear as though there were only a single copy
, and to make all operations act on it atomically
Causality: which puts all operations in a single, totally ordered
timeline, causality provides us with a weaker consistency
model
Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete
safety properties
(agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remainfault-tolerant
(able to make progress as long as a majority of nodes are working and reachable). They providetotal order broadcast
, and therefore they can also implement linearizable atomic operations in a fault-tolerant way.
定义: Deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable.
使用场景:
Zookeeper:
分布式 db 中,由于网络等因素,数据不一致一定会发生,因此 Most replicated databases provide at least eventual consistency
。
eventual consistency: 所有 replicas 的数据最终会达到一致。
when
the replicas will convergestronger consistency: 所有 replicas 数据总是保持一致
distributed consistency is mostly about
coordinating
the state of replicas in the face of delays and faults.
3 个特质:
定义:
Linearizability(atomic consistency) is a guarantee about single operations
on single objects.
It provides a real-time
(i.e., wall-clock) guarantee on the behavior of a set of single operations (often reads and writes) on a single object。
Linearizability is a recency guarantee
(once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again) on reads and writes of a register (an individual object). It doesn’t group operations
together into transactions.
Vs Serializability:
linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.
[http://www.bailis.org/blog/linearizability-versus-serializability/]
使用场景:
The most common approach to making a system fault-tolerant is to use replication? 怎么说
Single-leader replication(potentially linearizable): If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable.
Consensus algorithms(linearizable): consensus protocols contain measures to prevent split brain and stale replicas.
Multi-leader replication(not linearizable): because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes.
Leaderless replication(probably not linearizable): sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes (w + r > n)
简单的使用 quorums,即使满足 w + r > n, 也不一定是 linearizable, 如:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PpyVwbKp-1579403595685)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p45)]
The CAP theorem:
一般来说,分区容错无法避免,因此可以认为 CAP 的 P 总是成立。CAP 定理告诉我们,剩下的 C 和 A 无法同时做到。
CAP 的定义比较局限,The CAP theorem as formally defined is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions,vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XbD8HtS6-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p46)]
Linearizability and network delays:
A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems.
It turns out that there are deep connections between ordering, linearizability, and consensus.
Ordering helps preserve causality:
Causality imposes an ordering
on events,These chains of causally dependent operations define the causal order in the system—i.e., what happened before what.
Causality:
在 Casual 中没有依赖的操作可以并发执行,因此这些操作无法比较,所以 Casual 是 partial order(偏序关系).
Linearizability:
In a linearizable system, we have a total order of operations. Therefore, according to this definition, there are no concurrent operations in a linearizable datastore.
Linearizability implies causality
is what makes linearizable systems simple to understand and appealing.
Causal consistency
is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures.
In order to maintain causality, you need to know which operation happened before which other operation.
In order to determine causal dependencies, we need some way of describing the “knowledge” of a node in the system.
Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key
In order to determine the causal ordering, the database needs to know which version of the data was read by the application
使用 logical clock 给 event 编号,we can use sequence numbers or timestamps to order events, and they provide a total order.
We can create equence numbers in a total order that is consistent with causality,先发生的 event 的 number 更小.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FVTv2jFg-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p50)]
Each node has a unique identifier
,and each node keeps a counter
of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID). It provides total ordering.
原理解释:
This idea of knowing when your total order is
finalized
is captured in the topic of total order broadcast.
Total order broadcast is usually described as a protocol for exchanging messages
between nodes, Two safety properties always be satisfied:
This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!
Atomic commit,即保证分布式事务的原子性,需要依赖 consensus algo,2PC is a kind of consensus algorithm, which solving atomic commit.
Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state.
存储硬件层面保证:
Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.
节点层面保证:
A node must only commit once it is certain that all other nodes in the transaction are also going to commitA transaction commit must be irrevocable。
应用层面保证:
However, from the database’s point of view this is a separate transaction, and thus any cross-transaction correctness
requirements are the application’s problem
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort.
在分布式系统里,每个节点都可以知晓自己操作的成功或者失败,却无法知道其他节点操作的成功或失败。当一个事务跨多个节点时,为了保持事务的原子性与一致性,需要引入一个协调(Coordinator)来统一掌控所有参与者(Participant)的操作结果,并指示它们是否要把操作结果进行真正的提交或者回滚.
2PC顾名思义分为两个阶段,其实施思路可概括为:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A9oElRSG-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p47)]
Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery [88], and the additional network round-trips.
2PC can become stuck waiting for the coordinator to recover.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OLcLfF8h-1579403595687)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p48)]
The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants:when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.
Two types of distributed transactions are often conflated:
Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once
, even if it required a few retries before it succeeded.
Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IkjhNBwl-1579403595687)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p49)]
The transaction coordinator implements the XA API.
The database cannot release those locks until the transaction commits or aborts, Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt.
This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.
Orphaned in-doubt transactions(如transaction log lost) cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions,只能通过管理员手动解决.
Many XA implementations have an emergency escape hatch called heuristic decisions: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator.
对于 XA transactions, the key realization is that the transaction coordinator is itself a kind of database (in which transaction outcomes are stored), and so it needs to be approached with the same care as any other important database.
思想: Everyone decides on the same outcome, and once you have decided, you cannot change your mind, a consensus algorithm must satisfy the following properties:
fault tolerance
: 当有 node crash,也可以达成决策Fault Tolerance:
never comes back
, 以避免无限等待 node recover.quorum
来允许部分 node crash.Consensu algorithm: vsr, paxos, raft, zab…
这些算法不直接使用上述 Consensus 模型,而是 they decide on a sequence of values
, which makes them total order broadcast
algorithms.
So, total order broadcast is equivalent to repeated rounds of consensus
(each consensus decision corresponding to one message delivery).
例子: [https://www.cnblogs.com/j-well/p/7061091.html]
Consensus algo 使用 leader,但是不保证 leader 唯一,所以需要解决选主问题:
当有多个 leader 时候,use a leader in some form or another, but they don’t guarantee that the leader is unique.
Node 为了确定自己是 leader,每次操作前需要举行投票确认自己的身份,因此 we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal.
Consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met.
影响性能
ZooKeeper 使用场景: