Distributed systems theory for the distributed systems engineer
适合 分布式系统工程师 的 分布式系统理论
Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.
Gwen Shapira曾在Cloudera做工程师,现在宣传Kafka,他在Twitter问了以下问题,使我有所思考。
I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?
我想在分布式理论上有所提升。应该从哪开始?有推荐的书?
— Gwen (Chen) Shapira (@gwenshap) August 7, 2014
My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”,
我第一反应是“可以看:FLP论文、paxos论文、Byzantine将军论文”,
and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed.
我推荐的主要阅读材料,如果你贸然去读,你至少要阅读6个月才会有感觉。
But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program).
由此可知,推荐一吨的理论论文让你阅读,这是了解分布式系统的错误的方式。(除非你在读博士)
Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context.
论文一般是深奥、复杂的,而且需要一系列学习和丰富的经验才能感觉到其贡献、才能其放到对应的场景(以理解和应用)。
What good is requiring that level of expertise of engineers?
工程师了解分布式理论有什么好处?
And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory;
很不幸,几乎没有好的引导文章,来总结、提炼、场景化 分布式系统理论中的重要结论和想法;
particularly material that does so without condescending.
特别是 通俗易懂的引导文章 更没有。
Considering that gap lead me to another interesting question:
考虑这样的空白区域,让我想问另一个问题:
What distributed systems theory should a distributed systems engineer know?
一个分布式系统工程师应该了解什么样的分布式系统理论?
A little theory is, in this case, not such a dangerous thing.
这种情况下,了解一点点理论并不是坏事。
So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer.
我日常工作是一个分布式系统工程师,我认为适合我的基本概念,下面会给出这些基本概念。
Let me know what you think I missed!
你认为我缺失的请告知我!
First steps 准备
These four readings do a pretty good job of explaining what about building distributed systems is challenging.
下面四个读物解释了构建分布式系统会遇到的困难。
Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections
这些读物都勾勒了一些列 抽象而非技术 的困难,分布式系统工程师必须要克服这些困难。这些读物的后面章节有更详细的研究。
Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.
Distributed Systems for Fun and Profit 是一本小书,它想覆盖分布式系统中的一些基本问题,包括 时钟所起的作用、不同策略的复制。
Notes on distributed systems for young bloods - not theory, but a good practical counterbalance to keep the rest of your reading grounded.
Notes on distributed systems for young bloods - 非理论,而是一个很好的实践,以让你落到实处。
A Note on Distributed Systems - a classic paper on why you can’t just pretend all remote interactions are like local objects.
A Note on Distributed Systems - 一个经典论文,关于 为什么你不能假装所有远程交互像本地对象一样。
The fallacies of distributed computing - 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.
The fallacies of distributed computing 分布式计算的8个错误的推论,以提醒系统设计者。
You should know about safety and liveness properties:
你应该知道 安全 和 活力:
safety properties say that nothing bad will ever happen. For example, the property of never returning an inconsistent value is a safety property, as is never electing two leaders at the same time.
安全 说的是 永远不会发生坏事。比如,不返回不一致的值 是 一种 安全, 同一时刻不会选出两个 主节点 也是 一种 安全。
liveness properties say that something good will eventually happen. For example, saying that a system will eventually return a result to every API call is a liveness property, as is guaranteeing that a write to disk always eventually completes.
活力 说的是 好事情终究会发生。比如,对于每个api调用,一个系统终究会返回一个结果,这是一种 活力;保证一次写磁盘最终总能结束,这是一种 活力。
Failure and Time 失败和时钟
Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:
分布式系统工程师面对的许多困难可以归结为以下两个原因:
Processes may fail
进程可能失败
There is no good way to tell that they have done so
There is a very deep relationship between what, if anything, processes share about their knowledge of time, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented.
进程间怎么共用时钟、什么样的失败可以检测、什么样的算法和原语可以被正确实现,这三者之间有很深的联系。
Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.
一般情况下,我们假设不同节点绝对无法共用时钟(时刻值或流过了多少时间)
You should know:
你应该知道:
The (partial) hierarchy of failure modes: crash stop -> omission -> Byzantine. You should understand that what is possible at the top of the hierarchy must be possible at lower levels, and what is impossible at lower levels must be impossible at higher levels.
失败模型的层次:节点崩溃后关机 -> 节点崩溃后死机(经过无限长时间后才响应) -> 恶意节点 (不遵守约定的规则) 。 各个层次间逐渐将限制放松,你应该知道这些限制.
How you decide whether an event happened before another event in the absence of any shared clock. This means Lamport clocks and their generalisation to Vector clocks, but also see the Dynamo paper.
两个节点之间,没有任何共用时钟,你怎么确定一个节点上的一个事件和另一个节点上的另一个事件之间的先后顺序. 这就要阅读Lamport时钟和更一般化的Vector时钟, 也可以阅读Dynamo论文.
How big an impact the possibility of even a single failure can actually have on our ability to implement correct distributed systems (see my notes on the FLP result below).
允许单节点失败对实现正确的分布式系统有多大的冲击?(见下面FLP结论处)
Different models of time: synchronous, partially synchronous and asynchronous
时钟的不同模型:同步、部分同部 、 异步
That detecting failures is a fundamental problem, one that trades off accuracy and completeness - yet another safety vs liveness conflict. The paper that really set out failure detection as a theoretical problem is Chandra and Toueg’s ‘Unreliable Failure Detectors for Reliable Distributed Systems’. But there are several shorter summaries around - I quite like this random one from Stanford.
失败检测是一个基本问题,失败检测可以平衡准确度和完成度(如果能检测到失败了,则可以容许不那么准确、没完全做完),失败检测也可以解决安全和活力间的冲突。把失败检测作为理论来研究的论文是 Chandra and Toueg’s ‘Unreliable Failure Detectors for Reliable Distributed Systems’. 不过也有一些简短的总结-我特别喜欢this random one from Stanford.
The basic tension of fault tolerance 容错导致的基本矛盾
A system that tolerates some faults without degrading must be able to act as though those faults had not occurred.
一个系统容忍一些错误而没有降级 必须能当成 就像这些错误没有发生过一样。
This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption.
这意味着系统的一部分要冗余地工作(同样的功能部署多个节点),冗余是绝对必要的,冗余一般会带来性能和资源的消耗。
This is the basic tension of adding fault tolerance to a system.
这就是给一个系统添加冗余的基本矛盾。
You should know:
你应该知道:
The quorum technique for ensuring single-copy serialisability. See Skeen’s original paper, but perhaps better is Wikipedia’s entry.
确保串行单复制的多数派技术. 见 Skeen的原始论文, 不过或许更好的是 Wikipedia’s entry.
(多数派中有一个是主节点,其余为从节点,以主节点接收到的写请求序列为准[串行],主节点单方面的要求从们接受字节的写请求序列[从节点不得反抗、不得有异议:从节点是非恶意的、遵守全局规则的、非拜占庭的])About 2-phase-commit, 3-phase-commit and Paxos, and why they have different fault-tolerance properties.
两步提交、 三步提交 、Paxos, 以及为什么他们不同于容错.
How eventual consistency, and other techniques, seek to avoid this tension at the cost of weaker guarantees about system behaviour. The Dynamo paper is a great place to start, but also Pat Helland’s classic Life Beyond Transactions is a must-read.
最终一致性、其他技术 以 对系统行为做更弱的保证 为代价 来 设法避开 此矛盾 . 可以看 Dynamo 论文 , 不过 必须要读 Pat Helland的论文 经典 Life Beyond Transactions .
Basic primitives 基本原语
There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:
在分布式系统中,很少有约定的基本构建块,更多的是处于形成中的基本构建块。有应该知道下面的问题是什么,并且从哪能找到他们的解决方案:
Leader election (e.g. the Bully algorithm)
主节点选举 (例如 Bully 算法)
Consistent snapshotting (e.g. this classic paper from Chandy and Lamport)
一致快照 (比如 这个来自 Chandy and Lamport的经典论文 )
一致性 (见上面 2PC 、 Paxos 处)
Distributed state machine replication (Wikipedia is ok, Lampson’s paper is canonical but dry).
分布式状态机复制 (看Wikipedia 就行, Lampson的 论文 是权威但是太枯燥了).
Broadcast - delivering messages to more than one node at once
-
广播 - 同时发送消息给集群
Atomic broadcast - can you deliver a message to all nodes in a group, or none?
原子广播 - 你能发送消息给一集群,使得要么集群中的所有节点都收到了这条信息、要么集群中全部节点都没收到此消息?(这就是原子广播)
Gossip (the classic paper)
Gossip (经典论文)
Causal multicast (but also consider the enjoyable back-and-forth between Birman and Cheriton).
[因果广播](https://www.cs.cornell.edu/courses/cs614/2003sp/papers/BSS91.pdf) (也可以看看 [Birman](https://www.cs.rice.edu/~alc/comp520/papers/Cheriton_Skeen.pdf)和[forth](https://www.cs.princeton.edu/courses/archive/fall07/cos518/papers/catocs-limits-response.pdf) ).
Chain replication (a neat way of ensuring consistency and ordering of writes by organizing nodes into a virtual linked list).
-
链式复制 (将节点们放进一个虚拟链表中,从而可以干净的确保写请求的一致性和顺序 ).
The original paper
原始论文
一系列改良 for read-mostly workloads
对读请求占绝大多数的一系列改良
An experiential report by @slfritchie
@slfritchie给出的 一个经验报告
Fundamental Results 基础结论
Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:
有些事实只需要主观理解(不需要关注证明).
You can’t implement consistent storage and respond to all requests if you might drop messages between processes. This is the CAP theorem.
如果节点间可能丢失消息[:P],那么你不可能 既 实现一致性存储[:C] 又 响应所有时刻的请求[:A]. 这就是 CAP理论.
Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures (the FLP result). The first slides - before the proof gets going - of my Papers We Love SF talk do a reasonable job of explaining the result, I hope. Suggestion: there’s no real need to understand the proof.
在一个异步系统中,一致性不可能以这样一个途径实现:既a) 总是正确的 ; 又b) 总是能结束 即使只有一个节点可能以 崩溃-*停止 失败 (FLP结论). 在看证明之前,看下我以简明的方式解释FLP结论的论文 Papers We Love SF talk . 建议: 没有理解证明的需求.
(一个异步系统中,假设节点崩溃后停止而不是奔溃后又恢复;1、要确保结果总是正确的,2、每次写请求能够在有限时间内返回结果。这两点没法同时满足:这就是FLP结论)Consensus is impossible to solve in fewer than 2 rounds of messages in general.
一般地,只进行少于2轮的消息传递,不可能达成一致性 .
Atomic broadcast is exactly as hard as consensus - in a precise sense, if you solve atomic broadcast, you solve consensus, and vice versa. Chandra and Toueg prove this, but you just need to know that it’s true.
原子广播和一致性,二者的难度精确的相等。更直白的说,如果你能解原子广播,那么你也能解一致性,反之亦然。 Chandra 和 Toueg 证明了这一点, 但是你只需要知道这个论断是成立的。
Real systems 真实系统
The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:
最重要的、应该不断重复的实践是:读新的、真实的系统的描述,并评价他们设计的决定。 下面是建议的系统:
Google:
GFS
Spanner
F1
Chubby
BigTable
MillWheel
Omega
Dapper
Paxos Made Live
The Tail At Scale
Not Google:
Dryad
Cassandra
Ceph
RAMCloud
HyperDex
PNUTS
Azure Data Lake Store
Postscript 结尾
If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.
如果你驯服了这个列表中的所有概念和技术,我很乐意和你聊聊Cloudera的分布式系统工程师职位。