
0x00 前言

MIT 6.824课程第一节的部分内容,加上自己参考了一些书一起整理而得。

资源共享是构造分布式系统的主要动机! —— 《分布式系统:概念与设计》 第一章

0x01 什么是分布式系统



A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.


A distributed system is a collection of independent computers that appears to its users as a single coherent system.



  • 组件并发性(concurrency of components)
  • 缺乏全局时钟(lack of a global clock)
  • 组件故障的独立性(independent failure of components)



  • to connect physically separate entities
  • to achieve security via isolation
  • to tolerate faults via replication
  • to scale up throughput via parallel CPUs/mem/disk/net
  • 为了连接物理上相互分离的实体
  • 为了通过隔离(isolation)实现安全性
  • 为了通过复制(replication)实现容错
  • 为了使CPUs/mem/disk/net可以实现扩容



  • complex: many concurrent parts
  • must cope with partial failure
  • tricky to realize performance potential
  • 复杂性: 多个并发的部分
  • 必须处理部分失败的情况
  • 难以实现的性能潜力


0x02 几个话题


那么怎么样能设计一个完美的分布式系统呢? 这个我也不知道,先学着吧......

1. consistency(一致性)

Consistency is an issue for both replicated objects and transactions involving related updates to different objects (recall ACID properties)


Achieving good behavior is hard!

  • "Replica" servers are hard to keep identical.
  • Clients may crash midway through multi-step update.
  • Servers crash at awkward moments, e.g. after executing but before replying.
  • Network may make live servers look dead; risk of "split brain".

Consistency and performance are enemies.

  • Consistency requires communication, e.g. to get latest Put().
  • "Strong consistency" often leads to slow systems.
  • High performance often imposes "weak consistency" on - applications.
    People have pursued many design points in this spectrum.

2. fault tolerance

1000s of servers, complex net -> always something broken. We'd like to hide these failures from the application.

What we want:

  • Availability -- app can keep using its data despite failures
  • Durability -- app's data will come back to life when failures are repaired

How: replicated servers.

If one server crashes, client can proceed using the other(s).

3. Performance

What we want: scalable throughput.

Nx servers -> Nx total throughput via parallel CPU, disk, net. So handling more load only requires buying more computers.

But Scaling gets harder as N grows. Why?

  • Load im-balance, stragglers. (Some node is much more slower than others. 慢节点)
  • Non-parallelizable code: initialization, interaction.
  • Bottlenecks from shared resources, e.g. network.

0XFF 总结



  • http://www.jianshu.com/p/2ed7ec08d0c3
  • 《分布式系统:概念与设计》
  https://pdos.csail.mit.edu/6.824/

