Lesson 1 : Introduction to Distributed Systems

0x00 Introduction

This post focus on distributed systems. The basic concept and some main topics.

0x01 Distributed Systems

1. Definition

Various definitions of distributed systems have been given in the literature ! Here are two.

One :

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.

The other :

A distributed system is a collection of independent computers that appears to its users as a single coherent system.

Three significant characteristics of distributed systems are:

concurrency of components
lack of a global clock
independent failure of components

Some example of distributed system: multiple cooperating computers;big databases, P2P file sharing, MapReduce, DNS, &c;lots of critical infrastructure is distributed!

2. Why distributed?

to connect physically separate entities
to achieve security via isolation
to tolerate faults via replication
to scale up throughput via parallel CPUs/mem/disk/net

3. Difficulties?

complex: many concurrent parts
must cope with partial failure
tricky to realize performance potential

0x02 Topics

1. Performance

What we want: scalable throughput.

Nx servers -> Nx total throughput via parallel CPU, disk, net. So handling more load only requires buying more computers.

But Scaling gets harder as N grows. Why?

Load im-balance, stragglers. (Some node is much more slower than others. 慢节点)
Non-parallelizable code: initialization, interaction.
Bottlenecks from shared resources, e.g. network.

2. fault tolerance

1000s of servers, complex net -> always something broken. We'd like to hide these failures from the application.

What we want:

Availability -- app can keep using its data despite failures
Durability -- app's data will come back to life when failures are repaired

How: replicated servers.

If one server crashes, client can proceed using the other(s).

3. consistency（一致性）

Consistency is an issue for both replicated objects and transactions involving related updates to different objects (recall ACID properties)

Achieving good behavior is hard!

"Replica" servers are hard to keep identical.
Clients may crash midway through multi-step update.
Servers crash at awkward moments, e.g. after executing but before replying.
Network may make live servers look dead; risk of "split brain".

Consistency and performance are enemies.

Consistency requires communication, e.g. to get latest Put().
"Strong consistency" often leads to slow systems.
High performance often imposes "weak consistency" on - applications.
People have pursued many design points in this spectrum.

参考

https://www.cl.cam.ac.uk/teaching/0910/ConcDistS/11a-cons-tx.pdf
https://en.wikipedia.org/wiki/Consistency_model
https://pdos.csail.mit.edu/6.824/notes/l01.txt
https://en.wikipedia.org/wiki/Distributed_computing