Cloud Computing on Coursera Week4

Cloud Computing Week4

===================

Coursera Cloud Computing的Week 4难度不小,整整做了两天最后quiz也只拿了80分。。。遇到一些描述不清的question也不知如何理解,distributed system这一块还需要认真补一补,下面对课程内容做一些整理。

key-value storage/NoSQL是相对于传统数据库所提出的新型存储系统架构。这里提到了scale up & scale out。scale up是指更新机器从而提升capacity,缺点是耗费巨大。scale up则是incrementally grow the cluster by adding more COTS (componet off the shelf) machines。即每次都淘汰一部分的机器(可能是年代久远,也可能是运行速度慢),再加入一些新的机器。与scale out不同点在于,每次都只更新部分而不做全量的替换。

典型的NoSQL有 Facebook的Cassandra & Yahoo的Hbase。以Cassandra为例:作为storage system,Cassandra使用多个data center,首先需要解决which server a key-value resides on的问题(key-value mapping)。

主要有两种策略:

  1. SimpleStrategy
    Random Partitioner: Chord-like hash partitioning
    ByteOrdered Partitioner: Assigns ranges of keys to server not using hashing
  2. NetworkTopologyStrategy
    Per DC deployment: Put first replica according to partitioner, then go clockwise around ring until hit another rack
    Snitches: Simple Snitches/RankInferring Snitches. For example: X.101.123.111 where 101, 123, 111 represent DC, rack, node (cluster), respective.

决定写在哪个server后,接下来需要解决how to write的问题。Need to lock-free & fast。这里Cassandra采用的是在每个server上挑选一个coordinator的方法。

步骤如下write步骤:

  1. client sends write to one coordinator node in Cassandra cluster
  2. Coordinator uses Partitioner to send query to all repilca nodes responsible for the key
  3. When X replicas record, coordinator returns an acknowledgement to the client

在write中有一个默认机制称为Hinted Handoff mechanism。若某个replica down,则coordinator本地write,直到replica恢复后重新write给该replica。若所有replicas down,则coordinator持续write buffer。

当write某个replica时:

  1. Log it in disk commit log for failure recovery (首先将该write操作写入log日志,若出现failure可以根据日志信息进行recovery)
  2. Make changes to appropriate memtables (in memory representation of multple key-value pairs) (先写到一个叫memtable的地方,可以认为是一个memory缓存区)
  3. Later when memtable is full or old, flash to disk. Specifically, we use SSTable (Sorted String Table) as Data File and another SSTable (key-position pairs) as Index File. (每当memtable满了,就写到SSTable中)

在write时,首先会check一下key是否对应某个replica。这里就引入了Bloom Filter的概念。

Bloom Filter是一个m-bit的map,当输入某个key时,bloom filter用K个hash function计算key从而得到K个hash value,对应h1...hK。相应地将map中的h1...hK位设为1,若已经为1则不管。再check key是否存在时,同样计算key的K个hash值,check在filter中是否这K个bit的值均为1,若都为1则认为存在。

Cassandra使用gossip-based method去maintain membership list。同时accrual detector作为failure detector检查heartbeat是否成功。

为了进一步介绍Cassandra是如何选取replica的数量,一个重量级的theorem被引入。

CAP Theorem

In a distributed system, you can at most satisfy 2 of 3 guarantees

  1. Consistency: all nodes see same data at any time, or read return latest written value on client
  2. Availability: the system allows operations all the time, and operations return quickly.
  3. Partition-tolerance: the system continues work in spite of network partitions.

在真是系统中23比较重要,因此Cassandra上并不要求strong consistency并引入了eventual consistency的概念作为替代。Eventual consistency does imply: The set of replicas is always trying to catch up and converge to the latest writes.

另一方面,Cassandra-like key-value store system引入了consistency level的概念,包括write consistency level of size W and read level of size R。

Assume there are N replicas of each key, and N is an even integer that is large enough. You are told that to maintain the strong consistency needed by your application, all conflicting writes must be detected by at least one replica (i.e., any two sets of written replicas must overlap) and a read must return the value of the latest acknowledged write (i.e., a read replica set must overlap with every written replica set).

根据以上要求,任意两个replica必须overlap,因此W > N/2;并且read replica需要与所有replica重叠,于是W + R > N。

关于Hbase重要的一点是: In HBase, in order to guarantee that updates are not lost due to a crash, HLog needs to be written before an update is added to MemStore。其余不多做展开。

下面终于要说到synchronization。对于distributed system而言,由于需要实现不同server之间的heartbeat/write/read等操作,各client之间的同步异常重要,因为这决定了server之间数据传输的先后,甚至会影响failure detector的判断。

解决Synchronization的两种思路

  1. 校准各server之间的clock,从而减少clock skew。常见算法有Christina's alg & Network Time Protocol。
  2. 不校准各server的时间,而是在server传递message的时候附带一个time stamp信息,从而确定client各事件发生的order。由于time stamp满足causality of all events,因此能很好的解决该问题。常见的方法有Lamport TimeStamp & Vector Clock。

你可能感兴趣的:(Cloud Computing on Coursera Week4)