Gossip协议是一个Gossip思想的P2P实现。现代的分布式系统经常使用这个协议,他往往是唯一的手段。因为底层的结构非常复杂,而且Gossip也很有效。
Gossip协议也被戏称为病毒式传播,因为他的行为生物界的病毒很相似。
Gossip (State Transfer Model)
在状态转移到模式下,每个重复节点都保持的一个Vector clock和一个state version tree。每个节点的状态都是相同的(based on vector clock comparison),换句话说,state version tree包含有全部的冲突updates.
At query time, the client will attach its vector clock and the replica will send back a subset of the state tree which precedes the client's vector clock (this will provide monotonic read consistency). The client will then advance its vector clock by merging all the versions. This means the client is responsible to resolve the conflict of all these versions because when the client sends the update later, its vector clock will precede all these versions.
At update, the client will send its vector clock and the replica will check whether the client state precedes any of its existing version, if so, it will throw away the client's update.
Replicas also gossip among each other in the background and try to merge their version tree together.
Gossip (Operation Transfer Model)
In an operation transfer approach, the sequence of applying the operations is very important. At the minimum causal order need to be maintained. Because of the ordering issue, each replica has to defer executing the operation until all the preceding operations has been executed. Therefore replicas save the operation request to a log file and exchange the log among each other and consolidate these operation logs to figure out the right sequence to apply the operations to their local store in an appropriate order.
"Causal order" means every replica will apply changes to the "causes" before apply changes to the "effect". "Total order" requires that every replica applies the operation in the same sequence.
In this model, each replica keeps a list of vector clock, Vi is the vector clock the replica itself and Vj is the vector clock when replica i receive replica j's gossip message. There is also a V-state that represent the vector clock of the last updated state.
When a query is submitted by the client, it will also send along its vector clock which reflect the client's view of the world. The replica will check if it has a view of the state that is later than the client's view.
When an update operation is received, the replica will buffer the update operation until it can be applied to the local state. Every submitted operation will be tag with 2 timestamp, V-client indicates the client's view when he is making the update request. V-@receive is the replica's view when it receives the submission.
This update operation request will be sitting in the queue until the replica has received all the other updates that this one depends on. This condition is reflected in the vector clock Vi when it is larger than V-client
On the background, different replicas exchange their log for the queued updates and update each other's vector clock. After the log exchange, each replica will check whether certain operation can be applied (when all the dependent operation has been received) and apply them accordingly. Notice that it is possible that multiple operations are ready for applying at the same time, the replica will sort these operation in causal order (by using the Vector clock comparison) and apply them in the right order.
The concurrent update problem at different replica can also happen. Which means there can be multiple valid sequences of operation. In order for different replica to apply concurrent update in the same order, we need a total ordering mechanism.
One approach is whoever do the update first acquire a monotonic sequence number and late comers follow the sequence. On the other hand, if the operation itself is commutative, then the order to apply the operations doesn't matter
After applying the update, the update operation cannot be immediately removed from the queue because the update may not be fully exchange to every replica yet. We continuously check the Vector clock of each replicas after log exchange and after we confirm than everyone has receive this update, then we'll remove it from the queue.
Merkle tree
有数据存储成树状结构,每个节点的Hash是其所有子节点的Hash的Hash,叶子节点的Hash是其内容的Hash。这样一旦某个节点发生变化,其Hash的变化会迅速传播到根节点。需要同步的系统只需要不断查询跟节点的hash,一旦有变化,顺着树状结构就能够在logN级别的时间找到发生变化的内容,马上同步。
Paxos
paxos是一种处理一致性的手段,可以理解为事务吧。
其他的手段不要Google GFS使用的Chubby的Lock service。我不大喜欢那种重型的设计就不费笔墨了。
背景
当规模越来越大的时候。
一、Master/slave
这个是多机房数据访问最常用的方案,一般的需求用此方案即可。因此大家也经常提到“premature optimization is the root of all evil”。
优点:利用mysql replication即可实现,成熟稳定。
缺点:写操作存在单点故障,master坏掉之后slave不能写。另外slave的延迟也是个困扰人的小问题。
二、Multi-master
Multi-master指一个系统存在多个master, 每个master都具有read-write能力,需根据时间戳或业务逻辑合并版本。比如分布式版本管理系统git可以理解成multi-master模式。具备最终一致性。多版本数据修改可以借鉴Dynamo的vector clock等方法。
优点:解决了单点故障。
缺点:不易实现一致性,合并版本的逻辑复杂。
三、Two-phase commit(2PC)
Two-phase commit是一个比较简单的一致性算法。由于一致性算法通常用神话(如Paxos的The Part-Time Parliament论文)来比喻容易理解,下面也举个类似神话的例子。
某班要组织一个同学聚会,前提条件是所有参与者同意则活动举行,任意一人拒绝则活动取消。用2PC算法来执行过程如下
Phase 1
Prepare: 组织者(coordinator)打电话给所有参与者(participant) ,同时告知参与者列表。
Proposal: 提出周六2pm-5pm举办活动。
Vote: participant需vote结果给coordinator:accept or reject。
Block: 如果accept, participant锁住周六2pm-5pm的时间,不再接受其他请求。
Phase 2
Commit: 如果所有参与者都同意,组织者coodinator通知所有参与者commit, 否则通知abort,participant解除锁定。
Failure 典型失败情况分析
Participant failure:
任一参与者无响应,coordinator直接执行abort
Coordinator failure:
Takeover: 如果participant一段时间没收到cooridnator确认(commit/abort),则认为coordinator不在了。这时候可自动成为Coordinator备份(watchdog)
Query: watchdog根据phase 1接收的participant列表发起query
Vote: 所有participant回复vote结果给watchdog, accept or reject
Commit: 如果所有都同意,则commit, 否则abort。
优点:实现简单。
缺点:所有参与者需要阻塞(block),throughput低;无容错机制,一节点失败则整个事务失败。
四、Three-phase commit (3PC)
Three-phase commit是一个2PC的改进版。2PC有一些很明显的缺点,比如在coordinator做出commit决策并开始发送commit之后,某个participant突然crash,这时候没法abort transaction, 这时候集群内实际上就存在不一致的情况,crash恢复后的节点跟其他节点数据是不同的。因此3PC将2PC的commit的过程1分为2,分成preCommit及commit, 如图。
(图片来源:http://en.wikipedia.org/wiki/File:Three-phase_commit_diagram.png)
从图来看,cohorts(participant)收到preCommit之后,如果没收到commit, 默认也执行commit, 即图上的timeout cause commit。
如果coodinator发送了一半preCommit crash, watchdog接管之后通过query, 如果有任一节点收到commit, 或者全部节点收到preCommit, 则可继续commit, 否则abort。
优点:允许发生单点故障后继续达成一致。
缺点:网络分离问题,比如preCommit消息发送后突然两个机房断开,这时候coodinator所在机房会abort, 另外剩余replicas机房会commit。
Google Chubby的作者Mike Burrows说过, “there is only one consensus protocol, and that’s Paxos” �C all other approaches are just broken versions of Paxos. 意即“世上只有一种一致性算法,那就是Paxos”,所有其他一致性算法都是Paxos算法的不完整版。相比2PC/3PC, Paxos算法的改进
P1a. 每次Paxos实例执行都分配一个编号,编号需要递增,每个replica不接受比当前最大编号小的提案
P2. 一旦一个 value v 被replica通过,那么之后任何再批准的 value 必须是 v,即没有拜占庭将军(Byzantine)问题。拿上面请客的比喻来说,就是一个参与者一旦accept周六2pm-5pm的proposal, 就不能改变主意。以后不管谁来问都是accept这个value。
一个proposal只需要多数派同意即可通过。因此比2PC/3PC更灵活,在一个2f+1个节点的集群中,允许有f个节点不可用。
另外Paxos还有很多约束的细节,特别是Google的chubby从工程实现的角度将Paxos的细节补充得非常完整。比如如何避免Byzantine问题,由于节点的持久存储可能会发生故障,Byzantine问题会导致Paxos算法P2约束失效。
以上几种方式原理比较如下
DHT
Distributed hash table
Map Reduce Execution
Map Reduce已经烂大街了,不过还是要提一下。
参见:http://zh.wikipedia.org/wiki/MapReduce
Handling Deletes
但我们执行删除操作的时候必须非常谨慎,以防丢失掉相应的版本信息。
通常我们给一个Object标注上"已删除"的标签。在足够的时间之后,我们在确保版本一致的情况下可以将它彻底删除。回收他的空间。
存储实现
One strategy is to use make the storage implementation pluggable. e.g. A local MySQL DB, Berkeley DB, Filesystem or even a in memory Hashtable can be used as a storage mechanism.
Another strategy is to implement the storage in a highly scalable way. Here are some techniques that I learn from CouchDB and Google BigTable.
CouchDB has a MVCC model that uses a copy-on-modified approach. Any update will cause a private copy being made which in turn cause the index also need to be modified and causing the a private copy of the index as well, all the way up to the root pointer.
Notice that the update happens in an append-only mode where the modified data is appended to the file and the old data becomes garbage. Periodic garbage collection is done to compact the data. Here is how the model is implemented in memory and disks
In Google BigTable model, the data is broken down into multiple generations and the memory is use to hold the newest generation. Any query will search the mem data as well as all the data sets on disks and merge all the return results. Fast detection of whether a generation contains a key can be done by checking a bloom filter.
When update happens, both the mem data and the commit log will be written so that if the
节点变化
Notice that virtual nodes can join and leave the network at any time without impacting the operation of the ring.
When a new node joins the network
- 新加入的节点宣告自己的存在(广播或者其他手段)
- 他的邻居节点要调整Key的分配和复制关系。这个操作通常是同步的
- 这个新加入的节点异步的拷贝数据
- 这个节点变化的操作被发布到其他节点
Notice that other nodes may not have their membership view updated yet so they may still forward the request to the old nodes. But since these old nodes (which is the neighbor of the new joined node) has been updated (in step 2), so they will forward the request to the new joined node.
On the other hand, the new joined node may still in the process of downloading the data and not ready to serve yet. We use the vector clock (described below) to determine whether the new joined node is ready to serve the request and if not, the client can contact another replica.
When an existing node leaves the network (e.g. crash)
- The crashed node no longer respond to gossip message so its neighbors knows about it.崩溃的节点不再发送Gossip Message的回应,所以他的邻居都知道他是了
- The neighbor will update the membership changes and copy data asynchronously,他的邻居处理后事,将他的活分给别人干,同时调整节点关系。
We haven't talked about how the virtual nodes is mapped into the physical nodes. Many schemes are possible with the main goal that Virtual Node replicas should not be sitting on the same physical node. One simple scheme is to assigned Virtual node to Physical node in a random manner but check to make sure that a physical node doesn't contain replicas of the same key ranges.
Notice that since machine crashes happen at the physical node level, which has many virtual nodes runs on it. So when a single Physical node crashes, the workload (of its multiple virtual node) is scattered across many physical machines. Therefore the increased workload due to physical node crashes is evenly balanced.
列存
描述
数据库以行、列的二维表的形式存储数据,但是却以一维字符串的方式存储,例如以下的一个表:
EmpId |
Lastname |
Firstname |
Salary |
1 |
Smith |
Joe |
40000 |
2 |
Jones |
Mary |
50000 |
3 |
Johnson |
Cathy |
44000
|
这个简单的表包括员工代码(EmpId), 姓名字段(Lastname and Firstname)及工资(Salary).
这个表存储在电脑的内存(RAM)和存储(硬盘)中。虽然内存和硬盘在机制上不同,电脑的操作系统是以同样的方式存储的。数据库必须把这个二维表存储在一系列一维的“字节”中,又操作系统写到内存或硬盘中。
行式数据库把一行中的数据值串在一起存储起来,然后再存储下一行的数据,以此类推。 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;
列式数据库把一列中的数据值串在一起存储起来,然后再存储下一列的数据,以此类推。 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;
特点
- 良好的压缩比。由于大多数数据库设计都有冗余,如此一来,压缩比非常高,把40多M的数据导入infobright,没想到数据文件只有1M多
- 列上的计算非常的快。
- 方便MapReduce和Key-value模型的融合
- 读取整行的数据较慢,但部分数据较快
简单分析含源码
|
0人
|
了这篇文章 |
类别: 博文摘┆阅读(
0)┆评论(
0) ┆ 返回博主首页┆ 返回博客首页
上一篇 高效程序员的45个习惯:敏捷开发修炼之道 下一篇 NoSQL数据库笔谈(2)