TAO: Facebook's Distributed Data Store for the Social Graph论文阅读笔记

Several fundamental problems

在TAO之前,Facebook用的主要的缓存系统就是Memcache,但是像Memcache这一类的lookaside cache(旁路缓存系统)存在着一些问题:

  • Inefficient edge lists
    • 像Memcache这样的key-value缓存系统并不适合存储edge lists,因为在Facebook庞大的社交网络图中,对某个节点的所有edge lists查询操作很常见,而往往改变其中的某一条边,就会要求整个edge lists失效(update操作),这样当用户在进行查询时,这个之前改变的edge lists就会要求重新被加载到缓存中,那么可想而知,如果这样的edge lists很庞大的话,对后端数据库的负载就会增大。
  • Distributed control logic
    • 根据上一篇文章中提到的,Memcache的分布式控制逻辑是运行在客户端方面的,这种控制逻辑很难避免”thundering heads”这种情况发生,而TAO中是通过一些简单的API把控制逻辑移到了缓存中。
  • Expensive read-after-write consistency
    • 在Memcache中,如果本地region发起write操作的话, 需要先跨区域向Master region写,同时在sql语句中嵌入key和远程标记rk,然后在本地cache中设置远程标记rk,当本地发生read miss时,查询本地cache中是否有远程标记rk,如果有,则向Master数据库查询,否则定位到slave region中(因为如果本地cache还缓存着rk的话,说明本地数据库的数据是脏数据,Master的数据还没同步过来)。由于需要进行跨区域通信,往往会带来很大的延时,那么是否能够在保证一致性的前提下,能够直接往本地cache写呢?当然,TAO实现了这一点。

Goals for TAO

  • Provide a data store with a graph abstraction (vertexes and edges), not keys+values.
  • Optimize heavily for reads(99.8% read requests).
  • Explicitly favor efficiency and availability over consistency.
    • Slightly stale data is often okay (for Facebook).
    • Communication between data centers in different regions is expensive.

TAO Data Model

主要分为两种,Objects and Associations

  • Objects(Nodes):Object id(64-bit integer), Object type(otype), data, in the form of key-value pairs.
  • Associations:Source Object id(id1), Association type(atype), Destination Object id(id2), 32-bit timestamp, data, in the form of key-value pairs.
  • 数据模型
  • Example:Encoding in TAO
    • Alice used her mobile phone to record her visit to a famous landmark with Bob. She ‘checked in’ to the Golden Gate Bridge and ‘tagged’ Bob to indicate that he is with her. Cathy added a comment that David has ‘liked’.
    • Example

Association queries in TAO

在Facebook中,一种很常见的查询操作就是给定某个节点(Object)以及边的类型(atype),返回与给定信息相关联的一个边列表(edge list),比如上面的’check in’行为,为此,Facebook设计了Association List这种结构。

Association List的结构: (id1, atype) → [anew …aold],其中列表中的每个元素都按时间先后排序。

TAO’s queries on associations lists

  • Assoc_get(id1, atype, id2set, high?, low?)
  • Assoc_count(id1, atype)
  • Assoc_range(id1, atype, pos, limit)
  • Assoc_time_range(id1, atype, high, low, limit)

TAO Architecture

  • Storage Layer
    • Objects and Associations存储在MySQL中,但是面对庞大的facebook社交网络数据,单台MySQL服务器难以存储,因此需要进行数据分片。
      • Divide data into logical shards, every shard is contained in a logical database.
      • Each object id contains an embedded shard_id that identifies its hosting shard.
  • Caching Layer

    • 多台cache server组成在一起,我们把它称为一个tier,一个tier能够服务facebook的所有请求。
    • cache server中主要缓存三种数据:objects, association lists, and association counts。

    由于随着这种业务的增长,需要不断扩充服务器的数量,但是一味地增加服务器数量往往会出现例如hot spots这种问题,于是可以把cache层分为两层,称为leader和followers。

    • Leader and Followers
      • Leaders reading from and writing to MySQL, serialize concurrent writes that arrive from followers.
      • Followers forward read misses and writes to a leader.
      • Clients can only interact with followers.

Caching Consistency

由于cache层被分为两层,follower的功能是处理读请求以及转发写请求,当出现read miss时,才把请求转发给leader,leader查询MySQL数据库,更新缓存。但是leader对待不同的update请求操作时,做法是不一样的,这里分为两种:

  • An object update
    • The leader forwards invalidation messages to each corresponding follower.
    • The follower issued the write is updated synchronously on reply from the leader
  • An association update
    • 对于association update,之前讨论过,如果在Memcache中,失效一条边往往需要使整个边列表重新载入,这样会大大增加延迟以及数据库的负载,在这里TAO是这么做的:
      • the leader sends a refill message to notify followers。
      • if a follower has cached the association,Asks the leader to update the follower’s now-stale association list。

Scaling Geographically

  • Master-Slave
    A. The master leader sends read misses, writes, and consistency messages to the master database.
    B. Messages are delivered to the slave leader as the replication stream updates the slave database.
    C. Slave leader sends writes to the master leader.
    D. Read misses to the replica DB.

Consistency

  • Changeset
    • TAO synchronously updates the cache with locally written values by having the master leader return a changeset when the write is successful.
    • This changeset is propagated through the slave leader (if any) to the follower tier that originated the write query.
  • Version number(In the persistent store and cache)
    • The version number is incremented during each update.
    • The follower can safely invalidate its local copy of the data if the changeset indicates that its pre-update value was stale.

论文下载

你可能感兴趣的:(Facebook)