The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google∗
【西周翻译】
ABSTRACT 概述
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
我们设计和实现了Google File System,简称GFS,一个可扩展的分布式文件系统,用于大型分布式数据相关应用。它提供了基于普通商用硬件上的容错机制,同时对大量的客户端提供高性能的响应。
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment,
both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
GFS与此前的分布式文件系统具有许多相同的目标,但我们的设计是基于对我们的应用负载和技术环境的观察而来,包含当前状况,也包含今后的发展,这与一些早期的文件系统的假定就有了分别。这驱使着我们去重新考虑传统的选择和探索新的设计点。
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
这个文件系统成功的满足了我们的存储需求。在Google它被广泛的部署,我们的业务用其作为生成和处理数据的存储平台,同时也被用于节省在面对大量数据时的研究和开发成本。当前最大的集群已经可以基于超过一千台机器上的数千个磁盘,来存储上万TB的数据,同时它也支持来自于上万个客户端的访问请求。
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both
micro-benchmarks and real world use.
在这篇论文中,我们展示了文件系统的接口扩展,用以支持分布式应用,并且针对我们的设计进行的多个方面的讨论,以及在真实环境中运行的度量数据。
1. INTRODUCTION 简介
We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier
file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
我们设计实现了GFS来应对来自Google快速增长的数据处理需求。GFS和此前的分布式文件系统具有某些相同的目标,如性能,可扩展型,可靠性和可用性。然而,GFS的设计被Google的应用负载情况及技术环境所驱动,具有和以往的分布式文件系统不同的方面。我们从设计角度重新考虑了传统的选择,针对这些不同点进行了探索。
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
第一,组件的失效比异常更加常见。文件系统包含了成百上千的基于普通硬件的存储机器,同时被大量的客户端机器访问,组件的数量和质量决定了在某个时刻一些组件会失效而其中的一些无法从失效状态中恢复。我们曾经见到过由于下面的原因引发的实效:应用缺陷,OS缺陷,人为错误,磁盘/内存/连接器/网络/电源错误等等,因此系统必须包含状态监视、错误检测、容错、自动恢复等能力。
Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.
第二,传统标准的文件量十分巨大,总量一般都会达到GB级别。文件通常包含许多应用对象,诸如Web文档等。当我们在工作中与日益增长的包含大量对象的TB级的数据进行交互时,管理数以亿计的KB大小的文件是非常困难的。所以,设计假定和参数需要重新定义,如I/O操作和块大小等。
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
第三,多数的文件变化是因为增加新的数据,而非重写原有数据。在一个文件中的随机写操作其实并不存在。一旦完成写入操作,文件就变成只读,通常也是顺序存储。多种数据拥有这样的特征。构造大型存储区以供数据分析程序操作;运行应用产生的连续数据流;历史归档数据;一台机器产生的会被其他机器使用的中间数据;对于巨大文件的访问模式,“增加”变成了性能优化的焦点,与此同时,在客户端进行数据块缓存逐渐失去了原有的意义。
Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
第四,统一设计应用和文件系统API对提升灵活性有着好处。例如,我们将GFS的一致性模型设计的尽量轻巧,使得文件系统得到极大的简化,应用系统也不会背上沉重的包袱。我们还引入了一个原子Append操作,这样多个客户端可以同时向一个文件增加内容,而不会出现同步问题。这些将会在论文的后续章节进行讨论。
Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.
多个GFS集群被部署用于不同的用途。最大的一个拥有1000个存储节点,300TB的磁盘存储,被上万个用户持续的密集访问。
2. DESIGN OVERVIEW 设计概览
2.1 Assumptions 假定
In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details.
在设计符合我们需求的文件系统的时候,我们制定了下述的假定,有挑战也有机会。前面我们提到过一些关键的观察,现在我们将其具体化。
• The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
系统由许多便宜常见的组件构成,它们经常出现错误。必须定期进行监视、检测、容错、以及从错误状态恢复到例行工作状态。
• The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
系统存储了一定数目的大型文件。我们期望是数百万个文件,每个大概是100MB以上。GB级文件是常见情形,需要被有效的管理起来。小文件也必须支持,但是我们无需为其优化。
• The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
系统的负荷来自于两种读操作:大型顺序读,以及小型随机读。在大型顺序读的情况中,单个操作通常读取MB级别以上的数据。来自相同客户端的连续操作通常读取一个文件的连续区间。小型随机读通常读取若干KB的数据据。关注性能的应用往往会将小型读操作进行打包和排序,从而使得在文件中平稳的读取,而非反复前后跳转。
• The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
系统的负荷也有许多大型的连续的Append写操作。通常操作的大小与读取相似。一旦完成写入,文件几乎不会再被修改。系统也会支持小型随机写入操作,但是效率不会很高。
• The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently
append to a file. Atomicity with minimal synchronization overhead is essential. The file may be
read later, or a consumer may be reading through the file simultaneously.
对于多个客户端并发向同一个文件进行Append操作的情况,系统必须有效的实现良好定义的语义。我们的文件常被用作“生产者-消费者队列“或者“多路合并”。数以百计的生产者,每个运行于单独的机器,并行向同一个文件添加数据。降低同步的困扰必不可少。文件可能后续被读取,也许一个消费者会同时读取。
• High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.
持续的高带宽比低延迟更为重要。多数目标应用期望以高速率对块数据进行处理,同时只有少量应用对单个读写操作的响应时间有严格的要求。
2.2 Interface 接口
GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files.
GFS提供了一套常见的文件系统接口,虽然它并没有实现诸如POSIX这样的标准API。文件在目录中以层次化的形式进行组织,可以通过路径名称进行标识。我们提供了诸如创建、删除、打开、关闭、读和写文件这样的常见操作。
Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.
GFS也拥有快照和Append记录操作。快照以最低成本创建一个文件或一个目录树的拷贝。Append记录允许多个客户端同时向一个文件进行Append操作,同时确保每个单独客户端Append的原子性。这一点对于实现“多路合并”和“生产者-消费者队列”非常有意义,许多客户端可以同时进行Append操作而不受额外的加锁限制。我们发现在构造大型分布式应用时,这种类型的文件非常有价值。快照和Append记录将在3.4和3.5章中详细讨论。
2.3 Architecture 架构
A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
一个GFS集群由一个master和多个块服务器(Chunkserver)组成,被多个客户端所访问,如图1所示。每个机器都是廉价的Linux机器,运行用户态服务进程。也可以将块服务器和客户端在同一台机器上运行,只要机器的资源允许,或者可以接受可能有问题的应用代码带来的低稳定性。
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.
文件被分割成固定大小的块。每个块都使用一个不变的全局唯一的64位块句柄进行标识,这个句柄在master创建块时进行分配。块服务器在本地磁盘上像Linux文件一样存储块,根据指定的块句柄和字节范围来读写块数据。为了可靠性,每个块被复制在多个块服务器上。缺省情况下,我们保存三分复制,用户也可以为文件名称空间的不同地区指定不同的复制级别。
The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
Master维护所有的文件系统元数据。它将包括名字空间,访问控制信息,文件与块的链接,以及块的当前位置。它还控制着系统层面的活动,诸如块租借管理,孤立块的垃圾回收,块服务器之间的块迁移。master会定期的与块服务器使用心跳消息进行通信,发送指令给块服务器,以及收集块服务器的状态。
GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
嵌入与应用中的GFS客户端代码实现了文件系统API,与master和块服务器进行通信,代为应用程序读写数据。客户端与master交互以进行元数据操作,但是所有的数据通信都将直接访问块服务器。我们没有提供POSIX API,因此无需在Linux vnode层放置钩子。
Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
客户端和块服务器都不会缓存文件数据。客户端进行缓存只有极少的益处,因为多数应用操作巨大的文件,而且工作输出的大小也超出的缓存的范围。没有缓存让客户端和整个系统都变得简单,因为可以忘记缓存同步问题。(然后客户端还是会缓存元数据)块服务器也无需缓存文件数据,因为块在本地文件中存放,Linux的缓冲区机制已经将频繁访问的数据放进了内存。
2.4 Single Master 单Master
Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
单master极大的简化了我们的设计,同时也使得master可以给予全局知识进行复杂的块存储和复制策略。但是我们必须使得master在读写方面的占用最小化,从而避免让它成为瓶颈。客户端从不直接从master读写数据。相反的,客户端会询问master该与哪个块服务器进行交互。而后它会将这个信息缓存一段时间,接下来的操作会直接与这个块服务器进行交互。
Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunk size, the client translates the file name and byte offset specified by the application
into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.
让我们用图1来解释一下一个简单的读操作的交互过程。首先,使用固定的块大小,客户端将文件名和应用指定的偏移量转换成文件内部的块索引。然后,客户端向master发送一个请求,包含文件名和块索引。master响应对应的块句柄和复本的位置。客户端将这些信息进行缓存,使用文件名和块索引作为Key。
The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This
extra information sidesteps several future client-master interactions at practically no extra cost.
客户端向复本之一发送一个请求,通常是最近的一个。这个请求指定了块句柄和块内部的一个区间。接下来对于相同块的读取将不会再次进行客户端与master的交互,直到缓存过期,或者文件被重新打开。事实上,客户端通常在一个请求中尝试读取多个块,master也会立即返回相应的块信息。这些额外的信息避免了后续的一些客户端与master的交互,但又没有引入额外的成本。
2.5 Chunk Size 块大小
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
块的大小是一个关键的设计点。我们选择了64MB,这比通常的文件系统块要大出很多。每个块的复本在一个块服务器上被存储为一个平面的Linux文件,仅在需要的时候进行扩展。“懒”空间分配避免了内部碎片导致的空间浪费,这也许是如此大小的块机制的最无争议之处。
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
大型的块有许多关键的好处。首先,它减少了客户端与master交互的需求,因为对于同一的块的读写,只需要向master发送一个获取块位置信息的初始请求。这极大的降低系统的负荷,因为应用通常对大型文件进行顺序读写操作。即使对于小型随机读操作,客户端也可以轻松的对TB级别的工作集的块位置存储进行缓存。第二,因为块足够大,客户端基本上是在一个给定的块上进行多次操作,这也可以降低网络方面的困难,因为可以在一个时间段内与块服务器之间保持一个持久的TCP连接。第三,这使得可以减少在master上存储的元数据大小。这样的话,我们可以将元数据放入内存中,从而带来其他的将在2.6中讨论的好处。
On the other hand, a large chunks ize, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
另一方面,虽然可以进行“懒”空间分配,大型的块也有它的缺点。一个小文件包含较少的块,也可能只有一个。存储这些块的块服务器可能会变成“热点”,如果许多客户端尝试访问相同的文件。在实践中,热点不会成为主要问题,因为我们的应用在大多数情况下,是顺序的对多个块的文件进行读操作。
However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines
at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batch queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
但是,当GFS被第一次用于一个批处理队列系统中试,热点还是出现了:一个可执行文件作为单块文件被写入GFS,然后在成百上千台机器上启动运行。保存这个可执行文件的几台块服务器由于大量并发请求进入过载状态。我们采取了一些措施来解决这个问题,提高复本数量,以及让批处理队列系统错开应用的启动时间。一个潜在的长期解决方案是:允许客户端在这种情况下从其他的客户端读取数据。
2.6 Metadata 元数据
The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkma pping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. Using a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins the cluster.
Master存储三种主要的元数据:文件和块的名字空间,文件和块的映射关系,每个块复本的位置。所有的元数据都保存在Master的内存中。前两类(名字空间和映射关系)也作为操作日志被保存在Master的本地磁盘上,并且在远程机器上保存一个复本。使用日志使得我们更加简单、可靠的更新Master的状态,不用担心由于Master死机造成的数据不一致。Master不会持久化块的位置信息,相反,Master启动时会向块服务器查询块的状态,并且一个块服务器加入集群时也会进行相同的操作。
2.6.1 In-Memory Data Structures 内存中的数据结构
Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunkm igration to balance load and disk space usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.
因为元数据被存储在内存中,Master的操作速度非常快。并且,Master可以简单有效的定期在后台扫描所有的状态。这样的周期扫描被用于实现块的垃圾回收,块服务器失效后重新生成复本。4.3和4.4将进行详细讨论。
One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
对于纯内存方式的潜在忧虑在于,块的数量、乃至于整个系统的容量受限于Master的内存大小。实践中这并不是一个严重的限制。对于每个64MB大小的块,Master保存小于64字节的元数据。大多数的块是满的因为多数文件包含多个块,只有最后的一个是部分填充。相似的,每个文件的名字空间数据通常也仅需要64字节,因为保存的文件名使用前缀压缩过。
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.
如果需要支持一个更大的文件系统,在Master上添加内存只是很小的投入。将元数据放置于内存中带来了简洁性、可靠性、高性能、灵活性等诸多好处。
2.6.2 Chunk Locations 块位置
The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.
Master并不持久化指定块的复本位置信息。当启动时,Master从块服务器上获取这些信息。Master可以自行保持更新,因为它控制了所有的块放置操作,以及通过心跳信息监视块服务器的状态。
We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.
我们起初曾经尝试将快的位置信息在Master中进行持久化,但是我们决定启动时读取数据更加简洁,同时可以消除Master与块服务器之间的数据同步问题,诸如块服务器加入或退出集群,变换名称,失效,重启等等。在一个拥有数百台机器的集群中,这些事件经常发生。
Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
从另外一个角度去理解这个设计决策,对于某个块是否存在于块服务器上,这个快服务器是最具发言权的。在Master上维护这个信息的对应视图是没有必要的,因为块服务器上的错误会导致块自动消失(例如磁盘损坏失效)或者操作员重命名一个块服务器。
2.6.3 Operation Log 操作日志
The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
操作日志包含关键元数据的变动记录。这是GFS的核心。它不仅是元数据的唯一持久记录,也被作为定义并发操作顺序的逻辑时间线。文件和块,以及它们的版本(见4.5),在他们被创建后都可以被永远唯一的标识。
Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
由于操作日志的重要性,我们必须以可靠的方式保存它,而且只有元数据的变动被持久化后,变动才会对客户端可见。否则,虽然块还存在,我们却可能丢失整个文件系统或者最近的客户端操作。因此,我们将它的复本保存在多台远程机器上,并且只有在已经将日志输出到本地和远程的磁盘上后,才会对客户端的请求完成响应。为了降低传输和备份对于整个系统的影响,在发送日志前,Master会将多个日志记录打包在一起。
The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.
恢复文件系统的状态时,Master重放操作日志。为了最小化启动时间,我们必须让日志尽量的小。每次日志超过一个指定的大小后,Master会对日志保存检查点,这样系统可以先加载最新的检查点,而后只重放少数的日志就可以回退到最新状态。检查点是一个压缩B树的形式,可以直接被映射到内存中,并且使用名称空间查询时无需额外的解析。这也将使恢复过程变得更快。
Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to disk both locally and remotely.
因为创建一个检查点会花费一些时间,Master的内部状态被构造为一种形式,这种形式可以使得创建新检查点时不会对到来的变化产生延迟。Master会切换到一个新的日志文件,并在另一个线程中创建一个新的检查点。新的检查点包含切换前所有的变动。一个百万级的集群的检查点可以在一分钟内完成创建。当结束后,它将被写入到本地和远程的磁盘中。
Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.
恢复只需要最新的完整检查点和后续的日志文件。更早的检查点和日志文件可以被删除,虽然我们将会保存一些来应对意外。创建检查点时发生错误不会影响正确性,因为恢复代码可以检测和跳过不完整的检查点。
2.7 Consistency Model 一致性模型
GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the
details to other parts of the paper.
GFS拥有一个轻量的一致性模型,可以完美的支持高度分布的应用,但保持了简单和容易实现的优点。我们现在讨论GFS对于一致性的保证,以及对于应用程序意味着什么。我们强调GFS如果管理这些保证,但是实现细节将在论文的后续部分进行讨论。
The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summarizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference
from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined regions. The applications do not need to further distinguish between different kinds of undefined regions.
数据变动后文件范围的状态取决于变动的类型,是否成功,是否是并发变动。表格1汇总了结果。如果所有的客户端不管从哪个复本读取,都一直能看见相同的数据,则这个文件范围是一致的。在一个文件数据变动后,如果它是一致的,并且客户端可知变动的地方,则这个文件范围是已定义的。如果一个变动成功,则被影响的文件范围是已定义的(隐含的一致性):所有的客户端一直都可见写入的变动。同步的成功变动使得范围是一致的但是未定义:所有的客户端看见相同的数据,但是它也许不会表现出发生的变动。通常情况下,它包含多个变动的混合片段。一个失败的变动使得范围变成不一致(因此也是未定义):不同的客户端在不同时间可能看见不同的数据。我们将会描述我们的程序如何能够分辨已定义和未定义的范围。应用不用去区分未定义的范围的种类。
Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.
数据变动可以是写或记录追加。写会将数据写在应用指定的文件偏移位置。记录追加将把数据原子性的追加到文件中,但是GFS可以选择偏移位置(3.3)。(相比而言,通常的追加仅指在文件的末尾)偏移量将会返回到客户端,并标识出包含记录的已定义范围的开始处。此外,GFS会在中间插入填充字符或者冗余记录。它们占据被认为是不一致的范围,通常比用户数据的量少的多。
After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas (Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity.
在一系列成功变动后,变动的文件范围保证是已定义的,并且包含了最后变动所写入的数据。GFS通过下面的方法做到这一点:(a)将块的变动在所有的复本上按相同的顺序进行记录(3.1),(b)使用块版本号来检测是否因为块服务器死机造成错过了某些变动,从而复本变成失效(4.5)。失效的复本将不再会涉及后续的变动,Master向客户端响应块的位置时也不会返回此复本的信息。它们将尽早被垃圾回收。
Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next
open of the file, which purges from the cache all chunk information for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.
因为客户端缓存了块的位置,它们可能会在信息刷新前从一个失效的复本读取。时间窗口由缓存超时时间以及文件再次打开的时间而决定,文件打开后会清除缓存中所有块的信息。而且,因为我们的大多数文件是仅追加的,一个失效的复本通常返回块末尾之前的数据,而不是无效的数据。当重新联系Master时,它可以立即得到当前的块位置。
Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk
is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data.
成功变动过后很久,部件错误可以会损坏或销毁数据。GFS使用Master和块服务器之间的握手和数据校验,可以识别失效的块服务器(5.2),一旦出现问题,数据可以尽快的从有效的复本中恢复出来(4.3)。只有当一个块的所有复本在GFS应对之前全部丢失,这个块才会不可逆的丢失,通常GFS的反应时间在几分钟之内,即使在此种情况下,块变成不可用,但并没有损坏:应用可以接收到明确的错误,而不是损坏的数据。
2.7.2 Implications for Applications 应用的影响
GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
GFS应用使用一些已经在其他用途也需要的技巧,就可以适应这样的简化一致性模型了:依赖追加甚于覆写,检查点,写入自验证,自标识的记录等。
Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined
state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more efficient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.
实际上,我们所有的应用程序使用追加进行文件变动多过覆写。一个典型用法,写入者从头到尾生成文件。它写完所有的数据后,将文件重命名为一个永久的名称,或者周期性的为写入成功多少而建立检查点。检查点也包含应用性的校验和。读取者只验证和处理在最新检查点中的文件范围,也就是已定义的状态。不管发生一致性和同步问题,这个方法工作的很好。追加比随机写有效的多,并且对应用失效更有弹性。检查点允许写入者渐进的重新开始,并避免读取者从应用的角度认为文件数据已经成功处理,然而实际上是不完整的。
In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. These
functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.
在另一个常见的用法中,多个写入者并发的向一个文件进行追加,进行结果的合并或者作为生产者-消费者队列。记录追加的“最少一次追加”的语义保证了每个写入者的输出。读取者按照下面的方法来应对偶尔的填充数据和冗余信息。写入者准备的每条记录都包含诸如校验和这样的额外信息,因此记录的有效性可以被判断。读取者可以使用校验和来识别和消除额外的填充数据和记录片段。如果它不能容忍偶然的冗余(例如,如果他们出发非幂操作),它可以使用记录的唯一标识来过滤掉它们,这些标识符通常用于名称对应的应用,例如Web文档。这些记录I/O的功能(除去移除冗余)都封装在库的代码中在应用中共享,并且可以用于google实现的其它文件接口。记录的相同序列,加上少有的冗余,总是被分发给记录的读取者。