HDFS架构(Apache Hadoop 2.1.1-beta)

Apache Hadoop 2.1.1-beta,摘自http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS Architecture  HDFS架构

Introduction 引言

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/.

Hadoop分布式文件系统(HDFS)是一个设计运行在通常的硬件机器上的分布式文件系统。它与已存在的分布式文件系统有许多相似性。但是,与其它系统的不同之处也很重要。HDFS是一个高容错性系统,被设计成可以运行在廉价硬件上。HDFS可提供高吞吐量,适合于那些具有大数据集的应用场合。HDFS放宽了一些POSIX要求,以适应流式存取文件数据。HDFS最初是作为Apache Nutch web搜索引擎项目的基础构件来开发的。现在HDFSApache Hadoop的核心项目,项目URLhttp://hadoop.apache.org/.

Assumptions and Goals设与目标

Hardware Failure件故障

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults andquick, automatic recovery from them is a core architectural goal of HDFS.

硬件故障不是意外。一个HDFS实例可能由成百上千的服务器机器构成,每一台机器存储了文件系统的一部分数据,这个事实表明有数量巨大的部件,每个部件都有不小的故障可能性,以至HDFS中的一些部件总是失效的。

因而,检测到故障,并且快速地、原子性地恢复过来,是HDFS的一个核心的设计目标。

 Streaming Data Access流式数据访问

Applications that run on HDFS need streaming access to their data sets.They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

运行在HDFS之上的应用需要以流式方式访问它们的数据集,它们不是运行在通常文件系统之上的通常意义的应用。HDFS被设计成更适合批处理,而不是采用与用户相互交互的方式。HDFS强调数据高吞吐性而非低延迟性,POSIX设置了许多严格的要求,以至以HDFS作为运行目标的应用不需要。为提高数据吞吐率,在一些关键方面的POSIX的语义已被改变。

 Large Data Sets 大数据集

Applications that run on HDFS have large data sets. A typical file in HDFSis gigabytes to terabytes in size. Thus, HDFS is tuned to support large files.It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

运行在HDFS之上的应用有大数据集。HDFS上一个典型文件的大小为数个GB到数个TB,因而,HDFS已为大数据文件做了优化,它应该聚合数据传输带宽,并能在单一集群部署数百个节点。在单一实例下,它应该支持数百万文件。

Simple Coherency Model 简单耦合模型

HDFS applications need a write-once-read-many access model for files. Afile once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

HDFS应用需要“一次写多次读”这样一个访问文件模式。一个文件一旦被创建,写入和关闭,都不需要修改。这个假设简化了数据的一致性,使数据可以大容量吞吐。一个Map/Reduce应用和一个web爬虫应用特别适合于这种场合。将来有一个支持“添加写”的计划。

“Moving Computation is Cheaper than Moving Data” “移动计算比移动数据廉价”

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size ofthe data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than movingthe data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

如果应用程序所操作的数据就在附近,那么这种应用的计算效能要高得多,特别是当数据集很大的时刻。这减少了网络拥塞,增加了系统的整体效率。这要求将计算放置到数据存储的附近,而不是将数据移动到计算程序处。HDFS为应用程序提供了将它们移动到数据存储位置的接口。

Portability Across Heterogeneous Hardware and Software Platforms 异种硬件和软件平台的可移植性

HDFS has been designed to be easily portable from one platform to another.This facilitates widespread adoption of HDFS as a platform of choice for alarge set of applications.

HDFS被设计成很容易地从一个平台迁移到另一个平台,这个特性有利于将HDFS作为一个可选择的,大规模应用的平台而被广泛应用。

NameNode and DataNodes 

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes,usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the filesystem’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

HDFS有一个master/slave(/)架构。一个HDFS集群有一个单一的NameNode,这是一个管理文件系统命名空间和规定客户端如何访问文件的主服务器。另外,集群由若干DataNode,通常每个节点一个,管理这个节点上的存储。HDFS暴露出一个文件系统命名空间,允许用户数据存储在其中。在内部,一个文件被分割成一个或多个块,这些块被存储到若干DataNode上。NameNode维护文件系统命名空间,例如打开、关闭、重命名文件和目录,它也决定着数据块到DataNode的映射。客户端对文件系统的请求(读/写)由DataNode负责处理。在NameNode的指令下,DataNode也负责数据块的创建、删除和复制。

HDFS架构(Apache Hadoop 2.1.1-beta)_第1张图片

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has adedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system.The NameNode is the arbitrator and repository for all HDFS metadata. The systemis designed in such a way that user data never flows through the NameNode.

NamaNodeDataNode被设计成可以运行在日常通用的机器上,这些机器通常运行GNU/Linux操作系统。HDFSJava编写,任何支持Java的机器都能运行NameNodeDataNode。采用高可移植Java语言意味着HDFS可以部署到各种类型的机器。一个典型的部署中,有一台特别的机器,上面仅运行NameNode,集群中其他的机器,每一台运行一个DataNode的实例。这个架构不排除在一台机器上运行多个DataNode,但现实中很少见。

集群中仅有一个NameNode极大地简化了系统的架构。NameNode是所有HDFS元数据的存储者和管理者,而用户数据决不会流入到NameNode上。

The File System Namespace  文件系统命名空间

HDFS supports atraditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFSdoes not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.

The NameNode maintains the file system namespace. Any change to the file system namespace orits properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number ofcopies of a file is called the replication factor of that file. This information is stored by the NameNode.

HDFS支持传统的层次结构文件组织方式。用户或应用程序可以在目录中创建目录和存储文件。文件系统命名空间的层次结构类似于其他文件系统,可以创建和删除文件,将文件从一个目录移动到另一个目录,或者重命名文件。HDFS目前还未实现用户限额或访问权限。HDFS不支持硬链接或软链接。但是,HDFS架构并不排除实现这些特性。

NameNode维护文件系统命名空间。任何对文件系统命名空间或者它的属性的修改都被NameNode所记录。应用程序可以指定一个文件的复制份数,一个文件拥有的拷贝数叫做该文件的复制因数,此信息由NameNode保存。

Data Replication  数据复制

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the lastblock are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

HDFS被设计用来可靠地在一个大集群中跨机器地存储巨大文件,它将每个文件保存为连续的块,除最后一个块之外,一个文件中所有块的大小相同,为容错性,文件块被复制。文件块大小与复制因数是每个文件的配置参数。应用程序可以为每个文件指定复制份数,这个复制因数可以在文件创建时指定,也可以在以后改变。HDFS中的文件都是“写一次”的方式,并且在任何时刻,严格只有一个写入者。

NameNode 决定如何复制文件块。它周期性地从集群中每一个 DataNode 接收心跳和块报告。接收到心跳表示这个 DataNode 还在正常工作,而块报告则包含着这个 DataNode 上所有文件块的列表。

HDFS架构(Apache Hadoop 2.1.1-beta)_第2张图片

Replica Placement: The First Baby Steps 复制块放置:初步的想法

The placement of replicas is critical to HDFS reliability and performance.Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve datare liability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate iton production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

复制块的放置对HDFS的性能和可靠性至关重要。优化的复制块放置方法是HDFS不同于其他分布式文件系统的地方,这个特性需要大量的经验与调试。复制块的机架自适应策略就是为了提高数据的可靠性、可用性和节省网络带宽。目前复制块放置的实现策略仅是这个方向上的一个初步努力的结果。这个策略的短期目标是验证它在实际生产系统的情况,了解它更多的状况,为测试和研究更复杂的策略建立基础。

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

集群上运行的大容量HDFS实例通常有许多机架。不同机架上的两个节点之间的通信需要通过通信设备,大部分情况下,同一机架上两台机器之间的网络带宽大于不同机架上两台机器之间的带宽。

The NameNode determines the rack id each DataNode belongs to via the process outlined inHadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks.This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure.However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

NameNode决定着每个DataNode所属的机架id,这个过程在“Hadoop Rack Awareness”中有一个描述。一个简单但不太优化的方法是将复制块放置到不同的机架上,当某个机架故障时,数据也不会丢失,并且读数据时可以利用多个机架的带宽,这种将复制块均匀分布到不同机架的策略在发生部件故障时很容易做到负荷平衡,但是这个策略增加了写成本,因为写时,数据块需要在不同机架之间传输数据。

For the common case, when the replication factor is three, HDFS’splacement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that ofnode failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are onone rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

The current, default replica placement policy described here is a work in progress.

通常情况下,当复制因数是3时,HDFS的放置策略是一个数据块放置到本地机架的一个节点,另一份数据块放置到本地机架的另一个节点,最后一份数据块放置到不同机架的一个节点。这个策略减少了机架之间的通信而极大提高了写性能。机架发生故障的可能性远小于节点,这个策略并不影响数据的可靠性和可用性,但是,在读数据时,它确实降低了整体带宽,因为数据块仅放置在两个而不是三个机架上,这种策略下,一个文件的复制块没有在机架之间均匀分布,三分之一的复制块在一个节点上;三分之二的复制块在一个机架上,另三分之一的复制块均匀分布在另外的机架上。这个策略提高了写性能,没有降低数据可靠性或读性能。

这儿描述的目前缺省的复制块放置策略还在不断改进中。

Replica Selection  复制块的选择

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multipledata centers, then a replica that is resident in the local data center is preferred over any remote replica.

为减少全局的带宽和读延迟,HDFS尝试从离要求读的客户端最近的地方读取复制块,如果读者节点与复制块节点在同一机架上,则这个复制块优先用来满足读请求,如果HDFS集群有多个数据中心,则优先使用本地数据中心中的复制块,而不是远方的数据块。

Safemode  安全模式

On startup, the NameNode enters a special state called Safemode.Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that datablock has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number ofreplicas. The NameNode then replicates these blocks to other DataNodes.

启动时,NameNode进入一个特别的状态,叫做“安全模式”。在安全模式下,不会发生数据块的复制。NameNode接收各DataNode的心跳和数据块信息。数据块信息包含了DataNode上所有的数据块列表。每个数据块都有一个特定的最小的复制份数,如果数据块的复制份数满足这个最小值要求,NameNode就认为这个数据块已被安全复制。当NameNode检查了某一百分比(可配置)的复制块,再加额外的30秒之后,NameNode就退出安全模式,它接着看看是否还有复制块的份数少于规定值,然后将这些块复制到其他DataNode上。

The Persistence of File System Metadata 文件系统元数据的一致性

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called theEditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

HDFS命名空间存储在NameNode上。NameNode采用一个叫EditLog的事务日志来持久性记录每一个发生在文件系统metadata上的变化。例如,在HDFS上创建一个新文件会使NameNodeEditLog插入一条记录,同样,改变一个文件的复制因数也将在EditLog中插入一条记录。NameNode采用宿主操作系统的一个本地文件来保存EditLog。整个文件系统命名空间,包括数据块与文件的对应,文件系统的属性,存储在一个叫fsimage的文件中,这个文件也保存在NameNode的本地文件系统中。

The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB ofRAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage,and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up. Work is in progress to support periodic checkpointing in the near future.

NameNode将整个文件系统命名空间和文件块映射表保持在内存中,它们被设计成紧凑形式,因此,即使NameNode4GBRAM,也能支持很大数量的文件和目录。当NameNode启动时,它从磁盘上读取FsImageEditLog,将EditLog中的事务应用于内存中的FsImage,然后更新磁盘上的FsImage,形成一个新版本。旧版本的EditLog就可以被截断了,因为事务应用到FsImage中,并且被保存了。这个过程叫做一个检查点。在目前的实现中,只有在NameNode启动后才能产生检查点。在不久的将来,会支持周期性检查点,这个工作目前正在进行中。

The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separatefile in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number offiles per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system,generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.

DataNodeHDFS的数据存储在本地文件系统中,DataNode并不知道HDFS文件,它只是将HDFS不连续的文件数据块存储在本地文件系统。DataNode不会将所有文件都创建在一个目录中,相反,它采用启发式方面来决定每个目录下最佳的文件数目,在合适的时候创建子目录。将所有文件创建在同一目录下不是最佳的方法,这是因为本地文件系统可能不高效地支持单一目录下有巨大的文件数目。当DataNode启动后,它扫描本地文件系统,产生一个所有HDFS数据块与本地文件对应的列表,然后将其发给NameNode,这个列表叫做Blockreport

The Communication Protocols 通讯协议

All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes aconnection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.

所有HDFS的通讯协议建立在TCP/IP协议之上,客户端与NameNode通过一个配置的TCP端口建立连接,它们之间的协议叫ClientProtocolDataNodeNameNode之间采用DataNodeProtocol通讯。一个RPC抽象层包裹了ClientProtocolDataNodeProtocol。从设计上讲,NameNode从不发起RPC,而只应答由客户端或DataNode发起的RPC请求。

Robustness  健壮性

The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures,DataNode failures and network partitions.

即使发生故障时,HDFS也能存储数据,这是HDFS的基本目标。有三种常见的故障类型:NameNode故障、DataNode故障和网络故障。

Data Disk Failure, Heartbeats and Re-Replication 数据磁盘故障、心跳和再复制

Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition cancause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

每个DataNode周期性地向NameNode发送心跳。网络故障可能使一部分DataNode失去与NameNode的联系。NameNode通过心跳丢失来发现这种情况,它将最近没有发送心跳的DataNode标记为“dead”,并且不再向其发送新的IO指令。任何注册在“deadDataNode上的数据都不在可用。DataNode的“dead”状态可以使一些数据块的复制因数低于设定值,NameNode就不断地检查哪些数据块需要复制,并在需要时发起复制过程。再复制可能因为如下原因而发生:一个DataNode不可用,一个复制块损坏,DataNode上一个磁盘发生损坏,或者文件的复制因数增加了。

Cluster Rebalancing  集群再平衡

The HDFS architecture is compatible with data rebalancing schemes. Ascheme might automatically move data from one DataNode to another if the freespace on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

HDFS的架构与数据再平衡的方案是适应的。如果一个DataNode的磁盘空间低于某个阈值,一种再平衡方案要将数据从这个DataNode自动地移动到另一个。当对某一文件的请求突然有巨大增加时,一种再平衡方案要动态地创建额外的复制块并再分配其他数据块。这些类型的数据再平衡方案目前还没用实现。

Data Integrity  数据完整性

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device,network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client canopt to retrieve that block from another DataNode that has a replica of that block.

DataNode中获取的数据块在到达时损坏了,这是有可能发生的,因为存在存储设备瑕疵、网络故障或软件bugHDFS客户端软件实现验证HDFS文件的校验和。当客户端创建了一个HDFS文件,它计算文件每个数据块的校验和,并将校验和存储在HDFS同一命名空间下的另一个隐藏文件中。当客户端软件接收到文件内容,通过与此文件相关联的校验和文件,它验证数据块的校验和。如果不匹配,客户端软件可以选择从另一个DataNode接收相同的数据块。

Metadata Disk Failure  元数据磁盘故障

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. Forthis reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support.However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.

The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.

FsImageEditLogHDFS的中心数据结构,它们的损坏将使HDFS实例不能工作,因此,NameNode可以被配置支持多个FsImageEditLog拷贝,任何对FsImageEditLog的更新都将同步到每一个拷贝中。这种同步地更新FsImageEditLog的多个拷贝可能会降低NameNode每秒可处理的事务数目,但是,这种降低是可接受的,因为即使对数据十分敏感的HDFS应用,它们对元数据不会敏感。当NameNode启动时,它选择最新的具有一致性的FsImageEditLog拷贝来使用。

 Snapshots 快照

Snapshots support storing a copy of data at a particular instant of time.One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently supports napshots but will in a future release.

快照支持保存某一特定时刻的数据。快照的一个使用场合是将损坏的HDFS实例回滚到以前已知好的时间点。HDFS目前不支持快照,但会在以后的发行版中支持。

Data Organization  数据组织

Data Blocks  数据块

HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times andrequire these reads to be satisfied at streaming speeds. HDFS supportswrite-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible,each chunk will reside on a different DataNode.

HDFS被设计成支持非常大的文件。适合于HDFS的应用程序就是那些使用大数据集的应用,这些应用一次写入数据,一次或多次读取数据,并且满足流式读取的速度。HDFS支持这种文件一次写多次读的特性。HDFS中一个典型的数据块是64MB,因此,HDFS中的文件被分割成64MB的块,如果可能,每一块放在不同的DataNode上。

Staging

A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary localfile. Application writes are transparently redirected to this temporary localfile. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination datablock. Then the client flushes the block of data from the local temporary fileto the specified DataNode. When a file is closed, the remaining un-flushed datain the temporary local file is transferred to the DataNode. The client thentells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.

用户创建一个文件的请求并不会立刻传递给NameNode,实际上,开始时,HDFS客户端会将文件数据缓存在本地临时文件中。应用程序的写操作会透明地再定向到这个临时文件,当这个本地临时文件积累到足够HDFS块大小时,客户端与NameNode通讯,NameNode将文件名插入到文件系统层次目录中,并分配一个数据块给它,NameNode于是回应客户端,给出DataNode和目的数据块的标识,接着客户端将临时文件中的数据更新到指定的DataNode,当文件被关闭时,本地临时文件中剩下的、没有更新的数据被传递到DataNode,客户端告诉NameNode说这个文件关闭了,在这个时候,NameNode才将文件创建的操作持久化保存(提交事务)。如果文件关闭之前NameNode失效了,文件就丢失了。

The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes tofiles. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performanceof data uploads.

以上的方案是经过仔细考虑HDFS上的应用程序而得出的,这些应用程序需要流式写文件,如果用户不经过客户端缓存而直接写一个远程文件,网络速度和拥塞情况将极大地影响吞吐量,这个方案不是没有先例,早期的分布式文件系统,例如AFS,就利用客户端缓存来提高性能。POSIX已放松了对数据上传更高性能的要求。

Replication Pipelining  复制管道

When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block ofuser data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block,writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline.Thus, the data is pipelined from one DataNode to the next.

当用户向一个HDFS文件写数据时,数据先被写入本地的临时文件中,这已在上面解释过了。假设HDFS文件的复制因数为3,当本地文件积累满一个数据块时,客户端从NameNode得到一个DataNode的列表,这个列表就是放置该数据块的DataNode。于是,客户端将数据更新到第一个DataNode,第一个DataNode开始接收数据,接收到后(以较小的数据块,4KB),将这个小数据块保存到本地,然后将其传输给列表中的第二个DataNode,第二个DataNode也以相同的方式操作,将其传递给第三个DataNode,最后,第三个NataNode将数据写入本地存储位置。这样,类似于管道,一个DataNode一边从前一个DataNode接收数据,一边将其传递给后一个DataNode

Accessibility  可访问性

HDFS can be accessed from applications in many different ways. Natively, HDFS provides a File System Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.

从应用程序可以有多种办法访问HDFSHDFS原生地提供了一个JAVA API来访问文件系统,还有一个对这个JAVA APIC包装。另外,也可以用一个HTTP浏览器来浏览HDFS中的文件。目前,通过WebDAV协议来浏览HDFS的工作正在进行中。

FS Shell

HDFS allows user data to be organized in the form of files and directories. It provides a command line interface called FS shell that lets auser interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:

HDFS 将用户数据组织成文件和目录这样一种形式。它提供了一个叫“ FSshell ”的命令行接口,允许用户与 HDFS 数据交换。这个命令行的句法类似于用户已经熟悉的其他命令行工具(例如 bash,csh ),以下是一些命令行和与之对应动作的例子。

Action

Command

Create a directory named /foodir

bin/hadoop dfs -mkdir /foodir

Remove a directory named /foodir

bin/hadoop dfs -rmr /foodir

View the contents of a file named /foodir/myfile.txt

bin/hadoop dfs -cat /foodir/myfile.txt

FS shell is targeted for applications that need a scripting language to interact with the stored data.

FS shell的目标是给需要脚本语言与存储数据交互的应用程序。

DFSAdmin

The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:

DFSAdmin 命令集是用来管理 HDFS 集群的,这些命令仅被 HDFS 管理员使用。下面是是一些命令和与之对应动作的例子。

Action

Command

Put the cluster in Safemode

bin/hadoop dfsadmin -safemode enter

Generate a list of DataNodes

bin/hadoop dfsadmin -report

Recommission or decommission DataNode(s)

bin/hadoop dfsadmin -refreshNodes

Browser Interface  浏览器接口

A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

典型的HDFS安装配置了一个web服务器,采用配置的TCP端口来浏览HDFS命名空间。这个功能允许用户可以用web浏览器来浏览HDFS命名空间并查看文件的内容。

Space Reclamation  空间回收

File Deletes and Undeletes  文件删除与恢复

When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.

如果一个文件被用户或应用程序删除了,它并不立刻从HDFS中删除,相反,HDFS首先改变它的名字,将其放到/trash目录中,只要它保留在/trash中,这个文件可以迅速恢复。当/trash中的文件到达其生命期后,NameNodeHDFS命名空间中删除这个文件。删除这个文件将释放与之关联的数据块的空间。需要注意的是,用户删除一个文件的时刻与HDFS增加剩余空间的时刻之间相比,有一个适当的滞后。

只要这个文件还在/trash目录,用户可以恢复一个删除的文件。如果用户想恢复,则他/她可以转到/trash目录,恢复这个文件。/trash目录仅包含最新的被删除文件,/trash目录与其他目录相似,除了一点特征之外:HDFS施加特别的自动删除策略给这个目录下的文件。目前,这个缺省的策略是删除超过6个小说的文件。将来,这个策略可以通过一个预定义好的接口来配置。

Decrease Replication Factor 降低复制因数

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

当一个文件的复制因数变小时,NameNode选择能被删除的过量的复制块。这个信息随下一次的DataNode的心跳而传递给DataNodeDataNode接着删除对应的块和释放对应的空间,再一次说明,调用setReplication完成的时刻与集群中出现对应的剩余空间的时刻之间相比,可能有一个滞后。

 


你可能感兴趣的:(HDFS架构(Apache Hadoop 2.1.1-beta))