This document is a starting point for users working with HadoopDistributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system. While HDFS is designed to "just work" in many environments, a working knowledge of HDFS helps greatly with configuration improvements and diagnostics on a specific cluster.
不管是将Hadoop分布式文件系统(HDFS)用作Hadoop集群的一部分,还是将其用作一个独立的、通用的分布式文件系统,对使用者来说,本文都是一个起点。在许多场合下,虽然HDFS能做到“即刻可用”,但对HDFS工作过程的理解将极大地帮助你提高配置水平和诊断集群故障。
HDFS is the primary distributed storage used by Hadoop applications. AHDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. This user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actualfile I/O directly with the DataNodes.
HDFS是被Hadoop应用所使用的基本的分布式存储系统。一个HDFS集群基本上由一个NameNode和若干DataNode组成,NameNode管理系统的元数据(metadata),DataNode存储实际的数据。HDFS架构说明详细描述了它的结构,本文的用户手册主要关注与用户的交互以及HDFS集群的管理。HDFS架构图描述了NameNode,DataNode和客户端的基本交互过程。客户端与NameNode交互,得到文件的元数据和修改信息,与DataNode直接交互,产生实际的文件I/O操作。
The following are some of the salient features that could be of interestto many users.
- File permissions and authentication.
- Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
- Safemode: an administrative mode formaintenance.
- fsck: a utility to diagnose health of the file system, to find missing filesor blocks.
- fetchdt: a utility to fetch DelegationToken and store it in a file on the localsystem.
- Rebalancer: tool to balance the clusterwhen the data is unevenly distributed among DataNodes.
- Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in caseof unexpected problems.
- Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log ofHDFS modifications within certain limits at the NameNode.
- Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored atthe NameNode containing changes to the HDFS. Replaces the role previouslyfilled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
- Backup node: An extension to theCheckpoint node. In addition to checkpointing it also receives a stream ofedits from the NameNode and maintains its own in-memory copy of the namespace,which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.
以下是许多用户所关心的HDFS的一些突出的特性。
- 文件权限和安全认证
- 机架的自适应:调度任务和分配存储时,会将节点的物理位置考虑在内
- 安全模式:为维护目的而设置的管理模式
- fsck:一个诊断文件系统健康状态的工具,能够帮助发现丢失的文件或数据块
- fetchdt:一个发送DelegationToken和将其存储到本地文件中的工具
- 再平衡器:当数据在DataNode之间分布不平衡时,用来平衡数据的工具
- 升级与回滚:如果一个软件升级后发生不可预料的问题,这一特性可使HDFS回滚到升级之前的状态
- 备用NameNode:周期性的在命名空间中执行检查点,帮助NameNode限制HDFS修改日志的大小
- Checkpoint节点 :在命名空间上周期性地执行检查点,用以减少存储在NameNode上的记录HDFS变化的日志。这个工作以其是由备用NameNode来做的,虽然这不困难。只要系统中没有Backup节点,NameNode允许同时具有多个Checkpoint节点。
- Backup节点:它是对Checkpoint节点的一个扩展,除了执行检查点之外,它还接收来自NameNode的“edits”,以此在内存中来维护一份命名空间的拷贝,这份拷贝总是与NameNode中的一致。同一时刻,NameNode只允许注册一个Backup节点。
The following documents describe how to install and set up a Hadoopcluster:
The rest of this document assumes the user is able to set up and run a HDFS with at least one DataNode. For the purpose of this document, both the NameNode and DataNode could be running on the same physical machine.
以下文档描述了如何建立一个Hadoop集群:
这篇文档的余下部分假设用户已能够建立起一个至少一个DataNode的HDFS,NameNode和DataNode可以运行在同一台物理机中。
NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at http://namenode-name:50070/. It lists the DataNodes in the cluster and basic statistics of the cluster. The web interface can also be used to browse the file system (using "Browse the file system"link on the NameNode front page).
NameNode和DataNode都有一个内置的web服务器,以显示集群的基本状态信息。在缺省配置下,NameNode的前台页为http://namenode-name:50070/,它显示了集群中DataNodes的列表以及基本的集群统计信息,这个web接口也可以用来显示文件系统(NameNode前台页上 "Browse the file system" 链接)。
Hadoop includes various shell-like commands that directly interact withHDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most ofthe normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see .
Hadoop包含各种类似于shell命令一样的东西,可以直接与HDFS及Hadoop支持的文件系统交互。/bin下命令hdfs dfs –help command-name可以显示出一个命令的详细帮助。这些命令支持了大部分通常所用的文件系统的命令,例如拷贝文件、改变文件权限等。它也支持了一些HDFS特有的命令,例如改变文件的复制系数等。
The bin/hadoop dfsadmin command supports a few HDFS administration related operations. The bin/hadoop dfsadmin -help command lists all the commands currently supported. For e.g.:
bin目录下的hadoop dfsadmin命令支持一些管理HDFS相关的操作。hadoop dfsadmin –help可以列出目前所有支持的命令,例如:
The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFSstate from an image file, fsimage, and then applies edits from the edits logfile. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since NameNode merges fsimage and edits files onlyduring start up, the edits log file could get very large over time on a busycluster. Another side effect of a larger edits file is that next restart of NameNode takes longer.
The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements areon the same order as the primary NameNode.
The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.
The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.
NameNode保存了文件系统的修改信息,并将其作为一个本地的日志文件edits。当NameNode启动时,它从映像文件fsimage中读取HDFS的状态,并开始操作一个空的edits文件。由于NameNode仅在启动时合并fsimage和各edits文件,所以日志文件edits在一个很忙的集群上会变得越来越大。大日志文件edits另一个副作用是会使NameNode在下次启动时变长。
备用NameNode周期性地合并fsimage和edits文件,将edits限制在一个范围内,备用NameNode与主NameNode通常运行在不同的机器上,因为备用NameNode与主NameNode有同样的内存要求。
备用NameNode上检查点进程的运行受两个配置参数控制:
备用NameNode存储最新的检查点,它目录结构与主NameNode一致,所以这个备用的检查点映像在主NameNode需要时,总是能访问的。
NameNode persists its namespace using two files: fsimage, which is thelatest checkpoint of the namespace and edits, a journal (log) of changes to the namespace since the checkpoint. When a NameNode starts up, it merges thefsimage and edits journal to provide an up-to-date view of the file system metadata. The NameNode then overwrites fsimage with the new HDFS state and begins a new edits journal.
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usuallyruns on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfsnamenode -checkpoint on the node specified in the configuration file.
The location of the Checkpoint (or Backup) node and its accompanying webinterface are configured via thedfs.namenode.backup.address anddfs.namenode.backup.http-address configuration variables.
The start of the checkpoint process on the Checkpoint node is controlledby two configuration parameters.
The Checkpoint node stores the latest checkpoint in a directory that is structured the same as the NameNode's directory. This allows the checkpointed image to be always available for reading by the NameNode if necessary. SeeImport checkpoint.
Multiple checkpoint nodes may be specified in the cluster configurationfile.
NameNode采用两个文件来保存命名空间的信息:fsimage,它是最新的已执行检查点的命名空间的信息;edits,它是执行检查点后命名空间变化的日志文件。当NameNode启动时,fsimage和edits合并,提供一个最新的文件系统的metadata,然后NameNode将新的HDFS状态写入fasimage,并开始一个新的edits日志。
Checkpoint节点周期性地创建命名空间的检查点。它从NameNode下载fsimage和edits,在本地合并它们,并将其发回给活动的NameNode。Checkpoint节点通常与NameNode不在同一台机器上,因为它们有同样的内存要求。Checkpoint节点由配置文件中的bin/hdfs namenode –checkpoint来启动。
Checkpoint(或Backup)节点的位置以及附带的web接口由dfs.namenode.backup.address anddfs.namenode.backup.http-address参数指定。
Checkpoint进程的运行受两个配置参数控制:
Checkpoint节点上保存的最新的检查点,其目录结构与NameNode上一样,这样,如果需要,NameNode总是可以读取这上面的已执行检查点的文件映像。参见“import checkpoint”。
多个Checkpoint节点可以在集群的配置文件中指定。
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Along with accepting a journal stream of file system edits from the NameNode and persisting this to disk, the Backup node also applies those editsinto its own copy of the namespace in memory, thus creating a backup of thenamespace.
The Backup node does not need to download fsimage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary NameNode, since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local fsimage file and reset edits.
As the Backup node maintains a copy of the namespace in memory, its RAM requirements are the same as the NameNode.
The NameNode supports one Backup node at a time. No Checkpoint nodes maybe registered if a Backup node is in use. Using multiple Backup nodes concurrently will be supported in the future.
The Backup node is configured in the same manner as the Checkpoint node.It is started with bin/hdfs namenode-backup.
The location of the Backup (or Checkpoint) node and its accompanying webinterface are configured via thedfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.
Use of a Backup node provides the option of running the NameNode with nopersistent storage, delegating all responsibility for persisting the state of the namespace to the Backup node. To do this, start the NameNode with the -importCheckpoint option, alongwith specifying no persistent storage directories of type editsdfs.namenode.edits.dir for the NameNode configuration.
For a complete discussion of the motivation behind the creation of theBackup node and Checkpoint node, seeHADOOP-4539. For command usage, see .
Backup节点与Checkpoint节点提供同样的执行检查点功能,只不过它还在内存中保存一份最新的命名空间的的拷贝,该拷贝与NameNode中的保持同步。除了接收NameNode中发送的edits并把它保存到磁盘之外,Backup还将edits用到自己的内存中,因而创建出一份命名空间的备份。
因为Backup节点在内存中保持有最新的命名空间的状态,因此它不需要从NameNode下载fsimage和edits文件来创建一个检查点,而这是Checkpoint节点或备用NameNode所必需的步骤。Backup节点的检查点进程更高效,因为它只需要将命名空间信息保存到本地的fsimage文件并重置edits就可以了。
由于Backup节点内存中维护了一份命名空间的拷贝,它的内存要求与NameNode一致。
NameNode同一时刻只支持一个Backup节点。如果Backup在用,则不能注册Checkpont节点。同时有多个Backup节点会在将来被支持
Backup节点的配置与Checkpoint节点一样,它采用bin/hdfs namenode –backup启动。Backup(或Checkup)节点的位置及其web接口由配置参数dfs.namenode.backup.address和 dfs.namenode.backup.http-address指定。
使用Backup节点,NameNode就可以选择不进行存储,而将保持命名空间状态的责任交给Backup节点。为此,在NameNode的配置中,采用选项-importCheckpoint来启动NameNode,并且不设置edits的存储位置选项dfs.namenode.edits.dir。
关于创建Backup和Checkpoint节点背后动机的详细讨论,请参见HADOOP-4539。
The latest checkpoint can be imported to the NameNode if all other copies of the image and the edits files are lost. In order to do that one should:
The NameNode will upload the checkpoint from the dfs.namenode.checkpoint.dir directory and then save it to the NameNode directory(s) set in dfs.namenode.name.dir. The NameNodewill fail if a legal image is contained in dfs.namenode.name.dir. The NameNode verifies that the image indfs.namenode.checkpoint.dir is consistent, but does not modify it in any way.
如果其它所有的映像文件和edits都丢失了,可以将最后的检查点导入到NameNode,为此,需要以下步骤:
NameNode将从dfs.namenode.checkpoint.dir设置的目录中上载检查点,并将其保存在dfs.namenode.name.dir指定的目录中。如果dfs.namenode.name.dir中存在一个合法的映像文件,NameNode就会启动失败,NameNode要验证dfs.namenode.checkpoint.dir中的映像文件是否合法,但在任何情况下,都不会修改该文件。
HDFS data might not always be be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:
Due to multiple competing considerations, data might not be uniformly placed across the DataNodes. HDFS provides a tool for administrators that analyzes block placement and rebalanaces data across the DataNode. A brief administrator's guide for rebalancer as a PDF is attached toHADOOP-1652.
HDFS的数据不可能总是在DataNode中均匀分布。一个最常见的原因是因为有新的DataNode加入。当存放新数据块(文件是以一系列数据来保存的)时,NameNode会考虑各种参数来选择DataNode接收该数据块,以下是一些这样的考虑:
基于多种考虑,DataNode上的数据可能会不平衡,HDFS为管理者提供了一个分析和再平衡数据的工具。一个简要的管理者指南的PDF附在HADOOP-1652中。
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance. Hadoop lets the cluster administrators decide which rack a node belongs to throughconfiguration variable net.topology.script.file.name. When this script is configured, each node runs the script to determine its rack id. A default installation assumes all the nodes belong to the same rack. This feature and configuration is further described in PDF attached to HADOOP-692.
通常情况下,大型的Hadoop集群会布置到若干机架上,同一机架上不同节点之间的网络通讯会比不同机架之间的理想的多,另外,NameNode会试着将复制块放置到不同的机架上,以提高容错性。Hadoop让管理员通过配置参数net.topology.script.file.name来指定一个节点属于哪个机架。当这个选项被配置后,每个节点都运行这个脚本来决定它自己的机架编号。缺省的安装配置假设所有的节点都属于一个机架。这个配置的进一步讨论可参见HADOOP-692附加的PDF。
During start up the NameNode loads the file system state from the fsimageand the edits log file. It then waits for DataNodes to report their blocks sothat it does not prematurely start replicating the blocks though enough replicas already exist in the cluster. During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for theHDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available. If required, HDFScould be placed in Safemode explicitly usingbin/hadoop dfsadmin -safemode command. NameNode front page shows whether Safemode is on or off. A more detailed description and configuration is maintained as JavaDoc forsetSafeMode().
启动时,NameNode从fsimage和edits日志文件载入文件系统的状态,然后,尽管复制块可能已经是足够的,它也等待DataNode报告它们的数据块,而不贸然开始复制数据块。这期间,NameNode就处于安全模式。对HDFS集群来说,NameNode的安全模式基本上就是一个只读模式,不允许对文件系统或数据块进行修改。通常,当DataNode报告大部分文件系统块都可访问之后,NameNode就会自动退出安全模式。如果需要,可以显式地用bin/hadoop dfsadmin –safemode命令将HDFS置于安全模式。NameNode的前台页面会显示出是否处于安全模式。更详细的描述与配置,在JavaDoc中的setSafeMode()。
HDFS supports the fsck command to check for various inconsistencies. It is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects.Normally NameNode automatically corrects most of the recoverable failures. Bydefault fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It canbe run as bin/hadoop fsck. For command usage, see . fsck can be run on the whole file system or on a subset of files.
HDFS支持用fsck命令来检查各种不一致性。它被设计成报告文件问题的工具,例如,一个文件丢失数据块或复制块不够。但与传统的检查本地文件的fsck工具不同,这个命令不修正它发现的错误。缺省情况下,fsck没有打开文件选项,但提供一个选择所有文件的选项。Fsck不是Hadoop shell命令。它以bin/hadoop fsck方式运行。Fsck能用在整个文件系统或一部分文件。
HDFS supports the fetchdt command to fetch Delegation Token and store it in a file on the local system. This token can be later used to access secure server (NameNode for example) from a non secure client. Utility uses either RPCor HTTPS (over Kerberos) to get the token, and thus requires kerberos tickets to be present before the run (run kinit to get the tickets). The HDFS fetchdt command is not a Hadoop shell command. It can be run asbin/hadoop fetchdt DTfile. After you got the token you can run an HDFS command without having Kerberos tickets, by pointingHADOOP_TOKEN_FILE_LOCATION environmental variable to the delegation token file. For command usage,see command.
HDFS支持fetchdt命令来获取授权标识,并将其存储在本地文件系统的一个文件中。以后,一个“非安全”的客户端可以用这个标识以后来访问受限的服务器(例如NameNode)。获取这个标识,采用RPC或HTTPS(over Kerberos)方式,然后,在获取之前需要提交Kerberos凭证(运行kinit来获得凭证)。HDFS fechedt命令不是一个Hadoop shell命令。它以bin/hadoop fetchdt DTfile方式运行。当你获得授权标识后,通过指定环境变量HADOOP_TOKEN_FILE_LOCATION为授权标识文件名,你就可以运行HDFS命令,而不需要Kerberros凭证了。
Typically, you will configure multiple metadata storage locations. Then,if one storage location is corrupt, you can read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.
You can start the NameNode in recovery mode like so:namenode -recover
When in recovery mode, the NameNode will interactively prompt you at thecommand line about possible courses of action you can take to recover your data.
If you don't want to be prompted, you can give the-force option. This option will force recovery mode to always select the first choice. Normally, this will be the most reasonable choice.
Because Recovery mode can cause you to lose data, you should always backup your edit log and fsimage before using it.
通常,你要配置多个metadata存储位置,当一个存储位置崩溃后,你可以从其它位置读取到metadata。
但是,如果仅有的一个存储位置崩溃后怎么办呢?在这种情况下,有一个特别的NameNode启动模式,叫恢复模式,允许你恢复大大部分数据。
你可以像这样启动恢复模式:namenode –recover
在恢复模式时,NameNode以命令行的方式与你交互,显示你可能采取的恢复数据的措施。
如果你不想采用交互模式,你可以加上选项-force,这个选项将强制选取第一个选择恢复,通常,这是最合理的选择。
由于恢复模式可能使数据丢失,你应该在使用它之前备份edit日志文件和fsimage。
When Hadoop is upgraded on an existing cluster, as with any software upgrade, it is possible there are new bugs or incompatible changes that affect existing applications and were not discovered earlier. In any non-trivial HDFSinstallation, it is not an option to loose any data, let alone to restart HDFS from scratch. HDFS allows administrators to go back to earlier version of Hadoop and rollback the cluster to the state it was in before the upgrade. HDFS upgrade is described in more detail in Wiki page. HDFS can have one such backupat a time. Before upgrading, administrators need to remove existing backupusing bin/hadoop dfsadmin-finalizeUpgrade command. The following briefly describes the typical upgrade procedure:
- stop the cluster and distribute earlier version of Hadoop.
- start the cluster with rollback option.(bin/start-dfs.h -rollback).
当一个已存在的Hadoop集群升级时,同其它软件升级一样,有可能有未发现的bug或不完善之处,影响到已存在的应用程序。在一个重要的HDFS系统,不能丢失任何数据,更不要说从头更新HDFS系统了,因此HDFS允许管理员将Hadoop和集群回滚到升级之前的状态。HDFS升级更详细的描述见Wiki页。HDFS有一种这样的备份。在升级之前,管理员采用bin/hadoopdfsadmin –finalizeUpgrade命令移除备份,典型的升级过程如下所述:
- 停止集群,分发旧版本的Hadoop
- 以回滚选项启动集群(bin/start-dfs.h -rollback)
The file permissions are designed to be similar to file permissions onother familiar platforms like Linux. Currently, security is limited to simplefile permissions. The user that starts NameNode is treated as the superuser forHDFS. Future versions of HDFS will support network authentication protocols like Kerberos for user authentication and encryption of data transfers. The details are discussed in the Permissions Guide.
文件的权限设计地类似于其它类平台,如Linux。目前,安全还限制在简单的文件权限方面。启动NameNode的用户被HDFS视为超级用户。HDFS的以后版本将支持网络安全协议,例如Kerberos,来做用户安全认证和加密数据传输,详细的讨论见权限指南。
Hadoop currently runs on clusters with thousands of nodes. The Wiki pagelists some of the organizations that deploy Hadoop on large clusters. HDFS hasone NameNode for each cluster. Currently the total memory available on NameNodeis the primary scalability limitation. On very large clusters, increasing average size of files stored in HDFS helps with increasing cluster size without increasing memory requirements on NameNode. The default configuration may not suite very large clustes. The Wiki page lists suggested configuration improvements for large Hadoop clusters.
Hadoop目前可以运行在由上千个节点组成的集群上。Wiki页上列出了一些采用大集群的组织。每个集群有一个NameNode。目前NameNode上可用的总内存数量是一个扩展性的基本限制。在非常巨大的集群上,增加存储在HDFS上文件的平均大小,有助于增加整个集群的尺寸而不会增加NameNode上内存的使用量,缺省的配置可能不适合非常巨大的集群,Wike页列出了在大集群情况下为提供性能而建议的配置项。
略