1 目标:This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system. While HDFS is designed to "just work" in many environments, a working knowledge of HDFS helps greatly with configuration improvements and diagnostics on a specific cluster.(这篇文档是用户开始使用HDFS,集群或者单点。hdfs被设计成运行在各种环境,hdfs working知识极大的帮助提高配置)
2 概述:HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. This user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.
The following are some of the salient features that could be of interest to many users.
The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.
The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.
The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.
namenode在一个本地文件系统存储着对文件系统的变更,当namenode启动时,它从镜像,fsImage读hdfs状态,并把读到的改变应用到编辑日志文件。并且把新的hdfs状态写到fsimage。在namenode启动期间对fsimage和编辑日志文件的合并,使得编辑日志文件变得很大。另一方面大的编辑日志文件会使得下次的namenode启动变得更慢。第二namendoe定期和并fsimage和编辑日志文件,保持编辑日志文件在一个可控的大小,它通常运行在一个和主namenode不用的机器上,因为他要求和主namenode同样的存储顺序。
第二namenode通过两个配置参数来控制检查点进程:
1))dfs.namenode.checkpoint.period 两次连续检查间隔(默认:1小时)
2))dfs.namenode.checkpoint.txns 针对未检查的事物强制进行紧急的检查时间间隔(默认:1毫秒)
第二namenode在一个目录存储着最新的检查点,和主namenode目录一致,以便于检查点镜像在必要的情况下能被读到。
4)checkpoint node:NameNode persists its namespace using two files: fsimage, which is the latest checkpoint of the namespace and edits, a journal (log) of changes to the namespace since the checkpoint. When a NameNode starts up, it merges the fsimage and edits journal to provide an up-to-date view of the file system metadata. The NameNode then overwrites fsimage with the new HDFS state and begins a new edits journal.
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node specified in the configuration file.
The location of the Checkpoint (or Backup) node and its accompanying web interface are configured via the dfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.
The start of the checkpoint process on the Checkpoint node is controlled by two configuration parameters.
The Checkpoint node stores the latest checkpoint in a directory that is structured the same as the NameNode's directory. This allows the checkpointed image to be always available for reading by the NameNode if necessary. See Import checkpoint.
Multiple checkpoint nodes may be specified in the cluster configuration file.
检查点namenode在一个目录存储着最新的检查点,和namanode目录相同,这保证namenode在必要的时候使用检查过的镜像。(和namenode运行在不同的机器上)
5)backup node:
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Along with accepting a journal stream of file system edits from the NameNode and persisting this to disk, the Backup node also applies those edits into its own copy of the namespace in memory, thus creating a backup of the namespace.
The Backup node does not need to download fsimage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary NameNode, since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local fsimage file and reset edits.
As the Backup node maintains a copy of the namespace in memory, its RAM requirements are the same as the NameNode.
The NameNode supports one Backup node at a time. No Checkpoint nodes may be registered if a Backup node is in use. Using multiple Backup nodes concurrently will be supported in the future.
The Backup node is configured in the same manner as the Checkpoint node. It is started with bin/hdfs namenode -backup.
The location of the Backup (or Checkpoint) node and its accompanying web interface are configured via the dfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.
Use of a Backup node provides the option of running the NameNode with no persistent storage, delegating all responsibility for persisting the state of the namespace to the Backup node. To do this, start the NameNode with the -importCheckpoint option, along with specifying no persistent storage directories of type edits dfs.namenode.edits.dir for the NameNode configuration.
For a complete discussion of the motivation behind the creation of the Backup node and Checkpoint node, see HADOOP-4539. For command usage, see .
备份节点和检查点提供同样的功能,是基于内存的,保存着和active namenode节点同步的状态。现在namenode节点只支持一个backup节点,在以后的版本可能会支持并发的多个备份节点。
6)import checkpoint:
The latest checkpoint can be imported to the NameNode if all other copies of the image and the edits files are lost. In order to do that one should:
The NameNode will upload the checkpoint from the dfs.namenode.checkpoint.dir directory and then save it to the NameNode directory(s) set in dfs.namenode.name.dir. The NameNode will fail if a legal image is contained in dfs.namenode.name.dir. The NameNode verifies that the image in dfs.namenode.checkpoint.dir is consistent, but does not modify it in any way.
如果其他的image和编辑日志文件丢失了,最新的检查点可以被导入到namenode,想做到这一点需要一下配置:
1))配置dfs.namenode.name.dir指定到一个空目录
2))指定dfs.namenode.checkpoint.dir检查目录位置
3))使用-importCheckpoint启动namenode节点
7)Rebalancer:
HDFS data might not always be be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:
Due to multiple competing considerations, data might not be uniformly placed across the DataNodes. HDFS provides a tool for administrators that analyzes block placement and rebalanaces data across the DataNode. A brief administrator's guide for rebalancer as a PDF is attached to HADOOP-1652
hdfs提供了均衡datanode数据的功能
8)rack awareness:Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variablenet.topology.script.file.name. When this script is configured, each node runs the script to determine its rack id. A default installation assumes all the nodes belong to the same rack. This feature and configuration is further described in PDF attached to HADOOP-692
考虑到网络IO,同一个块的copy放在同一个货架上更合理;把块copy放到不同的货架上降低故障数据丢失的危险,要在这之间建立一个平衡
9)safemode:During start up the NameNode loads the file system state from the fsimage and the edits log file. It then waits for DataNodes to report their blocks so that it does not prematurely start replicating the blocks though enough replicas already exist in the cluster. During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available. If required, HDFS could be placed in Safemode explicitly using bin/hadoop dfsadmin -safemode command. NameNode front page shows whether Safemode is on or off. A more detailed description and configuration is maintained as JavaDoc for setSafeMode().
在namenode启动等待datanode汇报他们块信息这段时间,namenode处于安全模式(是只读的)。
10)fsck:HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as bin/hadoop fsck. For command usage, see . fsck can be run on the whole file system or on a subset of files.
检查各种矛盾的命令
11)fetchdt:HDFS supports the fetchdt command to fetch Delegation Token and store it in a file on the local system. This token can be later used to access secure server (NameNode for example) from a non secure client. Utility uses either RPC or HTTPS (over Kerberos) to get the token, and thus requires kerberos tickets to be present before the run (run kinit to get the tickets). The HDFS fetchdt command is not a Hadoop shell command. It can be run as bin/hadoop fetchdt DTfile. After you got the token you can run an HDFS command without having Kerberos tickets, by pointingHADOOP_TOKEN_FILE_LOCATION environmental variable to the delegation token file. For command usage, see command.
抓取授权的token并存储在一个本地文件中
12)recovery mode:Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.You can start the NameNode in recovery mode like so: namenode -recover
When in recovery mode, the NameNode will interactively prompt you at the command line about possible courses of action you can take to recover your data.If you don't want to be prompted, you can give the -force option. This option will force recovery mode to always select the first choice. Normally, this will be the most reasonable choice.Because Recovery mode can cause you to lose data, you should always back up your edit log and fsimage before using it.
一般情况下你可以配置多元数据存储地址,一个存储地址不能用,可以使用其他的的另一个。恢复模式,启动方式:namenode -recover,
13)upgrade and rollback:When Hadoop is upgraded on an existing cluster, as with any software upgrade, it is possible there are new bugs or incompatible changes that affect existing applications and were not discovered earlier. In any non-trivial HDFS installation, it is not an option to loose any data, let alone to restart HDFS from scratch. HDFS allows administrators to go back to earlier version of Hadoop and rollback the cluster to the state it was in before the upgrade. HDFS upgrade is described in more detail in Wiki page. HDFS can have one such backup at a time. Before upgrading, administrators need to remove existing backup using bin/hadoop dfsadmin -finalizeUpgrade command. The following briefly describes the typical upgrade procedure: