大数据系列（二）：HDFS（Hadoop分布式文件系统）（一）

HDFS设计

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

HDFS不适用的情况

低时间延迟的数据访问（Low-latency data access)
HDFS 是为高吞吐量应用优化的，这会导致它的高延迟性（Remember, HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency）
大量的小文件
由于namenode将文件系统的元数据存储在内存中，因此该文件系统所能存储的文件总数受限于namenode的内存容量。每个文件、目录和数据块的存储信息大约占150B，数十亿的文件存储会有问题（Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes。While storing millions of files is feasible, billions is beyond the capability of current hardware）
多用户写入，任意修改文件
HDFS中的文件可能只有一个writter (Files in HDFS may be written to by a single writer)

HDFS概念

数据块（block）

数据块（block）：是最小可读写数据的数量（is the minimum amount of data that it can read or write.）HDFS也拥有数据块，默认值为64MB。HDFS上的文件也被划分为块大小（block-sized)的分块(chunks)，作为独立的储存单元。

为什么HDFS的一个数据块如此大（Why Is a Block in HDFS So Large?）

和硬盘数据块相比，HDFS数据块大的原因是为了最小化寻址开销(HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.)**
通过将块设置的足够大，从磁盘传输数据的时间可以明显大于定位这个块的开始位置所需的时间。这样，传输一个有多个块组成的文件的时间取决于磁盘传输速率(By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.)
比如，寻址时间10ms左右，而传输速率为100MB/s，为了使寻址速率占传输时间的1%，块的大小需要是100MB左右（if the seek time is around 10 ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB）

优势

文件的所有块不需要存储在同一个磁盘上，因此它们可以利用集群上的任意磁盘进行存储（There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster）
使用抽象块而非整个文件作为存储单元，大大简化了存储子系统的设计(making the unit of abstraction a block rather than a file simplifies the storage subsystem)
- 将存储子系统控制单元设置为块，可简化存储管理（单个磁盘能存储多少块相对容易）(since blocks are a fixed size, it is easy to calculate how many can be stored on a given disk)
- 消除对元数据的顾虑（块只是存储数据的一部分，而文件的元数据，并不需要一起存储。这样，其他系统可以单独管理这些元数据）(blocks are just a chunk of data to be stored—file metadata such as permissions information does not need to be stored with the blocks, so another system can handle metadata separately)
块非常适合用于数据备份进而提供数据容错能力和可用性，一般为3份(blocks fit well with replication for providing fault tolerance and availability)

Namenode and datanode

一个HDFS集群有两类节点，以管理-工作形式运行，即一个namenode（管理者）和多个datanode（工作者）（An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers).）

Namenode

The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree (namenode管理文件系统的命名空间，维持文件系统树和树里所有文件和目录的元数据）。This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.(这些信息永久以两类文件（命名空间镜像文件和编制日志文件）储存在本地硬盘。）
The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.(namenode也记录每个文件中各个块所在的数据节点信息，但它并不永久保存块的信息，因为这些信息会在系统启动时由数据节点重建）

Datanodes

Datanodes store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of that they are storing.(Datanodes存储并检索数据块（受客户端和namenode调度），并定期向namenode发送它储存的块的列表）

HDFS Federation

The namenode keeps a reference to every file and block in the filesystem in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling. HDFS Federation allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. (namenode和内存文件系统中每个文件和数据块保持关联，这意味着在多文件的大集群上，内存会限制规模化的增长。HDFS Federation 允许一个集群通过增加namenodes实现规模化，每一个namenode管理命名空间文件系统的一部分）

Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace(在Federation体制下，每个namenode管理一个在命名空间组成元数据的namespace volume和一个包括命名空间文件所有数据块的block pool）

Namespace volumes are independent of each other, which means namenodes do not communicate with one another (Namespace volumes 相互独立，意味着namenodes之间并无联系）
Block pool storage is not partitioned, however, so datanodes register with each namenode in the cluster and store blocks from multiple block pools.

HDFS High-Availability

The combination of replicating namenode metadata on multiple filesystems, and using the secondary namenode to create checkpoints protects against data loss, but does not provide high-availability of the filesystem. The namenode is still a single point of failure (SPOF).(在多个文件系统上复制namenode元数据并使用第二namenode来建立检查点可以防止数据丢失，但并没有提供高有效性。namenode依旧是SPOF）

To support HDFS High-Availability, there is a pair of namenodes in an active standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption. (为了支持HA，会有备用的namenodes待命。如果活动的namenode失效，这些备用可以继续工作避免明显的中断）