官方文档:http://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
Hadoop2.0.0之前,NameNode存在单点故障问题,每个集群中只有一个NameNode,一旦机器损坏或者进程出了问题,在问题被修复(重启进程或者机器)之前整个集群都处于不可用的状态。
This impacted the total availability of the HDFS cluster in two major ways:
HDFS集群的可用性主要是指在遇到以下两种情况的时候仍然可以对外不间断的提供服务:
In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
不可预测的硬件故障!Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.
有计划的软件升级或维护!The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
HDFS的HA特性通过在同一个集群中同时运行一个Active和一个Standby状态的NameNode来解决上面的问题。当Active状态的NameNode节点意外宕机之后Standby NameNode迅速切换为Active NameNode,或者在软件升级的时候通过管理员命令平滑的切换来解决NameNode的可用性。
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
在典型的HA集群中,两台独立的物理节点都被配置为NameNode,在任意时刻都只会有一个Active状态的NameNode,而另一个则是Stadnby状态的。Active NameNode负责为所有客户端提供服务,Standby NameNode同步集群状态,在集群故障时快速的进行Failover。
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
Standby NameNode通过JournalNodes来同步Active NameNode的元数据信息。当Active NameNode的namespace被修改之后,日志记录信息就会被持久化到大部分的JournalNode节点上,Standby NameNode 会一直监控JournalNode节点上的编辑日志,当发现编辑日志有所改变后会读取这些编辑日志并合并到自己的namespace中。当故障切换发生时,Standby NameNode在成为Active状态之前会确保已经读取JournalNode上的所有edit log,避免元数据的不完整。
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
为了尽量减少故障切换所消耗的时间,Standby NameNode存储集群中block的位置信息也是必要的。为了实现这个功能,集群中所有的DataNode节点会配置好所有NameNode节点的信息,然后为所有NameNode节点发送块的位置信息和心跳。
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
在HA集群中确保只有一个Active NameNode是非常重要的,否则namespace的状态就会出现紊乱,数据就会有所丢失。为了确保这一点,JournalNodes在同一时刻只会允许一个NameNode往里面写数据。当故障切换发生时,只有即将成为Active状态的那个NameNode才有往JournalNode写的权限,这也有效避免了其他NameNode成为Active状态。
In order to deploy an HA cluster, you should prepare the following:
部署HA集群需要做以下准备:
- NameNode machines - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
NameNode machines - 用来部署Active NameNode和Standby NameNode节点的机器配置应该一致。- JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.
JournalNode machines - JournalNode服务是轻量级服务,所以没有必要部署到单独的节点上,可以运行在NameNode节点、ResourceManager节点上。
注:JournalNode至少需要部署3台机器节点上,或者更多奇数台节点上,只要不超过(N-1)/2台节点挂掉的话JournalNode仍然可以正常工作。Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
在HA集群中,Standby NameNode同时也完成了namespace的checkpoint工作,所以不需要在运行Secondary NameNode, CheckpointNode, or BackupNode。实际上如果你运行这些服务的话会出错。
dfs.nameservices - nameservice的逻辑名称
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
dfs.ha.namenodes.[nameservice ID] - nameservice中的每个NameNode的id
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
注:当前每个nameservice中最多只能有两个namenode。
dfs.namenode.rpc-address.[nameservice ID].[name node ID] - 每个NameNode监听的RPC地址
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
注:如果需要的话你也可以配置”servicerpc-address”。
dfs.namenode.http-address.[nameservice ID].[name node ID] - 每个NameNode监听的Http地址
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
注:如果启用了Hadoop的security特性,则需要为每个NameNdoe设置https-address。
dfs.namenode.shared.edits.dir - NameNode从JournalNode中读写的命名空间
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>
dfs.client.failover.proxy.provider.[nameservice ID] - HDFS 客户端通过该类找到当前Active NameNode
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
dfs.ha.fencing.methods - 在NameNode进行故障切换的时候进行fence的脚本或者Class
sshfence - SSH to the Active NameNode and kill the process
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
或者
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence([[username][:port]])</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
shell - run an arbitrary shell command to fence the Active NameNode
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
注:Active和Standby NameNode节点之间需要做好免SSH验证登录。
dfs.journalnode.edits.dir - 本地存储JournalNode状态的目录
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/path/to/journal/node/local/data</value>
</property>
fs.defaultFS - Hadoop 文件系统客户端默认前缀名
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command “hdfs-daemon.sh journalnode” and waiting for the daemon to start on each of the relevant machines.
配置完成之后可以通过”hadoop-daemon.sh start journalnode”启动JournalNode进程。Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes’ on-disk metadata.
当JournalNode进程启动之后需要同步两台NameNode节点上的元数据信息。
- If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.
如果是新配置的HDFS集群则在其中一个NameNode节点上运行format命令。- If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command “hdfs namenode -bootstrapStandby” on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.
如果已经formated过NameNode节点或者是将一个非HA集群配置为HA集群,则需要拷贝NameNode的元数据目录到新部署的NameNode节点上,通过在新的NameNode节点上运行”hdfs namenode -bootstrapStandby”命令。这个命令也会确保JournalNodes包含足够的edits来启动所有的NameNodes。 。- If you are converting a non-HA NameNode to be HA, you should run the command “hdfs -initializeSharedEdits”, which will initialize the JournalNodes with the edits data from the local NameNode edits directories.
如果是将一个非HA集群转换为一个HA集群,你还需要运行”hdfs -initializeSharedEdits”命令,将NameNode本地的edits目录初始化到JournalNode中。At this point you may start both of your HA NameNodes as you normally would start a NameNode.
这时候再去启动每个NameNode节点。You can visit each of the NameNodes’ web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either “standby” or “active”.) Whenever an HA NameNode starts, it is initially in the Standby state.
可以通过查看每个NameNode的Web界面来验证是否配置成功,一个active,一个standby。
注:上面是对官方文档的翻译,但是实际操作验证后步骤是有问题的,后面我会根据实际操作来重新整理步骤。
Introduction
The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
上面的章节部分介绍的是如何手动配置故障切换,但是这个切换并不是自动的,当Active NameNode发生故障时Standby NameNode无法自动切换为Active状态,下面介绍如何配置和部署NameNode的自动故障切换。
Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
自动故障切换需要增加两个新的组件:Zookeeper集群,ZKFailoverController进程(ZKFC)。Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
Zookeeper在NameNode的自动故障切换中起到以下两点作用:
- Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.
故障监测 - 每个NameNode在Zookeeper中维护了一个持久性的session会话,如果机器宕机了,保存在Zookeeper中的session就会过期失效,这个时候NameNode的故障切换行为就会被触发。- Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.
Active NameNode选择 - Zookeeper提供了一个机制用来选择一个节点作为Active NameNode。如果Active节点宕机了,另一个节点会在Zookeeper中持有一把锁来指明该节点即将成为Active NameNodeThe ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
ZKFC 是一个Zookeeper客户端,用来监控和管理NameNode状态,每个运行了NameNode服务的节点也同样需要运行ZKFC服务,主要作用如下:
- Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
Health monitoring - ZKFC 周期性的ping 本地的NameNode来进行health-check。如果NameNode及时的响应一个健康的状态,那么ZKFC就会认为该NameNode当前为健康。如果节点处故障了,则会被标记为不健康状态。- ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
Zookeeper session管理 - 当本地的NameNode为健康状态时,ZKFC会在Zookeeper中打开一个session会话。当本地的NameNode为Active NameNode时,ZKFC也会维持一个特殊的”lock”znode。这个znode是瞬态的,如果session过期,这个”lock”znode就会被自动删除。- ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.
ZooKeeper-based election - 如果本地NameNode为健康状态,ZKFC发现当前没有其他节点持有lock znode,就会自己持有这个znode。如果成功了,则运行failover来确保当前NameNode成为Active NameNode。这个理的failover进程类似手动进行failover:首先之前的Active NameNode进行fence,然后当前节点的NameNode进入Active状态。For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.
Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.
Zookeeper集群至少部署3台机器节点上,Zookeeper服务是轻量级服务,不会占用太多资源,可以部署到其他服务节点上。为了获得更好的性能和数据隔离,资建议将Zookeeper的数据存放在和HDFS元数据不同的磁盘上。The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.
安装Zookeeper集群已经超过本文档内容的范围,我们假定你已经可以正确安装部署Zookeeper集群。
Before you begin
Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
在配置自动故障切换之前先将集群关闭,目前不支持在运行的集群上配置自动故障切换。
Configuring automatic failover
在hdfs-site.xml中添加如下配置:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
在core-site.xml中添加如下配置:
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.
配置完成之后通过在NameNode上执行如下命令初始化ZKFC
$ hdfs zkfc -formatZK
Starting the cluster with start-dfs.sh
Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
自动故障切换启用后,start-dfs.sh脚本会自动启动每个NameNode节点上的ZKFC服务。当ZKFC服务启动后会自动选择一个NameNode作为Active NameNode的。
Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually start the zkfc daemon on each of the machines that runs a NameNode. You can start the daemon by running:
如果希望手动管理集群中的服务也可以使用如下命令启动NameNode节点上的ZKFC服务。
$ hadoop-daemon.sh start zkfc