Hadoop HA主备无法切换问题

文章目录

  • 1. 缺少命令
      • 问题1解决方案
  • 2. JournalNode 数量问题

参考文档: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

1. 缺少命令

目前已知的是hdfs-site.xml文件中的dfs.ha.fencing.methods参数,值如果是sshfence的话是需要这条命令的,这个参数共有两个值即sshfence或者shell()

同时值得注意的是,如果值为sshfence的话,他的主备切换过程是:

假设现在有两个节点运行的NameNode,分别是A B .A节点为active节点.
切换的时候A节点的NameNode进场服务死掉,B节点回去通过ssh连接到A节点上,调用命令fuser来重新kill一遍NameNode进程,本意为了防止出现脑裂现象.

这个过程中产生了两个可能影响到主备切换的原因:

  1. A节点无法连接. 如果A节点无法连接那么B节点会一直尝试连接,提示no route to hosts from xxx.解决方案见下面问题1解决方案
  2. A节点可以连接,但是没有fuser命令,也就没法kill进程.解决方案为安装对应的包.
yum install -y psmisc
  1. A节点存活但是无法直接连接上去.这个现象是ssh免秘钥配置的问题,切换的时候B节点要可以不用密码直接连接到A节点上,可以直接在B节点到A节点之间做免秘钥或者在hdfs-site.xml文件中通过参数dfs.ha.fencing.ssh.private-key-files指定密钥位置.
    示例:
    <property>
        <name>dfs.ha.fencing.ssh.private-key-filesname>
        <value>/root/.ssh/id_dsavalue>
    property>

问题1解决方案

官网原文:The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service’s TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:

sshfence如果是节点宕机是没办法切换主备的,所以改为shell方式.
修改hdfs-site.xml文件,或者参考:Hadoop HA高可用部署

    <property>
        <name>dfs.ha.fencing.methodsname>
        <value>shell(/bin/true)value>
    property>

2. JournalNode 数量问题

JournalNode 为了防止出现脑裂现象,也采用了半数机制,即只有节点数量大于或者等于集群总节点数量的一半以上的节点时集群才能正常运行.

官方原文:JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.

所以如果JournalNode 的节点数是两个也是没办法切换的,所以最少应该是三个节点(一个主节点两个备用节点).计算方法为(n - 1) / 2.

你可能感兴趣的:(Linux,hadoop,hdfs,big,data)