第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释

一,原理

先说一下Zookeeper在Hadoop集群的作用,以前我们学习Hadoop伪分布式的时候没有用到Zookeeper是因为伪分布式只有一个NameNode,没有Active和Standby状态的两个NameNode之说,因此根本就不需要Zookeepr来帮我们自动切换。但是Hadoop真正的集群就不一样了,为了集群的高可靠性,Hadoop集群采用主备NameNode方式来工作,一个处于Active激活状态,另一个处于Standby备份状态,一旦激活状态的NameNode发生宕机,处于备份状态的NameNode需要立即顶替上来进行工作,从而对外提供持续稳定的服务。那么,Zookeeper便是为我们提供这种服务的。

        在Hadoop1.0当中,集群当中只有一个NameNode,一旦宕机,服务便停止,这是非常大的缺陷,在Hadoop2.0当中,针对这一问题进行了优化,它对NameNode进行了抽象处理,它把NameNode抽象成一个NameService,一个NameService下面有两个NameNode,如下图所示。既然有两个NameNode,就需要有一个人来协调,谁来协调呢?那就是Zookeeper,Zookeeper有一个选举机制,它能确保一个NameService下面只有一个活跃的NameNode。因此Zookeeper在Hadoop2.0当中是非常重要的。

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第1张图片

         我们会疑问,在Hadoop集群当中一共就有两个NameNode吗?不是的,因为一个Hadoop集群每天面对的数据量是海量的,只有两个NameNode的话,内存会被写爆,因此NameService是可以水平扩展的,即一个集群有多个NameService,每个NameService有两个NameNode。NameService的名字依次是NameService1、NameService2...,由于DataNode是无限扩展的,因此NameService也是无限扩展的(当然不是说多的就没边了,合适的数量就好),如下图所示。

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第2张图片 

        下面我看一张Hadoop高可靠性的工作原理图,其中NN代表的是NameNode,DN代表的是DataNode,ZK代表的是Zookeeper,我们发现这个集群当中有两个NameNode,一个处于Active状态,另一个处于Standby状态,NameNode是受Zookeeper控制的,但是又不是直接受Zookeeper控制,有一个中间件FailoverController(也就是ZKFC进程),每一个NameNode所在的机器都有一个ZKFC进程,ZKFC可以给NameNode发送一些指令,比如切换指令。同时ZKFC还负责监控NameNode,一旦它发现NameNode宕机了,它就会报告给Zookeeper,另一台NameNode上的ZKFC可以得到那一台NameNode宕机的信息,因为Zookeeper数据是同步的,因此它可以从ZK中得到这条信息,它得到这条信息之后,会向它控制的NameNode发送一条指令,让它由Standby状态切换为Active状态。具体原理是什么呢,刚开始的时候两个NameNode都正常工作,处于激活状态的NameNode会实时的把edits文件写入到存放edits的一个介质当中(如下图绿色的如数据库图形的东西),Standby状态的NameNode会实时的把介质当中的edits文件同步到它自己所在的机器。因此Active里面的信息与Standby里面的信息是实时同步的。FailoverController实时监控NameNode,不断把NameNode的情况汇报给Zookeeper,一旦Active状态的NameNode发生宕机,FailoverController就跟NameNode联系不上了,联系不上之后,FailoverController就会把Active宕机的信息汇报给Zookeeper,另一个FailoverController便从ZK中得到了这条信息,然后它给监控的NameNode发送切换指令,让它由Standby状态切换为Active状态。存放edits文件的方式可以使用NFS---网络文件系统,另一种是JournalNode,我们本课程便使用JournalNode来存储edits文件。DataNode连向的是NameService,DataNode既可以跟Active的NameNode通信又可以跟Standby的NameNode通信,一旦Active宕机,DataNode会自动向新的Active进行通信。

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第3张图片  

        上面说了一大堆理论了,下面我们来开始搭建我们的Hadoop集群!,我们先来看一下我们的集群规划,我们打算布一个3台设备的集群,每台设备应该安装的软件、运行的进程如下图所示。其中DFSZKFailoverController是我们上图中介绍的FailoverContrlloer进程。我们可能会疑问,为什么NameNode和ResourceManager不放到一台设备上呢,是不能放到一起吗?不是的,之所以把它们分开是因为它们都是管理者(NameNode是HDFS的管理者,ResourceManager是Yarn的管理者)都十分耗资源,为了不让它们争抢资源,因此最好把它们分别布置到不同的设备上。NodeManager和DataNode最好在一台设备上,因为NodeManager以后要运行MapReduce程序,运行程序需要数据,数据从本地取最好,而DataNode刚好就是用来存储数据的。JournalNode是用来存储共享的edits文件的。

说明:
       在hadoop2.0中通常由两个NameNode组成,一个处于active状态,另一个处于standby状态。Active NameNode对外提供服务,而Standby NameNode则不对外提供服务,仅同步active namenode的状态,以便能够在它失败时快速进行切换。
        hadoop2.0官方提供了两种HDFS HA的解决方案,一种是NFS,另一种是QJM。这里我们使用简单的QJM。在该方案中,主备NameNode之间通过一组JournalNode同步元数据信息,一条数据只要成功写入多数JournalNode即认为写入成功。通常配置奇数个JournalNode
       这里还配置了一个zookeeper集群,用于ZKFC(DFSZKFailoverController)故障转移,当Active NameNode挂掉了,会自动切换Standby NameNode为Active状态


2,搭建集群

1,集群规划:

主机名 ip NameNode DataNode Yarn ZooKeeper JournalNode
ubuntu 192.168.72.131
ubuntu2 192.168.72.132
ubuntu3 192.168.72.133




2,条件

在之前我们都已经安装了zookeeper,hadoop单机版,jdk等,并且已经克隆出了三台机器,并设置好了ip和主机名。

3,插曲,这里小编纠正一下之前的有处配置错误。需要切换到root用户。 vim /etc/hosts

修改如下(就是把127.0.1.1      ubuntu2这行注释掉。因为如果不注释掉我们就不能在window浏览器访问虚拟机的namenode网页了,三个虚拟机都改):

127.0.0.1       localhost
#127.0.1.1      ubuntu2


192.168.72.131  ubuntu
192.168.72.132  ubuntu2
192.168.72.133  ubuntu3
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

4,配置

这里的配置并不多。因为大部分的工作我们之前都做好了。

(1)首先

xiaoye@ubuntu:~$ cd hadoop/etc/hadoop
xiaoye@ubuntu:~/hadoop/etc/hadoop$ vim core-site.xml 


       
       
                fs.defaultFS
                hdfs://ns
       

       
       
                ha.zookeeper.quorum
                ubuntu:2181,ubuntu2:2181,ubuntu2:2181
       

       


                hadoop.tmp.dir


                /home/xiaoye/hadoop/tmp


       



       


                fs.default.name


                hdfs://localhost:9000


       




这个配置文件只需要把主机名对应好就行了。

(2)xiaoye@ubuntu:~/hadoop/etc/hadoop$ vim hdfs-site.xml 


 
       
                dfs.nameservices
                ns
       

         
       
                dfs.ha.namenodes.ns
                nn1,nn2
       

 
       
                dfs.namenode.rpc-address.ns.nn1
 
       


                dfs.namenode.http-address.ns.nn1


                ubuntu:50070


       

         
       
                dfs.namenode.rpc-address.ns.nn2
                ubuntu2:9000
       

   
                ubuntu2:50070
       


       
                dfs.namenode.shared.edits.dir
       


       
                dfs.journalnode.edits.dir
                /home/xiaoye/hadoop/journal
       

 
   
          dfs.ha.automatic-failover.enabled
          true
   


   
            dfs.client.failover.proxy.provider.ns
   

   
   
             dfs.ha.fencing.methods
             sshfence
                        shell(/bin/true)
             

   

   
   
            dfs.ha.fencing.ssh.private-key-files
            /home/xiaoye/.ssh/id_rsa
   

       


                dfs.datanode.data.dir


                /home/xiaoye/hadoop/hadoop/data


       



                dfs.namenode.name.dir


                /home/xiaoye/hadoop/hadoop/name


       


       


       


                dfs.replication


                2


       

                           
   
       dfs.webhdfs.enabled
       true
   

这里每个配置都有解释,更换主机名就行了。

(3)xiaoye@ubuntu:~/hadoop/etc/hadoop$ vim mapred-site.xml





 
    mapreduce.framework.name
    yarn
  




               

(4)xiaoye@ubuntu:~/hadoop/etc/hadoop$ vim yarn-site.xml 





 
  
    yarn.nodemanager.aux-services
    mapreduce_shuffle
  

  
  
    yarn.resourcemanager.hostname
    ubuntu3
  


这样就好了。

当然这只配好了一台机子,另外两台也要配一样的配制。可以一个一个配置,也可以先删掉其他两台的hadoop安装目录。再scp命令从ubuntu复制过去。不过hadoop有点大,复制需要时间。


搭建好集群后,启动所有进程:

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第4张图片

如上图,共三个机器,首先打开第一ubuntu的机器,启动zookeeper.

xiaoye@ubuntu:~$ ./zookeeper/bin/zkServer.sh start
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg

Starting zookeeper ... STARTED

复制命令,到其他两个机器执行,

xiaoye@ubuntu2:~$ ./zookeeper/bin/zkServer.sh start
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg

Starting zookeeper ... STARTED

xiaoye@ubuntu3:~$ ./zookeeper/bin/zkServer.sh start
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg

Starting zookeeper ... STARTED

启动后看进程,zookeeper是否成功启动。

xiaoye@ubuntu:~$ jps
1492 Jps

1467 QuorumPeerMai

其它同样也是。

执行命令:

xiaoye@ubuntu:~$ ./zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg

Mode: follower

看看那个是leader,那个是follower.

其他两个机子也是,:

xiaoye@ubuntu2:~$  ./zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg

Mode: leader

xiaoye@ubuntu3:~$  ./zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg

Mode: follower

接着使用命令:start-all.sh 启动hadoop所有进程,并顺便jps看看进程启动情况,如下:

xiaoye@ubuntu:~$ ./hadoop/sbin/start-all.sh 
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
18/04/01 19:58:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [ubuntu ubuntu2]
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu: starting namenode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-namenode-ubuntu.out
ubuntu2: starting namenode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-namenode-ubuntu2.out
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.72.133' (ECDSA) to the list of known hosts.
ubuntu: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu.out
ubuntu2: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu2.out
ubuntu3: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu3.out
Starting journal nodes [ubuntu ubuntu2 ubuntu3]
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.72.133' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu3: starting journalnode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-journalnode-ubuntu3.out
ubuntu2: starting journalnode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-journalnode-ubuntu2.out
ubuntu: starting journalnode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-journalnode-ubuntu.out
18/04/01 19:58:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting ZK Failover Controllers on NN hosts [ubuntu ubuntu2]
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu: starting zkfc, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-zkfc-ubuntu.out
ubuntu2: starting zkfc, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-zkfc-ubuntu2.out
starting yarn daemons
starting resourcemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-resourcemanager-ubuntu.out
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.72.133' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu3: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu3.out
ubuntu2: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu2.out
ubuntu: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu.out
xiaoye@ubuntu:~$ jps
2129 DFSZKFailoverController
1974 JournalNode
2378 NodeManager
1467 QuorumPeerMain
2524 Jps
1660 NameNode

xiaoye@ubuntu:~$ 

可以看到共六个进程,但是正常的是要启动7个进程。细看发现datanode没有启动成功。

看日志xiaoye@ubuntu:~$ tail -200 hadoop/logs/hadoop-xiaoye-datanode-ubuntu.log,报错是:

2018-04-01 19:58:37,145 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to ubuntu/192.168.72.131:9000. Exiting. 
java.io.IOException: Incompatible clusterIDs in /home/xiaoye/hadoop/hadoop/data: namenode clusterID = CID-657e9540-2de9-43a2-bf91-199a4334b05a; datanode clusterID = CID-b824b399-e941-4982-a618-7453739d3d55
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:517)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:265)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:293)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1109)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1080)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:320)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:824)

        at java.lang.Thread.run(Thread.java:748)

解决办法,修改下面文件,把dataa目录下的VERSION文件的culster改成和name的一样

xiaoye@ubuntu:~$ vim hadoop/hadoop/data/current/VERSION 

#Sun Apr 01 18:33:44 PDT 2018
storageID=DS-b1750224-83b2-4da4-9c69-2d16e2f47185
clusterID=CID-657e9540-2de9-43a2-bf91-199a4334b05a
cTime=0
datanodeUuid=ae0efde3-3eab-4423-b69c-a9a8c6ca0fd8
storageType=DATA_NODE

layoutVersion=-56

重新单独启动datanode;

xiaoye@ubuntu:~$ vim hadoop/hadoop/data/current/VERSION 
xiaoye@ubuntu:~$ ./hadoop/sbin/hadoop-daemons.sh start datanode
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu.out
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.72.133' (ECDSA) to the list of known hosts.
ubuntu2: datanode running as process 1650. Stop it first.
ubuntu3: datanode running as process 1541. Stop it first.
xiaoye@ubuntu:~$ jps
2129 DFSZKFailoverController
1974 JournalNode
2713 DataNode
2378 NodeManager
1467 QuorumPeerMain
1660 NameNode

2781 Jps

可以看出没有问题了。

因为我们现在是集群,所以启动ubuntu就启动了其他两台机器。

看看其他两个机子的进程。

xiaoye@ubuntu2:~$ jps
1650 DataNode
2002 NodeManager
1747 JournalNode
1894 DFSZKFailoverController
2200 Jps

1466 QuorumPeerMain

ubuntu2的namenode没有启动成功。看日志,报错是:

There appears to be a gap in the edit log. We expected txid 1, but got txid

百度解决了好长时间,说是原因是namenode元数据被破坏了,需要修复。最终解决办法是(选Y ,选c):

xiaoye@ubuntu2:~/hadoop$ ./bin/hadoop namenode -recover        

You have selected Metadata Recovery mode.  This mode is intended to recover lost metadata on a corrupt filesystem.  Metadata recovery mode often permanently deletes data from your HDFS filesystem.  Please back up your edit log and fsimage before trying this!


Are you ready to proceed? (Y/N)
 (Y or N) y
18/04/01 20:47:00 INFO namenode.MetaRecoveryContext: starting recovery...
18/04/01 20:47:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/01 20:47:01 WARN common.Util: Path /home/xiaoye/hadoop/hadoop/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/04/01 20:47:01 WARN common.Util: Path /home/xiaoye/hadoop/hadoop/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/04/01 20:47:01 WARN namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!
18/04/01 20:47:01 WARN common.Util: Path /home/xiaoye/hadoop/hadoop/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/04/01 20:47:01 WARN common.Util: Path /home/xiaoye/hadoop/hadoop/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/04/01 20:47:01 INFO namenode.FSNamesystem: No KeyProvider found.
18/04/01 20:47:01 INFO namenode.FSNamesystem: fsLock is fair:true
18/04/01 20:47:01 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
18/04/01 20:47:01 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
18/04/01 20:47:01 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
18/04/01 20:47:01 INFO blockmanagement.BlockManager: The block deletion will start around 2018 Apr 01 20:47:01
18/04/01 20:47:01 INFO util.GSet: Computing capacity for map BlocksMap
18/04/01 20:47:01 INFO util.GSet: VM type       = 64-bit
18/04/01 20:47:01 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
18/04/01 20:47:01 INFO util.GSet: capacity      = 2^21 = 2097152 entries
18/04/01 20:47:01 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
18/04/01 20:47:01 INFO blockmanagement.BlockManager: defaultReplication         = 2
18/04/01 20:47:01 INFO blockmanagement.BlockManager: maxReplication             = 512
18/04/01 20:47:01 INFO blockmanagement.BlockManager: minReplication             = 1
18/04/01 20:47:01 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
18/04/01 20:47:01 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
18/04/01 20:47:01 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
18/04/01 20:47:01 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
18/04/01 20:47:01 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
18/04/01 20:47:01 INFO namenode.FSNamesystem: fsOwner             = xiaoye (auth:SIMPLE)
18/04/01 20:47:01 INFO namenode.FSNamesystem: supergroup          = supergroup
18/04/01 20:47:01 INFO namenode.FSNamesystem: isPermissionEnabled = true
18/04/01 20:47:01 INFO namenode.FSNamesystem: Determined nameservice ID: ns
18/04/01 20:47:01 INFO namenode.FSNamesystem: HA Enabled: true
18/04/01 20:47:01 INFO namenode.FSNamesystem: Append Enabled: true
18/04/01 20:47:01 INFO util.GSet: Computing capacity for map INodeMap
18/04/01 20:47:01 INFO util.GSet: VM type       = 64-bit
18/04/01 20:47:01 INFO util.GSet: 1.0% max memory 966.7 MB = 9.7 MB
18/04/01 20:47:01 INFO util.GSet: capacity      = 2^20 = 1048576 entries
18/04/01 20:47:01 INFO namenode.NameNode: Caching file names occuring more than 10 times
18/04/01 20:47:01 INFO util.GSet: Computing capacity for map cachedBlocks
18/04/01 20:47:01 INFO util.GSet: VM type       = 64-bit
18/04/01 20:47:01 INFO util.GSet: 0.25% max memory 966.7 MB = 2.4 MB
18/04/01 20:47:01 INFO util.GSet: capacity      = 2^18 = 262144 entries
18/04/01 20:47:01 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
18/04/01 20:47:01 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
18/04/01 20:47:01 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
18/04/01 20:47:01 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
18/04/01 20:47:01 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
18/04/01 20:47:01 INFO util.GSet: Computing capacity for map NameNodeRetryCache
18/04/01 20:47:01 INFO util.GSet: VM type       = 64-bit
18/04/01 20:47:01 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
18/04/01 20:47:01 INFO util.GSet: capacity      = 2^15 = 32768 entries
18/04/01 20:47:01 INFO namenode.NNConf: ACLs enabled? false
18/04/01 20:47:01 INFO namenode.NNConf: XAttrs enabled? true
18/04/01 20:47:01 INFO namenode.NNConf: Maximum size of an xattr: 16384
18/04/01 20:47:01 INFO hdfs.StateChange: STATE* Safe mode is ON. 
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
18/04/01 20:47:01 INFO common.Storage: Lock on /home/xiaoye/hadoop/hadoop/name/in_use.lock acquired by nodename 3269@ubuntu2
18/04/01 20:47:02 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded
18/04/01 20:47:02 INFO namenode.FSImageFormatPBINode: Loading 1 INodes.
18/04/01 20:47:02 INFO namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds.
18/04/01 20:47:02 INFO namenode.FSImage: Loaded image for txid 0 from /home/xiaoye/hadoop/hadoop/name/current/fsimage_0000000000000000000
18/04/01 20:47:02 INFO namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6aaceffd expecting start txid #1
18/04/01 20:47:02 INFO namenode.FSImage: Start loading edits file http://ubuntu:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a, http://ubuntu3:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a
18/04/01 20:47:02 INFO namenode.EditLogInputStream: Fast-forwarding stream 'http://ubuntu:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a, http://ubuntu3:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a' to transaction ID 1
18/04/01 20:47:02 INFO namenode.EditLogInputStream: Fast-forwarding stream 'http://ubuntu:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a' to transaction ID 1
18/04/01 20:47:03 ERROR namenode.MetaRecoveryContext: There appears to be a gap in the edit log.  We expected txid 1, but got txid 2.
18/04/01 20:47:03 INFO namenode.MetaRecoveryContext: 
Enter 'c' to continue, ignoring missing  transaction IDs
Enter 's' to stop reading the edit log here, abandoning any later edits
Enter 'q' to quit without saving
Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)


c
18/04/01 20:47:05 INFO namenode.MetaRecoveryContext: Continuing
18/04/01 20:47:05 INFO namenode.FSEditLogLoader: replaying edit log: 2/2 transactions completed. (100%)
18/04/01 20:47:05 INFO namenode.FSImage: Edits file http://ubuntu:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a, http://ubuntu3:8480/getJournal?jid=ns&segmentTxId=2&storageInfo=-59%3A695608861%3A0%3ACID-657e9540-2de9-43a2-bf91-199a4334b05a of size 1048576 edits # 1 loaded in 2 seconds
18/04/01 20:47:05 INFO namenode.FSNamesystem: Need to save fs image? false (staleImage=false, haEnabled=true, isRollingUpgrade=false)
18/04/01 20:47:05 INFO namenode.NameCache: initialized with 0 entries 0 lookups
18/04/01 20:47:05 INFO namenode.FSNamesystem: Finished loading FSImage in 3914 msecs
18/04/01 20:47:05 INFO namenode.FSImage: Save namespace ...
18/04/01 20:47:05 INFO namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 0
18/04/01 20:47:05 INFO namenode.MetaRecoveryContext: RECOVERY COMPLETE
18/04/01 20:47:05 INFO namenode.FSNamesystem: Stopping services started for active state
18/04/01 20:47:05 INFO namenode.FSNamesystem: Stopping services started for standby state
18/04/01 20:47:05 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu2/192.168.72.132
************************************************************/
xiaoye@ubuntu2:~/hadoop$ ./sbin/hadoop-daemons.sh start namenode
xiaoye@ubuntu3's password: ubuntu2: starting namenode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-namenode-ubuntu2.out
ubuntu: namenode running as process 1660. Stop it first.


ubuntu3: Permission denied, please try again.
xiaoye@ubuntu3's password: 
ubuntu3: starting namenode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-namenode-ubuntu3.out


xiaoye@ubuntu2:~/hadoop$ jps
3489 Jps
1650 DataNode
1747 JournalNode
1894 DFSZKFailoverController
1466 QuorumPeerMain

3404 NameNode

此时ubuntu2也有6个进程了。

到ubuntu3查看,

xiaoye@ubuntu3:~$ jps
1618 JournalNode
1541 DataNode
1431 QuorumPeerMain

2171 Jps

好,下面启动yarn资源管理

xiaoye@ubuntu3:~$ ./hadoop/sbin/start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-resourcemanager-ubuntu3.out
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.72.133' (ECDSA) to the list of known hosts.
ubuntu3: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu3.out
ubuntu: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu.out
ubuntu2: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu2.out
xiaoye@ubuntu3:~$ jps
1618 JournalNode
2563 Jps
1541 DataNode
2229 ResourceManager
1431 QuorumPeerMain

2347 NodeManager

看看其他两个机器,jps会看到也启动了nodemanager进程。

这里小编看到ubuntu2机器的

xiaoye@ubuntu2:~$ jps
1650 DataNode
1747 JournalNode
1466 QuorumPeerMain
3404 NameNode
4174 Jps

4014 NodeManager

zkfc进程没有启动成功,解决办法,首先ubuntu机器上单独启动zkfc,

xiaoye@ubuntu:~$ ./hadoop/sbin/hadoop-daemons.sh start zkfc
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.72.132' (ECDSA) to the list of known hosts.
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.72.133' (ECDSA) to the list of known hosts.
ubuntu: Warning: Permanently added 'ubuntu,192.168.72.131' (ECDSA) to the list of known hosts.
ubuntu: zkfc running as process 2129. Stop it first.
ubuntu2: starting zkfc, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-zkfc-ubuntu2.out
ubuntu3: starting zkfc, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-zkfc-ubuntu3.out

再在ubuntu2机器上查看,发现有zkfc这个进程了。那就这样吧

在浏览器上看状态:

ubuntu

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第5张图片

下面演示,关闭一个active的namenode,看看另外一个会不会自动由standby进入到active状态。


ubuntu2

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第6张图片

关闭ubuntu.

xiaoye@ubuntu:~$ ./hadoop/sbin/hadoop-daemon.sh stop  namenode

这里很尴尬,我关了ubuntu,但是ubuntu2没有变成active.但是不要放弃。关闭三个机器所有hadoop进程。zookeeper可以不用关。然后重新启动,这里严格按照步骤启动:

(1)启动三个机器的日志节点:xiaoye@ubuntu:~$ hadoop/sbin/hadoop-daemons.sh start jorunalnode

(2)单独启动ubuntu的namenode,注意命令的daemon不带s哦。xiaoye@ubuntu:~$ hadoop/sbin/hadoop-daemon.sh start namenode 

(3)ubuntu2节点执行命令:xiaoye@ubuntu2:~$ ./hadoop/bin/hdfs namenode -bootstrapStandby  

(4)ubuntu2节点单独启动namenode;xiaoye@ubuntu2:~$ ./hadoop/sbin/hadoop-daemon.sh  start namemode

(5) ubuntu启动所有datanode :xiaoye@ubuntu:~$ ./hadoop/sbin/hadoop-daemons.sh start datanode 

(6)ubuntu3启动yarn资源管理:xiaoye@ubuntu3:~$ ./hadoop/sbin/start-yarn.sh 

(7)ubuntu启动所有zkfc :xiaoye@ubuntu:~$ ./hadoop/sbin/hadoop-daemons.sh start zkfc     

好了,浏览器上看看ubuntu和ubuntu2的状态。

小编这里又遇到状况了,ubuntu的namenode启动不了。看日志报错同上面的一样:

There appears to be a gap in the edit log. We expected txid 1, but got txid

解决办法也一样。然后单独重启ubuntu的namenode.

在浏览器上看:

ubuntu:

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第7张图片

ubuntu2:

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第8张图片

然后关了ubuntu2的namenode。

xiaoye@ubuntu2:~$ ./hadoop/sbin/hadoop-daemon.sh stop namenode  

stopping namenode

看到ubuntu自动变为active状态:

第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第9张图片第七章 :Hadoop+Zookeeper 3节点高可用集群搭建和原理解释_第10张图片

4,可能出现的错误解决办法:

节点三启动yarn失败解决办法:

1,如果有读者在启动zookeeper时,喜欢看zookeeper.out的启动日志。但是如果启动一个节点的时候如下错,不用担心,这是链接不上其他两个节点的错误信息,因为其他两个节点还没有启动,因此出现以下错误是正常的。正确的看是否正确启动还是使用./zkServer.sh status命令为准。

2018-03-29 00:26:14,583 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1 (n.leader), 0x700000000 (n.zxid), 0x17 (n.round), LOOKING (n.state), 1 (n.sid), 0xf (n.peerEPoch), LOOKING (my state)

2018-03-29 00:26:14,640 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 2 (n.leader), 0x700000000 (n.zxid), 0x16 (n.round), LOOKING (n.state), 2 (n.sid), 0xf (n.peerEPoch), LOOKING (my state)
2018-03-29 00:26:17,654 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel to 3 at election address ubuntu3/192.168.72.133:3888
java.net.NoRouteToHostException: No route to host (Host unreachable)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
        at java.lang.Thread.run(Thread.java:748)
2018-03-29 00:26:17,655 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1 (n.leader), 0x700000000 (n.zxid), 0x17 (n.round), LOOKING (n.state), 1 (n.sid), 0xf (n.peerEPoch), LOOKING (my state)
2018-03-29 00:26:20,725 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel to 3 at election address ubuntu3/192.168.72.133:3888
java.net.NoRouteToHostException: No route to host (Host unreachable)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
        at java.lang.Thread.run(Thread.java:748)
2018-03-29 00:26:20,856 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@774] - Notification time out: 6400
2018-03-29 00:26:20,857 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1 (n.leader), 0x700000000 (n.zxid), 0x17 (n.round), LOOKING (n.state), 1 (n.sid), 0xf (n.peerEPoch), LOOKING (my state)
2018-03-29 00:26:23,797 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel to 3 at election address ubuntu3/192.168.72.133:3888

java.net.NoRouteToHostException: No route to host (Host unreachable)

还有就是按照小编上面的配置应该是不会出错的。如果发现有的三台机子有的起来了,有的没有,那也别纠结啥原因了。直接全部kill掉zookeeper进程,按顺序依次重新启动,等三个都start了之后再使用 status命令看启动状态。

2,

xiaoye@ubuntu3:~/hadoop$ ./sbin/start-yarn.sh 

starting yarn daemons
starting resourcemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-resourcemanager-ubuntu3.out
The authenticity of host 'ubuntu2 (192.168.72.132)' can't be established.
ECDSA key fingerprint is SHA256:TSAQ5j2Yx7F2wunlVGW7lyVpbVEJZyovXIPevsObNX0.
Are you sure you want to continue connecting (yes/no)? ubuntu: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu.out
ubuntu3: starting nodemanager, logging to /home/xiaoye/hadoop/logs/yarn-xiaoye-nodemanager-ubuntu3.out
ubuntu2: Host key verification failed.

解决:出现这个错的解决办法网上有(我自己遇到这个问题是用第三个办法解决的):

这时候的处理方法,有3种:
1. 删除提示信息中,对应的行数,例如上例,需要删除/home/cobyeah/.ssh/known_hosts文件的第7行。

2. 删除整份/home/cobyeah/.ssh/known_hosts文件。

3. 修改/etc/ssh/ssh_config文件的配置,以后则不会再出现此问题StrictHostKeyChecking no
UserKnownHostsFile /dev/null


3,启动zookeeper如果出现以下问题,可修改文件拥有者,使用者权限

xiaoye@ubuntu3:~$ zkServer.sh start
JMX enabled by default
Using config: /home/xiaoye/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... /home/xiaoye/zookeeper/bin/zkServer.sh: line 126: ./zookeeper.out: Permission denied
STARTED
xiaoye@ubuntu3:~$ ls
apache-activemq-5.15.3  Downloads         Music      zookeeper
classes                 examples.desktop  Pictures   zookeeper.out
derby.log               hadoop            Public
Desktop                 hive              Templates
Documents               metastore_db      Videos
xiaoye@ubuntu3:~$ cd zookeeper/
xiaoye@ubuntu3:~/zookeeper$ chown -R xiaoye data/ 
xiaoye@ubuntu3:~/zookeeper$ 


4,设置免密钥登录

xiaoye@ubuntu:~$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/xiaoye/.ssh/id_rsa): 
/home/xiaoye/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/xiaoye/.ssh/id_rsa.
Your public key has been saved in /home/xiaoye/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:E36xHQ1ExDlgQ4WlwXmScOxQhA2uP37Uikf+skQxgxc xiaoye@ubuntu
The key's randomart image is:
+---[RSA 2048]----+
|        o@E%B.   |
|       .o+O==o   |
|        +oBo...  |
|       o o.B .   |
|      . S +..    |
|       . +o .    |
|        o=..     |
|       ..o*      |
|        .o.+.    |
+----[SHA256]-----+
xiaoye@ubuntu:~$ ssh-copy-id ubuntu
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/xiaoye/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
xiaoye@ubuntu's password: 


Number of key(s) added: 1


Now try logging into the machine, with:   "ssh 'ubuntu'"
and check to make sure that only the key(s) you wanted were added.


xiaoye@ubuntu:~$ ssh-copy-id ubuntu2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/xiaoye/.ssh/id_rsa.pub"
The authenticity of host 'ubuntu2 (192.168.72.132)' can't be established.
ECDSA key fingerprint is SHA256:TSAQ5j2Yx7F2wunlVGW7lyVpbVEJZyovXIPevsObNX0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
xiaoye@ubuntu2's password: 


Number of key(s) added: 1


Now try logging into the machine, with:   "ssh 'ubuntu2'"
and check to make sure that only the key(s) you wanted were added.


xiaoye@ubuntu:~$ ssh-copy-id ubuntu3
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/xiaoye/.ssh/id_rsa.pub"
The authenticity of host 'ubuntu3 (192.168.72.133)' can't be established.
ECDSA key fingerprint is SHA256:TSAQ5j2Yx7F2wunlVGW7lyVpbVEJZyovXIPevsObNX0.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
xiaoye@ubuntu3's password: 


Number of key(s) added: 1


Now try logging into the machine, with:   "ssh 'ubuntu3'"
and check to make sure that only the key(s) you wanted were added.


5 ,结语
安装hadoop和测试机器的过程中可能出现很多错误。这就需要耐心了。小编花费了整整三天才把这篇博文写出来。遇到的问题很多,但是都解决了。凡事问度娘。当然了,遇到问题实时看日志,这是解决问题的根源。然后再百度。再者可能需要重启机器或重启集群很多次,每次都需要等很长时间,但是要坚持,坚持就是胜利。

对可能出现的问题总结。有免密钥登录,datanode或namenode没有启动成功,主机名不对,windows不能访问,zkfc没有启动成功等,这些在博文中我都有介绍到。希望对读者有帮助。

谢谢阅读

你可能感兴趣的:(大数据,从零学习hadoop)