报错
CM HDFS管理界面的报错(由于CM down这个信息是无法通过管理界面查看的,这里是从日志中获得的):
The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /opt/tmp/.cloudera_health_monitoring_canary_files.
排查并处理
(1)CDH的CM节点挂掉
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status
cloudera-scm-server dead but pid file exists
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /usr/java/jdk1.8.0_111/bin/jps
20656 Main
20626 Main
25667 Jps
20630 EventCatcherService
20632 AlertPublisher
29995 Main
10619 -- process information unavailable
#从这里可以看到,没有7180这个端口,说明CM没有正常启动,少了一个Main进程
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ss -nltup|grep 718*
tcp LISTEN 0 50 *:7184 *:* users:(("java",20630,233))
tcp LISTEN 0 50 *:7185 *:* users:(("java",20630,241))
tcp LISTEN 0 5 *:4433 *:* users:(("python2.6",17152,8))
tcp LISTEN 0 5 127.0.0.1:7190 *:* users:(("python2.6",17152,11))
tcp LISTEN 0 5 *:7191 *:* users:(("python2.6",17152,7))
#我们的CDH相关的数据是存放在MySQL数据库中,由于CM down,导致无法查看CDH的其他相关组件,所以需要查看数据库信息,看看这个CDH都包括哪些节点
mysql> select * from hosts; +---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+ | HOST_ID | OPTIMISTIC_LOCK_VERSION | HOST_IDENTIFIER | NAME | IP_ADDRESS | RACK_ID | STATUS | CONFIG_CONTAINER_ID | MAINTENANCE_COUNT | DECOMMISSION_COUNT | CLUSTER_ID | NUM_CORES | TOTAL_PHYS_MEM_BYTES | PUBLIC_NAME | PUBLIC_IP_ADDRESS | CLOUD_PROVIDER | +---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+ | 1 | 11 | 264b10bb-b488-4ee7-8fcd-3c68f7a8860a | ec6s-logshedcl58manager-01 | 10.177.101.146 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL | | 2 | 17 | b584457b-705d-4b1f-8000-df0e6da1838d | ec6s-logshedcl58dn-03 | 10.177.102.38 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL | | 3 | 16 | e28dabc1-c105-464e-8bf6-0bd0435ace9a | ec6s-logshedcl58dn-02 | 10.177.102.193 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL | | 4 | 17 | 994cf04e-2510-426a-8336-6e2d28a3001d | ec6s-logshedcl58nn-02 | 10.177.102.218 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL | | 5 | 16 | a9cab0d5-5e48-49a7-8fb0-e57a0bac16db | ec6s-logshedcl58nn-01 | 10.177.101.60 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL | | 6 | 16 | 60bf1721-d6db-4d72-9164-41d89f81e789 | ec6s-logshedcl58dn-01 | 10.177.101.64 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL | +---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+ 6 rows in set (0.00 sec) mysql> select * from roles; +---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+ | ROLE_ID | NAME | HOST_ID | ROLE_TYPE | CONFIGURED_STATUS | SERVICE_ID | MERGED_KEYTAB | MAINTENANCE_COUNT | DECOMMISSION_COUNT | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | HAS_EVER_STARTED | +---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+ | 14 | mgmt-HOSTMONITOR-92f15c379891f3c8dbdbbcbe57db9067 | 1 | HOSTMONITOR | RUNNING | 4 | NULL | 0 | 0 | 6 | 25 | 1 | | 15 | mgmt-EVENTSERVER-92f15c379891f3c8dbdbbcbe57db9067 | 1 | EVENTSERVER | RUNNING | 4 | NULL | 0 | 0 | 6 | 21 | 1 | | 16 | mgmt-ACTIVITYMONITOR-92f15c379891f3c8dbdbbcbe57db9067 | 1 | ACTIVITYMONITOR | RUNNING | 4 | NULL | 0 | 0 | 6 | 22 | 1 | | 17 | mgmt-SERVICEMONITOR-92f15c379891f3c8dbdbbcbe57db9067 | 1 | SERVICEMONITOR | RUNNING | 4 | NULL | 0 | 0 | 6 | 24 | 1 | | 18 | mgmt-ALERTPUBLISHER-92f15c379891f3c8dbdbbcbe57db9067 | 1 | ALERTPUBLISHER | RUNNING | 4 | NULL | 0 | 0 | 6 | 20 | 1 | | 19 | zookeeper-SERVER-5779e83332b2c66cc02029a8ab2c3628 | 3 | SERVER | RUNNING | 5 | NULL | 0 | 0 | 9 | 27 | 1 | | 20 | zookeeper-SERVER-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | SERVER | RUNNING | 5 | NULL | 0 | 0 | 9 | 27 | 1 | | 21 | zookeeper-SERVER-dc971e0a60f4e798e85e2ab9bd57a041 | 6 | SERVER | RUNNING | 5 | NULL | 0 | 0 | 9 | 27 | 1 | | 23 | hdfs-NAMENODE-ed39ed17d751bee1bd6ad84c0db46ca1 | 5 | NAMENODE | RUNNING | 6 | NULL | 0 | 0 | 22 | 30 | 1 | | 24 | hdfs-DATANODE-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | DATANODE | RUNNING | 6 | NULL | 0 | 0 | 10 | 28 | 1 | | 25 | hdfs-DATANODE-5779e83332b2c66cc02029a8ab2c3628 | 3 | DATANODE | RUNNING | 6 | NULL | 0 | 0 | 10 | 28 | 1 | | 26 | hdfs-DATANODE-dc971e0a60f4e798e85e2ab9bd57a041 | 6 | DATANODE | RUNNING | 6 | NULL | 0 | 0 | 10 | 28 | 1 | | 27 | hdfs-NAMENODE-16c21945a5f07e23a510dd5e32caa6dd | 4 | NAMENODE | RUNNING | 6 | NULL | 0 | 0 | 6 | 30 | 1 | | 28 | hdfs-FAILOVERCONTROLLER-ed39ed17d751bee1bd6ad84c0db46ca1 | 5 | FAILOVERCONTROLLER | RUNNING | 6 | NULL | 0 | 0 | 4 | 29 | 1 | | 29 | hdfs-FAILOVERCONTROLLER-16c21945a5f07e23a510dd5e32caa6dd | 4 | FAILOVERCONTROLLER | RUNNING | 6 | NULL | 0 | 0 | 2 | 29 | 1 | | 30 | hdfs-JOURNALNODE-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | JOURNALNODE | RUNNING | 6 | NULL | 0 | 0 | 2 | 34 | 1 | | 31 | hdfs-JOURNALNODE-dc971e0a60f4e798e85e2ab9bd57a041 | 6 | JOURNALNODE | RUNNING | 6 | NULL | 0 | 0 | 2 | 34 | 1 | | 32 | hdfs-JOURNALNODE-5779e83332b2c66cc02029a8ab2c3628 | 3 | JOURNALNODE | RUNNING | 6 | NULL | 0 | 0 | 2 | 34 | 1 | | 36 | kafka-KAFKA_BROKER-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | KAFKA_BROKER | RUNNING | 8 | NULL | 0 | 0 | 9 | 40 | 1 | | 37 | kafka-KAFKA_BROKER-ed39ed17d751bee1bd6ad84c0db46ca1 | 5 | KAFKA_BROKER | RUNNING | 8 | NULL | 0 | 0 | 10 | 40 | 1 | | 38 | kafka-KAFKA_BROKER-16c21945a5f07e23a510dd5e32caa6dd | 4 | KAFKA_BROKER | RUNNING | 8 | NULL | 0 | 0 | 10 | 40 | 1 | +---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+ 21 rows in set (0.00 sec) mysql> select * from services; +------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+ | SERVICE_ID | OPTIMISTIC_LOCK_VERSION | NAME | SERVICE_TYPE | CLUSTER_ID | MAINTENANCE_COUNT | DISPLAY_NAME | GENERATION | +------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+ | 4 | 14 | mgmt | MGMT | NULL | 0 | Cloudera Management Service | 1 | | 5 | 7 | zookeeper | ZOOKEEPER | 5 | 0 | ZooKeeper | 1 | | 6 | 23 | hdfs | HDFS | 5 | 0 | HDFS | 1 | | 8 | 15 | kafka | KAFKA | 5 | 0 | Kafka | 1 | +------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+
#重启cloudera-scm-server服务
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status
cloudera-scm-server dead but pid file exists
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server stop
cloudera-scm-server is already stopped
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# cat /var/run/cloudera-scm-server.pid
10617
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ps -ef|grep 10617
root 28331 27755 0 19:02 pts/3 00:00:00 grep 10617
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20656
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20626
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20630
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 29995
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20632
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server start
[root@ec6s-logshedcl58manager-01 ~]# /etc/init.d/cloudera-scm-server status
cloudera-scm-server (pid 1378) is running...
#正常启动
[root@ec6s-logshedcl58manager-01 ~]# /usr/java/jdk1.8.0_111/bin/jps
1380 Main
2469 Main
2471 EventCatcherService
7272 Jps
2473 AlertPublisher
2475 Main
2462 Main
(2)两个NameNode之前无法通信,但是没有挂掉
当上面的CM正常起来之后,我们就可以通过图像界面管理NameNode,从图形界面上得到的信息是,NameNode彼此不能通信,NameNode无法写日志到Jounral Node中
日志报错:
Jul 18, 5:38:09.355 PMFATALorg.apache.hadoop.hdfs.server.namenode.FSEditLog Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.177.101.64:8485, 10.177.102.193:8485, 10.177.102.38:8485], stream=QuorumOutputStream starting at txid 1338050)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:651) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:585) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2752) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:599) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:401) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1783) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)
从日志可以看出,NameNode写journal文件失败,导致NameNode超时,因为公司用的AWS ec2环境,可能但是在做网络维护,导致instance网络不稳定,如果出现timeout的情况,我们可以把默认的20s修改成60s,如
#vim /etc/hadoop/conf/hdfs-site.xml
然后可以通过CM的管理平台:http://10.177.101.146:7180 分别重启两个NameNode