现象:启动start-abase.sh后过一小段时间,所有的hmaster和regionserver进程全部自动死掉。
这个问题因为hmaster和hregionserver进程都死掉,一直以为是什么别的原因,也没有耐心去查看日志,花了很多时间瞎弄,后来无意间才发现我这有两个节点根本无法解析另一个节点的主机名(hadoop.lsd4.com),才导致这样的问题,贴一下日志:
2017-03-13 09:18:41,194 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181 sessionTimeout=90000 watcher=regionserver:160200x0, quorum=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181, baseZNode=/hbase
2017-03-13 09:18:41,195 WARN [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.RecoverableZooKeeper: Unable to create ZooKeeper Connection
java.net.UnknownHostException: hadoop.lsd4.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1259)
at java.net.InetAddress.getAllByName(InetAddress.java:1171)
at java.net.InetAddress.getAllByName(InetAddress.java:1105)
at org.apache.zookeeper.client.StaticHostProvider.
at org.apache.zookeeper.ZooKeeper.
at org.apache.zookeeper.ZooKeeper.
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.checkZk(RecoverableZooKeeper.java:141)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1236)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1225)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1416)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-03-13 09:18:42,206 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181 sessionTimeout=90000 watcher=regionserver:160200x0, quorum=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181, baseZNode=/hbase
2017-03-13 09:18:42,207 WARN [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.RecoverableZooKeeper: Unable to create ZooKeeper Connection
java.net.UnknownHostException: hadoop.lsd4.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1259)
at java.net.InetAddress.getAllByName(InetAddress.java:1171)
at java.net.InetAddress.getAllByName(InetAddress.java:1105)
at org.apache.zookeeper.client.StaticHostProvider.
at org.apache.zookeeper.ZooKeeper.
at org.apache.zookeeper.ZooKeeper.
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.checkZk(RecoverableZooKeeper.java:141)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1236)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1225)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1416)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-03-13 09:18:44,207 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181 sessionTimeout=90000 watcher=regionserver:160200x0, quorum=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181, baseZNode=/hbase
2017-03-13 09:18:44,208 WARN [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.RecoverableZooKeeper: Unable to create ZooKeeper Connection
java.net.UnknownHostException: hadoop.lsd4.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1259)
at java.net.InetAddress.getAllByName(InetAddress.java:1171)
at java.net.InetAddress.getAllByName(InetAddress.java:1105)
at org.apache.zookeeper.client.StaticHostProvider.
at org.apache.zookeeper.ZooKeeper.
at org.apache.zookeeper.ZooKeeper.
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.checkZk(RecoverableZooKeeper.java:141)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1236)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1225)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1416)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-03-13 09:18:48,208 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181 sessionTimeout=90000 watcher=regionserver:160200x0, quorum=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181, baseZNode=/hbase
2017-03-13 09:18:48,209 WARN [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.RecoverableZooKeeper: Unable to create ZooKeeper Connection
java.net.UnknownHostException: hadoop.lsd4.com: unknown error
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302)
at java.net.InetAddress.getAllByName0(InetAddress.java:1255)
at java.net.InetAddress.getAllByName(InetAddress.java:1171)
at java.net.InetAddress.getAllByName(InetAddress.java:1105)
at org.apache.zookeeper.client.StaticHostProvider.
at org.apache.zookeeper.ZooKeeper.
at org.apache.zookeeper.ZooKeeper.
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.checkZk(RecoverableZooKeeper.java:141)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1236)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1225)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1416)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-03-13 09:18:56,210 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop.lsd1.com:2181,hadoop.lsd2.com:2181,hadoop.lsd3.com:2181,hadoop.lsd4.com:2181 sessionTimeout=90000 watcher=regionserver:160200x0,
2017-03-13 09:18:56,211 ERROR [regionserver/hadoop.lsd2.com/192.168.56.12:16020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts
2017-03-13 09:18:56,212 WARN [regionserver/hadoop.lsd2.com/192.168.56.12:16020] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$OperationTimeoutException: KeeperErrorCode = OperationTimeout
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.checkZk(RecoverableZooKeeper.java:144)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1236)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1225)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1416)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1090)
at java.lang.Thread.run(Thread.java:745)
2017-03-13 09:18:56,213 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] regionserver.HRegionServer: stopping server hadoop.lsd2.com,16020,1489411066501; zookeeper connection closed.
2017-03-13 09:18:56,213 INFO [regionserver/hadoop.lsd2.com/192.168.56.12:16020] regionserver.HRegionServer: regionserver/hadoop.lsd2.com/192.168.56.12:16020 exiting
2017-03-13 09:18:56,225 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2677)
[root@hadoop logs]#
其实回头来看,如果仔细查看日志的话,也不难找出问题,我这里是hadoop.lsd2.com/hadoop.lsd3.com两个节点的/etc/hosts文件中没有配置好hadoop.lsd4.com的映射(应该是以前做别的试验删掉了没及时还原),导致在通信的时候无法解析域名。解决方法:重新把主机名映射写的最全的节点的/etc/hosts文件拷贝到各节点,保证每个节点的主机名都能解析,再重启集群。