环境信息:
Hadoop2.7.2+HBase1.2.2+Zookeeper3.4.10
11台服务器,1主10从,基本配置:128G内存,2个CPU12核48线程
服务器上运行了HDFS(11台),HBase(11台),Zookeeper(11台,部分复用集群资源),Yarn(11台,上面运行MR以及Spark任务),以及部分业务多线程程序
问题描述:
HBase启动后频繁报错,然后进程就aborted了:
regionserver日志:
java.io.EOFException: Premature EOF: no length prefix available
java.io.IOException: 断开的管道
[RS_OPEN_META-hd16:16020-0-MetaLogRoller] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.10.38:50010, 192.168.10.48:50010], original=[192.168.10.38:50010, 192.168.10.48:50010]). The current failed data
node replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
datanode日志:
java.io.EOFException: Premature EOF: no length prefix available
java.io.IOException: 断开的管道
java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.10.52:50010 remote=/192.168.10.48:48482]. 60000 millis timeout left.
java.io.IOException: Premature EOF from inputStream
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch
问题分析:
1、socket超时,修改超时时间:
重启并未起到相应的作用,错误仍在继续
2、hbase的JVM参数配置不佳,导致Full GC过于频繁,时间过长,超时,修改JVM参数,Full GC次数明显减少,且时间从几十秒变成几秒
export HBASE_MASTER_OPTS="-Xmx2000m -Xms2000m -Xmn750m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
export HBASE_REGIONSERVER_OPTS="-Xmx12800m -Xms12800m -Xmn1000m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly"
重启并未起到相应的作用,错误仍在继续
3、修改Zookeeper的超时时间以及HBase超时后不abort变为restart,配置如下:
HBase passes this tothe zk quorum as suggested maximum time for a
session. See http://hadoop.apache.org/zooke ... sions
“The client sends a requested timeout, theserver responds with the
timeout that it cangive the client. The current implementation
requires that thetimeout be a minimum of 2 times the tickTime
(as set in the serverconfiguration) and a maximum of 20 times
the tickTime.” Set thezk ticktime with hbase.zookeeper.property.tickime.
In milliseconds.
Zookeeper sessionexpired will force regionserver exit.
Enable this will makethe regionserver restart.
重启后错误仍在继续,只是坚持的时间变长了点,可见根本问题不在此
4、上述两个参数修改完毕后一度陷入僵局,后来分析进来集群的yarn负载较大,常态化的运行T级别的MR以及Spark任务,导致磁盘IO过大,响应超时,修改HDFS(hdfs-site.xml)配置:
重启后错误仍在继续,只是坚持的时间变长了点,可见根本问题不在此
5、最后分析可能是由于集群负载过大,尤其是IO压力过大,尝试着增大DataNode的内存资源以及连接数来尝试,配置如下:
注:此处我开始配置的是16384(网上有人说此值的值域为[1-8192]),未经证实,安全起见改为8192,有确认答案的朋友欢迎留言确认
修改DataNode的内存分配(hadoop-env.sh)
export HADOOP_HEAPSIZE=16384(之前是8192)
重启后,HBase运行正常,未再出现节点掉线的问题