HBase节点故障YouAreDeadException Server REPORT rejected

hbase性能测试,加载了一个晚上的数据,早上来时发现一个节点挂掉了,其他一切正常。

查看日志,发下如下问题

12/01/04 09:45:39 FATAL regionserver.HRegionServer: ABORTING region server serverName=hadoop5.site,60020,1325663355680, load=(requests=983, regions=252, usedHeap=3085, maxHeap=4983): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop5.site,60020,1325663355680 as dead server
 org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop5.site,60020,1325663355680 as dead server
     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
     at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
     at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
     at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:735)
     at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:596)
     at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop5.site,60020,1325663355680 as dead server
     at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:204)
     at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:262)
     at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:669)
     at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
     at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
 
    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
     at $Proxy6.regionServerReport(Unknown Source)
     at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729)
     … 2 more


再往上找,可以看到

2012-01-04T09:42:27.829-0500: 24795.829: [GC 24795.829: [ParNew: 151317K->10586K(153344K), 0.5282750 secs] 4970251K->4832124K(5102976K) icms_dc=0 , 0.5284260 secs] [Times: user=3.29 sys=0.01, real=0.53 secs] 
2012-01-04T09:42:28.721-0500: 24796.721: [GC 24796.721: [ParNew (promotion failed): 146906K->140702K(153344K), 0.5622020 secs]24797.283: [CMS: 4824062K->3150755K(4949632K), 189.5658760 secs] 4968444K->3150755K(5102976K), [CMS Perm : 20156K->20153K(33704K)] icms_dc=0 , 190.1283170 secs] [Times: user=7.43 sys=0.96, real=190.14 secs] 

2012-01-04T09:45:38.852-0500: 24986.852: [GC [1 CMS-initial-mark: 3150755K(4949632K)] 3152726K(5102976K), 0.0015480 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2012-01-04T09:45:38.853-0500: 24986.854: [CMS-concurrent-mark-start]
 
12/01/04 09:45:38 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 237682ms for sessionid 0x34a7b17bf80004, closing socket connection and attempting reconnect
 12/01/04 09:45:38 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 237682ms for sessionid 0x34a7b17bf80003, closing socket connection and attempting reconnect
 
12/01/04 09:45:38 WARN ipc.HBaseServer: IPC Server Responder, call getClosestRowBefore([B@166cb249, [B@3a2ce21f, [B@58b17f0f) from 192.9.200.164:34106: output error
 12/01/04 09:45:38 WARN ipc.HBaseServer: IPC Server Responder, call getClosestRowBefore([B@6816a498, [B@26902c8b, [B@435c6d74) from 192.9.200.238:38457: output error
 12/01/04 09:45:38 WARN ipc.HBaseServer: IPC Server Responder, call getClosestRowBefore([B@23b4b286, [B@2c348dba, [B@2e44c502) from 192.9.200.164:34106: output error
 12/01/04 09:45:38 WARN ipc.HBaseServer: PRI IPC Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
     at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
     at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
     at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
     at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)


 

系统在挂掉前进行了一次长达190s+的gc,导致长时间未与ZooKeeper通信,系统就认为这个节点挂掉了。

再分析这次fullgc的原因ParNew (promotion failed)

这个问题的产生是由于救助空间不够,从而向年老代转移对象,年老代没有足够的空间来容纳这些对象,导致一次full gc的产生。解决这个问题的办法有两种完全相反的倾向:增大救助空间、增大年老代或者去掉救助空间  增大救助空间就是调整-XX:SurvivorRatio参数,这个参数是Eden区和Survivor区的大小比值,默认是32,也就是说Eden区是 Survivor区的32倍大小,要注意Survivo是有两个区的,因此Surivivor其实占整个young  genertation的1/34。调小这个参数将增大survivor区,让对象尽量在survitor区呆长一点,减少进入年老代的对象。去掉救助空  间的想法是让大部分不能马上回收的数据尽快进入年老代,加快年老代的回收频率,减少年老代暴涨的可能性,这个是通过将-XX:SurvivorRatio 设置成比较大的值(比如65536)来做到。

还有一个系统的原因,那是因为这台机器比别的节点多部署了一个约占2G内存的应用,导致这台机器挂掉,但是其他机器没有出现问题

你可能感兴趣的:(hadoop,server,report,hbase,processing,output)