hbase性能测试,加载了一个晚上的数据,早上来时发现一个节点挂掉了,其他一切正常。
查看日志,发下如下问题
12/01/04 09:45:39 FATAL regionserver.HRegionServer: ABORTING region server serverName=hadoop5.site,60020,1325663355680, load=(requests=983, regions=252, usedHeap=3085, maxHeap=4983): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop5.site,60020,1325663355680 as dead server org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop5.site,60020,1325663355680 as dead server at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:735) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:596) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hadoop5.site,60020,1325663355680 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:204) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:262) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:669) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy6.regionServerReport(Unknown Source) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729) … 2 more
再往上找,可以看到
2012-01-04T09:42:27.829-0500: 24795.829: [GC 24795.829: [ParNew: 151317K->10586K(153344K), 0.5282750 secs] 4970251K->4832124K(5102976K) icms_dc=0 , 0.5284260 secs] [Times: user=3.29 sys=0.01, real=0.53 secs] 2012-01-04T09:42:28.721-0500: 24796.721: [GC 24796.721: [ParNew (promotion failed): 146906K->140702K(153344K), 0.5622020 secs]24797.283: [CMS: 4824062K->3150755K(4949632K), 189.5658760 secs] 4968444K->3150755K(5102976K), [CMS Perm : 20156K->20153K(33704K)] icms_dc=0 , 190.1283170 secs] [Times: user=7.43 sys=0.96, real=190.14 secs] 2012-01-04T09:45:38.852-0500: 24986.852: [GC [1 CMS-initial-mark: 3150755K(4949632K)] 3152726K(5102976K), 0.0015480 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2012-01-04T09:45:38.853-0500: 24986.854: [CMS-concurrent-mark-start] 12/01/04 09:45:38 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 237682ms for sessionid 0x34a7b17bf80004, closing socket connection and attempting reconnect 12/01/04 09:45:38 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 237682ms for sessionid 0x34a7b17bf80003, closing socket connection and attempting reconnect 12/01/04 09:45:38 WARN ipc.HBaseServer: IPC Server Responder, call getClosestRowBefore([B@166cb249, [B@3a2ce21f, [B@58b17f0f) from 192.9.200.164:34106: output error 12/01/04 09:45:38 WARN ipc.HBaseServer: IPC Server Responder, call getClosestRowBefore([B@6816a498, [B@26902c8b, [B@435c6d74) from 192.9.200.238:38457: output error 12/01/04 09:45:38 WARN ipc.HBaseServer: IPC Server Responder, call getClosestRowBefore([B@23b4b286, [B@2c348dba, [B@2e44c502) from 192.9.200.164:34106: output error 12/01/04 09:45:38 WARN ipc.HBaseServer: PRI IPC Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
系统在挂掉前进行了一次长达190s+的gc,导致长时间未与ZooKeeper通信,系统就认为这个节点挂掉了。
再分析这次fullgc的原因ParNew (promotion failed)
这个问题的产生是由于救助空间不够,从而向年老代转移对象,年老代没有足够的空间来容纳这些对象,导致一次full gc的产生。解决这个问题的办法有两种完全相反的倾向:增大救助空间、增大年老代或者去掉救助空间。 增大救助空间就是调整-XX:SurvivorRatio参数,这个参数是Eden区和Survivor区的大小比值,默认是32,也就是说Eden区是 Survivor区的32倍大小,要注意Survivo是有两个区的,因此Surivivor其实占整个young genertation的1/34。调小这个参数将增大survivor区,让对象尽量在survitor区呆长一点,减少进入年老代的对象。去掉救助空 间的想法是让大部分不能马上回收的数据尽快进入年老代,加快年老代的回收频率,减少年老代暴涨的可能性,这个是通过将-XX:SurvivorRatio 设置成比较大的值(比如65536)来做到。
还有一个系统的原因,那是因为这台机器比别的节点多部署了一个约占2G内存的应用,导致这台机器挂掉,但是其他机器没有出现问题