hbase region 下线,region server也全部下线,hmaster 依然正常.

Call exception, tries=11, retries=31, started=48384 ms ago, cancelled=false, msg=Call to hzd-t-vbdl-01/10.253.76.213:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: hzd-t-vbdl-01/10.253.76.213:16020, details=row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hzd-t-vbdl-01,16020,1558678879676, seqNum=-1, see https://s.apache.org/timeout
下午5点26:39.566分    ERROR    RegionNormalizerChore    
Failed to normalize regions.
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=31, exceptions:
Mon May 27 17:26:39 CST 2019, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68569: Call to hzd-t-vbdl-01/10.253.76.213:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: hzd-t-vbdl-01/10.253.76.213:16020 row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hzd-t-vbdl-01,16020,1558678879676, seqNum=-1

    at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:299)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:242)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58)
    at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
    at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:266)
    at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
    at org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:309)
    at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:594)
    at org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:766)
    at org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:734)
    at org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:690)
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScanTables(MetaTableAccessor.java:240)
    at org.apache.hadoop.hbase.master.TableStateManager.getTablesInStates(TableStateManager.java:189)
    at org.apache.hadoop.hbase.master.HMaster.normalizeRegions(HMaster.java:1748)
    at org.apache.hadoop.hbase.master.normalizer.RegionNormalizerChore.chore(RegionNormalizerChore.java:48)
    at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68569: Call to hzd-t-vbdl-01/10.253.76.213:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: hzd-t-vbdl-01/10.253.76.213:16020 row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hzd-t-vbdl-01,16020,1558678879676, seqNum=-1
    at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:159)
    at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
    ... 3 more
Caused by: java.net.ConnectException: Call to hzd-t-vbdl-01/10.253.76.213:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: hzd-t-vbdl-01/10.253.76.213:16020
    at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:178)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
    at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
    at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
    at org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92)
    at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
    at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
    at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:307)
    at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1377)
    at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
    at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
    at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:929)
    at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
    at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
    at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:267)
    at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:261)

 

Failed state=SERVER_CRASH_SPLIT_META_LOGS, retry pid=1631, state=RUNNABLE:SERVER_CRASH_SPLIT_META_LOGS, locked=true; ServerCrashProcedure server=hzd-t-vbdl-01,16020,1558950677747, splitWal=true, meta=true; cycles=0
java.io.IOException: error or interrupted while splitting logs in [hdfs://namenodens/hbase/WALs/hzd-t-vbdl-01,16020,1558950677747-splitting] Task = installed = 1 done = 0 error = 0
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:271)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:401)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:299)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:291)
    at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitMetaLogs(ServerCrashProcedure.java:241)
    at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:128)
    at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59)
    at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
    at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)    


ERROR: Region { meta => kylin_metadata,,1557233441904.0030550b445a31ce0771017dfbf656f9., hdfs => hdfs://namenodens/hbase/data/default/kylin_metadata/0030550b445a31ce0771017dfbf656f9, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => KYLIN_DAPNOIUKN6,\x00\x01,1557467750732.028a07abee8ff9d2039dbb43f522297c., hdfs => hdfs://namenodens/hbase/data/default/KYLIN_DAPNOIUKN6/028a07abee8ff9d2039dbb43f522297c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => users,,1557299637703.0a5f935f12764f39232b9d55864e2c75., hdfs => hdfs://namenodens/hbase/data/default/users/0a5f935f12764f39232b9d55864e2c75, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => KYLIN_U1GTBC783V,\x00\x04,1557401362125.0f7b844b358c2ee2ba065df1b4123905., hdfs => hdfs://namenodens/hbase/data/default/KYLIN_U1GTBC783V/0f7b844b358c2ee2ba065df1b4123905, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => KYLIN_R757IS3HCB,,1557732773507.1d91b780f9ddc1b9a76e5ef7649340f4., hdfs => hdfs://namenodens/hbase/data/default/KYLIN_R757IS3HCB/1d91b780f9ddc1b9a76e5ef7649340f4, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => KYLIN_U1GTBC783V,\x00\x14,1557401362125.1e251beb3e1a2fcf4caad5c99ed23951., hdfs => hdfs://namenodens/hbase/data/default/KYLIN_U1GTBC783V/1e251beb3e1a2fcf4caad5c99ed23951, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => KYLIN_8KMDETCTD2,,1557459660987.1f2463332c849cec62051029bf0fbf5e., hdfs => hdfs://namenodens/hbase/data/default/KYLIN_8KMDETCTD2/1f2463332c849cec62051029bf0fbf5e, deployed => , replicaId =>  
 

ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_9BDZ4K8R2Q
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table tmp_table
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_CCWH4U80ZH
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_4RCZ8UFD3Q
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_30T77VXKML
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_4KRZ6D9I6R
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_HDCPL33NBQ
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table kylin_metadata
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_WXK3L2OR6T
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_R757IS3HCB
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_KHAYDY3U9V

 

-------------------------------##############################-------------------------------##############################

错误日志记录到此为止.

在hbase region server挂掉了之后,

尝试重启,依然得不到解决,hbase:meta,acl,namespace都下线了,

尝试使用hbase hbck发现不一致问题,inconsistent.

尝试使用 assign meta region 

--hbase:acl  1558582510111.138d74a294eb284e61ba3bda29da606b  
 --hbase hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 1558582510111  
 hbase hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 138d74a294eb284e61ba3bda29da606b  
-- hbase:meta  
 hbase hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 1588230740 

 -- namespace 
  hbase hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 9d13718e1465315927812f37d9aeb6d9  

-- normal region
hbase hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 0030550b445a31ce0771017dfbf656f9  


但是 kylin_meta表却不能恢复,

删除 zk路径下的 /hbase/cluster 重启hbase,来重建元数据信息依然不能恢复.

 

最终只能删除hdfs上HBASE路径下的recovered.edits ,然后重启hbase ,如果发现还是不行,在重新assign一下region

dfs -mv /hbase/default/KYLIN_CCWH4U80ZH/06f74b91e6f7df813178f17500d76e02/recovered.edits    /hbase/default/KYLIN_CCWH4U80ZH/06f74b91e6f7df813178f17500d76e02/recovered.edits_bak  ;

重启hbase 


hbase hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 06f74b91e6f7df813178f17500d76e02
,最后又修改了hbase 的zookeeper.session.timeout 180000ms和zk的maxSessionTimeout 超时时间180000ms.以及hbase 

HBase RegionServer 的 Java 堆栈大小 2G,之前是默认50M,不OOM才怪,不超时才怪啊.

初步判断是因为gc超时导致的zk超时.

hbase region 下线,region server也全部下线,hmaster 依然正常._第1张图片

只记录的大致流程,具体的需要结合实际情况操作来恢复元数据.

 

 

 关于region in transaction :

https://mp.weixin.qq.com/s?__biz=MzU5OTQ1MDEzMA==&mid=2247483940&idx=1&sn=4121aa1bd7ef188ccc2d982b80d642f8&chksm=feb5f559c9c27c4f403f94c4221bf4a6a85f7e7a55fd4442e6c65bb2ebab6fd81d80300353d1&scene=27#wechat_redirect

 

你可能感兴趣的:(hbase)