版权声明:本文为博主原创文章,未经博主允许不得转载。https://www.jianshu.com/p/7a8c947c8bb7
【背景】
应用在进行bulkload将hfile导入hbase中报错:
2019-04-03 11:27:18,509 [LoadIncrementalHFiles-2][org.apache.hadoop.hbase.client.RpcRetryingCaller:132] [INFO ] - Callexception, tries=20, retries=35, started=269051 ms ago, cancelled=false,msg=row '048000-1229171819-48889202-48889202-0-79818770#090' on table'tbl_glhis_hb_swt_addn_inf' atregion=tbl_glhis_hb_swt_addn_inf,048000-1229171819-48889202-48889202-0-79818770#090,1554144453180.10bd9b9ffb8bcaac4f939a79a29f4ba1.,hostname=y3050705,60020,1517293052636, seqNum=1
查看hbase master web ui页面发现tbl_glhis_hb_swt_addn_inf在y3050705节点上的region不可用,手工停止该节点上的regionserver服务。应用再次发起,报错依然。
【修复过程】
任意在一台regionserver节点上执行
export HADOOP_USER_NAME=hbase
hbase hbck -details
tbl_glhis_hb_swt_addn_inf > 1.txt 2>&1
查看1.txt日志报错如下:
---- Table 'tbl_achis_hb_trans_flow': overlap groups
There are 0 overlap groups with 0 overlapping regions
19/04/03 13:40:15 INFO util.HBaseFsck: Computingmapping of all store files
...........................................................................................java.lang.OutOfMemoryError:Java heap space
Dumping heap to java_pid16636.hprof ...
Heap dump file created [330952255 bytes in 2.844 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing/bin/sh -c "kill -9 16636"...
Killed
[END] 2019/4/3 13:42:50
为内存溢出导致,查看/etc/hbase/conf/hbase-env.sh文件
HBSE_OPTS中设置的GC最大堆内存是256M
于是在该regionserver上临时修改JVM的内存大小
export
HBASE_OPTS="-Xmx4294967296 -XX:+HeapDumpOnOutOfMemoryError
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Djava.net.preferIPv4Stack=true
$HBASE_OPTS"
再次执行hbase hbck -details tbl_glhis_hb_swt_addn_inf >
1.txt 2>&1
结果检测到4个不一致
4inconsistencies detected.
Status:INCONSISTENT
报错日志为:
ERROR: Region { meta => tbl_glhis_hb_swt_addn_inf,048000-1229171819-48889202-48889202-0-79818770#090,1554144453180.10bd9b9ffb8bcaac4f939a79a29f4ba1.,hdfs => null, deployed => , replicaId => 0 }found in META, but not in HDFS or deployed on any region server.
ERROR: Region { meta =>tbl_glhis_hb_swt_addn_inf,047648-0225101088-48429202-48429202-0-79809774#090,1554144453180.bddba6808daaf163334bdda58fe8ff47.,hdfs => null, deployed => , replicaId => 0 }found in META, but not in HDFS or deployed on any region server.
ERROR: Region { meta => null, hdfs => null,deployed =>y3031903,60020,1534847671712;tbl_glhis_hb_swt_addn_inf,047648-0225101088-48429202-48429202-0-79809774#090,1554143472127.defad07ae1207932536feb505f0d047b.,replicaId => 0 }, key=defad07ae1207932536feb505f0d047b,not on HDFS or in hbase:meta but deployed ony3031903,60020,1534847671712
atorg.apache.hadoop.hbase.master.MasterRpcServices.offlineRegion(MasterRpcServices.java:1232)atorg.apache.hadoop.hbase.master.MasterRpcServices.offlineRegion(MasterRpcServices.java:1232)
ERROR: No regioninfo in Meta or HDFS. { meta =>null, hdfs => null, deployed =>y3031903,60020,1534847671712;tbl_glhis_hb_swt_addn_inf,047648-0225101088-48429202-48429202-0-79809774#090,1554143472127.defad07ae1207932536feb505f0d047b.,replicaId => 0 }
ERROR: There is a hole in the region chain
between 047648-0225101088-48429202-48429202-0-79809774#090and 048337-0529174121-48028810-00098800-0-79829470#090. You need to create a new .regioninfo andregion dir in hdfs to plug the hole.
ERROR: Found inconsistency in tabletbl_glhis_hb_swt_addn_inf
于是进行修复:
hbase hbck
-fixHdfsHoles -fixMeta -fixAssignments tbl_glhis_hb_swt_addn_inf > 3.txt
2>&1
查看日志不一致变为两个:
ERROR: Region { meta => null, hdfs =>hdfs://nameservice/hbase/data/default/tbl_glhis_hb_swt_addn_inf/1cb1bdb5a6d35637d1365d1bfddfa839,deployed => , replicaId => 0 }on HDFS, but not
listed in hbase:meta or deployed on any region server
ERROR:There is a hole in the regionchain between 047648-0225101088-48429202-48429202-0-79809774#090 and048337-0529174121-48028810-00098800-0-79829470#090. You need to create a new .regioninfo andregion dir in hdfs to plug the hole.
ERROR: Found inconsistency in table tbl_glhis_hb_swt_addn_inf
于是再次运行hbase hbck
-repairHoles tbl_glhis_hb_swt_addn_inf > repair.txt 2>&1
检查表状态已正常,无不一致情况。应用重新发起导入成功。
【小结】
HBCK检查什么?
1、HBase Region一致性
a.集群所有region都被assign,且被deploy到唯一一台regionserver上
b.该region的状态在内存、hbase:meta表及zk上是否一致
2、HBase表完整性
对集群中任意一张表,每个rowkey都仅能存在于一个region区间
region不一致情况主要分为以下几种类型:
1、There is a hole in the region chain between X and Y.
这种情况是在hdfs层面上的,这个region的.regioninfo(meta)文件不存在,使用"-fixHdfsHole"进行修复;
-fixHdfsHole:修复region holes(空洞,某个区间没有region)问题
2、Found lingering reference file X.
这种情况基本上都是由于split reion时造成的,这些文件都是连接文件,使用"-fixReferenceFiles"进行修复;
3、Region X on HDFS,but not listed in hbase:meta or deployed on any region server.
这种情况下region的实际数据是存在的,但是在hbase:meta中不存在,使用"-fixMeta"进行信息同步修复;
-fixMeta:主要修复.regioninfo文件和hbase:meta元数据表的不一致。修复的原则是以HDFS文件为准:如果region在HDFS上存在,但在hbase.meta表中不存在,就会在hbase:meta表中添加一条记录。反之如果在HDFS上不存在,而在hbase:meta表中存在,就会将hbase:meta表中对应的记录删除
4、Region X not deployed on any region server.
这种情况下,region的hfile等数据都在,只是没有在任何region上online,使用"fixAssignments"进行修复。
-fixAssignments:修复没有assign、assign不正确或者同时assign到多台RegionServer的问题region。
5、使用"- repairHoles"进行修复。相当于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans
由于不一致的现象多种多样,原因也不尽相同,通过来说,regionserver crash、region在regionserver中迁移灯是基本原因。
步骤1. hbase hbck 检查输出所以ERROR信息,每个ERROR都会说明错误信息。
步骤2. hbase hbck -fixTableOrphones 先修复tableinfo缺失问题,根据内存cache或者hdfs table 目录结构,重新生成tableinfo文件。
步骤3. hbase hbck -fixHdfsOrphones 修复regioninfo缺失问题,根据region目录下的hfile重新生成regioninfo文件
步骤4. hbase hbck -fixHdfsOverlaps 修复region重叠问题,merge重叠的region为一个region目录,并从新生成一个regioninfo
步骤5. hbase hbck -fixHdfsHoles 修复region缺失,利用缺失的rowkey范围边界,生成新的region目录以及regioninfo填补这个空洞。
步骤6. hbase hbck -fixMeta 修复meta表信息,利用regioninfo信息,重新生成对应meta row填写到meta表中,并为其填写默认的分配regionserver
步骤7. hbase hbck -fixAssignment 把这些offline的region触发上线,当region开始重新open 上线的时候,会被重新分配到真实的RegionServer上 , 并更新meta表上对应的行信息。