一. 问题描述
发现hbase集群中有一个表的region在SPLITING 状态持续很久不结束,HMaster节点进行full gc 可以回收 。
同时在此期间提交的创建表和drop表的操作无效,显示在ENABLING和DISABLING状态,无法成功完成创建表和删除表的操作
test_mgq ENABLING
test_userfriend DISABLING
test_userfriendlwj ENABLING
在HMaster报错信息如下
2014-04-09 00:17:52,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:17:59,706 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 regions.
2014-04-09 00:18:02,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:18:12,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:18:22,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:18:32,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:18:39,602 DEBUG org.apache.hadoop.hbase.client.ClientScanner: Creating scanner over .META. starting at key ''
2014-04-09 00:18:39,602 DEBUG org.apache.hadoop.hbase.client.ClientScanner: Advancing internal scanner to startKey at ''
2014-04-09 00:18:40,220 DEBUG org.apache.hadoop.hbase.client.ClientScanner: Finished with scanning at {NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', ENCODED => 1028785192,}
2014-04-09 00:18:41,549 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 16924 catalog row(s) and gc'd 0 unreferenced parent region(s)
2014-04-09 00:18:42,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ-134.opi.com,60020,1395937706714
2014-04-09 00:18:52,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ-134.opi.com,60020,1395937706714
2014-04-09 00:19:02,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:19:12,008 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:19:22,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:19:32,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:19:42,007 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
2014-04-09 00:19:47,451 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because 1 region(s) in transition: {50aa699775f00b753aaaa6e029fed887=mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714}
2014-04-09 00:19:52,008 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: mau_selecteduser,2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF2\xF1QQQQQQQQQ,1396415717740.50aa699775f00b753aaaa6e029fed887. state=SPLITTING, ts=1396415877142, server=YZ18-134.opi.com,60020,1395937706714
^C
产看相应的ts时间为
Unix时间戳(Unix timestamp)
1396415877142
转换为北京时间 2014年4月2日 下午1:17:57
查看YZ18-134.opi.com 这台机器上的regionserver日志发现:
处理办法:
1.简单粗暴:
直接重启对应机器134的regionserver,让这台机器的上的region信息重新汇报,SPLITING状态消失。
恢复后mau_selectuser表状态恢复,并且测试的两个表的状态也很快正常
test_userfriend_community DISABLED
test_userfriend_communitylwj ENABLED
HBase Region Split过程详解
1、检查是否需要进行Region Split的时机:
每次flush或者compact之后,regionserver都会去检查是否满足了Split的条件。
2、Region Split的过程如下:
(1)RegionServer在Zookeeper上创建一个/hbase/region-in-transition/region-name结点,并设置结点的内容为SPLITTING
(2)由于Master监听/hbase/region-in-transition,所以(1)发生时,Master会收到相应的通知。
(3)RegionServer在HDFS上的parent’s region 目录下创建一个.splits目录
(4)RegionServer关闭ParentRegion,同时强制执行flush操作,并在RegionServer的本地数据结构中将该Region标记为offline状态。此时当客户端再请求该ParentRegion时,会抛出NotServingRegionException的异常,客户端会不断的进行尝试。
(5)RegionServer在.splits目录下创建daughter regions A and B两个子目录,并创建对应的数据结构。然后,RegionServer开始对ParentRegion中所有StoreFile执行Split的操作。此阶段RegionServer只会为ParentRegion中的每一个StoreFile创建两个索引文件。
(6)RegionServer在HDFS上分别为daughterA Region和daughterB Region创建实际的存储目录
(7)RegionServer向.META.表发送一个Put请求。此请求首先将.META.表中的ParentRegion标记为offline,然后将daughterA Region和daughterB Region的信息添加到.META.表中。但是此时在.META.表并不存在代表daughterA和daughterB的单独实体。此时查询.META.表,我们将看到ParentRegion正在进行Split,但是看不到daughter的信息。如果RegionServer 执行Put操作执行成功,那么ParentRegion将会被成功的Split。如果RegionServer执行Put操作失败,Master和下一个打开ParentRegion的RegionServer会将关于ParentRegion的Split操作的脏数据删除掉。
(8)RegionServer打开daughterA Region和daughterB Region,然后daughter Region开始接受写请求。
(9)RegionServer将daughterA 和daughterB 的信息添加到.META.表中。之后,客户端才能够发现daughterA 和daughterB region,并向daughter Region发送请求。
(10)RegionServer 将zookeeper上的/hbase/region-in-transition/region-name结点的状态更新为SPLIT,此时Master会收到状态更新的通知,然后Balanceer可以将daughter Region指定到其他的RegionServer上。
(11) Split过程结束之后,HDFS和META中还会保留着指向parent region的索引文件的信息。这些索引文件会在daughter Region执行Major Compact来对StoreFile进行重写时删除掉。Master中的Garbage collection任务会周期性的检查daughter regions中是否还包含指向parents Region的索引文件,如果不包含,Master会将parents Region删除掉。