线上和开发集群都遇到过master abort的情况,而且错误都是Unexpected state导致的FATAL错误
2014-01-08 13:44:40,770 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
2014-01-08 13:44:40,771 FATAL org.apache.hadoop.hbase.master.HMaster: 69 Unexpected state : usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. state=PENDING_OPEN, ts=1389159880770, server=localhost.localdomain,60020,1389159296923 .. Cannot transit it to OFFLINE.
java.lang.IllegalStateException: 69 Unexpected state : usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. state=PENDING_OPEN, ts=1389159880770, server=localhost.localdomain,60020,1389159296923 .. Cannot transit it to OFFLINE.
在本地测试可以复现了这种情况,创建一个表,设置压缩算法,但是压缩算法本地不支持,这样master abort很容易复现。
create 'usertable2', { NAME=>'cf', VERSIONS=> 1, COMPRESSION => 'LZO', BLOCKCACHE => true}, { SPLITS => ['user']}
我们来看下HRegionServer的日志:
RegionServer:
2014-01-08 13:44:40,661 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1437058a1d60001 Successfully transitioned node a37e60fd78a3bff2bddac6270f6b317c from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2014-01-08 13:44:40,669 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:600
接着我们来看HMaster的日志:
Master:
2014-01-08 13:44:40,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 28 Handling transition=RS_ZK_REGION_FAILED_OPEN, server=localhost.localdomain,60020,1389159296923, region=a37e60fd78a3bff2bddac6270f6b317c
2014-01-08 13:44:40,679 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 28 Handling transition=RS_ZK_REGION_FAILED_OPEN, server=localhost.localdomain,60020,1389159296923, region=a37e60fd78a3bff2bddac6270f6b317c
RegionServer产生了两个zk事件,一个是RS_ZK_REGION_OPENING,一个是RS_ZK_REGION_FAILED_OPEN。而Master处理这两个事件的时间比较晚,所以读取到的数据都是RS_ZK_REGION_FAILED_OPEN,导致执行两次ClosedRegionHandler。
这两个ClosedRegionHandler同时执行,有一定的概率导致Unexpected state,进而master abort,看下面master的日志。
2014-01-08 13:44:40,677 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: 68 Handling CLOSED event for a37e60fd78a3bff2bddac6270f6b317c
2014-01-08 13:44:40,677 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 68 Forcing OFFLINE; was=usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. state=CLOSED, ts=1389159880663, server=localhost.localdomain,60020,1389159296923
2014-01-08 13:44:40,677 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x1437058a1d60005 Creating (or updating) unassigned node for a37e60fd78a3bff2bddac6270f6b317c with OFFLINE state
2014-01-08 13:44:40,683 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: 69 Handling CLOSED event for a37e60fd78a3bff2bddac6270f6b317c
2014-01-08 13:44:40,683 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 69 Forcing OFFLINE; was=usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. state=CLOSED, ts=1389159880663, server=localhost.localdomain,60020,1389159296923
2014-01-08 13:44:40,736 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 68 Found an existing plan for usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. destination server is localhost.localdomain,60020,1389159296923
2014-01-08 13:44:40,736 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 68 Using pre-existing plan for region usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c.; plan=hri=usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c., src=, dest=localhost.localdomain,60020,1389159296923
2014-01-08 13:44:40,736 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 68 Assigning region usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. to localhost.localdomain,60020,1389159296923
2014-01-08 13:44:40,762 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x1437058a1d60005 Creating (or updating) unassigned node for eff0a14a56f0750278268d6bb49f35e0 with OFFLINE state
2014-01-08 13:44:40,770 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 68 update region usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. to PENDING_OPEN
2014-01-08 13:44:40,770 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
2014-01-08 13:44:40,771 FATAL org.apache.hadoop.hbase.master.HMaster: 69 Unexpected state : usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. state=PENDING_OPEN, ts=1389159880770, server=localhost.localdomain,60020,1389159296923 .. Cannot transit it to OFFLINE.
java.lang.IllegalStateException: 69 Unexpected state : usertable2,,1389159324467.a37e60fd78a3bff2bddac6270f6b317c. state=PENDING_OPEN, ts=1389159880770, server=localhost.localdomain,60020,1389159296923 .. Cannot transit it to OFFLINE.
at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1896)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1700)
这里也就可以解释为什么master收到了同一个region的两个RS_ZK_REGION_FAILED_OPEN事件,不是zk导致的,而是时序导致的。