1 java.io.IOException: java.io.IOException: java.lang.IllegalArgumentException: offset (0) + length (8) exceed the capacity of the array: 4
做简单的incr操作时出现,原因是之前put时放入的是int 长度为 vlen=4 ,不适用增加操作,只能改为long型 vlen=8
2 写数据到column时 org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: NotServingRegionException: 1 time, servers with issues: 10.xx.xx.37:60020, 或是 org.apache.hadoop.hbase.NotServingRegionException: Region is not online: 这两种出错,master-status中出现Regions in Transition 长达十几分钟,一直处于PENDING_OPEN状态,导致请求阻塞。目前把10.xx.xx.37这台机器下线,运行一夜稳定,没有出现因split造成的阻塞。怀疑是机器问题。Hmaster的日志显示这台region server 不停的open close,不做任何split 或flush
RIT 的全称是region in transcation. 每次hbase master 对region 的一个open 或一个close 操作都会向Master 的RIT中插入一条记录,因为master 对region 的操作要保持原子性,region 的 open 和 close 是通过Hmaster 和 region server 协助来完成的. 所以为了满足这些操作的协调,回滚,和一致性.Hmaster 采用了 RIT 机制并结合Zookeeper 中Node的状态来保证操作的安全和一致性.
OFFLINE, // region is in an offline state PENDING_OPEN, // sent rpc to server to open but has not begun OPENING, // server has begun to open but not yet done OPEN, // server opened region and updated meta PENDING_CLOSE, // sent rpc to server to close but has not begun CLOSING, // server has begun to close but not yet done CLOSED, // server closed region and updated meta SPLITTING, // server started split of a region SPLIT // server completed split of a region
进一步发现是load balance的问题 region server不停重复的被open close,参考http://abloz.com/hbase/book.html#regions.arch.assignment 重启了region server正常
后来的代码运行中又出现region not on line ,是NotServingRegionException抛出的,原因是“Thrown by a region server if it is sent a request for a region it is not serving.”
为什么会不断请求一个离线的region?且这种错误集中在150个中的3个region,追踪服务器端log,region 会被CloseRegionHandler关掉,过了20分钟左右才重新打开,关掉后客户端请求的region仍然是这个关闭的region?
3 设置开关不写入hbase并不生效
需要配置rpc超时参数和retry time解决它
4 flush、split、compact导致stop-the-world
出现长时间的flush split操作导致hbase服务器端无法响应请求。需要调整region大小,并测试获取flush次数
5 hbase参数设置
6 region server crush
Regionserver crash的原因是因为GC时间过久导致Regionserver和zookeeper之间的连接timeout。
minSessionTimeout 单位毫秒,默认2倍tickTime。
maxSessionTimeout 单位毫秒,默认20倍tickTime。
7 代码问题导致死锁
8 operation too slow
2012-07-26 05:30:39,141 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"processingtimems":69315,"ts":9223372036854775807,"client":"","starttimems":1343251769825,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"delete","totalColumns":1,"table":"trackurl_status_list","families":{"sl":[{"timestamp":1343251769825,"qualifier":"zzzn1VlyG","vlen":0}]},"row":""}
空row-key 删除不存在的column 耗时 700ms
空row-key 删除存在的column 耗时 5ms
非空row-key 删除任意的column 耗时 3ms
9 responseTooSlow
2012-07-31 17:52:06,619 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":1156438,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3dbb29e5), rpc version=1, client version=29, methodsFingerPrint=-1508511443","client":"","starttimems":1343727170177,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
引用hbase说明:The output is tagged with operation e.g.(operationTooSlow)
if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for. If not, it is tagged(responseTooSlow)
and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself.
10 output error
2012-07-31 17:52:06,812 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call get([B@61574be4, {"timeRange":[0,9223372036854775807],"totalColumns":1,"cacheBlocks":true,"families":{"c":["ALL"]},"maxVersions":1,"row":"zOuu6TK"}), rpc version=1, client version=29, methodsFingerPrint=-1508511443 from output error
11 rollbackMemstore问题
2012-08-07 10:21:49,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: rollbackMemstore rolled back 0 keyvalues from start:0 to end:0
方法解释为:Remove all the keys listed in the map from the memstore. This method is called when a Put has updated memstore but subequently fails to update the wal. This method is then invoked to rollback the memstore.
方法中循环: for (int i = start; i < end; i++) {
12 新上线一个region server 导致region not on line
往错误的region server服务器请求region
13 请求不存在的region,重新建立tablepool也不起作用
请求的时间戳 1342510667
最新region rowkey相关时间戳 1344558957
最终发现维持region location表的属性是在HConnectionManager中
get Get,delete Delete,incr Increment 是在 ServerCallable类 withRetries处理
情景1 若有出错(SocketTimeoutException ConnectException RetriesExhaustedExcetion),则清理regionServer location
情景2 numRetries 若设置为1 ,则 循环只执行一次,connect(tries!=0) 为connect(false),即reload=false,不会进行location更新,当为numRetries>1的时候才会重新获取
get Gets List, put Put或Puts List,delete Deletes List 则调用HConnectionManager的 processBatch去处理,当发现批量get或者put、delete操作结果有问题,则刷新regionServer location
设置 numRetries为>1次, 我这里是3次,解决问题
14 zookeeper.RecoverableZooKeeper(195): Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
最终更改配置21818端口换为2181 运行正常,应该是单机环境才要做这种更改。
