背景
由于公司CDH集群资源有限,在使用Hbase对数据厂商上报数据文件进行解析写入hbase过程中,一旦遇到数据上报峰值(如历史数据批量上报,异常数据批量重传),都会导致Hbase可用机器资源不足,导致Hbase服务异常终止。尝试调试解析数据文件的并发线程数及Hbase服务的可分配内存资源均不能有效解决该问题,每次都需要手动重启解决(后续考虑集群扩容以支撑数据读写压力,当前姑息解决)。为了及时发现Hbase服务异常终止现象,有尝试通过监控CDH各节点部署的Hbase服务进程存活情况进行监控。但是由于内外网不通的原因,即使在凌晨及时发现了Hbase服务异常终止,也无法立即进行恢复操作。尝试通过定位各服务进程启动命令,定位到服务启动、重启的脚本,但是由于用户、权限等问题,通过CDH脚本自动重启Hbase异常服务的尝试,始终没有奏效。好在找到了CM的开放API。
解决方案
1.获取到集群名称
curl -u CM用户名:密码 'http://localhost:7180/api/v1/clusters'
2.获取hbase服务名称
curl -u CM用户名:密码 'http://localhost:7180/api/v1/clusters/Cluster%201/services' #注意,默认的集群名称是有空格的,需要转义为%20
3.获取hbase服务状态
curl -u CM用户名:密码 'http://localhost:7180/api/vi/clusters/Cluster%201/services/{servicename} # 注意,{servicename}为2中获取的hbase服务的名称,一般为hbase
该接口返回的json数据中,包含了每个master、regionserver、restserver、thriftserver的服务名称、是否启动、健康状态等,其中服务名称比较重要,后续重启哪些服务,均要指定服务名称列表。
4.重启异常的hbase服务
curl -X POST -H "Content-Type:application/json" -u CM用户名:密码 \
-d '{ "items": ["regionserver-xxxxxxx","master-xxxx"] }' \
'http://localhost:7180/api/v1/clusters/Cluster%201/services/hbase/roleCommands/restart'
其他CDH服务的状态监控、启停同理。
附:
- CDH REST API 文档:http://cloudera.github.io/cm_api/apidocs/v1/index.html
- 调用API示例:http://cloudera.github.io/cm_api/apidocs/v1/tutorial.html
- hbase服务监控并自动重启脚本源码:
import os
import json
import time
stime = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime())
os.popen("echo '%s check...' >> /home/dqrcsc/check/hbase.logs" % (stime))
os.popen("curl -u username:password 'http://ip:7180/api/v1/clusters/Cluster%201/services/hbase/roles' > /home/csc/check/servers.txt")
f = open("/home/csc/check/servers.txt","r")
s = json.load(f)
for item in s['items']:
if item['healthSummary'] == 'BAD':
print(item['name'])
print(item['type'])
print(item['roleState'])
print(item['healthSummary'])
os.popen('curl -X POST -H "Content-Type:application/json" -u username:password -d \'{"items":["'+item['name']+'"]}\' "http://ip:7180/api/v1/clusters/Cluster%201/services/hbase/roleCommands/restart"');
os.popen("echo '%s %s %s restart' >> /home/csc/check/hbase.logs" % (stime, item['type'], item['name']))
f.close()
- 官网关于CDH相关服务启动管理的描述
In a non-Cloudera Manager managed cluster, you most likely start a role instance using an init script, for example, service hadoop-hdfs-datanode start.
Cloudera Manager does not use init scripts for the daemons it manages; in a Cloudera Manager managed cluster, starting and stopping services using init scripts will not work.
In a Cloudera Manager managed cluster you can only start or stop services via Cloudera Manager. Cloudera Manager uses an open source process management tool called supervisord,
which takes care of redirecting log files, notifying of process failure, setting the effective user ID of the calling process to the right user, and so on. Cloudera Manager supports automatically
restarting a crashed process. It will also flag a role instance with a bad state if it crashes repeatedly right after start up.