如何对CDH集群的服务进行监控告警

众所周知,运维以Hadoop为主的大数据平台的难点在于其生态组件众多,各组件之间的交互关系复杂,问题排查修复困难。而Cloudera推出的CDH(Cloudera Distribution Hadoop)能够让用户通过其中的Cloudera Manager十分方便地部署和管理Hadoop集群。

Cloudera官方提供了Cloudera Manager的Java和Python版本的API接口供用户访问和操作CDH集群的状态和资源。由于我们在生产环境中使用的CDH是6.0.1版本,故这里以CDH6.0.1版本为例,介绍如何利用Cloudera Manager的Python2.7版本的API来编写监控CDH集群各服务的健康状态的脚本,最后通过Zabbix来发告警信息。

一、准备工作

要调用Cloudera Manager的Python2.7版本的API需要在最终部署监控脚本的zabbix agent所在主机上安装相应的module,执行如下命令即可安装:

pip install six
pip install urllib3
pip install certifi
pip install cm-client

由于默认部署的CDH集群开启了Auto-TLS功能,即一切web交互都使用https协议,并在Cloudera Manager Server所在的主机上配置了certificate manager,即CA证书授权机构(其实是一个可生成证书的可执行程序)。certificate manager为初始安装的集群各主机都生成了CA证书文件,在Cloudera Manager Server所在主机的/var/lib/cloudera-scm-server/certmanager/hosts-key-store目录下。为了保证zabbix agent执行监控脚本时有权限调用Cloudera Manager的RESTFUL API,需要其所在主机通过了certificate manager的认证,即需要在集群添加该主机时,在Cloudera Manager Server所在主机的/var/lib/cloudera-scm-server/certmanager目录下执行

./generate_host_cert master.cdh.cluster.com

(master.cdh.cluster.com为该主机的完全限定域名)生成该主机的证书文件目录。然后在Python脚本里调用cm_client模块API时,在安全认证配置部分必须引用该主机的证书文件目录下的cm-auto-host_cert_chain.pem文件(需拷贝至zabbix agent所在主机),相关代码如下:

import cm_client

# CM的管理账户密码
user = "admin"
pwd = "admin"
# Configure HTTPS authentication
cm_client.configuration.username = user
cm_client.configuration.password = pwd
cm_client.configuration.verify_ssl = True
# Path of truststore file in PEM
cm_client.configuration.ssl_ca_cert = '/var/lib/cloudera-scm-server/certmanager/hosts-key-store/master.cdh.cluster.com/cm-auto-host_cert_chain.pem'
# the host address url where Cloudera Manager is
api_host = 'https://master.cdh.cluster.com'
# the Cloudera Manager Server port
port = '7183'
# API version
api_version = 'v30'
# Construct base URL for API: https://cmhost:7183/api/v30
api_url = api_host + ':' + port + '/api/' + api_version
api_client = cm_client.ApiClient(api_url)
cluster_api_instance = cm_client.ClustersResourceApi(api_client)

二、脚本完善

CDH集群服务的状态有四种分别是GOOD, CONCERNING, BAD, DISABLED,对应Cloudera Manager的WEB UI上四种不同显示颜色分别是绿色,黄色,红色,灰色。只要资源稳定充足CDH集群服务可以自动由BAD恢复到GOOD状态。故不需要每次监控到BAD状态的服务就发告警信息,可设置为连续5次检测到BAD状态再发告警。

完整脚本代码如下:

import cm_client
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

# cdh service health_summary status type: GOOD, CONCERNING, BAD, DISABLED
# python -W ignore cdhServiceStatus.py BAD 5
if len(sys.argv) < 3:
    print "Can't Run! Please check whether missing parameter."
    print "eg: python -W ignore cdhServiceStatus.py BAD 5"
    exit(0)
user = "admin"
pwd = "admin"
# which status to monitor
status = sys.argv[1]
count = int(sys.argv[2])
# Configure HTTPS authentication
cm_client.configuration.username = user
cm_client.configuration.password = pwd
cm_client.configuration.verify_ssl = True
# Path of truststore file in PEM
cm_client.configuration.ssl_ca_cert = '/var/lib/cloudera-scm-server/certmanager/hosts-key-store/master.cdh.cluster.com/cm-auto-host_cert_chain.pem'
# the host address url where Cloudera Manager is
api_host = 'https://master.cdh.cluster.com'
# the Cloudera Manager Server port
port = '7183'
# API version
api_version = 'v30'
# Construct base URL for API: https://cmhost:7183/api/v30
api_url = api_host + ':' + port + '/api/' + api_version
api_client = cm_client.ApiClient(api_url)
cluster_api_instance = cm_client.ClustersResourceApi(api_client)
# Lists all known clusters.
api_response = cluster_api_instance.read_clusters(view='SUMMARY')
count_file_name = "bad_count.txt"
content_file_name = "bad_content.txt"
open(content_file_name, 'w').close()
service_count = 0
for (i, cluster) in zip(range(1, len(api_response.items) + 1), api_response.items):
    # Rest same as above
    # Look for CDH6 clusters
    services_api_instance = cm_client.ServicesResourceApi(api_client)
    services = services_api_instance.read_services(cluster.name, view='FULL')
    for (j, service) in zip(range(1, len(services.items) + 1), services.items):
        if service.health_summary == status:
            outText = """
service: {0}
state: {1}
healthSummary: {2}
-------------------------""".format(service.type, service.service_state, service.health_summary)
            # sys.stdout.write(outText)
            service_count = service_count + 1
            content_count_writer = open(content_file_name, 'a')
            content_count_writer.write(outText)
            content_count_writer.close()

if service_count > 0:
    bad_count_writer = open(count_file_name, 'a')
    bad_count_writer.write(str(service_count) + "\n")
    bad_count_writer.close()
else:
    bad_count_writer = open(count_file_name, 'a')
    bad_count_writer.write(str(service_count) + "\n")
    bad_count_writer.close()

bad_counts = open(count_file_name).readlines()
if len(bad_counts) == count and bad_counts.count('0\n') == 0:
    out = open(content_file_name).read()
    sys.stdout.write(out)
    open(count_file_name, 'w').close()
elif len(bad_counts) == count and bad_counts.count('0\n') != 0:
    open(count_file_name, 'w').close()

三、配置Zabbix监控告警

  • 配置监控项

在zabbix agent的配置文件(zabbix_agentd.conf或zabbix_agentd.d目录下所有的.conf)中的UserParameter参数处配置监控项,形如

UserParameter=cdh.service.state[*],python -W ignore /etc/zabbix/script/cdhServicesMonitor.py $1 $2

其中*代表可给监控项传递任意个参数,实际在zabbix界面上参数要以逗号分隔,如cdh.service.state[BAD,5],$1 $2代表执行脚本时接收所传递的第一和第二个参数,这里就是BAD和5,对应上面脚本需要的两个参数。

  • 配置触发器

配置完监控项后就可以创建与其对应的触发器。触发器的作用是对监控项所获取到的监控值(即执行监控项对应的脚本所获取的值)定义严重性级别,在zabbix界面上需要为触发器添加表达式和表达式成立时的严重性级别,即对配置项的值执行某种操作后满足什么条件时,所发出告警的严重性级别。例如表达式

{192.168.202.51:cdh.service.state[BAD,5].str(BAD)}=1

的含义是当监控项cdh.service.state[BAD,5]的值(即执行上述脚本后获得的打印文本)中包含BAD字符串时表达式成立。

  • 配置动作

最后一步是为所建立的触发器创建告警动作,即当触发器的表达式成立满足某个严重性级别时是发短信,还是发微信群消息,还是发邮件等。告警媒介和触发器示警度(有灾难,严重,一般严重,警告等分类)可以根据实际情况配置。

你可能感兴趣的:(大数据)