因为最近线上的hadoop集群从mrv1升级到mrv2了,监控模板也跟着变动了。。
线上是200台左右的集群,模块采用了link的方式来添加,即一个模板下link大量的模块,然后主机添加到这个模板里。
为了增加NM的监控,也采用了link的方式来连接模板,在页面上link时发现一直返回一个空白页。
为了快速上线,改变了下方法,使用了host.update的api,直接把host link到NM的模板。
回过头来看这个问题:
在通过页面link模板时,其实也是调用了zabbix template相关的api(具体调用了template.update方法)
直接通过脚本来调用api测试:
测试脚本:
#!/usr/bin/env python import urllib2 import sys import json def requestJason(url,values): data = json.dumps(values) print data req = urllib2.Request(url, data, {'Content-Type': 'application/json-rpc'}) response = urllib2.urlopen(req, data) data_get = response.read() output = json.loads(data_get) print output try: message = output['result'] except: message = output['error']['data'] quit() print json.dumps(message) return output def authenticate(url, username, password): values = {'jsonrpc': '2.0', 'method': 'user.login', 'params': { 'user': username, 'password': password }, 'id': '0' } idvalue = requestJason(url,values) return idvalue['result'] def getTemplate(hostname,url,auth): values = {'jsonrpc': '2.0', 'method': 'template.get', 'params': { 'output': "extend", 'filter': { 'host': hostname } }, 'auth': auth, 'id': '2' } output = requestJason(url,values) print output['result'][0]['hostid'] return output['result'][0]['hostid'] def changeTemplate(idx,id_list,url,auth): values = {'jsonrpc': '2.0', 'method': 'template.update', 'params': { "templateid":idx, "templates":id_list }, 'auth': auth, 'id': '2' } output = requestJason(url,values) print output def main(): id_list = [] hostname = "Vipshop_Template_OS_Linux_Hadoop_Datanode_Pro" url = 'xxxx' username = 'admin' password = 'xxxx' auth = authenticate(url, username, password) idx = getTemplate(hostname,url,auth) temlist = ['Vipshop_Template_LB_Tengine_8090','Vipshop_Template_Redis_6379','Vipshop_Template_Redis_6380','Vipshop_Template_Redis_6381','Vipshop_Template_Redis_6382','Vipshop_Template_Redis_6383'] for tem in temlist: idtemp = getTemplate(tem,url,auth) id_list.append({"templateid":idtemp}) print id_list #id_list = [{"templateid":'10843'},{"templateid":"10554"},{"templateid":"10467"},{"templateid":"10560"},{"templateid":"10566"},{"templateid":"10105"}] changeTemplate(idx,id_list,url,auth) if __name__ == '__main__': main()
脚本结果:
urllib2.HTTPError: HTTP Error 500: Internal Server Error
因为api其实是发送了一个jason格式的post请求,手动使用curl来验证:
curl -vvv -i -X POST -H 'Content-Type:application/json' -d '{"params": {"templates": [{"templateid": "10117"}, {"templateid": "10132"}, {"templateid": "10133"}, {"templateid": "10134"}, {"templateid": "10135"}, {"templateid": "10136"}], "templateid": "10464"}, "jsonrpc": "2.0", "method": "template.update", "auth": "421a04b400e859834357b5681a586a5f", "id": "2"}' http://zabbix.idc.vipshop.com/api_jsonrpc.php返回500错误(即后端php处理时遇到错误导致),调整php的配置,把日志改成debug格式:
php-fpm.conf: log_level = debug
在error log中发现如下错误:
[04-May-2014 14:04:32.115189] WARNING: pid 6270, fpm_request_check_timed_out(), line 271: [pool www] child 6294, script '/apps/svr/zabbix/wwwroot/api_jsonrpc.php' (request: "POST /api_jsonrpc.php") executing too slow (1.269946 sec), logging [04-May-2014 14:04:32.115327] DEBUG: pid 6270, fpm_got_signal(), line 72: received SIGCHLD [04-May-2014 14:04:32.115371] NOTICE: pid 6270, fpm_children_bury(), line 227: child 6294 stopped for tracing [04-May-2014 14:04:32.115385] NOTICE: pid 6270, fpm_php_trace(), line 142: about to trace 6294 [04-May-2014 14:04:32.115835] NOTICE: pid 6270, fpm_php_trace(), line 170: finished trace of 6294 [04-May-2014 14:04:32.115874] DEBUG: pid 6270, fpm_event_loop(), line 409: event module triggered 1 events [04-May-2014 14:04:35.318614] WARNING: pid 6270, fpm_stdio_child_said(), line 166: [pool www] child 6294 said into stderr: "NOTICE: sapi_cgi_log_message(), line 663: PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 512 bytes) in /apps/svr/zabbix/wwwroot/api/classes/CItem.php on line 1088" [04-May-2014 14:04:35.318665] DEBUG: pid 6270, fpm_event_loop(), line 409: event module triggered 1 events
即在做link模板时,需要把相关的数据放在php的内存中,而默认的设置是128M,如果在item和host比较多的时候,很容易就会超过这个限制。
更改为
memory_limit = 1280M
重新测试,返回了502 Bad Gateway错误,即后端执行超时导致。
error log:
[04-May-2014 14:50:21.318071] WARNING: pid 4131, fpm_request_check_timed_out(), line 281: [pool www] child 4147, script '/apps/svr/zabbix/wwwroot/api_jsonrpc.php' (request: "POST /api_jsonrpc.php") execution timed out (10.030883 sec), terminating
执行时间超过request_terminate_timeout 设置。导致502产生。
更改 request_terminate_timeout = 1800(默认是10s),max_execution_time = 0(默认30s),重新测试。ok.
小结:
zabbix不同于一般的线上应用,在调用api做更新时,是一个batch的行为,对内存和执行时间有一定的要求。
因此要合理的设置php的相关参数,在debug的时候调低日志级别并开启slow log来方便定位问题.