ncpa是nagios最近几年推出的监控客户端,已日趋完善,用于替代老旧的nrpe。
首先,nagios的优点在于
1、监控界的工业标准,专注报警近二十年(1999年诞生) 业界的话是这样的,每种监控系统背后都有nagios的影子 2、优秀的设计永不过时,无数据库设计 与zabbix的臃肿相比,nagios是遵循unix哲学的典范,做一件事并把它做好。 无数据库设计,不让数据库拖后腿。 3、c语言编写,超高性能 nagios4.0以前,采用了类似apache prefork模式,性能一度受到影响。在事件模型出现以前,它仍然是当时最好的方案。 nagios4.0之后,采用了类似nginx的事件模型,以极小的内存代价,取得性能上质的提升,10k+不成问题。 4、优秀的插件机制,非常灵活 nagios积累了十余年的由社区贡献的海量插件,自己编写插件也十分容易。
ncpa比nrpe优秀的地方在于
1、支持被动监控,即ncpa主动向nagios上报(通过nrdp) 2、ncpa跟snmp类似,基本不需要配置,自带基本监控项,比如cpu,内存,服务、进程等, 而nrpe需要在客户端定义一堆check,然后还要在nagios服务端再定义一遍,非常繁琐。 3、保留原有的nagios插件 4、通过简单的脚本编程,在nagios服务端用nmap扫描ncpa客户端,可以实现自动添加基本监控 5、环境依赖除了python2.7,对系统没有任何侵入
本文描述基于nagios+ncpa的主动监控,替代nrpe。
环境
服务端:CentOS 7 + nagios 4 IP:192.168.1.200 客户端:CentOS 7 + ncpa 2.0.6 IP:192.168.1.50
客户端配置
1、安装ncpa
rpm -ivh https://assets.nagios.com/downloads/ncpa/ncpa-2.0.6.el7.x86_64.rpm
2、启动ncpa服务
/etc/init.d/ncpa_passive start /etc/init.d/ncpa_listener start chkconfig ncpa_listener on chkconfig ncpa_passive on
3、客户端开启防火墙端口5693
iptables -A INPUT -p tcp --dport 5693 -j ACCEPT
或
iptables -A INPUT -s 192.168.1.200 -p tcp --dport 5693 -j ACCEPT
服务端配置
安装nagios(简略版)
yum install epel-release -y yum install nagios httpd php php-pecl-zendopcache fping nmap -y systemctl enable httpd nagios systemctl start httpd nagios iptables -A INPUT -p tcp --dport 80 -j ACCEPT
mkdir -p /etc/nagios/bin mkdir -p /etc/nagios/hosts mkdir -p /etc/nagios/services mkdir -p /etc/nagios/template echo "cfg_dir=/etc/nagios/hosts" >> /etc/nagios/nagios.cfg echo "cfg_dir=/etc/nagios/services " >>/etc/nagios/nagios.cfg service nagios restart
一、主机自动发现
所谓自动发现,就是用扫描器扫描局域网,
1、如果IP已在监控之内,则略过;
2、如果是新IP,则按照固定的模板,创建配置文件,并通知管理员;
3、如果某个IP发现后又消失了,nagios会报警,通知管理员。
这样就形成了一个局域网IP管理的闭环。
使用fping配置主机自动发现
创建主机模板文件/etc/nagios/template/host.cfg,内容如下:
define host { host_name HOST address HOST check_command check-host-alive max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 contacts nagiosadmin notification_interval 60 notification_period 24x7 notifications_enabled 1 } 创建脚本/etc/nagios/bin/find-hosts.sh,内容如下:
#!/usr/bin/env bash if [ ! -f /usr/sbin/fping ];then yum install fping -y fi network=$1 echo_usage() { echo -e "\e[1;31mUsage: $0 [network] \e[0m" echo -e "example: \e[1;32m $0 192.168.0.0/24 \e[0m" echo exit 3 } if [ x$network == "x" ];then echo_usage fi ######################################################## ######################################################## dir=/etc/nagios/hosts host_template=/etc/nagios/template/host.cfg result=$(mktemp -u /tmp/fping-XXXXXX) mkdir -p $dir fping -a -q -g $network > $result i=0 while read host;do if [ ! -f /etc/nagios/hosts/$host.cfg ];then echo new host found $host #mailx -s "new host found :$host" root@localhost sed "s/HOST/$host/g" $host_template > $dir/$host.cfg i=$(expr $i + 1) fi done < $result rm -rf $result if [ $i -eq 0 ];then echo no new host found exit 0 fi if (nagios -v /etc/nagios/nagios.cfg |grep -q "Things look okay");then echo "nagios configuration is OK" sleep 1 service nagios restart echo "nagios restart successfully" else echo "nagios restart failed.please check" exit 1 fi
通过定时任务运行这个脚本,即可自动添加主机监控,也可以修改脚本,让每次发现新机器时发邮件通知管理员。
二、服务自动发现
使用nmap+check_ncpa实现服务自动发现
1、下载check_ncpa
wget https://assets.nagios.com/downloads/ncpa/check_ncpa.tar.gz tar zxvf check_ncpa.tar.gz cp check_ncpa.py /usr/lib64/nagios/plugins/ cp check_ncpa.py /usr/bin/
2、配置check_ncpa
创建文件/etc/nagios/conf.d/check_ncpa.cfg,内容如下:
# 'check_ncpa' command definition define command{ command_name check_ncpa command_line $USER1$/check_ncpa.py -H $HOSTADDRESS$ -P 5693 -t mytoken $ARG1$ }
3、测试check_ncpa.py
python check_ncpa.py -H 192.168.1.50 -p 5693 -t mytoken -l
4、创建服务发现模板
常规的监控项目无外乎两类,一类是基本的CPU、swap、负载、磁盘等,另一种是服务,比如nginx
创建文件/etc/nagios/template/ncpa-service.cfg,内容如下:
define service { host_name HOST service_description SERVICE check_command check_ncpa!-M service/SERVICE max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }
创建文件/etc/nagios/template/ncpa-basic.cfg,内容如下:
#监控uptime,防止机器重启 define service { host_name HOST service_description system uptime check_command check_ncpa!-M system/uptime -w @60:120 -c @1:60 max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #监控CPU使用率 define service { host_name HOST service_description CPU Usage check_command check_ncpa!-M cpu/percent -w 50 -c 80 -q 'aggregate=avg' max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #监控swap define service { host_name HOST service_description swap Usage check_command check_ncpa!-M memory/swap -w 512 -c 1024 -u mb max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #监控进程总数 define service { host_name HOST service_description Process Count check_command check_ncpa!-M processes -w 500 -c 1000 max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #监控磁盘空间 define service { host_name HOST service_description Disk Usage check_command check_ncpa!-M 'plugins/check_disk' -a "-w 20 -c 10 --local" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #监控系统负载 define service { host_name HOST service_description Load average check_command check_ncpa!-M 'plugins/check_load' -a "-w 8,4,4 -c 12,8,8" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin } #监控僵尸进程 define service { host_name HOST service_description Load average check_command check_ncpa!-M 'plugins/check_procs' -a "-w 3 -c 5 -s Z" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }
创建自动发现脚本/etc/nagios/bin/find-ncpa.sh,内容如下
#!/usr/bin/env bash if [ ! -f /usr/bin/nmap ];then yum install nmap -y fi network=$1 usage() { echo -e "\e[1;31mUsage: $0 [ip|ip-rang|network] \e[0m" echo -e "example1: \e[1;32m $0 192.168.0.100 \e[0m" echo -e "example2: \e[1;32m $0 192.168.1-200 \e[0m" echo -e "example3: \e[1;32m $0 192.168.2.0/24 \e[0m" echo exit 0 } if [ x$network == "x" ];then usage fi dir="/etc/nagios/services" ncpa_basic_template="/etc/nagios/template/ncpa-basic.cfg" ncpa_service_template="/etc/nagios/template/ncpa-service.cfg" nmap -sS -p 5693 --open $network |awk '/Nmap scan report for/{print $5}' > /tmp/ncpa_hosts.txt while read host;do if [ ! -f $dir/$host.cfg ];then touch $dir/$host.cfg sed "s/HOST/$host/g" $ncpa_basic_template >> $dir/$host.cfg /usr/local/bin/check_ncpa.py -H $host -t mytoken -M services -l |grep running |awk '/running/{print $1}' |tr -d \" |tr -d \: |egrep -v "@|systemd" > /tmp/$host.servicelist.txt while read service;do sed -e "s/HOST/$host/g" -e "s/SERVICE/$service/g" $ncpa_service_template >> $dir/$host.cfg done < /dev/shm/$host.servicelist.txt rm -rf /dev/shm/$host.servicelist.txt fi done < /tmp/ncpa_hosts.txt rm -rf /tmp/ncpa_hosts.txt if (nagios -v /etc/nagios/nagios.cfg |grep -q "Things look okay");then echo "nagios configuration is OK" sleep 1 service nagios restart echo "nagios restart successfully" else echo "nagios restart failed. please check" exit 1 fi
业务监控
自动发现在很大程度上可以减轻工作量,但具体的业务监控仍然需要手动添加。
比如监控nginx是否重启过 (运行时长是否超过1800秒)
#监控进程运行时长 define service { host_name HOST service_description Load average check_command check_ncpa!-M plugins/check_procs -a "-a nginx -m ELAPSED -w @1800:3600 -c @1:1800" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }
对于php-fpm这类动态进程模型,其特点是root身份启动一个master进程,子进程属主是普通用户,且个数是动态的,故只需监控master进程运行时长即可,也可以照葫芦划瓢,
#监控php-fpm define service { host_name HOST service_description Load average check_command check_ncpa!-M plugins/check_procs -a "-u root -a php-fpm -m ELAPSED -w @1800:3600 -c @1:1800" max_check_attempts 3 check_interval 5 retry_interval 1 check_period 24x7 notification_interval 60 notification_period 24x7 contacts nagiosadmin }