本系列文章主要讲述如何一步一步地监控大数据平台集群状况,接上篇文章nagios安装部署,本文主要阐述Nagios主要配置文件,Nagios运作流程,如何监控一个Zookeeper集群,并以一个该实例贯穿全文。
nagios/
├── bin
├── etc
│ └── objects
├── libexec
├── sbin
├── share
│ ├── contexthelp
│ ├── docs
│ │ └── images
│ ├── images
│ │ └── logos
│ ├── includes
│ │ └── rss
│ │ └── extlib
│ ├── js
│ ├── locale
│ │ ├── de
│ │ │ └── LC_MESSAGES
│ │ └── fr
│ │ └── LC_MESSAGES
│ ├── media
│ ├── ssi
│ └── stylesheets
└── var
├── archives
├── rw
└── spool
└── checkresults
我们只捡重要的加以说明:
展开配置文件目录,如下:
etc/
├── cgi.cfg
├── htpasswd.users
├── nagios.cfg
├── objects
│ ├── commands.cfg
│ ├── contacts.cfg
│ ├── localhost.cfg
│ ├── printer.cfg
│ ├── switch.cfg
│ ├── templates.cfg
│ ├── timeperiods.cfg
│ └── windows.cfg
└── resource.cfg
展开nagios插件目录,如下:
libexec/
├── check_apt
├── check_breeze
├── check_by_ssh
├── check_clamd -> check_tcp
├── check_cluster
├── check_dhcp
├── check_dig
├── check_disk
├── check_disk_smb
├── check_dns
├── check_dummy
├── check_file_age
├── check_flexlm
├── check_ftp -> check_tcp
├── check_http
├── check_icmp
├── check_ide_smart
├── check_ifoperstatus
├── check_ifstatus
├── check_imap -> check_tcp
├── check_ircd
├── check_jabber -> check_tcp
├── check_load
├── check_log
├── check_mailq
├── check_mrtg
├── check_mrtgtraf
├── check_nagios
├── check_nntp -> check_tcp
├── check_nntps -> check_tcp
├── check_nrpe
├── check_nt
├── check_ntp
├── check_ntp_peer
├── check_ntp_time
├── check_nwstat
├── check_oracle
├── check_overcr
├── check_ping
├── check_pop -> check_tcp
├── check_procs
├── check_real
├── check_rpc
├── check_sensors
├── check_simap -> check_tcp
├── check_smtp
├── check_spop -> check_tcp
├── check_ssh
├── check_ssmtp -> check_tcp
├── check_swap
├── check_tcp
├── check_time
├── check_udp -> check_tcp
├── check_ups
├── check_users
├── check_wave
├── negate
├── urlize
├── utils.pm
└── utils.sh
面向对象思想在Nagios上体现得淋漓尽致,Nagios主要涉及联系人,主机,服务,命令,时间周期等对象。
调试运行脚本,启动nagios服务流程:
[root@hadoop-ehp0 etc]# sh -x /etc/init.d/nagios start
+ prefix=/usr/local/nagios
+ exec_prefix=/usr/local/nagios
+ NagiosBin=/usr/local/nagios/bin/nagios
+ NagiosCfgFile=/usr/local/nagios/etc/nagios.cfg
+ NagiosCfgtestFile=/usr/local/nagios/var/nagios.configtest
+ NagiosStatusFile=/usr/local/nagios/var/status.dat
+ NagiosRetentionFile=/usr/local/nagios/var/retention.dat
+ NagiosCommandFile=/usr/local/nagios/var/rw/nagios.cmd
+ NagiosVarDir=/usr/local/nagios/var
+ NagiosRunFile=/usr/local/nagios/var/nagios.lock
+ NagiosLockDir=/var/lock/subsys
+ NagiosLockFile=nagios
+ NagiosCGIDir=/usr/local/nagios/sbin
+ NagiosUser=nagios
+ NagiosGroup=nagios
+ checkconfig=true
+ '[' -f /etc/rc.d/init.d/functions ']'
+ . /etc/rc.d/init.d/functions
++ TEXTDOMAIN=initscripts
++ umask 022
++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
++ export PATH
++ '[' -z '' ']'
++ COLUMNS=80
++ '[' -z '' ']'
+++ /sbin/consoletype
++ CONSOLETYPE=pty
++ '[' -f /etc/sysconfig/i18n -a -z '' -a -z '' ']'
++ . /etc/profile.d/lang.sh
++ unset LANGSH_SOURCED
++ '[' -z '' ']'
++ '[' -f /etc/sysconfig/init ']'
++ . /etc/sysconfig/init
+++ BOOTUP=color
+++ RES_COL=60
+++ MOVE_TO_COL='echo -en \033[60G'
+++ SETCOLOR_SUCCESS='echo -en \033[0;32m'
+++ SETCOLOR_FAILURE='echo -en \033[0;31m'
+++ SETCOLOR_WARNING='echo -en \033[0;33m'
+++ SETCOLOR_NORMAL='echo -en \033[0;39m'
+++ PROMPT=yes
+++ AUTOSWAP=no
+++ ACTIVE_CONSOLES='/dev/tty[1-6]'
+++ SINGLE=/sbin/sushell
++ '[' pty = serial ']'
++ __sed_discard_ignored_files='/\(~|\.bak|\.orig|\.rpmnew|\.rpmorig|\.rpmsave\)$/d'
+ test -f /etc/sysconfig/nagios
+ USE_RAMDISK=0
+ test 0 -ne 0
+ '[' '!' -f /usr/local/nagios/bin/nagios ']'
+ '[' '!' -f /usr/local/nagios/etc/nagios.cfg ']'
+ case "$1" in
+ echo -n 'Starting nagios:'
Starting nagios:+ test true = true
+ check_config
++ mktemp /tmp/.configtest.XXXXXXXX
+ TMPFILE=/tmp/.configtest.NbW8s1FN
+ /usr/local/nagios/bin/nagios -vp /usr/local/nagios/etc/nagios.cfg
++ sed 's/ //g'
++ awk -F: '{print $2}'
++ grep '^Total Warnings:' /tmp/.configtest.NbW8s1FN
+ WARN=0
++ awk -F: '{print $2}'
++ grep '^Total Errors:' /tmp/.configtest.NbW8s1FN
++ sed 's/ //g'
+ ERR=0
+ test 0 = 0
+ test 0 = 0
+ echo 'OK - Configuration check verified'
+ chmod 0644 /usr/local/nagios/var/nagios.configtest
+ chown nagios:nagios /usr/local/nagios/var/nagios.configtest
+ /bin/rm /tmp/.configtest.NbW8s1FN
+ return 0
+ test -f /usr/local/nagios/var/nagios.lock
+ touch /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat
+ rm -f /usr/local/nagios/var/rw/nagios.cmd
+ touch /usr/local/nagios/var/nagios.lock
+ chown nagios:nagios /usr/local/nagios/var/nagios.lock /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat
+ /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
+ '[' -d /var/lock/subsys ']'
+ touch /var/lock/subsys/nagios
+ echo ' done.'
done.
分析上述脚本,首先设置各种变量,接着执行/etc/rc.d/init.d/functions,它是为init.d下的执行文件提供基本功能支持,接着是执行/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
,由此可知,nagios.cfg
是一切的配置的开始。
继续分析nagios.cfg
都干了啥,我们截取重要部分:
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg
# Definitions for monitoring the local (Linux) host
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
该文件实际上是调用了默认的几个配置文件,后面分别从下面每个文件进行分析。
截取部分进行分析:
# 'notify-host-by-email' command definition
define command{
command_name notify-host-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
}
# 'notify-service-by-email' command definition
define command{
command_name notify-service-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
}
# 'check-host-alive' command definition
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
上文注释已经很清晰了,是定义的发送邮件,检查主机是否活着等命令,它们肯定会被调用,哪里呢?我们后面再分析。
截取部分进行分析:
define contact{
contact_name nagiosadmin ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
email nagios@localhost ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagiosadmin
}
Nagios号称报警神器,终于看到了有关发送邮电相关的信息,这里可以定义联系人email。
截取部分进行分析:
# This defines a timeperiod where all times are valid for checks,
# notifications, etc. The classic "24x7" support nightmare. :-)
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
# 'workhours' timeperiod definition
define timeperiod{
timeperiod_name workhours
alias Normal Work Hours
monday 09:00-17:00
tuesday 09:00-17:00
wednesday 09:00-17:00
thursday 09:00-17:00
friday 09:00-17:00
}
报警通知总不能一直通知,我们得按照我们的需求设定是24小时还是工作日,这样才符合我们事情嘛。
截取部分进行分析:
# Generic contact definition template - This is NOT a real contact, just a template!
define contact{
name generic-contact ; The name of this contact template
service_notification_period 24x7 ; service notifications can be sent anytime
host_notification_period 24x7 ; host notifications can be sent anytime
service_notification_options w,u,c,r,f,s ; send notifications for all service states, flapping events, and scheduled downtime events
host_notification_options d,u,r,f,s ; send notifications for all host states, flapping events, and scheduled downtime events
service_notification_commands notify-service-by-email ; send service notifications via email
host_notification_commands notify-host-by-email ; send host notifications via email
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
}
# Generic host definition template - This is NOT a real host, just a template!
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
# Generic service definition template - This is NOT a real service, just a template!
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 10 ; Check the service every 10 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 60 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
联系人,主机,服务,时间周期,命令(服务中调用),既然模式就是那么几种,避免重复定义,何不定义成模板已供继承,上述就是这么干的。
以自带的localhost.cfg
为例,截取部分进行分析:
# Define a host for the local machine
define host{
use linux-server ; Name of host template to use
; This host definition will inherit all variables that are defined
; in (or inherited by) the linux-server host template definition.
host_name localhost
alias localhost
address 127.0.0.1
}
# Define a service to "ping" the local machine
define service{
use local-service ; Name of service template to use
host_name localhost
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}
localhost.cfg
才是最后的实体,前面对象都只是为之做铺垫,监控某台主机上某种服务的状态,并根据状态作出反应,这才是我们的初衷。监控机又是怎么监控被监控机的呢?依靠NRPE插件,NRPE插件也是CS架构,监控机是C端,被监控机是S端(需开启nrped daemon),C端定时地向所有S端发送我们定义的主机服务,S端收到消息后,调用本地Nagios-Plugins插件监控本机服务,并将结果返回给C端,C端接收结果,做出反应,或邮件或电话,并可提供web UI查看。
我们以监控Zookeeper集群中的每个QuorumPeerMain
进程为例,将整个流程重新梳理一遍,以增强我们的理解。
NRPE C端,在commands.cfg
中添加命令:
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
define command{
command_name check_nrpe_args
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ $ARG2$
}
定义主机组文件,在etc下新建目录hostservers,创建group.cfg,hadoop-ehp1.cfg,hadoop-ehp1.cfg,hadoop-ehp1.cfg文件:
hostservers/
├── group.cfg
├── hadoop-ehp1.cfg
├── hadoop-ehp2.cfg
└── hadoop-ehp3.cfg
在主配置nagios.cfg
中添加文件组(注释本机文件,不然后面检查报错):
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
cfg_dir=/usr/local/nagios/etc/hostservers
该配置会加载目录下所有.cfg文件,group.cfg
内容如下:
# 主机组
define hostgroup{
hostgroup_name linux-servers ; The name of the hostgroup
alias Linux Servers ; Long name of the group
members hadoop-ehp1,hadoop-ehp2,hadoop-ehp3 ; Comma separated list of hosts that belong to this group
}
hadoop-ehpx.cfg
内容如下(hadoop-ehp1.cfg为例):
# 主机与服务
define host{
use linux-server
host_name hadoop-ehp1
alias hadoop-ehp1
address 192.168.137.101
}
define service{
use generic-service
host_name hadoop-ehp1
service_description check_nrpe_users
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name hadoop-ehp1
service_description QuorumPeerMain
check_command check_nrpe_args!check_procs_args!"-c1:1 -Cjava -aserver.quorum.QuorumPeerMain"
}
作为参数check_procs_args
被传至NRPE S端,故S端需要定义命令(编辑nrpe.cfg):
command[check_procs_args]=/usr/local/nagios/libexec/check_procs $ARG1$
重新启动nrped服务。
define contact{
contact_name nagiosadmin ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
email [email protected] ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
编辑/etc/mail.rc,添加:
set [email protected] smtp=smtp.139.com
set [email protected] smtp-auth-password=wxl123456 smtp-auth=login
检验配置文件是否正确:
[root@hadoop-ehp0 nagios]# bin/nagios -v etc/nagios.cfg
Nagios Core 4.0.8
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-12-2014
License: GPL
Website: http://www.nagios.org
Reading configuration data...
Read main config file okay...
Read object config files okay...
Running pre-flight check on configuration data...
Checking objects...
Checked 6 services.
Checked 3 hosts.
Checked 1 host groups.
Checked 0 service groups.
Checked 1 contacts.
Checked 1 contact groups.
Checked 26 commands.
Checked 5 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 3 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 5 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
验证NRPE S端命令是否成功:
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H hadoop-ehp1 -c check_procs_args 11
PROCS OK: 141 processes | procs=141;;;0;
重启nagios。
进入web UI查看:
杀死其中一个QuorumPeerMain进程:
查看邮箱,收到消息:
本文详细地介绍了Nagios如何一步一步监控集群,并通知邮电,但作为大数据平台监控工具,它在具体监控方面还有很多不足,后面我们与Ganglia集成,更细粒度地监控大数据平台。