nagios配置详解与集群监控

1 前言

本系列文章主要讲述如何一步一步地监控大数据平台集群状况,接上篇文章nagios安装部署,本文主要阐述Nagios主要配置文件,Nagios运作流程,如何监控一个Zookeeper集群,并以一个该实例贯穿全文。

2 Nagios文件结构

2.1 监控端文件结构

nagios/
├── bin
├── etc
│   └── objects
├── libexec
├── sbin
├── share
│   ├── contexthelp
│   ├── docs
│   │   └── images
│   ├── images
│   │   └── logos
│   ├── includes
│   │   └── rss
│   │       └── extlib
│   ├── js
│   ├── locale
│   │   ├── de
│   │   │   └── LC_MESSAGES
│   │   └── fr
│   │       └── LC_MESSAGES
│   ├── media
│   ├── ssi
│   └── stylesheets 
└── var
├── archives
├── rw
└── spool
    └── checkresults

我们只捡重要的加以说明:

  • bin:可执行程序
  • etc:配置文件(重要)
  • libexec:nagios插件
  • sbin:执行命令

展开配置文件目录,如下:

etc/
├── cgi.cfg
├── htpasswd.users
├── nagios.cfg
├── objects
│   ├── commands.cfg
│   ├── contacts.cfg
│   ├── localhost.cfg
│   ├── printer.cfg
│   ├── switch.cfg
│   ├── templates.cfg
│   ├── timeperiods.cfg
│   └── windows.cfg
└── resource.cfg

展开nagios插件目录,如下:

libexec/
├── check_apt
├── check_breeze
├── check_by_ssh
├── check_clamd -> check_tcp
├── check_cluster
├── check_dhcp
├── check_dig
├── check_disk
├── check_disk_smb
├── check_dns
├── check_dummy
├── check_file_age
├── check_flexlm
├── check_ftp -> check_tcp
├── check_http
├── check_icmp
├── check_ide_smart
├── check_ifoperstatus
├── check_ifstatus
├── check_imap -> check_tcp
├── check_ircd
├── check_jabber -> check_tcp
├── check_load
├── check_log
├── check_mailq
├── check_mrtg
├── check_mrtgtraf
├── check_nagios
├── check_nntp -> check_tcp
├── check_nntps -> check_tcp
├── check_nrpe
├── check_nt
├── check_ntp
├── check_ntp_peer
├── check_ntp_time
├── check_nwstat
├── check_oracle
├── check_overcr
├── check_ping
├── check_pop -> check_tcp
├── check_procs
├── check_real
├── check_rpc
├── check_sensors
├── check_simap -> check_tcp
├── check_smtp
├── check_spop -> check_tcp
├── check_ssh
├── check_ssmtp -> check_tcp
├── check_swap
├── check_tcp
├── check_time
├── check_udp -> check_tcp
├── check_ups
├── check_users
├── check_wave
├── negate
├── urlize
├── utils.pm
└── utils.sh

3 Nagios运作流程

面向对象思想在Nagios上体现得淋漓尽致,Nagios主要涉及联系人,主机,服务,命令,时间周期等对象。

3.1 Nagios启动流程分析

调试运行脚本,启动nagios服务流程:

[root@hadoop-ehp0 etc]# sh -x /etc/init.d/nagios start
+ prefix=/usr/local/nagios
+ exec_prefix=/usr/local/nagios
+ NagiosBin=/usr/local/nagios/bin/nagios
+ NagiosCfgFile=/usr/local/nagios/etc/nagios.cfg
+ NagiosCfgtestFile=/usr/local/nagios/var/nagios.configtest
+ NagiosStatusFile=/usr/local/nagios/var/status.dat
+ NagiosRetentionFile=/usr/local/nagios/var/retention.dat
+ NagiosCommandFile=/usr/local/nagios/var/rw/nagios.cmd
+ NagiosVarDir=/usr/local/nagios/var
+ NagiosRunFile=/usr/local/nagios/var/nagios.lock
+ NagiosLockDir=/var/lock/subsys
+ NagiosLockFile=nagios
+ NagiosCGIDir=/usr/local/nagios/sbin
+ NagiosUser=nagios
+ NagiosGroup=nagios
+ checkconfig=true
+ '[' -f /etc/rc.d/init.d/functions ']'
+ . /etc/rc.d/init.d/functions
++ TEXTDOMAIN=initscripts
++ umask 022
++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
++ export PATH
++ '[' -z '' ']'
++ COLUMNS=80
++ '[' -z '' ']'
+++ /sbin/consoletype
++ CONSOLETYPE=pty
++ '[' -f /etc/sysconfig/i18n -a -z '' -a -z '' ']'
++ . /etc/profile.d/lang.sh
++ unset LANGSH_SOURCED
++ '[' -z '' ']'
++ '[' -f /etc/sysconfig/init ']'
++ . /etc/sysconfig/init
+++ BOOTUP=color
+++ RES_COL=60
+++ MOVE_TO_COL='echo -en \033[60G'
+++ SETCOLOR_SUCCESS='echo -en \033[0;32m'
+++ SETCOLOR_FAILURE='echo -en \033[0;31m'
+++ SETCOLOR_WARNING='echo -en \033[0;33m'
+++ SETCOLOR_NORMAL='echo -en \033[0;39m'
+++ PROMPT=yes
+++ AUTOSWAP=no
+++ ACTIVE_CONSOLES='/dev/tty[1-6]'
+++ SINGLE=/sbin/sushell
++ '[' pty = serial ']'
++ __sed_discard_ignored_files='/\(~|\.bak|\.orig|\.rpmnew|\.rpmorig|\.rpmsave\)$/d'
+ test -f /etc/sysconfig/nagios
+ USE_RAMDISK=0
+ test 0 -ne 0
+ '[' '!' -f /usr/local/nagios/bin/nagios ']'
+ '[' '!' -f /usr/local/nagios/etc/nagios.cfg ']'
+ case "$1" in
+ echo -n 'Starting nagios:'
Starting nagios:+ test true = true
+ check_config
++ mktemp /tmp/.configtest.XXXXXXXX
+ TMPFILE=/tmp/.configtest.NbW8s1FN
+ /usr/local/nagios/bin/nagios -vp /usr/local/nagios/etc/nagios.cfg
++ sed 's/ //g'
++ awk -F: '{print $2}'
++ grep '^Total Warnings:' /tmp/.configtest.NbW8s1FN
+ WARN=0
++ awk -F: '{print $2}'
++ grep '^Total Errors:' /tmp/.configtest.NbW8s1FN
++ sed 's/ //g'
+ ERR=0
+ test 0 = 0
+ test 0 = 0
+ echo 'OK - Configuration check verified'
+ chmod 0644 /usr/local/nagios/var/nagios.configtest
+ chown nagios:nagios /usr/local/nagios/var/nagios.configtest
+ /bin/rm /tmp/.configtest.NbW8s1FN
+ return 0
+ test -f /usr/local/nagios/var/nagios.lock
+ touch /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat
+ rm -f /usr/local/nagios/var/rw/nagios.cmd
+ touch /usr/local/nagios/var/nagios.lock
+ chown nagios:nagios /usr/local/nagios/var/nagios.lock /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat
+ /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
+ '[' -d /var/lock/subsys ']'
+ touch /var/lock/subsys/nagios
+ echo ' done.'
done.

分析上述脚本,首先设置各种变量,接着执行/etc/rc.d/init.d/functions,它是为init.d下的执行文件提供基本功能支持,接着是执行/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg,由此可知,nagios.cfg是一切的配置的开始。

继续分析nagios.cfg都干了啥,我们截取重要部分:

cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg

# Definitions for monitoring the local (Linux) host
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg

该文件实际上是调用了默认的几个配置文件,后面分别从下面每个文件进行分析。

3.2 命令(commands)

截取部分进行分析:

# 'notify-host-by-email' command definition
define command{
    command_name    notify-host-by-email
    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }

# 'notify-service-by-email' command definition
define command{
    command_name    notify-service-by-email
    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
    }

# 'check-host-alive' command definition
define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
        }

上文注释已经很清晰了,是定义的发送邮件,检查主机是否活着等命令,它们肯定会被调用,哪里呢?我们后面再分析。

3.3 联系人(contacts)

截取部分进行分析:

define contact{
    contact_name    nagiosadmin     ; Short name of user
    use             generic-contact     ; Inherit default values from generic-contact template (defined above)
    alias           Nagios Admin        ; Full name of user

    email           nagios@localhost    ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
        }

Nagios号称报警神器,终于看到了有关发送邮电相关的信息,这里可以定义联系人email。

3.4 时间周期(timeperiods)

截取部分进行分析:

# This defines a timeperiod where all times are valid for checks, 
# notifications, etc.  The classic "24x7" support nightmare. :-)
define timeperiod{
        timeperiod_name 24x7
        alias           24 Hours A Day, 7 Days A Week
        sunday          00:00-24:00
        monday          00:00-24:00
        tuesday         00:00-24:00
        wednesday       00:00-24:00
        thursday        00:00-24:00
        friday          00:00-24:00
        saturday        00:00-24:00
        }

# 'workhours' timeperiod definition
define timeperiod{
    timeperiod_name workhours
    alias       Normal Work Hours
    monday      09:00-17:00
    tuesday     09:00-17:00
    wednesday   09:00-17:00
    thursday    09:00-17:00
    friday      09:00-17:00
    }

报警通知总不能一直通知,我们得按照我们的需求设定是24小时还是工作日,这样才符合我们事情嘛。

3.5 模板(templates)

截取部分进行分析:

# Generic contact definition template - This is NOT a real contact, just a template!
define contact{
        name                            generic-contact     ; The name of this contact template
        service_notification_period     24x7            ; service notifications can be sent anytime
        host_notification_period        24x7            ; host notifications can be sent anytime
        service_notification_options    w,u,c,r,f,s     ; send notifications for all service states, flapping events, and scheduled downtime events
        host_notification_options       d,u,r,f,s       ; send notifications for all host states, flapping events, and scheduled downtime events
        service_notification_commands   notify-service-by-email ; send service notifications via email
        host_notification_commands      notify-host-by-email    ; send host notifications via email
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
        }

# Generic host definition template - This is NOT a real host, just a template!
define host{
        name                            generic-host    ; The name of this host template
        notifications_enabled           1           ; Host notifications are enabled
        event_handler_enabled           1           ; Host event handler is enabled
        flap_detection_enabled          1           ; Flap detection is enabled
        process_perf_data               1           ; Process performance data
        retain_status_information       1           ; Retain status information across program restarts
        retain_nonstatus_information    1           ; Retain non-status information across program restarts
        notification_period     24x7        ; Send host notifications at any time
        register                        0           ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

# Generic service definition template - This is NOT a real service, just a template!
define service{
        name                            generic-service     ; The 'name' of this service template
        active_checks_enabled           1               ; Active service checks are enabled
        passive_checks_enabled          1               ; Passive service checks are enabled/accepted
        parallelize_check               1               ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1               ; We should obsess over this service (if necessary)
        check_freshness                 0               ; Default is to NOT check service 'freshness'
        notifications_enabled           1               ; Service notifications are enabled
        event_handler_enabled           1               ; Service event handler is enabled
        flap_detection_enabled          1               ; Flap detection is enabled
        process_perf_data               1               ; Process performance data
        retain_status_information       1               ; Retain status information across program restarts
        retain_nonstatus_information    1               ; Retain non-status information across program restarts
        is_volatile                     0               ; The service is not volatile
        check_period                    24x7            ; The service can be checked at any time of the day
        max_check_attempts              3           ; Re-check the service up to 3 times in order to determine its final (hard) state
        normal_check_interval           10          ; Check the service every 10 minutes under normal conditions
        retry_check_interval            2           ; Re-check the service every two minutes until a hard state can be determined
        contact_groups                  admins          ; Notifications get sent out to everyone in the 'admins' group
        notification_options        w,u,c,r         ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           60          ; Re-notify about service problems every hour
        notification_period             24x7            ; Notifications can be sent out at any time
         register                        0              ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

联系人,主机,服务,时间周期,命令(服务中调用),既然模式就是那么几种,避免重复定义,何不定义成模板已供继承,上述就是这么干的。

3.5 主机服务(host and service)

以自带的localhost.cfg为例,截取部分进行分析:

# Define a host for the local machine
define host{
        use                     linux-server            ; Name of host template to use
                            ; This host definition will inherit all variables that are defined
                            ; in (or inherited by) the linux-server host template definition.
        host_name               localhost
        alias                   localhost
        address                 127.0.0.1
        }
# Define a service to "ping" the local machine
define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
        }

localhost.cfg才是最后的实体,前面对象都只是为之做铺垫,监控某台主机上某种服务的状态,并根据状态作出反应,这才是我们的初衷。监控机又是怎么监控被监控机的呢?依靠NRPE插件,NRPE插件也是CS架构,监控机是C端,被监控机是S端(需开启nrped daemon),C端定时地向所有S端发送我们定义的主机服务,S端收到消息后,调用本地Nagios-Plugins插件监控本机服务,并将结果返回给C端,C端接收结果,做出反应,或邮件或电话,并可提供web UI查看。

4 监控Zookeeper集群

我们以监控Zookeeper集群中的每个QuorumPeerMain进程为例,将整个流程重新梳理一遍,以增强我们的理解。

4.1 自定义命令

NRPE C端,在commands.cfg中添加命令:

define command{
    command_name    check_nrpe
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
    }
define command{
    command_name    check_nrpe_args
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ $ARG2$
    }

4.2 主机组与服务

定义主机组文件,在etc下新建目录hostservers,创建group.cfg,hadoop-ehp1.cfg,hadoop-ehp1.cfg,hadoop-ehp1.cfg文件:

hostservers/
├── group.cfg
├── hadoop-ehp1.cfg
├── hadoop-ehp2.cfg
└── hadoop-ehp3.cfg

在主配置nagios.cfg中添加文件组(注释本机文件,不然后面检查报错):

#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
cfg_dir=/usr/local/nagios/etc/hostservers

该配置会加载目录下所有.cfg文件,group.cfg内容如下:

# 主机组       
define hostgroup{
        hostgroup_name  linux-servers ; The name of the hostgroup
        alias           Linux Servers ; Long name of the group
        members         hadoop-ehp1,hadoop-ehp2,hadoop-ehp3     ; Comma separated list of hosts that belong to this group
        }

hadoop-ehpx.cfg内容如下(hadoop-ehp1.cfg为例):

# 主机与服务
define host{
       use                     linux-server
       host_name               hadoop-ehp1
       alias                   hadoop-ehp1
       address                 192.168.137.101
       }

define service{
       use                             generic-service
       host_name                       hadoop-ehp1
       service_description             check_nrpe_users
       check_command                   check_nrpe!check_users
       }       

define service{
       use                             generic-service
       host_name                       hadoop-ehp1
       service_description             QuorumPeerMain
       check_command                   check_nrpe_args!check_procs_args!"-c1:1 -Cjava -aserver.quorum.QuorumPeerMain"
       }

作为参数check_procs_args被传至NRPE S端,故S端需要定义命令(编辑nrpe.cfg):

command[check_procs_args]=/usr/local/nagios/libexec/check_procs $ARG1$

重新启动nrped服务。

4.2 联系人

define contact{
        contact_name                    nagiosadmin     ; Short name of user
        use             generic-contact     ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin        ; Full name of user

        email                           [email protected]    ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }

编辑/etc/mail.rc,添加:

set [email protected] smtp=smtp.139.com
set [email protected]  smtp-auth-password=wxl123456 smtp-auth=login

4.3 检验

检验配置文件是否正确:

[root@hadoop-ehp0 nagios]# bin/nagios -v etc/nagios.cfg

Nagios Core 4.0.8
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-12-2014
License: GPL

Website: http://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
    Checked 6 services.
    Checked 3 hosts.
    Checked 1 host groups.
    Checked 0 service groups.
    Checked 1 contacts.
    Checked 1 contact groups.
    Checked 26 commands.
    Checked 5 time periods.
    Checked 0 host escalations.
    Checked 0 service escalations.
Checking for circular paths...
    Checked 3 hosts
    Checked 0 service dependencies
    Checked 0 host dependencies
    Checked 5 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check

验证NRPE S端命令是否成功:

[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H hadoop-ehp1 -c check_procs_args 11
PROCS OK: 141 processes | procs=141;;;0;

重启nagios。
进入web UI查看:

杀死其中一个QuorumPeerMain进程:

查看邮箱,收到消息:

5 小结

本文详细地介绍了Nagios如何一步一步监控集群,并通知邮电,但作为大数据平台监控工具,它在具体监控方面还有很多不足,后面我们与Ganglia集成,更细粒度地监控大数据平台。

你可能感兴趣的:(006,Nagios)