使用Nagios的扩展插件check_esx3.pl、check_esxi_hardware.py来监控VMware ESX服务器,check_esx3.pl主要监控系统资源,例如cpu、内存等使用情况,check——esxi_hardware.py主要监控的是硬件资源。既可以实现监控单台ESX(i)服务器,也可以监控VirtualCenter/vCenter服务器集群。当企业中已经部署虚拟数据中心(vCenter)时,应该监控vCenter而不是单台ESX/vSphere服务器。
安装:
1、安装关联库文件
[root@localhost ~]# yum -y install gcc openssl libssl libssl-dev per-doc rpm
2、下载vmware vsphere sdk for perl工具包:
check_esx3.pl需要安装vmware vsphere sdk for perl工具包
https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=SDKPERL600需要注册登陆,根据你的操作系统下载对应的32bit/64bit版本。
[root@localhost src]# tar zxvf VMware-vSphere-SDK-for-Perl-4.0.0-161974.x86_64.tar.gz [root@localhost src]# cd vmware-vsphere-cli-distrib [root@localhost vmware-vsphere-cli-distrib]# ./vmware-install.pl
3、安装check_esx3.pl
check_esx3.pl存放至nagios安装目录下的libexec目录中:
http://exchange.nagios.org/directory/Plugins/Operating-Systems/*-Virtual-Environments/VMWare/Vmware-ESX-%26-VM-host/details
[root@localhost src]# mv check_esx3-0.5.pl /usr/local/nagios/libexec/ [root@localhost src]# cd /usr/local/nagios/libexec/ [root@localhost libexec]# chmod +x check_esx3-0.5.pl [root@localhost libexec]# ./check_esx3-0.5.pl -help Can't locate Nagios/Plugin.pm in @INC ......... [root@localhost libexec]#
安装 Nagios::Plugin插件
[root@localhost libexec]# perl -MCPAN -e 'install Nagios::Plugin' [root@localhost libexec]# ./check_esx3-0.5.pl -help Can't locate Nagios/Plugin.pm in @INC ......... [root@localhost libexec]#
安装 rpmforge-release
http://download.slogra.com/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
[root@localhost libexec]# wget http://download.slogra.com/rpmforge-release-0.5.2-2.el5.rf.i386.rpm [root@localhost libexec]# rpm -ivh rpmforge-release-0.5.2-2.el5.rf.i386.rpm
安装perl组件
[root@localhost libexec]# yum -y install perl-Params-Validate perl-Math-Calc-Units perl-Regexp-Commonperl-Class-Accessor perl-Config-Tiny perl-Nagios-Plugin.noarch [root@localhost libexec]# ./check_esx3-0.5.pl -help Can't locate LWP/UserAgent.pm in @INC [root@localhost libexec]#
安装插件Bundle::LWP
[root@localhost libexec]# perl -MCPAN -eshell cpan> install Bundle::LWP Do you want to modify/update your configuration (y|n) ? [no] no Shall I follow them and prepend them to the queue of modules we are processing right now? [yes] yes cpan> exit
这里提示要不要对原有网络配置进行更新修改,我们选择no,这里提示须跟随他们和他们预队列中我们现在正在处理的模块吗,直接输入yes.
[root@localhost libexec]# ./check_esx3-0.5.pl -help Can't locate Zlib/Compress.pm in @INC [root@localhost libexec]# perl -MCPAN -e 'install Compress::Zlib'
安装脚本使用cpan安装perl模块,会有一些perl模块安装不上,这些安装不上的模块,得手动使用cpan去安装,若还安装不上那么就用yum去安装,例如 UUID,
error:installed manuallyfor use by vSphere CLI:
UUID 0.03 or newer
解决:
[root@localhost libexec]# yum install perl-SOAP-Lite perl-Data-Dumpperl-Class-MethodMaker perl-Crypt-SSLeay perl-libxml-perlperl-XML-LibXML-Common libuuid-devel uuid-perl -y [root@localhost libexec]# perl -MCPAN -e'install UUID' [root@localhost libexec]# ./check_esx3.pl -H 10.10.2.233 -u root-p 'justin' -l cpu CHECK_ESX3.PL CRITICAL -Server version unavailable at 'https://10.10.2.233:443/sdk/vimService.wsdl' at/usr/share/perl5/VMware/VICommon.pm line 545. [root@localhost libexec]# vim check_esx3.pl #!/usr/bin/perl -w$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0; # # Nagios plugin to monitor vmware esxservers [root@nagioslibexec]# ./check_esx3.pl -H 10.10.2.233 -u root -p 'justin' -l cpu CHECK_ESX3.PLCRITICAL - Server version unavailable at 'https://10.10.2.233:443/sdk/vimService.wsdl' at /usr/lib/perl5/5.8.8/VMware/VICommon.pm line 545.
对这个问题的解决办法是添加一个参数,,以check_esx3.pl告诉LWP的,可以忽略不计,自签名的SSL证书(因为他们的ESX / i服务器的默认),根据提示在 check_esx3.pl中添加一行 "$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;"
[root@nagioslibexec]# vim check_esx3.pl $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0; # # Nagios plugin to monitor vmware esx servers # # License: GPL [root@nagioslibexec]# ./check_esx3.pl -H 10.10.2.233 -u root -p 'justin' -l cpu CHECK_ESX3.PL OK -cpu usage=55.00 MHz (0.12%) | cpu_usagemhz=55.00Mhz;; cpu_usage=0.12%;;
check_esx3.pl参数使用可以通过./check_esx3.pl --help查看
4、安装check_esxi_hardware.py
check_esxi_hardware.py需要安装python、python的扩展包pywbem、你的Esxi主机的443,5989端口必须对nagios监控端开放,
[root@nagioslibexec]# wget [root@nagioslibexec]# chown nagios.nagios check_esxi_hardware.py [root@nagioslibexec]# chmod 755 check_esxi_hardware.py [root@nagioslibexec]# ./check_esxi_hardware.py Traceback (most recent call last): File "./check_esxi_hardware.py", line 222, inimport pywbem ImportError: No module named pywbem [root@nagioslibexec]# ./check_esxi_hardware.py -h Traceback (most recent call last): File "./check_esxi_hardware.py", line 222, in import pywbem ImportError: No module named pywbem
pywbem模块没有安装,安装python的第三方模块
http:
//downloads.sourceforge.net/project/pywbem/pywbem/pywbem-0.7/pywbem-0.7.0.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fpywbem%2Ffiles%2Fpywbem%2F&ts=1299742557&use_mirror=voxel
[root@nagioslibexec]# wget http://downloads.sourceforge.net/project/pywbem/pywbem/pywbem-0.7/pywbem-0.7.0.tar.gz [root@nagioslibexec]# tar -zxvf pywbem-0.7.0.tar.gz [root@nagioslibexec]# cd pywbem-0.7.0 [root@nagioslibexec]# python setup.py build [root@nagioslibexec]# python setup.py install --record files.txt [root@nagioslibexec]# ./check_esxi_hardware.py -H 10.10.2.233-U nagios -P nagios -V dell OK - Server: Dell Inc. PowerEdge R610 s/n: XXXXXX System BIOS: XXXXXXXXXXX
如果使用pywbem-0.8.0版本可能导致我们的插件无法使用,python setup.py install --record files.txt 记录安装目录的目的就是为了方便卸载插件,cat files.txt | xargs rm -rf
使用check_esx3.pl和check_esxi_hardware.py都只需要在Esxi主机上建立只读的用户名和密码即可。使用./check——esx3.pl -helo和check_esxi_hardware.py -help可以查看插件使用语法,
[root@localhost libexec]# ./check_esxi_hardware.py -help Usage: check_esxi_hardware.py https://hostname user password system [verbose] example: check_esxi_hardware.py https://my-shiny-new-vmware-server root fakepassword dell or, using new style options: usage: check_esxi_hardware.py -H hostname -U username -P password [-V system -v -p -I XX] example: check_esxi_hardware.py -H my-shiny-new-vmware-server -U root -P fakepassword -V auto -I uk or, verbosely: usage: check_esxi_hardware.py --host=hostname --user=username --pass=password [--vendor=system --verbose --perfdata --html=XX] Options: --version show program's version number and exit -h, --help show this help message and exit Mandatory parameters: -H HOST, --host=HOST report on HOST -U USER, --user=USER user to connect as -P PASS, --pass=PASS password, if password matches file:, first line of given file will be used as password Optional parameters: -V VENDOR, --vendor=VENDOR Vendor code: auto, dell, hp, ibm, intel, or unknown (default) -v, --verbose print status messages to stdout (default is to be quiet) -p, --perfdata collect performance data for pnp4nagios (default is not to) -I XX, --html=XX generate html links for country XX (default is not to) -t TIMEOUT, --timeout=TIMEOUT timeout in seconds - no effect on Windows (default = no timeout) comma-separated list of elements to ignore --no-power don't collect power performance data --no-volts don't collect voltage performance data --no-current don't collect current performance data --no-temp don't collect temperature performance data --no-fan don't collect fan performance data [root@localhost libexec]#
给Esxi主机设置只读用户
1)先登录Esxi主机,在“本地用户和组”标签中,空白处右键“添加”,即可添加用户。
如果是esxi6.0以上 需要修改下密码复杂度,需要修改security中密码的值为 retry=3 min=8,8,8,7,6
1、登录ESXi Host后,执行如下命令:
#vi /etc/pam.d/passwd
2、找到里面的内容,如下:
password requisite /lib/security/$ISA/pam_passwdqc.so retry=3 min=8,8,8,7,6
password requisite /lib/security/$ISA/pam_passwdqc.so retry=N min=N0,N1,N2,N3,N4
说明:
retry=3的意思是说可以尝试输入3次密码;
N0 = 12,表示一种字符即可,但是最短也需要12位;
N1 = 10,密码至少要有2种字符类型,最短10位;
N2 = 8,密码最短需要8位;
N3 = 8,要求大小写和数字3种字符,最短8位;
N4 = 7,要求大小写、数字和特殊字符,且长度最少为7位;
2)将nagios用户设置成“只读角色”。在“权限”标签中,空白处右键“添加权限”,然后按下图操作
测试
[root@nagios libexec]# ./check_esxi_hardware.py -H 10.10.2.233 -U nagios -P nagios -V dell UNKNOWN: Authentication Error [root@localhost libexec]# ./check_esxi_hardware.py -H 10.15.98.204 -U nagios -P nagios -V auto -t 90 -i "IPMI SEL" Traceback (most recent call last): File "./check_esxi_hardware.py", line 593, inwbemclient = pywbem.WBEMConnection(hosturl, (user,password), no_verification=True) TypeError: __init__() got an unexpected keyword argument 'no_verification' [root@localhost libexec]#
认证失败。在网上查到原因是Esxi版本不同差异导致。
解决方法:
ssh登陆Esxi主机,Esxi主机开启ssh功能点此
~ # cat /etc/security/access.conf # This file is autogenerated and must not be edited. +:dcui:ALL+:root:ALL +:vpxuser:ALL +:vslauser:ALL -:nagios:ALL -:ALL:ALL
将“-:nagios:ALL”去掉,在第二行加上“+:nagios:sfcb”,修改成如下
~ # cat /etc/security/access.conf # This file is autogenerated and must not be edited. +:dcui:ALL+:root:ALL +:nagios:sfcb +:vpxuser:ALL +:vslauser:ALL -:ALL:ALL
这种方式适合在不经常添加用户的情况下使用,只改一次即可;但是经常加用户可能会导致access.conf变化,需要设置计划任务添加“+:nagios:sfcb”
[root@nagioslibexec]# ./check_esxi_hardware.py -H 10.10.2.233-U nagios -P nagios -V dell OK - Server: Dell Inc. PowerEdge R610 s/n: XXXXXX System BIOS: XXXXXXXXXXX
现在通过命令行使用./check_esx3.pl和./check_esxi_hardware.py都能正常获取数据,下面将将其加入到监控系统中。
5、监控添加
1)先在commands.cfg中添加命令
[root@nagioslibexec]# vim /usr/local/nagios/etc/objects/commands.cfg #check_esxi_hardware.py define command { command_name check_esxi_hardware command_line $USER1$/check_esxi_hardware.py -H $HOSTADDRESS$ -U $ARG1$ -P $ARG2$ -V $ARG3$ -I isolutions -p -t 20 } #check_esx3.pl define command{ command_name check_esx3_cpu_usage command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l cpu -s usage -w $ARG3$ -c $ARG4$ } define command{ command_name check_esx3_mem_usage command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l mem -s usage -w $ARG3$ -c $ARG4$ } define command{ command_name check_esx3_swap_usage command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l mem -s swap -w $ARG3$ -c $ARG4$ } define command{ command_name check_esx3_net_usage command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l net -s usage -w $ARG3$ -c $ARG4$ } define command{ command_name check_esx3_vmfs command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l vmfs -x EQ_LUN9,EQ_LUN8 -s $ARG3$ -w $ARG4$ -c $ARG5$ #-w 80%: -c 90%: 小于80% 不检测EQ_LUN8 EQ_LUN9 } define command{ command_name check_esx3_runtime_status command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l runtime -s status } define command{ command_name check_esx3_runtime_issues command_line $USER1$/check_esx3.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -l runtime -s issues } define command{ command_name check_esx3_dc_cpu_usage command_line $USER1$/check_esx3.pl -D $ARG1$ -u $ARG2$ -p $ARG3$ -H $HOSTALIAS$ -l cpu -s usage -w $ARG4$ -c $ARG5$ } define command{ command_name check_esx3_dc_mem_usage command_line $USER1$/check_esx3.pl -D $ARG1$ -u $ARG2$ -p $ARG3$ -H $HOSTALIAS$ -l mem -s usage -w $ARG4$ -c $ARG5$ }
告警参数-w -c使用说明
Range definition | Generate an alert if x... |
10 | < 0 or > 10, (outside the range of {0 .. 10}) |
10: | < 10, (outside {10 .. ∞}) |
~:10 | > 10, (outside the range of {-∞ .. 10}) |
10:20 | < 10 or > 20, (outside the range of {10 .. 20}) |
@10:20 | ≥ 10 and ≤ 20, (inside the range of {10 .. 20}) |
Command line | Meaning |
check_stuff -w10 -c20 | Critical if "stuff" is over 20, else warn if over 10 (will be critical if "stuff" is less than 0) |
check_stuff -w~:10 -c~:20 | Same as above. Negative "stuff" is OK |
check_stuff -w10: -c20 | Critical if "stuff" is over 20, else warn if "stuff" is below 10 (will be critical if "stuff" is less than 0) |
check_stuff -c1: | Critical if "stuff" is less than 1 |
check_stuff -w~:0 -c10 | Critical if "stuff" is above 10; Warn if "stuff" is above zero (will be critical if "stuff" is less than 0) |
check_stuff -c5:6 | Critical if "stuff" is less than 5 or more than 6 |
check_stuff -c@10:20 | OK if stuff is less than 10 or higher than 20, otherwise critical |
参考地址:https://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT
2)设置监控主机和服务
[root@nagioslibexec]# vim /usr/local/nagios/etc/10.10.2.233.cfg define host{ use linux-server host_name vSphere3 alias vSphere Host3(SSB412) address 10.2.1.153 hostgroups ESX icon_image vmware.png icon_image_alt VMware vSphere (SSB412) vrml_image vmware.jpg statusmap_image vmware.gd2 2d_coords 800,900 parents RackSW_PDC_1 } define service{ use generic-service host_name VM-ESXi-01,VM-ESXi-02,vSphere1,vSphere2,vSphere3 service_description CPU Usage check_command check_esx3_cpu_usage!nagios!password!100!110 } define service{ use generic-service host_name VM-ESXi-01,VM-ESXi-02,vSphere1,vSphere2,vSphere3 service_description Memory Usage check_command check_esx3_mem_usage!nagios!password!100!110 }
以上大家做为参考,请以实际环境为准
如果是用的nagiosql,需要先在命令项里定义命令,类似command.cfg文件
再到监督-服务里定义服务
最后到监督-hosts里添加监控主机,选择要监控的服务
扩展:
check_esxi_hardware.py 检测时候有时候会提示Service Check Timed Out,命令里加了-t 也会报timeout,后来查看check_esxi_hardware.py里timeout=0 修改成60后好像不会报timeout了,后面继续观察
nagios日志文件时间格式转换
/usr/local/nagios/var/nagios.log的格式是 时间 文本
文件里面的时间计算方法是 从1970.1.1到当前时间的间隔,单位是秒
通过下面命令可以转换成我们常用的时间格式:
转换前备份log文件,对拷贝的log文件进行操作。因为该命令会直接修改log文件,当log文件被修改后,nagios将不能读取当前的时间格式,会发生异常。
perl -i -pe '($t) = ($_ =~ m/^\[(\d+)\]/); $nice=scalar localtime $t; s/^\[(\d+)\]/[$nice]/' nagios.log.back