最近公司需要上线监控系统,而且需要部署很多的监控,环境与设备也大都不一样,所以我就写了一份安装监控的技术文档,让我公司的运维来根据我的文档来进行监控的部署。
我的系统是redhat5.4,关闭了iptables与selinux。
- [root@localhost yum.repos.d]# wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.1-1.el5.rf.i386.rpm
- root@localhost yum.repos.d]# wget http://dag.wieers.com/rpm/packages/RPM-GPG-KEY.dag.txt
- [root@localhost yum.repos.d]# rpm -Uvh rpmforge-release-0.5.1-1.el5.rf.i386.rpm
- root@localhost yum.repos.d]# rpm --import RPM-GPG-KEY.dag.txt
- [root@localhost yum.repos.d]# yum install yum-fastestmirror yum-presto
2、安装apache(如果本机默认安装了,那么可以跳过这一步,如果没有安装,则可以yum安装)
- [root@localhost ~]# yum -y install httpd
安装nagios需要一些基础支持套件
- [root@localhost etc]# yum -y install gd gd-devel glibc glibc-common gcc
3、配置apache来支持nagios
(1)建立nagios用户
- [root@localhost ~]# useradd nagios
- [root@localhost etc]# /usr/sbin/groupadd nagcmd 添加nagcmd用户组,用以通过web页面提交外部控制命令
- [root@localhost etc]# /usr/sbin/usermod -a -G nagcmd nagios将nagios用户加入nagcmd组
- [root@localhost etc]# /usr/sbin/usermod -a -G nagcmd apache将apache用户加入nagcmd组
- [root@localhost etc]# /usr/sbin/usermod -a -G apache nagios将nagios用户加入apache组
- [root@localhost etc]# /usr/sbin/usermod -a -G nagios apache将apache用户加入nagios组
(2)修改apache运行用户和组。默认是daemon,需要把它改成nagios。这样它才能有权限访问我们安装的nagios目录,执行相关的cgi命令,如通过浏览器界面关闭nagios、停止某个故障对象发送报警信息等。(此步可以省略,因为我在部署nagios的时候,没有改变apache的用户与组,也没有出现问题)
(3)添加nagios访问目录(nagios 的安装路径/usr/local/nagios),同时使用http用户验证。把下面的内容追加到httpd.conf文件的末尾:
- ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin
- <Directory "/usr/local/nagios/sbin">
- Options ExecCGI
- AllowOverride None
- Order allow,deny
- Allow from all
- AuthName "Nagios Access"
- AuthType Basic
- AuthUserFile /usr/local/nagios/etc/htpasswd
- Require valid-user
- </Directory>
- Alias /nagios /usr/local/nagios/share
- <Directory "/usr/local/nagios/share">
- Options None
- AllowOverride None
- Order allow,deny
- Allow from all
- AuthName "Nagios Access"
- AuthType Basic
- AuthUserFile /usr/local/nagios/etc/htpasswd
- Require valid-user
- </Directory>
4、安装nagios
- [root@localhost tmp]# tar zxvf nagios-3.3.1.tar.gz
- [root@localhost nagios]# ./configure --prefix=/usr/local/nagios -with-command-group=nagcmd
- [root@localhost nagios]# make all
- [root@localhost nagios]# make install
- [root@localhost nagios]# make install-init
- [root@localhost nagios]# make install-config
- [root@localhost nagios]# make install-commandmode
- [root@localhost nagios]# make install-webconf
5、安装nagios插件nagios-plugin
- [root@localhost nagios]#cd /tmp
- [root@localhost tmp]# tar zxvf nagios-plugins-1.4.15.tar.gz
- [root@localhost nagios-plugins-1.4.15]# ./configure --with-nagios-user=nagios --with-nagios-group=nagios
- [root@localhost nagios-plugins-1.4.15]# make
- [root@localhost nagios-plugins-1.4.15]# make install
- [root@localhost nagios-plugins-1.4.15]# cd /usr/local/
- [root@localhost local]# chown -R nagios:nagios nagios/
- [root@localhost local]# chown -R nagios:nagios nagios/*
- [root@localhost local]# cd nagios/etc/
- [root@localhost etc]# vim nagios.cfg ###修改nagios.cfg配置文件,内容如下:
- cfg_file=/usr/local/nagios/etc/hosts.cfg #增加主机配置文件
- cfg_file=/usr/local/nagios/etc/hostgroups.cfg #增加主机组配置文件
- cfg_file=/usr/local/nagios/etc/contacts.cfg #增加联系人配置文件
- cfg_file=/usr/local/nagios/etc/contactgroups.cfg #增加联系人配置文件
- cfg_file=/usr/local/nagios/etc/services.cfg ##增加服务配置文件
- cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg #时间周期配置文件
- cfg_file=/usr/local/nagios/etc/objects/commands.cfg #命令配置文件
- 修改cgi.cfg配置文件,修改内容如下:
- [root@localhost etc]# vim cgi.cfg
- #如有多个用户,中间用逗号隔开
- authorized_for_system_information=nagios
- authorized_for_configuration_information= nagios
- authorized_for_system_commands= nagios
- authorized_for_all_services= nagios
- authorized_for_all_hosts= nagios
- authorized_for_all_service_commands= nagios
- authorized_for_all_host_commands= nagios
- (1)、配置主机文件hosts.cfg
- define host{
- host_name web1## 主机名为web1,可以在hostname里查看
- alias Nagios Server ##主机别名为Server
- address 192.168.10.223##主机的ip地址
- check_command check-host-alive ##检查使用的命令,需要在命令定
- 义文件定义,默认是定义好的。
- check_interval 5 ##检测的时间间隔
- retry_interval 1 ##检测失败后重试的时间间隔
- max_check_attempts 5 ##最大重试次数
- check_period 24x7 ##检测的时段
- process_perf_data 0
- retain_nonstatus_information 0
- contact_groups admin ###联系组,就是设置邮件报警的组
- notification_interval 30 ##通知间隔
- notification_period 24x7 ##通知周期设置
- notification_options d,u,r ####定义什么状态时报警,定义报警状态中的w表示warning,u表示unknown,c表示critial,r表示recovery(即恢复后是否发送通知);报警选项一般生产环境下设置w,c,r即可
- }
- (2)、配置主机组文件hostgroups.cfg
- define hostgroup {
- hostgroup_name Nagios-Example ##定义主机组的名字
- alias Nagios Example ##定义主机组的别名
- members web1 ##主机组的成员,跟hosts.cfg里的hostname一致,否则出错
- }
- (3)、配置联系人文件contacts.cfg
- define contact{
- contact_name nagiosadmin #联系名称
- alias Nagios Admin #联系别名
- service_notification_period 24x7 #服务监控时间为任何时候
- host_notification_period 24x7 #主机监控时间为任何时候
- service_notification_options w,u,c,r #服务监控的状态
- host_notification_options d,u,r #主机监控的状态
- service_notification_commands notify-service-by-email #邮件报警
- host_notification_commands notify-host-by-email #同上
- email [email protected] #接收报警的邮箱
- }
- (4)、配置联系组文件contactgroups.cfg
- define contactgroup{
- contactgroup_name admin #联系组的名字
- alias Nagios Administrators #联系组的别名
- members nagiosadmin #联系组里的成员,与contacts.cfg里的contact_name 保存一致
- }
- (5)、配置服务文件 services.cfg
- define service {
- host_name web1 #与hosts.cfg里的host-name保持一致
- service_description check-host-alive #服务描述
- check_period 24x7 #服务描述
- max_check_attempts 4 #最大检测次数
- normal_check_interval 3 #检测的时间间隔
- retry_check_interval 2 #重复检测的时间间隔
- contact_groups admin #发生故障通知的联系人组
- notification_interval 10 #通知间隔
- notification_period 24x7 #通知的时间段
- notification_options w,u,c,r #定义什么状态时报警,定义报警状态中
- check_command check-host-alive #检测的命令
- }
- define service {
- host_name web1
- service_description PING
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notification_options w,u,c,r
- check_command check_ping!100.0,20%!500.0,60%
- }
- define service {
- host_name web1
- service_description Root Partition
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notification_options w,u,c,r
- check_command check_local_disk!20%!10%!/
- }
- define service {
- host_name web1
- service_description Current Users
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notification_options w,u,c,r
- check_command check_local_users!20!50
- }
- define service {
- host_name web1
- service_description Total Processes
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notification_options w,u,c,r
- check_command check_local_procs!250!400!RSZDT
- }
- define service {
- host_name web1
- service_description Current Load
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notification_options w,u,c,r
- check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
- }
- define service {
- host_name web1
- service_description Swap Usage
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notification_options w,u,c,r
- check_command check_local_swap!20!10
- }
- define service {
- host_name web1
- service_description SSH
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notifications_enabled 0
- notification_options w,u,c,r
- check_command check_ssh
- }
- define service {
- host_name web1
- service_description HTTP
- check_period 24x7
- max_check_attempts 4
- normal_check_interval 3
- retry_check_interval 2
- contact_groups admin
- notification_interval 10
- notification_period 24x7
- notifications_enabled 0
- notification_options w,u,c,r
- check_command check_http
- }
- [root@localhost etc]# cd /tmp/
- [root@localhost tmp]# tar zxvf nrpe-2.12.tar.gz
- [root@localhost tmp]# cd nrpe-2.12
- [root@localhost nrpe-2.12]# ./configure --prefix=/usr/local/nrpe
- [root@localhost nrpe-2.12]# make
- [root@localhost nrpe-2.12]# make install
- [root@localhost nrpe-2.12]# cp /usr/local/nrpe/libexec/check_nrpe /usr/local/nagios/libexec
- [root@localhost nrpe-2.12]# cp /usr/local/nagios/libexec/check_disk /usr/local/nrpe/libexec
- [root@localhost nrpe-2.12]# cp /usr/local/nagios/libexec/check_load /usr/local/nrpe/libexec
- [root@localhost nrpe-2.12]# cp /usr/local/nagios/libexec/check_ping /usr/local/nrpe/libexec
- [root@localhost nrpe-2.12]# cp /usr/local/nagios/libexec/check_procs /usr/local/nrpe/libexec
- [root@localhost nrpe-2.12]# mkdir /usr/local/nrpe/etc
- [root@localhost nrpe-2.12]# cp sample-config/nrpe.cfg /usr/local/nrpe/etc/
修改nrpe.cfg的配置问题,如果是服务端的话,可以不修改,如果是客户端的话,则修改下面:
allowed_hosts=127.0.0.1
可以在allowed_hosts里加入服务都的ip
- [root@localhost nrpe-2.12]# /usr/local/nrpe/bin/nrpe -c /usr/local/nrpe/etc/nrpe.cfg -d
- [root@localhost nrpe-2.12]# ps -ef|grep nrpe
- nagios 4465 1 0 21:02 ? 00:00:00 /usr/local/nrpe/bin/nrpe -c /usr/local/nrpe/etc/nrpe.cfg -d
- root 4467 12877 0 21:02 pts/2 00:00:00 grep nrpe
- [root@localhost nrpe-2.12]# lsof -i:5666
- COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
- nrpe 4465 nagios 4u IPv4 81685 TCP *:5666 (LISTEN)
修改nagios与nrpe的所属用户与组
- [root@localhost local]# chown -R nagios:nagios /usr/local/nagios/*
- [root@localhost local]# chown -R nagios:nagios /usr/local/nrpe/*
8、启动nagios
- [root@localhost etc]# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
- Nagios Core 3.3.1
- Copyright (c) 2009-2011 Nagios Core Development Team and Community Contributors
- Copyright (c) 1999-2009 Ethan Galstad
- Last Modified: 07-25-2011
- License: GPL
- Website: http://www.nagios.org
- Reading configuration data...
- Read main config file okay...
- Processing object config file '/usr/local/nagios/etc/objects/commands.cfg'...
- Processing object config file '/usr/local/nagios/etc/objects/timeperiods.cfg'...
- Processing object config file '/usr/local/nagios/etc/hosts.cfg'...
- Processing object config file '/usr/local/nagios/etc/hostgroups.cfg'...
- Processing object config file '/usr/local/nagios/etc/contacts.cfg'...
- Processing object config file '/usr/local/nagios/etc/contactgroups.cfg'...
- Processing object config file '/usr/local/nagios/etc/services.cfg'...
- Read object config files okay...
- Running pre-flight check on configuration data...
- Checking services...
- Checked 9 services.
- Checking hosts...
- Checked 1 hosts.
- Checking host groups...
- Checked 1 host groups.
- Checking service groups...
- Checked 0 service groups.
- Checking contacts...
- Checked 2 contacts.
- Checking contact groups...
- Checked 1 contact groups.
- Checking service escalations...
- Checked 0 service escalations.
- Checking service dependencies...
- Checked 0 service dependencies.
- Checking host escalations...
- Checked 0 host escalations.
- Checking host dependencies...
- Checked 0 host dependencies.
- Checking commands...
- Checked 24 commands.
- Checking time periods...
- Checked 5 time periods.
- Checking for circular paths between hosts...
- Checking for circular host and service dependencies...
- Checking global event handlers...
- Checking obsessive compulsive processor commands...
- Checking misc settings...
- Total Warnings: 0
- Total Errors: 0
- Things look okay - No serious problems were detected during the pre-flight check
- [root@localhost etc]# chkconfig --add nagios 将nagios添加到服务中
- [root@localhost etc]# chkconfig nagios on 设置服务为自启动
- [root@localhost etc]# service nagios start 启动nagios
- [root@localhost etc]# htpasswd -c /usr/local/nagios/etc/htpasswd nagios
- New password:
- Re-type new password:
- Adding password for user nagios
- [root@localhost etc]#echo "/usr/local/nrpe/bin/nrpe -c /usr/local/nrpe/etc/nrpe.cfg -d" >>/etc/rc.local
启动sendmail,接收报警
- [root@localhost etc]#service sendmail start