monit监控工具（客户端）

1、安装Monit

Redhat、CentOS可以到：http://pkgs.repoforge.org/monit/ 下载对应位数rpm包安装。

Debian可以直接：apt-get install monit 安装。

官网下载最新版：http://mmonit.com/download/

源码编译安装：

相关依赖： yum install pam-devel

tar zxvf monit-5.7.tar.gz

cd monit-5.7

./configure --prefix=/usr/local/monit --sysconfdir=/usr/local/monit/etc

make

make install

mkdir -p /usr/local/monit/etc

cp monitrc /usr/local/monit/etc

chmod 600 /usr/local/monit/etc/monitrc

2、修改配置文件

CentOS用RPM包安装的话配置文件在： /etc/monit.conf；

Debian apt-get安装的话配置文件在：/etc/monit/monitrc；

源码编译在安装目录 /etc/monit.conf

下面是一个简单的例子，配置文件中已经包含了大量的例子，可以看配置文件参考。

set daemon 120 #设置检测时间

set logfile /var/log/monit.log #monit日志

set httpd port 2812 and # monit内置了一个用于查看被监视服务

use address 192.168.10.197 # 设置这个http服务器的地址或注释使用所有本机IP

allow 192.168.10.0/24 # 允许本地访问

allow admin:monit # 设置使用用户名admin和密码monit

set mailserver smtp.12320.tv port 25 USERNAME "zsxlmonitor" PASSWORD "123456" #(设置发送邮件的服务器及邮箱)

#制定报警邮件的格式

set mail-format {

from: [email protected]

subject: [From Monit]$SERVICE $EVENT at $DATE

message:Date:$DATE

ServerHost: $HOST

Item:$SERVICE

Problem: $DESCRIPTION.

Action:$ACTION

}

#指定邮件接收者

set alert [email protected] with reminder on 3 cycles #每3个周期发送一次警告

或者

set alert [email protected] #每个错误都发送警告邮件

##################### 监控实例 #####################

#检测sshd服务

check process sshd with pidfile /var/run/sshd.pid

start program "/etc/init.d/sshd start"

stop program "/etc/init.d/sshd stop"

if failed port 22 protocol ssh then restart

if 5 restarts within 5 cycles then timeout

#检测mysql服务

check process mysql with pidfile /usr/local/mysql/var/vpser.pid

group database

start program = "/etc/init.d/mysql start"

stop program = "/etc/init.d/mysql stop"

if failed host 127.0.0.1 port 3306 then restart

if 5 restarts within 5 cycles then timeout

#检测nginx服务

check process nginx with pidfile /usr/local/nginx/logs/nginx.pid

start program = "/etc/init.d/nginx start"

stop program = "/etc/init.d/nginx stop"

if failed host localhost port 80 protocol http

then restart

自定义脚本：

if failed host localhost port 80 protocol http then exec “/usr/bin/restart.sh”

配置中的pid及启动、关闭脚本的路径一定要是全路径，且参数一定要正确否则会造成无法正常检测或正常启动。

如果添加了http可以直接使用http://192.168.10.197:2812 进行管理。

可以看到监控服务的运行状态：

3、启动/停止Monit

/usr/local/monit/bin/monit 启动

/usr/local/monit/bin/monit quit 关闭

Optional action arguments for non-daemon mode are as follows:

start all - Start all services

start name - Only start the named service

stop all - Stop all services

stop name - Only stop the named service

restart all - Stop and start all services

restart name - Only restart the named service

monitor all - Enable monitoring of all services

monitor name - Only enable monitoring of the named service

unmonitor all - Disable monitoring of all services

unmonitor name - Only disable monitoring of the named service

reload - Reinitialize monit

status - Print full status information for each service

summary - Print short status information for each service

quit - Kill monit daemon process

validate - Check all services and start if not running

procmatch - Test process matching pattern

4、参数配置语法

4.1 监控模块

监控性能[运行中的程序，需指定PID]

CHECK PROCESS | MATCHING >

监控文件

CHECK FILE PATH

CHECK FIFO PATH

监控文件系统

CHECK FILESYSTEM PATH

监控目录

CHECK DIRECTORY PATH

监控主机

CHECK HOST ADDRESS

监控系统

CHECK SYSTEM

监控程序

CHECK PROGRAM PATH [TIMEOUT SECONDS]

4.2 动作[action]

ALERT 执行报警动作

RESTART 执行重启[根据定义的start program和stop program，先执行stop，然后start]

START 执行启动[直接执行定义的start program]

STOP 执行关闭[直接执行定义的stop program]

EXEC 执行脚本[直接执行 “ ” 内指定脚本(全路径)]

UNMONITOR 停止监控

4.3 资源项目[RESOURCE]

--PROCESS

CPU([user|system|wait]) CPU明细

CPU CPU使用率<%>

TOTALCPU CPU使用率【含子进程】<%>

SWAP 交换分区使用< Byte, kB, MB, GB >

CHILDREN 子进程

MEMORY 内存使用< Byte, kB, MB, GB >

TOTALMEMORY 内存使用【含子进程】< Byte, kB, MB, GB >

LOADAVG([1min|5min|15min]) 系统负载

UPTIME 运行时间< "SECONDS", "MINUTES", "HOURS", or "DAYS" >

--FILE

SIZE 大小< "B","KB","MB","GB" >

Permission 权限

UID

GID

PID

PPID

TIMESTAMP 时间戳< "SECONDS", "MINUTES", "HOURS", or "DAYS" >

-- SYSTEMFILE

usage 已使用

SPACE 距离项存在[值]

INODE inode值<个/%>

--HOST

host [] 主机[IP、域名]

port [] 端口[值]

type [] 传输协议[TCP|UDP|TCPSSL]

protocol [] 服务协议[APACHE-STATUS DNS DWP FTP GPS HTTP IMAP CLAMAV LDAP2 LDAP3 LMTP MEMCACHE MYSQL NNTP NTP3 POP POSTFIX-POLICY RADIUS RDATE RSYNC SIP SMTP SSH TNS PGSQL]

The HTTP protocol supports in addition:

REQUEST

HOSTHEADER

CHECKSUM

The Apache-status protocol supports in addition:

logging (loglimit)

closing connections (closelimit)

performing DNS lookups (dnslimit)

in keepalive with a client (keepalivelimit)

replying to a client (replylimit)

receiving a request (requestlimit)

initialising (startlimit)

waiting for incoming connections (waitlimit)

gracefully closing down (gracefullimit)

performing cleanup procedures (cleanuplimit)

-- PROGRAM

status 程序执行状态

4.4 判断测量[TEST]

FAILED [RESOURCE] 项的值为错误

CHANGED [RESOURCE] 项存在变化

EXIST [项] 存在[项]

DOES NOT EXIST 不存在[项]

4.4 比较语法

"<", ">", "!=", "=="

"gt", "lt", "eq", "ne"

"greater", "less", "equal", "notequal" then 值

4.5 监测时间

EVERY [number] CYCLES

every 2 cycles #每2个周期监测一次

EVERY [cron]

every "* * * * *" 分时日月周，*所有 x-y表示X至y，“，”指定某个点

every "* 8-19 * * 1-5" #每周1至5，每天早上8点至晚上7点内，按周期间隔监测

NOT EVERY [cron] 用法与VERY [cron]相反

5、实例语法

1．系统性能

#监控系统性能，定义监控名称myhost

check system myhost

#如果1分钟内系统负载大于4，则执行报警

if loadavg (1min) > 4 then alert

#如果5分钟内系统负载大于2，则执行报警

if loadavg (5min) > 2 then alert

#如果总内存使用率高于75%，则执行报警

if memory usage > 75% then alert

#如果交换空间使用率大于25%，则执行报警

if swap usage > 25% then alert

#如果CPU(user)使用率高于70%，则执行报警

if cpu usage (user) > 70% then alert

#如果CPU(system)使用率高于30%，则执行报警

if cpu usage (system) > 30% then alert

#如果CPU(wait)使用率高于20%，则执行报警

if cpu usage (wait) > 20% then alert

2．硬盘监控

#监控文件系统：/dev/sdb1，定义监控名称为datafs

check filesystem datafs with path /dev/sdb1

#挂载、卸载文件系统，注意开启这个功能。

#start program = "/bin/mount /data"

#stop program = "/bin/umount /data"

#检测文件系统的权限不为660时，则停止监控

if failed permission 660 then unmonitor

#检测文件系统的UID不为root时，则停止监控

if failed uid root then unmonitor

#检测文件系统的GID不为disk时，则停止监控

if failed gid disk then unmonitor

#检测文件系统的空间使用率超过80%，则执行报警

if space usage > 80% for 5 times within 15 cycles then alert

#检测文件系统的空间使用率超过90%，则执行卸载文件系统

#if space usage > 99% then stop

#检测文件系统的inode使用数超过30000，则执行报警

if inode usage > 30000 then alert

#检测文件系统的inode使用率超过99%，则执行卸载文件系统

if inode usage > 99% then stop

3．文件监控

#监控文件：/data/mydatabase.db，定义监控名称为database

check file database with path /data/mydatabase.db

#检测文件系统的权限不为700时，则停止监控

if failed permission 700 then alert

#检测文件的UID不为data时，则执行报警

if failed uid data then alert

#检测文件的GID不为data时，则执行报警

if failed gid data then alert

#检测文件的时间戳大于15分钟时，则执行报警

if timestamp > 15 minutes then alert

#检测文件的大小如果大于100M，则执行脚本

if size > 100 MB then exec "/my/cleanup/script" as uid dba and gid dba

4．目录监控

#监控目录：/bin，定义监控名称为bin

check directory bin with path /bin

#如果目录权限不为755，则停止监控

# if failed permission 755 then unmonitor

#如果目录的UID不为0，则停止监控

# if failed uid 0 then unmonitor

#如果目录的GID不为0，则停止监控

# if failed gid 0 then unmonitor

5．进程监控

#监控进程，指定进程的pid文件：/usr/local/apache/logs/httpd.pid，定义监控名称为Apache

check process Apache with pidfile /usr/local/apache/logs/httpd.pid

start program = "/usr/local/apache/bin/httpd -k start"

stop program = "/usr/local/apache/bin/httpd -k stop"

#检测进程的CPU占用率高于60%时，则执行报警

if cpu > 60% for 5 cycles then alert

#检测进程的CPU占用率高于80%时，则执行重启

if cpu > 80% for 10 cycles then restart

#检测进程的总内存占用高于200MB时，则执行重启

# if totalmem > 200.0 MB for 5 cycles then restart

#检测进程的子进程数高于200个，则执行报警

if children > 200 for 3 times within 5 cycles then alert

#检测进程的子进程数高于500个，则执行重启

if children > 500 for 5 times within 15 cycles then restart

#检测进程5分钟内的平均负载大于10时，则执行停止

#if loadavg(5min) greater than 10 for 8 cycles then stop

#检测127.0.0.1的80端口如果超时[5s]、错误，则执行重启

if failed host 127.0.0.1 port 80 protocol http for 5 times within 10 cycles then restart

#检测访问http://127.0.0.1/check.php，如果返回内容不是”OK”，则执行报警

if failed url http://127.0.0.1/check.php

and content == 'ok'

then alert

#检测访问127.0.0.1:80//somefile.html如果错误，则执行重启

if failed host 127.0.0.1 port 80 protocol http and request "/somefile.html" then restart

#检测指定主机，发送请求，判断返回值，执行指令

if failed host 127.0.0.1 port 80

send "GET / HTTP/1.0\r\nHost: 127.0.0.1\r\n\r\n"

expect "HTTP/[0-9\.]{3} 200 OK"

then alert

#检测apache-status判断值，执行指令

if failed host 127.0.0.1 port 80 protocol apache-status

loglimit > 10% or

dnslimit > 50% or

waitlimit < 20%

then alert

#if failed port 443 type tcpssl protocol http with timeout 15 seconds then restart

#如果在5个周期内重启3次进程，则判断为超时

#if 3 restarts within 5 cycles then timeout

#depends on apache_bin

#group server

6．监控主机

check host myserver with address 192.168.1.1

if failed icmp type echo count 3 with timeout 3 seconds then alert

if failed port 3306 protocol mysql with timeout 15 seconds then alert

if failed url http://user:[email protected]:8080/?querystring and content == 'action="j_security_check"' then alert

monit监控工具（客户端）

你可能感兴趣的:(monit监控工具（客户端）)