windows下有HDTune可以查看磁盘的状态,防止磁盘挂掉才会自己知道,CentOS下有SMART (Self-Monitoring, Analysis and Reporting Technology System) 同样对磁盘做状态检测
http://www.smartmontools.org/
下面以dell R720服务器举例,/dev/sda是1T的scsi接口普通硬盘,/dev/sdd 是三块盘做的raid5
# df -h #查看磁盘的名字
# dmesg |grep sdd #查看开机信息里面的磁盘info
sd 0:2:0:0: [sdd] Attached SCSI disk
# hdparm -I /dev/sda #查看磁盘硬件信息、开启的功能等,信息特别详细
下面用smart查看磁盘的状态:
# yum install smartmontools //安装SMART # smartctl -H /dev/sdd //磁盘健康状况查看 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net SMART Health Status: OK
# smartctl -A /dev/sda 或者 smartctl --all /dev/sda #硬盘的smart信息
# smartctl -a /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: DELL Product: PERC H310 Revision: 2.12 User Capacity: 598,879,502,336 bytes [598 GB] Logical block size: 512 bytes Logical Unit id: Serial number: Device type: disk Local Time is: Wed Jan 14 15:37:39 2015 CST Device does not support SMART Error Counter logging not supported Device does not support Self Test logging
这里提示Device does not support SMART,所以按下面方式查看
查看raid5中第一块磁盘的状态
# smartctl -a -d megaraid,0 /dev/sdd
同样查看第二块、第三块磁盘的状态,根据自己的监控情况,加速nagios、zabbix报警
# smartctl -a -d megaraid,1 /dev/sdd
# smartctl -a -d megaraid,2 /dev/sdd
除此之外的smartctl用法,介绍的很详细:
# smartctl -h Usage: smartctl [options] device ============================================ SHOW INFORMATION OPTIONS ===== -h, --help, --usage Display this help and exit -V, --version, --copyright, --license Print license, copyright, and version information and exit -i, --info Show identity information for device -g NAME, --get=NAME Get device setting: all, aam, apm, lookahead, security, wcache -a, --all Show all SMART information for device -x, --xall Show all information for device --scan Scan for devices --scan-open Scan for devices and try to open each device ================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS ===== -q TYPE, --quietmode=TYPE (ATA) Set smartctl quiet mode to one of: errorsonly, silent, noserial -d TYPE, --device=TYPE Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, cciss,N, auto, test -T TYPE, --tolerance=TYPE (ATA) Tolerance: normal, conservative, permissive, verypermissive -b TYPE, --badsum=TYPE (ATA) Set action on bad checksum to one of: warn, exit, ignore -r TYPE, --report=TYPE Report transactions (see man page) -n MODE, --nocheck=MODE (ATA) No check if: never, sleep, standby, idle (see man page) ============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS ===== -s VALUE, --smart=VALUE Enable/disable SMART on device (on/off) -o VALUE, --offlineauto=VALUE (ATA) Enable/disable automatic offline testing on device (on/off) -S VALUE, --saveauto=VALUE (ATA) Enable/disable Attribute autosave on device (on/off) -s NAME[,VALUE], --set=NAME[,VALUE] Enable/disable/change device setting: aam,[N|off], apm,[N|off], lookahead,[on|off], security-freeze, standby,[N|off|now], wcache,[on|off] ======================================= READ AND DISPLAY DATA OPTIONS ===== -H, --health Show device SMART health status -c, --capabilities (ATA) Show device SMART capabilities -A, --attributes Show device SMART vendor-specific Attributes and values -f FORMAT, --format=FORMAT (ATA) Set output format for attributes: old, brief, hex[,id|val] -l TYPE, --log=TYPE Show device log. TYPE: error, selftest, selective, directory[,g|s], xerror[,N][,error], xselftest[,N][,selftest], background, sasphy[,reset], sataphy[,reset], scttemp[sts,hist], scttempint,N[,p], scterc[,N,M], devstat[,N], ssd, gplog,N[,RANGE], smartlog,N[,RANGE] -v N,OPTION , --vendorattribute=N,OPTION (ATA) Set display OPTION for vendor Attribute N (see man page) -F TYPE, --firmwarebug=TYPE (ATA) Use firmware bug workaround: none, samsung, samsung2, samsung3, swapid -P TYPE, --presets=TYPE (ATA) Drive-specific presets: use, ignore, show, showall -B [+]FILE, --drivedb=[+]FILE (ATA) Read and replace [add] drive database from FILE [default is +/etc/smart_drivedb.h and then /usr/share/smartmontools/drivedb.h] ============================================ DEVICE SELF-TEST OPTIONS ===== -t TEST, --test=TEST Run test. TEST: offline, short, long, conveyance, force, vendor,N, select,M-N, pending,N, afterselect,[on|off] -C, --captive Do test in captive mode (along with -t) -X, --abort Abort any non-captive test on device =================================================== SMARTCTL EXAMPLES ===== smartctl --all /dev/hda (Prints all SMART information) smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda (Enables SMART on first disk) smartctl --test=long /dev/hda (Executes extended disk self-test) smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda (Prints Self-Test & Attribute errors) smartctl --all --device=3ware,2 /dev/sda smartctl --all --device=3ware,2 /dev/twe0 smartctl --all --device=3ware,2 /dev/twa0 smartctl --all --device=3ware,2 /dev/twl0 (Prints all SMART info for 3rd ATA disk on 3ware RAID controller) smartctl --all --device=hpt,1/1/3 /dev/sda (Prints all SMART info for the SATA disk attached to the 3rd PMPort of the 1st channel on the 1st HighPoint RAID controller) smartctl --all --device=areca,3/1 /dev/sg2 (Prints all SMART info for 3rd ATA disk of the 1st enclosure on Areca RAID controller)
http://linux-wiki.cn/wiki/zh-hans/SSD_(%E5%9B%BA%E6%80%81%E7%A1%AC%E7%9B%98)
nagios设置
下面检测raid5磁盘,总共3块磁盘
root@web: /usr/local/nagios/libexec # vim check_disk_status.sh #!/bin/bash # STATE_OK=0 STATE_W ARNING=1 SMARTCTL="/usr/sbin/smartctl" CHECK_DISK="/dev/sda" DISK_HEALTH1=`$SMARTCTL -a -d megaraid,0 $CHECK_DISK |grep "SMART Health Status"|awk '{print $4}'` if [ "$DISK_HEALTH1" = "OK" ]|| [ "$DISK_HEALTH1" = "PASSED" ];then echo "OK - $CHECK_DISK 1 status is $DISK_HEALTH1 " else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH1 " exit $STATE_CRITICAL fi DISK_HEALTH2=`$SMARTCTL -a -d megaraid,1 $CHECK_DISK |grep "SMART Health Status"|awk '{print $4}'` if [ "$DISK_HEALTH2" = "OK" ]|| [ "$DISK_HEALTH2" = "PASSED" ];then echo "OK - $CHECK_DISK 2 status is $DISK_HEALTH2 " else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH2 " exit $STATE_CRITICAL fi DISK_HEALTH3=`$SMARTCTL -a -d megaraid,2 $CHECK_DISK |grep "SMART Health Status"|awk '{print $4}'` if [ "$DISK_HEALTH3" = "OK" ]|| [ "$DISK_HEALTH3" = "PASSED" ];then echo "OK - $CHECK_DISK 3 status is $DISK_HEALTH3 " else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH3 " exit $STATE_CRITICAL fi # chmod 755 check_disk_status.sh
vim /usr/local/nagios/etc/nrpe.cfg command[check_disk_status]=/usr/bin/sudo /usr/local/nagios/libexec/check_disk_status.sh
因为/usr/sbin/smartctl必须要root才可以运行,得到磁盘的状态
vim /etc/sudoers #Defaults requiretty nagios ALL=(ALL) NOPASSWD:/usr/local/nagios/libexec/check_disk_status.sh
在nagios服务器端执行命令来测试:
root@nagios: /usr/local/nagios/libexec # ./check_nrpe -H 192.168.2.2 -c check_disk_status OK - /dev/sda 1 status is OK OK - /dev/sda 2 status is OK OK - /dev/sda 3 status is OK
定义nagios服务
define service{ use linux-service host_name 192_168_2_2 service_description check disk status check_command check_nrpe!check_disk_status }
再把时间定义为1天一次,省的总扫描硬盘,对硬盘也不好
参考http://blog.chinaunix.net/uid-20592013-id-2436813.html
执行脚本,发邮件
最简单的,加入crontab,查看邮件即可,下面是脚本