Nagios通过check_megaraid_sas(基于MecaCli工具的插件)对RAID卡和硬盘进行监控的方法

对于使用了LSI MegaRAID卡搭建RAID的, 通过LSI公司提供的MegaCli工具, 就可以实现对RAID卡和硬盘的监控. 注: DELL PERC5/6(PowerEdge RAID ControllerPERC)阵列卡实际上也就是LSI MegaRAID SAS controllers.
最新MegaCli工具包下载地址:
http://www.lsi.com/Search/Pages/results.aspx?k=megacli&r=assettype%3D%22AQ1NaXNjZWxsYW5lb3VzCWFzc2V0dHlwZQEBXgEk%22%20os%3D%22AQVMaW51eAJvcwEBXgEk%22

1. 安装前提

1) 查看服务器类型

# dmidecode -s system-product-name                                 (新版本dmidecode使用)
or
# dmidecode | grep "Product Name"                                 (低版本dmidecode使用)
Lenovo WQ R520 G7

2) 确认是否使用MegaRAID卡

--HP ProLiant系列服务器大都使用Smart Array阵列卡
不适用.

--Lenovo万全系列服务器可能如下显示(有些不可用?)
# dmesg | grep RAID
scsi0 : LSI SAS based MegaRAID driver
  Vendor: LSI       Model: MegaRAID 8300XLP  Rev: 2.02
md: Autodetecting RAID arrays.

--IBM x系列服务器可能如下显示
# dmesg | grep RAID
scsi0 : LSI SAS based MegaRAID driver
  Vendor: IBM       Model: ServeRAID M5015   Rev: 2.0.
md: Autodetecting RAID arrays.

--Dell PowerEdge系列服务器可能如下显示
# dmesg | grep RAID
scsi0 : LSI Logic SAS based MegaRAID driver
md: Autodetecting RAID arrays.

3) 确认是否已安装

# rpm -qa | egrep 'Lib_Utils|MegaCli'

2. 安装MegaCli

建议下载安装使用最新的MegaCli, 这样就支持更多的SAS硬盘类型的监控.
# cd /tmp
# unzip 8.01.06_Linux_MegaCLI.zip                 (解压MegaCli软件包)
Archive:  8.01.06_Linux_MegaCLI.zip
  inflating: readme.txt              
  inflating: 8.01.06_Linux_MegaCLI.txt  
 extracting: MegaCliLin.zip          
# unzip MegaCliLin.zip                                 (进一步解压MegaCliLin软件包)
Archive:  MegaCliLin.zip
  inflating: Lib_Utils-1.00-08.noarch.rpm  
replace readme.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: readme.txt              
  inflating: MegaCli-8.01.06-1.i386.rpm  
其中MegaCli-8.01.06-1.i386.rpm包是我们需要的(32bit或64bit系统都使用该包), 如果操作系统缺失了MegaCli相关的依赖包, 那么就需要先安装Lib_Utils-1.00-08.noarch.rpm了:
# rpm -ivh Lib_Utils-1.00-08.noarch.rpm
# rpm -Uvh MegaCli-8.01.06-1.i386.rpm
# rpm -ql MegaCli                                 (确认MegaCli包的安装文件信息)
/opt/MegaRAID/MegaCli/MegaCli
/opt/MegaRAID/MegaCli/MegaCli64
如果是32bit系统, 就使用MegaCli; 如果是64bit系统就是使用MegaCli64.
# /opt/MegaRAID/MegaCli/MegaCli         (该命令直接执行会提示如下错误)
or
# /opt/MegaRAID/MegaCli/MegaCli64      (该命令直接执行会提示如下错误)
Fatal error - Command Tool invoked with wrong parameters
Exit Code: 0x01

3. 测试MegaCli

# arch                                 (确定操作系统架构)
x86_64
原文件有大小写和数字, 且路径太长, 建议做个软连接到/usr/bin目录:
# ln -sf /opt/MegaRAID/MegaCli/MegaCli /usr/bin/megacli              (32bit系统)
or

# ln -sf /opt/MegaRAID/MegaCli/MegaCli64 /usr/bin/megacli          (64bit系统)

现在就可以直接执行软连接后的文件了:

# megacli -help                                      (查看命令帮助)

# megacli -adpCount                                 (查看适配器个数)
# megacli -LdGetNum -aALL                      (查看逻辑盘个数)
# megacli -LdInfo -LALL -aAll                     (显示所有逻辑盘信息, IBM x3650服务器示例)
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name               :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.086 TB
State               : Optimal
Strip Size          : 128 KB
Number Of Drives per span:4                     //表示每4个物理盘做成一个RAID1盘组
Span Depth          : 2                                  //表示共2个RAID1盘组做成了RAID10
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disabled
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: Yes
LD's IO profile supports MAX power savings with cached writes: Yes
Exit Code: 0x00
# megacli -PdList -aAll| more                     (显示所有的物理盘信息, IBM x3650服务器示例)
Adapter #0
Enclosure Device ID: 252
Slot Number: 0
Enclosure position: 0
Device Id: 8
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.464 GB [0x22cee000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x5000cca015512ae5
SAS Address(1): 0x0
Connected Port Number: 1(path0)
Inquiry Data: IBM-ESXSCBRCA300C3ETS0 NC610PFWEMUBECCXSA610    
IBM FRU/CRU: 42D0638     
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :38C (100.40 F)
...
# megacli -cfgdsply -aALL | more                       (显示Raid卡型号,Raid设置,Disk相关信息)
# megacli -FwTermLog -Dsply -aALL | more       (查看Raid卡日志)
# megacli -AdpAllInfo -aALL | more                      (查看Raid卡功能详细说明)

4. 安装check_megaraid_sas

就是一个通过MegaCli命令来获取监控信息的Nagios插件, 使用perl编写的.
下载地址: http://www.techno-obscura.com/~delgado/code/check_megaraid_sas

# cd /tmp
# vi check_megaraid_sas
-------------------------------------------------------------------------
# 35行修改如下
use lib qw(/usr/local/nagios/libexec); # possible pathes to your Nagios plugins and utils.pm
# 52-53行修改如下
my $megaclibin = '/usr/bin/megacli';  # the full path to your MegaCli binary
my $megacli = "$megaclibin";      # how we actually call MegaCli
-------------------------------------------------------------------------
# cp check_megaraid_sas /usr/local/nagios/libexec/check_megaraid_sas
# chmod 755 /usr/local/nagios/libexec/check_megaraid_sas
# /usr/local/nagios/libexec/check_megaraid_sas -h                             (查看使用帮助)
Usage: /usr/local/nagios/libexec/check_megaraid_sas [-s number] [-m number] [-o number]
       -s is how many hotspares are attached to the controller
       -m is the number of media errors to ignore
       -p is the predictive error count to ignore
       -o is the number of other disk errors to ignore

5. 测试check_megaraid_sas

# /usr/local/nagios/libexec/check_megaraid_sas
WARNING: 0:0:RAID-10:6 drives:1.225TB:Optimal Drives:6 (365 Errors)
如果报告有错误信息, 那么通过如下命令获得哪些物理盘有错误:
# megacli -PdList -aAll| egrep "Slot Number|Error Count|Failure Count"
Slot Number: 0
Media Error Count: 0
Other Error Count: 36
Predictive Failure Count: 0
Slot Number: 1
Media Error Count: 0
Other Error Count: 37
Predictive Failure Count: 0
Slot Number: 2
Media Error Count: 0
Other Error Count: 92
Predictive Failure Count: 0
Slot Number: 3
Media Error Count: 0
Other Error Count: 90
Predictive Failure Count: 0
Slot Number: 4
Media Error Count: 0
Other Error Count: 56
Predictive Failure Count: 0
Slot Number: 5
Media Error Count: 0
Other Error Count: 54
Predictive Failure Count: 0

如果确认这些错误可以忽略, 那么如下执行:
# /usr/local/nagios/libexec/check_megaraid_sas -o 365
OK: 0:0:RAID-10:6 drives:1.225TB:Optimal Drives:6 (365 Errors)

输出信息格式说明:

<status> <controller #>:<volume #>:<RAID level>:<volume drive count>:<volume size>:<volume status> Drives:<total drives attached to controller(s)>

剩下就是设置Nagios的Command和Service了, 就不细写了啊.

--End-- 阅读全文
类别: Nagios  查看评论

你可能感兴趣的:(nagios)