Nagios自己编写监控磁盘脚本check_disk

不知不觉已经实习了一个月了,实习期间做的主要工作就是搭建Nagios+Centreon监控平台了,自己动手还是比较快的,搭这个东西虽然bug一堆,但还算顺利,后来就开始自行编写监控磁盘的脚本了。
先说一下为什么要自己编写监控磁盘的脚本,其实,我自己也不是太清楚,因为Nagios-plugins里面是有check_disk的脚本的,可能我的导师是想锻炼一下我,同时也为了有一个更符合自己实际情况的脚本。
面对的硬件有:三台服务器搭建测试云平台,两台服务器上有RAID卡,两台服务器上有SSD,还有HDD若干。对的,只有这么点,但对于我这个小菜鸟,也够我折腾了。


对于有RAID卡的主机,MegaCli就是个不错的选择了,自行下载安装MegaCli,然后就动手了:

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL  ---查raid
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL    ---查raid卡信息
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL           ---查看硬盘信息

自己弄着弄着玩一下,观察一下显示的东西,显示出来的东西有很大一片的,随便看看。如果该主机本身没有RAID卡,那你在它上面使用MegaCli的话,显示的就只有 Exit Code: 0x00
主要用的是第三条命令/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
然后抓取我要的信息/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep -E 'Device Id|Error|Media Type'
Device Id — 监控SSD寿命的时候用到,就是一个Id而已
Error — Error Count 就是我们要观察的错误信息了,为0就是木有错误,不为0就要担心了
Media Type — 硬盘类型,主要是我要找主机面的SSD对应的是哪个Device Id,因为除了这样,我也不知道Device Id跟硬盘或者跟分区有什么对应关系,贴一下我显示的结果:

[root@cloud-13 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL  | grep -E 'Device Id|Error|Media Type'
Device Id: 0
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 1
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 2
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 3
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 4
Media Error Count: 0
Other Error Count: 0
Media Type: Solid State Device

这样,自行写代码观察Error Count后面的数值就行了,就达到监控的效果了。
刚刚有提到SSD寿命的问题,在这一并说了吧,使用smartctl可以检测SSD的寿命,当然还有很多其它结果,SSD寿命只是其中一部分,但是对于有RAID卡的主机,需要刚刚获取到的Device Id。

[root@cloud-13 ~]# smartctl -a -d megaraid,4 /dev/sdc1
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sdc1 [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'
Smartctl open device: /dev/sdc1 [megaraid_disk_04] [SAT] failed: SATA device detected,
MegaRAID SAT layer is reportedly buggy, use '-d sat+megaraid,N' to try anyhow

我的主机上需要我加上sat,就听他话咯

[root@cloud-13 ~]# smartctl -a -d megaraid,4 /dev/sdc1
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sdc1 [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'
Smartctl open device: /dev/sdc1 [megaraid_disk_04] [SAT] failed: SATA device detected,
MegaRAID SAT layer is reportedly buggy, use '-d sat+megaraid,N' to try anyhow
[root@cloud-13 ~]# smartctl -a -d sat+megaraid,4 /dev/sdc1
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     OCZ INTREPID 3600
Serial Number:    A21N8061423000004
LU WWN Device Id: 5 e83a97 100006dc5
Firmware Version: 1.4.6.0
User Capacity:    800,166,076,416 bytes [800 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 (revision not indicated)
Local Time is:    Tue Aug 25 15:20:02 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x1d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   0) minutes.
Extended self-test routine
recommended polling time:        (   0) minutes.

SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       3964
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       28
100 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       2547072
171 Unknown_Attribute       0x0000   090   000   000    Old_age   Offline      -       12030
174 Unknown_Attribute       0x0000   071   100   000    Old_age   Offline      -       20
184 End-to-End_Error        0x0000   009   100   000    Old_age   Offline      -       1282
187 Reported_Uncorrect      0x0000   100   100   000    Old_age   Offline      -       0
190 Airflow_Temperature_Cel 0x0000   048   054   000    Old_age   Offline      -       48
195 Hardware_ECC_Recovered  0x0000   000   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   000   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   000   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       3562
199 UDMA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       3443
202 Data_Address_Mark_Errs  0x0000   100   100   000    Old_age   Offline      -       2061332509
205 Thermal_Asperity_Rate   0x0000   100   100   000    Old_age   Offline      -       3000
206 Flying_Height           0x0000   000   100   000    Old_age   Offline      -       0
207 Spin_High_Current       0x0000   002   100   000    Old_age   Offline      -       64
208 Spin_Buzz               0x0000   000   100   000    Old_age   Offline      -       9
210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
211 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
212 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
213 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
214 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
221 G-Sense_Error_Rate      0x0000   100   100   000    Old_age   Offline      -       0
222 Loaded_Hours            0x0000   100   100   000    Old_age   Offline      -       0
230 Head_Amplitude          0x0000   001   100   000    Old_age   Offline      -       1
233 Media_Wearout_Indicator 0x0000   100   000   000    Old_age   Offline      -       100
249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       5792
251 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       22849

SMART Error Log not supported
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Device does not support Selective Self Tests/Logging

然后抓取这个就行了,那个100就是表示寿命还剩100%,就是一点都没损耗,毕竟是新的呢
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
我也都是参照下面这两个博客做的,他们说得很详细

http://blog.yufeng.info/archives/1096
http://www.woxihuan.com/117417/1336095005082619.shtml


对于没有RAID卡的主机,smartctl可以很好的用来检测磁盘是否有错误
# smartctl -a /dev/sdx 显示所有信息sdx为自己电脑分区
因为我只要观察Error Count log,可以使用这个:
# smartctl -l error /dev/sdc 则只列出Error Counter

[root@cloud-11 ~]# smartctl -l error /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net


Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0      20680        755.998           0
write:         0        0         0         0       8177       1356.647           0
verify:        0        0         0         0        760         61.354           0

Non-medium error count:        0

观察带error的列,为0则是木有问题,实现代码抓取就行了
对于这台没有RAID卡的主机,使用smartctl检测ssd的时候,是没有Error Counter log的

[root@cloud-11 ~]# smartctl -a /dev/sdb
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     OCZ INTREPID 3600
Serial Number:    A21N8061423000020
LU WWN Device Id: 5 e83a97 100006dd5
Firmware Version: 1.4.6.0
User Capacity:    800,166,076,416 bytes [800 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 (revision not indicated)
Local Time is:    Tue Aug 25 15:34:29 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  25) The self-test routine was aborted by
                                        the host.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x1d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   0) minutes.
Extended self-test routine
recommended polling time:        (   0) minutes.

SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       5116
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       12
100 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       4009824
171 Unknown_Attribute       0x0000   090   000   000    Old_age   Offline      -       12041
174 Unknown_Attribute       0x0000   066   100   000    Old_age   Offline      -       8
184 End-to-End_Error        0x0000   009   100   000    Old_age   Offline      -       1271
187 Reported_Uncorrect      0x0000   100   100   000    Old_age   Offline      -       0
190 Airflow_Temperature_Cel 0x0000   045   063   000    Old_age   Offline      -       45
195 Hardware_ECC_Recovered  0x0000   000   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   000   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   000   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       2732
199 UDMA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       2458
202 Data_Address_Mark_Errs  0x0000   100   100   000    Old_age   Offline      -       2371926836
205 Thermal_Asperity_Rate   0x0000   100   100   000    Old_age   Offline      -       3000
206 Flying_Height           0x0000   000   100   000    Old_age   Offline      -       0
207 Spin_High_Current       0x0000   003   100   000    Old_age   Offline      -       90
208 Spin_Buzz               0x0000   000   100   000    Old_age   Offline      -       14
210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       9175
211 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
212 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
213 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
214 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
221 G-Sense_Error_Rate      0x0000   100   100   000    Old_age   Offline      -       0
222 Loaded_Hours            0x0000   100   100   000    Old_age   Offline      -       0
230 Head_Amplitude          0x0000   001   100   000    Old_age   Offline      -       1
233 Media_Wearout_Indicator 0x0000   100   000   000    Old_age   Offline      -       100
249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       7079
251 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       20961

SMART Error Log not supported
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%         0         -
# 2  Short offline       Aborted by host               90%         0         -

Device does not support Selective Self Tests/Logging

但却是有SSD的寿命的:
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
找了很久,对这块没有RAID的SSD的错误检测依旧没有办法,只能监控其寿命,要是哪位高人有办法,请指教。

至此就实现得差不多了,总体思路就是如此:

通过检测工具
对于没有使用raid卡的硬盘,可以用smartctl -a /dev/sdX 观察Error counter log的列的值有没有增加;
使用raid卡的硬盘,则用MegaCli来观察Error Count


最后就是对ioerr_cnt的研究了,操作系统为redhat5.x,具体版本不记得了,可以用df -h来查看磁盘分区情况
对于每一块磁盘,其目录下都会有这个文件,里面存放了一个值

# cat /sys/block/sdb/device/ioerr_cnt 
0x1494

从ioerr_cnt这个名字就觉得这个应该是对IO错误的计数,那么它的值就表示发生的IO错误数,0x1494,这可不是一个很低的值,它是否象征着磁盘错误?
而后导师在redhat社区找了一篇关于这个问题的讨论文章给我看,有兴趣的可自行去红帽社区找,我这里不方便提供

[Troubleshooting] How do I determine which io are causing ioerr_cnt to increase?

而这篇文章的存在就是为了确定是哪个IO发生了错误提供寻找办法,就是提出一个解决办法去找到是哪个IO导致错误,但是就算找到了,跟磁盘的健康状态有关系吗?或者说,只是某个进程发生了IO错误,如果这是那个进程本身的关系,那就跟磁盘毫不相干了。
我观察了我三台主机,9块磁盘的ioerr_cnt,发现只有一块硬盘的ioerr_cnt值为0,但是smartctl和MegaCli显示的error都为0。
最后决定放弃对ioerr_cnt的检测,毕竟它并不能全部和磁盘的健康状态挂钩,所以把MegaCli和smartctl作为标准。


这样写下来,总觉得好少,可是自己也将近做了一星期的研究,还要加上好几天的写代码,全部用Python实现的,因为对Python也生疏了好久,查了好久的函数怎么怎么用。但自己收获还是很大的,之前对nagios的脚本还一直抱有敬畏的心态(有一些打开全是乱码),现在发现其实还蛮简单的,主要还是要挑对工具,接着大多数都是字符串处理了,Python是个好东西。
最后的代码如下了,挺简单的,没什么含金量:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Description:
#   This application is used to discovery the pyhsical disk by using the MegaCLI tool.
#
# Author: Jiang Chuan <[email protected]>
#

import commands
import os
import sys
import string
import argparse

SMARTCTL = 'smartctl'
ListError = '-l error'
DISK = '/dev/sdc'
LSPCI = 'lspci | grep -i raid'
MEGACLI = '/opt/MegaRAID/MegaCli/MegaCli64'
PDLIST = '-PDList -aALL'
DEVICE = '|grep \'Device Id\''
ERROR = '|grep Error'

# nagios exit code
STATUS_OK = 0
STATUS_WARNING = 1
STATUS_ERROR = 2
STATUS_UNKNOWN = 3

def check_smartctl():
    (status, output) = commands.getstatusoutput('%s %s %s' % (SMARTCTL, ListError, DISK))
    line = output.split('\n')
    if status != 0:
        print 'UNKNOWN|Something not unexpected happened:' + line[3]
        return STATUS_UNKNOWN
    else:
        num = [0,1,2,3,4]
        str_read = ''
        str_write = ''
        str_verify = ''
        for item in line:
            if item.find("read") in num:
                str_read = item
            if item.find("write") in num:
                str_write = item
            if item.find("verify") in num:
                str_verify = item
            if str_read != '' and str_write != '' and str_verify != '':
                error_list = [max_error(str_read), max_error(str_write), max_error(str_verify)]
                if max(error_list) >= 5:
                    print 'ERROR|There is too much error:' + str(error_list) + ' >= 5'
                    return STATUS_ERROR
                elif max(error_list) == 0:
                    print 'OK'
                    return STATUS_OK
                else:
                    print 'WARNING|There is some error need handle:' + str(error_list) + '< 5'
                    return STATUS_WARNING
            else:
                print 'UNKNOWN|We can not get the error count,please check'
                return STATUS_UNKNOWN

def max_error(str):
    words = str.split(' ')
    words = filter(lambda x:x != '', words)
    lis = [int(words[1]), int(words[2]), int(words[3]), int(words[4]), int(words[7])]
    return max(lis)

def check_lsi():
    (status, output) = commands.getstatusoutput('%s' % (LSPCI))
    if status != 0:
        print 'UNKNOWN|LSPCI encounter a problem'
        return STATUS_UNKNOWN
        sys.exit(1)
    else:
        if(output.find('LSI') >=0 ):
            return STATUS_OK
        else:
            print 'ERROR|There is no lspci raid'
            return STATUS_ERROR

def check_MegaCli():
    check_lsi()
    device_id = get_device_id()
    error_count = get_error_count()
    # Some judgement, maybe useless
    if len(device_id)<1 or len(error_count)<1:
        print 'ERROR|There is some error because one of the device_id and error_count is 0'
        return STATUS_ERROR
    elif len(device_id)*2 != len(error_count):
        print 'ERROR|There is some error because the num of error_count does not equal to double device_id'
        return STATUS_ERROR
    else:
        warn_num = [1,2,3,4]
        # 0 represent NORMAL.1---WARNING.2---CRITICAL
        status_num = 0;
        if max(error_count) == 0:
            print 'OK'
            return STATUS_OK
        elif max(error_count) >=5:
            print 'ERROR|There is ' + str(max(error_count)) + ' error in device ' + error_count.index(max(error_count))
            return STATUS_ERROR
        else:
            print 'ERROR|There is ' + str(max(error_count)) + ' error in device ' + error_count.index(max(error_count))
            return STATUS_WARNING
        # Just for testing, print the error and the device_id
        # if status_num == 0:
        #     i = 0
        #     while i < len(device_id):
        #         print 'Device_Id ' + str(device_id[i]) + ':'
        #         print 'Media Error Count :' + str(error_count[2*i])
        #         print 'Other Error Count :' + str(error_count[2*i+1])
        #         i = i + 1
        # return status_num

def get_device_id():
    (status, output) = commands.getstatusoutput('%s %s %s' % (MEGACLI, PDLIST, DEVICE))
    if status != 0:
        print 'ERROR|Error for get device id'
        return STATUS_ERROR
        sys.exit(1)
    else:
        device_id = []
        line = output.split('\n')
        for item in line:
            device_id.append(int(item.split(' ')[-1]))
        return device_id

def get_error_count():
    (status, output) = commands.getstatusoutput('%s %s %s' % (MEGACLI, PDLIST, ERROR))
    if status != 0:
        print 'Error|Error for get MegaCli error count'
        return STATUS_ERROR
        sys.exit(1)
    else:
        error_count = []
        line = output.split('\n')
        for item in line:
            error_count.append(int(item.split(' ')[-1]))
        return error_count

def check_ssd(device_id,disk):
    (status, output) = commands.getstatusoutput('%s %s%s %s %s' % (SMARTCTL, '-a -d sat+megaraid,', device_id,disk, '|grep Media_Wearout_Indicator'))
    if status != 0:
        print 'UNKNOWN|Something unexpected happened,now is doing check_ssd().'
        return STATUS_UNKNOWN
        sys.exit(1)
    else:
        life = int(str(output).split(' ')[5])
        if life >= 50:
            print 'OK|The life of the SSD is ' + str(life) +'% left'
            return STATUS_OK
        elif life < 50 and life >= 20:
            print 'WARNING|The life of the SSD is ' + str(life) + '% < 20%'
            return STATUS_WARNING
        else:
            print 'CRITICAL|The life of the SSD is ' + str(life) + '% < 10%'
            return STATUS_ERROR

def check_ssd_no_id(disk):
    (status, output) = commands.getstatusoutput('%s %s %s %s' % (SMARTCTL, '-a ', disk, '|grep Media_Wearout_Indicator'))
    if status != 0:
        print 'UNKNOWN|Something unexpected happened,now is doing check_ssd().'
        return STATUS_UNKNOWN
        sys.exit(1)
    else:
        life = int(str(output).split(' ')[5])
        if life >= 50:
            print 'OK|The life of the SSD is ' + str(life) +'% left'
            return STATUS_OK
        elif life < 50 and life >= 20:
            print 'WARNING|The life of the SSD is ' + str(life) + '% < 20%'
            return STATUS_WARNING
        else:
            print 'CRITICAL|The life of the SSD is ' + str(life) + '% < 10%'
            return STATUS_ERROR

def init_option():
    parser = argparse.ArgumentParser(description="DISK nagios plugin.")
    parser.add_argument('-r', '--raid', help='raid or not(y/n)')
    parser.add_argument('-s', '--ssd', help='ssd or not(y/n), need device_id(0,1,2) and disk(/dev/sdc)')
    parser.add_argument('-i', '--device', help='Device Id(0,1,2), which is needed in check_ssd')
    parser.add_argument('-d', '--disk', help='DISK(/dev/sdx),which is needed in check_ssd')
    return parser

def main():
    parser = init_option()
    args = parser.parse_args()
    if args.raid == 'y':
        if not args.ssd:
            return check_MegaCli()
        else:
            if not args.device or not args.disk:
                    print 'Error|Check ssd needs device id and disk'
                    return STATUS_ERROR
                    sys.exit(1)
            else:
                # If it doesn't in the list of device id
                device_id = get_device_id()
                if int(args.device) in device_id:
                    return check_ssd(args.device,args.disk)
                else:
                    print 'Error|You must specify a Device_Id ' + str(args.device)
                    return STATUS_ERROR
                    sys.exit(1)
    else:
        if not args.ssd:
            return check_smartctl()
        elif args.ssd == 'y':
            # For the ssd doesn't need device id(no MegaCli)
            if not args.disk:
                print 'Error|Check the life of SSD with no ID must assign the DISK(/dev/sdx)'
                return STATUS_ERROR
                sys.exit(1)
            else:
                return check_ssd_no_id(args.disk)


if __name__ == '__main__':
    sys.exit(main())

# usage: check_disk_health_v2.py [-h] [-r RAID] [-s SSD] [-i DEVICE] [-d DISK]
# 要监控一台电脑的磁盘,因为不带自动识别,所以对于每一台电脑,都需要指定其:
# 是否有RAID:
# 是:是否检测SSD
#     是:check_ssd()
#     否:check_megacli()
# 否:是否检测SSD
#     是:check_ssd_no_id()
#     否:check_smartctl()
#
# 都需要自行指定参数,有点小麻烦

你可能感兴趣的:(运维,nagios,磁盘,脚本,运维)