什么是S.M.A.R.T.
SMART是一种磁盘自我分析检测技术,早在90年代末就基本得到了普及每一块硬盘(包括IDE、SCSI)在运行的时候,都会将自身的若干参数记录下来这些参数包括型号、容量、温度、密度、扇区、寻道时间、传输、误码率等

硬盘运行了几千小时后,很多内在的物理参数都会发生变化某一参数超过报警阈值,则说明硬盘接近损坏
此时硬盘依然在工作,如果用户不理睬这个报警继续使用那么硬盘将变得非常不可靠,随时可能故障

启用SMART
SMART是和主板BIOS上相应功能配合的
要使用SMART,必须先进入到主板BIOS设置里边启动相关设置
一般从Pentium2级别起的主板,都支持SMART
BIOS启动以后,就是操作系统级别的事情了
很遗憾,Windows没有内置SMART相关工具(需要安装第三方工具软件)
好在Linux上很早就有了SMART支持了
如果把Linux装在VMware等虚拟机上,在系统启动时候可以看到有个服务启动报错:smartd
这个服务器就是smart的daemon进程(因为vmware虚拟机的硬盘不支持SMART,所以报错)

首先通过dmesg工具,确认一下硬盘的设备符号
例如一个IDE硬盘连接到Primary IDE 总线上的Slave位置,硬盘设备符号是/dev/hdb
hdb中的h代表IDE,如果显示为sdb,则代表SATA和SCSI
最后一个字幕b代表Primary总线,第二块硬盘即Slave位置

确认硬盘是否打开了SMART支持
smartctl -i /dev/hdb

例如看到如下返回结果
[[email protected] ~]# smartctl -i /dev/hdb
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: ST380011A
Serial Number: 3JVAPRGH
Firmware Version: 3.04
User Capacity: 80,026,361,856 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
Local Time is: Tue Apr 3 15:39:52 2007 CST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled
SMART Disabled. Use option -s with argument 'on' to enable it.
[[email protected] ~]#
我们可以看到SMART support is: Disabled表示SMART未启用

执行如下命令,启动SMART:
smartctl --smart=on --offlineauto=on --saveauto=on /dev/hdb

例如看到如下返回结果
[[email protected] ~]# smartctl --smart=on --offlineauto=on --saveauto=on /dev/hdb
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.
[[email protected] ~]#
现在硬盘的SMART功能已经被打开

执行如下命令查看硬盘的健康状况:
smartctl -H /dev/hda
例如可以看到如下结果:
[[email protected] ~]# smartctl -H /dev/hdb
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[[email protected] ~]#
请注意result后边的结果:PASSED,这表示硬盘健康状态良好
如果这里显示Failure,那么最好立刻给服务器更换硬盘
SMART只能报告磁盘已经不再健康,但是报警后还能继续运行多久是不确定的
通常,SMART报警参数是有预留的,磁盘报警后,不会当场坏掉,一般能坚持一段时间
有的硬盘SMART报警后还继续跑了好几年,有的硬盘SMART报错后几天就坏了
但是一旦出现报警,侥幸心里是万万不能的……

执行如下命令可以看到详细的参数:
smartctl -A /dev/hdb
例如可以看到如下结果:
[[email protected] ~]# smartctl -A /dev/hdb
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 053 051 006 Pre-fail Always - 11338710
3 Spin_Up_Time 0x0003 098 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 17
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 610059516
9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 11974
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 110
194 Temperature_Celsius 0x0022 045 052 000 Old_age Always - 45
195 Hardware_ECC_Recovered 0x001a 053 051 000 Old_age Always - 11338710
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
[[email protected] ~]#
上边列出的参数表中可供进行技术分析和参考,使用下边的命令可以输出完整结果:
smartctl -a /dev/hdb

定期登录到服务器上运行smartctl是比较麻烦的,linux提供了系统进程smartd
编辑配置文件:
vi /etc/smartd.conf
这个配置文件中大部分可能是注释掉的说明,只需要和当前硬盘相关的一行写入正确即可:
/dev/hdb -H -m [email protected]
上边的配置表示smartd以静默状态工作,当SMART中报告PASSED的时候不理睬
一旦出现Failure,立刻用邮件通知用户指定的邮箱
修改配置后重启服务:
/etc/init.d/smartd
即可完成对SMART的全部配置。
linux下硬盘检测工具: smartmontools
Smartmontools for SCSI硬盘: http://smartmontools.sourceforge.net/smartmontools_scsi.html

smartctl命令参数列表:

The following options are currently available for SCSI disks and tape drives unless otherwise noted:

    * -a | --all : equivalent to the combination -i -H -A -l error -l selftest options invoked in that order.
    * -A | --attributes : outputs the current device temperature, trip temperature, the number of elements in the grown defect list (GLIST) and data from the start-stop log page. Outputs some vendor specific information if available.
    * -C | --captive : used in conjunction with -t short or -t long options to do short or long self tests in the foreground. [Has no effect on tape drives.]
    * -d TYPE | --device=TYPE where TYPE is "ata", "scsi", "sat", "marvell", "3ware,N", "hpt,L/N[,M]" or "cciss,N". Overrides utility's guess about the class of the device which is based on the form of the nominated device's name.
    * -h | --help : outputs lengthy usage message and exits without any other action.
    * -H | --health : outputs single device health metric determined by the device manufacturer. This will be "OK" or a failure message.
    * -i | --info : outputs device identification information (derived from a SCSI INQUIRY command) and whether the device supports SMART (and temperature warnings) and if those facilities are currently enabled. The type of transport (e.g. FC or SAS) is also reported, if available. Some users have reported disks that report the wrong transport.
    * -l TYPE | --log=TYPE where TYPE is either "background", "selftest" or "error". Decodes are outputs the requested log. Note that --all does not include --log=background .
    * -q TYPE | --quietmode=TYPE where TYPE is either "silent" or "errorsonly". When the type is silent then nothing is output to the console but the exit status is set (so it is suitable for scripts). For "errorsonly" only errors are output to the console. The exit status is always set. [See the smartctl man page.]
    * -r TYPE | --report=TYPE where TYPE is either "ioctl[,]" or "scsiioctl[,]". Turns on low level debugging of issued commands and responses. These commands are issued through a system command called an "ioctl" in Unix. The debug can be for all issued commands (i.e. "ioctl") or only SCSI commands ("scsiioctl"). Optionally the TYPE can have a comma and a number post pended to increase the volume of debug. See this section for more details.
    * -s VALUE | --smart=VALUE where VALUE is either "on" or "off". Enables or disables SMART monitoring (and temperature warnings).
    * -S VALUE | --saveauto=VALUE where VALUE is either "on" or "off". Controls whether the error log values are preserved across device power cycles.
    * -t TEST | --test=TEST where TEST is either "offline", "short" or "long". Despite its name "offline" is a short foreground test that all SCSI devices should support. A "short" self test is typically 2 minutes or less. A "long" self test will be considerably longer than 2 minutes, depending on the size of the media. The estimated time that a "long" self test will take is printed after the "selftest" log (i.e. with '-l selftest' or '-a')
    * -V | --version : outputs the smartctl version number (including the cvs version of all its source files) and build information then exits without any other action.
    * -X | --abort : will terminate a background short or long self test. Usually the self test log notes that a self test has been aborted. [Has no effect on tape drives.]

简单用法:
1、smartctl -a           检查该设备是否已经打开SMART技术。
2、smartctl -s on     如果没有打开SMART技术,使用该命令打开SMART技术。
3、smartctl -t short     后台检测硬盘,消耗时间短;
   smartctl -t long       后台检测硬盘,消耗时间长;
   smartctl -C -t short 前台检测硬盘,消耗时间短;
   smartctl -C -t long   前台检测硬盘,消耗时间长。
其实就是利用硬盘SMART的自检程序。
4、smartctl -X   中断后台检测硬盘。
5、smartctl -l selftest   显示硬盘检测日志。
6、smartctl -l error 显示硬盘错误汇总。