原文链接:Test If Linux Server SCSI / SATA Hard Disk Going Bad
我们读者中的一个常客提到一个问题:
怎么测试我的硬盘是否出故障?我在 /var/log/messages 文件中只能看到很少的错误
/var/log/messages 文件中的 I/O 错误表明硬盘出了一些故障甚至可能是挂掉。可以使用 smartctl 命令查看硬盘故障,这是Linux/Unix 类操作系统下对 SMART 硬盘的控制和监视工具。
smartctl 基于硬盘自检、分析和报告技术(SMART),该技术内置到很多 ATA-3(及其后来版本)、IDE、SCSI-3 硬盘驱动中。SMART的作用在与监测硬盘的可靠性和预测错误,同时展开不同类型的驱动自检。
服务器smartctl
smartctl 是一个命令行工具,旨在执行SMART任务比如:显示SMART自检和错误日志,启用和禁用SMART自动检测,开始设备自我测试。首先,确认BIOS中允许SMART支持。然后,运行如下命令查看你的硬盘是否支持SMART技术。
# smartctl -i /dev/sdb
# smartctl -s on -d ata /dev/sdb
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
运行整体状况和自我评价测试,输入
# smartctl -d ata -H /dev/sdb
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
一个不合格的硬盘输出样例:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 110 58 25)
下面的命令会对不合格的硬盘提供更多详细的信息:
# smartctl --attributes --log=selftest /dev/sda
样例输出:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 098 092 006 Pre-fail Always - 238320363
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 587
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51672328
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4805
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 586
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 417
188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032833
189 High_Fly_Writes 0x003a 094 094 000 Old_age Always - 6
190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 122 58 25)
194 Temperature_Celsius 0x0022 056 067 000 Old_age Always - 56 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 043 026 000 Old_age Always - 238320363
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 49
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 49
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 172082159686339
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 2155546016
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3048586928
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 4789 1746972641
# smartctl -d ata -a /dev/sdb
输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: WDC WD2500YS-01SHB0
Serial Number: WD-WCANY1729333
Firmware Version: 20.06C03
User Capacity: 251,000,193,024 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Jul 4 15:04:38 2007 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (7800) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 92) minutes.
Conveyance self-test routine
recommended polling time: ( 6) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 190 187 021 Pre-fail Always - 5500
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 24
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6382
10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
194 Temperature_Celsius 0x0022 127 096 000 Old_age Always - 23
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
RAID(磁盘阵列)控制器注意事项
查看 3ware SCSI RAID 控制器背后的的ATA硬盘语法是:
# smartctl -a -d 3ware,2 /dev/sda
# smartctl -a -d 3ware,0 /dev/twe0
任务:硬盘的扩展自检
你需要对 /dev/hdc 开始一个扩展的硬盘自检。你可以在一个运行的系统上执行这个命令。结果将会在自检日志中看到,当然是用'-l selftest'选项设置可见时。
# smartctl -d ata -t long /dev/sdb
# smartctl -a /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: ST31500341AS
Serial Number: 9VS0TG4B
Firmware Version: CC1H
User Capacity: 1,500,301,910,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Oct 26 21:16:15 2009 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 617) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 098 092 006 Pre-fail Always - 238338845
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 587
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51672525
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4806
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 586
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 417
188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032833
189 High_Fly_Writes 0x003a 094 094 000 Old_age Always - 6
190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 126 58 25)
194 Temperature_Celsius 0x0022 056 067 000 Old_age Always - 56 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 043 026 000 Old_age Always - 238338845
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 49
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 49
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 107168023974595
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 2155546480
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3048590512
SMART Error Log Version: 1
ATA Error Count: 416 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 416 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:55:03.917 READ DMA EXT
27 00 00 00 00 00 e0 00 00:55:03.818 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:55:03.798 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:55:03.779 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:55:03.658 READ NATIVE MAX ADDRESS EXT
Error 415 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:55:00.927 READ DMA EXT
27 00 00 00 00 00 e0 00 00:55:00.837 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:55:00.817 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:55:00.800 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:55:00.747 READ NATIVE MAX ADDRESS EXT
Error 414 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:54:57.903 READ DMA EXT
27 00 00 00 00 00 e0 00 00:54:57.807 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:54:57.787 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:54:57.757 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:54:57.637 READ NATIVE MAX ADDRESS EXT
Error 413 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:54:54.862 READ DMA EXT
27 00 00 00 00 00 e0 00 00:54:54.767 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:54:54.746 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:54:54.728 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:54:54.677 READ NATIVE MAX ADDRESS EXT
Error 412 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:54:51.838 READ DMA EXT
27 00 00 00 00 00 e0 00 00:54:51.736 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:54:51.716 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:54:51.685 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:54:51.566 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 4789 1746972641
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
如果其中一个测试报告错误,更换硬盘并且将数据从备份中恢复
在服务器上安装 smartd 来接收发现问题时的警告邮件
smartd 是一个监测硬盘的守护进程,并且它会试图启用SMART 监测硬盘。它会每隔30分钟(可配置选项)检测硬盘的健康数据和SCSI设备。它通过 SYSLOG界面记录SMART错误和属性。 这些SYSLOG通知和警告的默认位置是依赖于系统的(通常是 /var/log/messages或 /var/log/syslog)。smartd除了记录到一个文件中,也可以被配置为检测到错误时发送电子邮件警告。基于错误的不同类型,你可能需要运行盘上的自检程序,备份磁盘,更换硬盘或者使用制造商的程序,迫使坏或无法读取磁盘扇区的重新分配。更多内容请查看安装和配置smartd
Gnome 磁盘实用工具
大多数类unix系统比如FreeBSD、OpenBSD 都附带有叫做磁盘的图形工具。它只会在你运行带有gnome的台式和笔记本系统时才工作。访问磁盘工具:
Applications > System Tools > Disk Utility
点击smart data 查看详情:
一个健康硬盘的例子:
问候 GSmartControl
GSmartControll是一个硬盘健康视察工具,是 smartctl命令的图形界面。有如下特点:
1、自动报告并且高亮所有异常情况;
2、可以启用/禁用 SMART;
3、允许启用/禁用自动离线数据采集 --- 驱动器将每4小时执行一个简短的自检程序并不对性能产生影响;
4、只是对 smartctl 的全局和每个驱动选项的配置
5、显示 SMART 自检
6、显示驱动器特性信息:容量、属性和自检日志
7、可以从一个保存文件中读出 smartctl的输出,并把它解释为一个虚拟设备
8、能在大多数支持smartctl的操作系统上工作,如* BSD和Linux的各种发行版
9、有海量的帮助信息
在Debian或Ubuntu你可以使用apt-get命令如下安装:
$ sudo apt-get install gsmartcontrol
在Fedora、CentOS或Real中用yum命令效果相同:
# yum install gsmartcontrol
点击硬盘以查看更过信息:
点击属性标签:
点击性能测试标签进行快速或全面的硬盘测试:
参考资料:
* smartctl 帮助文档
*在Linux或Unix下用smartd 监测硬盘状况 点击打开链接