测试Linux服务器SCSI/SATA硬盘是否正常

原文链接:Test If Linux Server SCSI / SATA Hard Disk Going Bad

    我们读者中的一个常客提到一个问题:

    怎么测试我的硬盘是否出故障?我在 /var/log/messages 文件中只能看到很少的错误

    /var/log/messages 文件中的 I/O 错误表明硬盘出了一些故障甚至可能是挂掉。可以使用 smartctl 命令查看硬盘故障,这是Linux/Unix 类操作系统下对 SMART 硬盘的控制和监视工具。

    smartctl 基于硬盘自检、分析和报告技术(SMART),该技术内置到很多 ATA-3(及其后来版本)、IDE、SCSI-3 硬盘驱动中。SMART的作用在与监测硬盘的可靠性和预测错误,同时展开不同类型的驱动自检。


服务器smartctl

smartctl 是一个命令行工具,旨在执行SMART任务比如:显示SMART自检和错误日志,启用和禁用SMART自动检测,开始设备自我测试。首先,确认BIOS中允许SMART支持。然后,运行如下命令查看你的硬盘是否支持SMART技术。

# smartctl -i /dev/sdb

启用 SMART,运行

# smartctl -s on -d ata /dev/sdb

样例输出:

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
运行整体状况和自我评价测试,输入

# smartctl -d ata -H /dev/sdb

样例输出:

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
一个不合格的硬盘输出样例:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   044   033   045    Old_age   Always   FAILING_NOW 56 (96 110 58 25)
下面的命令会对不合格的硬盘提供更多详细的信息:

# smartctl --attributes --log=selftest /dev/sda
样例输出:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   098   092   006    Pre-fail  Always       -       238320363
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       587
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       9
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       51672328
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4805
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       586
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       417
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   094   094   000    Old_age   Always       -       6
190 Airflow_Temperature_Cel 0x0022   044   033   045    Old_age   Always   FAILING_NOW 56 (96 122 58 25)
194 Temperature_Celsius     0x0022   056   067   000    Old_age   Always       -       56 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   043   026   000    Old_age   Always       -       238320363
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       49
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       49
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       172082159686339
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2155546016
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3048586928
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4789         1746972641


通过输入下面这条命令,你可以获得更多数据:

# smartctl -d ata -a /dev/sdb
输出:

smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2500YS-01SHB0
Serial Number:    WD-WCANY1729333
Firmware Version: 20.06C03
User Capacity:    251,000,193,024 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Jul  4 15:04:38 2007 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (7800) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  92) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   190   187   021    Pre-fail  Always       -       5500
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6382
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   127   096   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


RAID(磁盘阵列)控制器注意事项

查看 3ware SCSI RAID 控制器背后的的ATA硬盘语法是:

# smartctl -a -d 3ware,2 /dev/sda
# smartctl -a -d 3ware,0 /dev/twe0

了解如何使用 smartctl 命令查看 Adaptec RAID 和  3ware SCSI RAID 背后的硬盘以获得更多信息

任务:硬盘的扩展自检

你需要对 /dev/hdc 开始一个扩展的硬盘自检。你可以在一个运行的系统上执行这个命令。结果将会在自检日志中看到,当然是用'-l selftest'选项设置可见时。

# smartctl -d ata -t long /dev/sdb

损坏硬盘的细节报告样例:

# smartctl -a /dev/sda

样例输出:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model:     ST31500341AS
Serial Number:    9VS0TG4B
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Oct 26 21:16:15 2009 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		 ( 617) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   098   092   006    Pre-fail  Always       -       238338845
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       587
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       9
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       51672525
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4806
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       586
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       417
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   094   094   000    Old_age   Always       -       6
190 Airflow_Temperature_Cel 0x0022   044   033   045    Old_age   Always   FAILING_NOW 56 (96 126 58 25)
194 Temperature_Celsius     0x0022   056   067   000    Old_age   Always       -       56 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   043   026   000    Old_age   Always       -       238338845
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       49
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       49
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       107168023974595
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2155546480
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3048590512
SMART Error Log Version: 1
ATA Error Count: 416 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 416 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:55:03.917  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:55:03.818  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:55:03.798  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:55:03.779  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:55:03.658  READ NATIVE MAX ADDRESS EXT
Error 415 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:55:00.927  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:55:00.837  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:55:00.817  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:55:00.800  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:55:00.747  READ NATIVE MAX ADDRESS EXT
Error 414 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:54:57.903  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:54:57.807  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:54:57.787  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:54:57.757  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:54:57.637  READ NATIVE MAX ADDRESS EXT
Error 413 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:54:54.862  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:54:54.767  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:54:54.746  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:54:54.728  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:54:54.677  READ NATIVE MAX ADDRESS EXT
Error 412 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      00:54:51.838  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:54:51.736  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:54:51.716  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:54:51.685  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:54:51.566  READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4789         1746972641
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

从备份中恢复

如果其中一个测试报告错误,更换硬盘并且将数据从备份中恢复

在服务器上安装 smartd 来接收发现问题时的警告邮件

smartd 是一个监测硬盘的守护进程,并且它会试图启用SMART 监测硬盘。它会每隔30分钟(可配置选项)检测硬盘的健康数据和SCSI设备。它通过 SYSLOG界面记录SMART错误和属性。 这些SYSLOG通知和警告的默认位置是依赖于系统的(通常是 /var/log/messages或 /var/log/syslog)。smartd除了记录到一个文件中,也可以被配置为检测到错误时发送电子邮件警告。基于错误的不同类型,你可能需要运行盘上的自检程序,备份磁盘,更换硬盘或者使用制造商的程序,迫使坏或无法读取磁盘扇区的重新分配。更多内容请查看安装和配置smartd

Gnome 磁盘实用工具

大多数类unix系统比如FreeBSD、OpenBSD 都附带有叫做磁盘的图形工具。它只会在你运行带有gnome的台式和笔记本系统时才工作。访问磁盘工具:

Applications > System Tools > Disk Utility

点击硬盘:


测试Linux服务器SCSI/SATA硬盘是否正常_第1张图片

点击smart data 查看详情:

测试Linux服务器SCSI/SATA硬盘是否正常_第2张图片

一个健康硬盘的例子:

测试Linux服务器SCSI/SATA硬盘是否正常_第3张图片

问候 GSmartControl

GSmartControll是一个硬盘健康视察工具,是 smartctl命令的图形界面。有如下特点:

1、自动报告并且高亮所有异常情况;

2、可以启用/禁用 SMART;

3、允许启用/禁用自动离线数据采集 --- 驱动器将每4小时执行一个简短的自检程序并不对性能产生影响;

4、只是对 smartctl 的全局和每个驱动选项的配置

5、显示 SMART 自检

6、显示驱动器特性信息:容量、属性和自检日志

7、可以从一个保存文件中读出 smartctl的输出,并把它解释为一个虚拟设备

8、能在大多数支持smartctl的操作系统上工作,如* BSD和Linux的各种发行版

9、有海量的帮助信息


在Debian或Ubuntu你可以使用apt-get命令如下安装:

$ sudo apt-get install gsmartcontrol
在Fedora、CentOS或Real中用yum命令效果相同:

# yum install gsmartcontrol

样例输出:

测试Linux服务器SCSI/SATA硬盘是否正常_第4张图片

点击硬盘以查看更过信息:

测试Linux服务器SCSI/SATA硬盘是否正常_第5张图片

点击属性标签:

测试Linux服务器SCSI/SATA硬盘是否正常_第6张图片

点击性能测试标签进行快速或全面的硬盘测试:

测试Linux服务器SCSI/SATA硬盘是否正常_第7张图片

参考资料:

* smartctl 帮助文档

*在Linux或Unix下用smartd 监测硬盘状况 点击打开链接

你可能感兴趣的:(Linux种种)