目录(?)[+]
错误现象
修复方案
一、数据备份
二、用备份数据进行恢复
操作系统:Suse Linux 11
文件系统:ext3
X日,接到告警,检查文件系统/dev/sda1发现写入报只读,检查IP存储有告警,随即umount /img,但卸载后无法正常挂载
fdisk -l显示IO错误,重启应用服务器后依然无法正常挂载,显示IO错误,
检查IP存储有告警信息,待存储厂商解决存储问题后,重启应用服务器仍然无法正常挂载文件系统,
由于mount命令执行后长时间无响应,但观察/var/log/messages仍然显示系统在进行block的扫描:
Nov 2 06:04:53 linux11 kernel: [128293.578670] Buffer I/O error on device sda1, logical block 483584660
Nov 2 06:04:53 linux11 kernel: [128293.578672] lost page write due to I/O error on sda1
Nov 2 06:05:01 linux11 /usr/sbin/cron[15283]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Nov 2 06:05:53 linux11 kernel: [128353.584893] sd 9:0:0:0: [sda] Unhandled sense code
Nov 2 06:05:53 linux11 kernel: [128353.584898] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 2 06:05:53 linux11 kernel: [128353.584901] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
Nov 2 06:05:53 linux11 kernel: [128353.584905] sd 9:0:0:0: [sda] Add. Sense: Medium not present
Nov 2 06:05:53 linux11 kernel: [128353.584910] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 e6 97 59 5f 00 00 08 00
Nov 2 06:05:53 linux11 kernel: [128353.584916] end_request: I/O error, dev sda, sector 3868678495
Nov 2 06:05:53 linux11 kernel: [128353.584920] Buffer I/O error on device sda1, logical block 483584804
Nov 2 06:05:53 linux11 kernel: [128353.584922] lost page write due to I/O error on sda1
Nov 2 06:05:53 linux11 kernel: [128353.599875] sd 9:0:0:0: [sda] Unhandled sense code
Nov 2 06:05:53 linux11 kernel: [128353.599878] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 2 06:05:53 linux11 kernel: [128353.599880] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
Nov 2 06:05:53 linux11 kernel: [128353.599883] sd 9:0:0:0: [sda] Add. Sense: Medium not present
Nov 2 06:05:53 linux11 kernel: [128353.599886] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 e6 97 5f bf 00 00 08 00
Nov 2 06:05:53 linux11 kernel: [128353.599890] end_request: I/O error, dev sda, sector 3868680127
Nov 2 06:05:53 linux11 kernel: [128353.599893] Buffer I/O error on device sda1, logical block 483585008
Nov 2 06:05:53 linux11 kernel: [128353.599895] lost page write due to I/O error on sda1
Nov 2 06:05:53 linux11 kernel: [128353.600872] sd 9:0:0:0: [sda] Unhandled sense code
Nov 2 06:05:53 linux11 kernel: [128353.600875] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 2 06:05:53 linux11 kernel: [128353.600877] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
Nov 2 06:05:53 linux11 kernel: [128353.600879] sd 9:0:0:0: [sda] Add. Sense: Medium not present
Nov 2 06:05:53 linux11 kernel: [128353.600882] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 e6 97 62 47 00 00 08 00
Nov 2 06:05:53 linux11 kernel: [128353.600887] end_request: I/O error, dev sda, sector 3868680775
红色部分显示系统仍在工作中,等待20小时候,工程师建议继续等待,20小时后,mount命令运行结束
linux11:~ #mount /dev/sda1 /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/sda1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
linux11:~ #dmesg|tail -50
[138764.297170] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.297172] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.297175] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.297178] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 1f f5 b7 00 00 10 00
[138764.297182] end_request: I/O error, dev sda, sector 4062180791
[138764.312193] sd 9:0:0:0: [sda] Unhandled sense code
[138764.312197] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.312199] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.312202] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.312204] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 37 9f 00 00 08 00
[138764.312209] end_request: I/O error, dev sda, sector 4062197663
[138764.312224] sd 9:0:0:0: [sda] Unhandled sense code
[138764.312226] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.312228] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.312230] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.312233] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 38 b7 00 00 08 00
[138764.312237] end_request: I/O error, dev sda, sector 4062197943
[138764.312242] sd 9:0:0:0: [sda] Unhandled sense code
[138764.312243] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138764.312245] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138764.312247] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138764.312250] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 7f 87 00 00 08 00
[138764.312254] end_request: I/O error, dev sda, sector 4062216071
[138824.286688] sd 9:0:0:0: [sda] Unhandled sense code
[138824.286692] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138824.286696] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138824.286699] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138824.286704] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 f2 bf 00 00 08 00
[138824.286710] end_request: I/O error, dev sda, sector 4062245567
[138824.286714] __ratelimit: 8 callbacks suppressed
[138824.286718] Buffer I/O error on device sda1, logical block 507780688
[138824.286719] lost page write due to I/O error on sda1
[138824.324706] sd 9:0:0:0: [sda] Unhandled sense code
[138824.324709] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138824.324711] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138824.324714] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138824.324717] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 20 fa 1f 00 00 08 00
[138824.324722] end_request: I/O error, dev sda, sector 4062247455
[138824.324726] Buffer I/O error on device sda1, logical block 507780924
[138824.324727] lost page write due to I/O error on sda1
[138824.324741] sd 9:0:0:0: [sda] Unhandled sense code
[138824.324742] sd 9:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[138824.324744] sd 9:0:0:0: [sda] Sense Key : Medium Error [current]
[138824.324747] sd 9:0:0:0: [sda] Add. Sense: Medium not present
[138824.324749] sd 9:0:0:0: [sda] CDB: Write(10): 2a 00 f2 2e a1 17 00 00 08 00
[138824.324754] end_request: I/O error, dev sda, sector 4063142167
[138824.324756] Buffer I/O error on device sda1, logical block 507892763
[138824.324758] lost page write due to I/O error on sda1
[138824.324773] JBD: recovery failed
[138824.324774] EXT3-fs: error loading journal.
工程师初步判定为Superblock损坏,开始进行制定修复方案:
1.通过dd将原/dev/sda1分区的文件备份到其他文件分区,原分区大小2T,IP存储重新划分了略大于2T的空间,挂到应用服务器上,进行数据备份
2.数据备份后通过fsck.ext3进行修复
创建新的分区/dev/sdb1
linux11:/var/log #fdisk /dev/sdb
The number of cylinders for this disk is set to 267075.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-267075, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-267075, default 267075):
Using default value 267075
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
linux11:/var/log #
linux11:/var/log #
linux11:/var/log #
linux11:/var/log # fdisk -l
Disk /dev/cciss/c0d0: 300.0 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000bf615
Device Boot Start End Blocks Id System
/dev/cciss/c0d0p1 * 1 38 305203+ 83 Linux
/dev/cciss/c0d0p2 39 4215 33551752+ 82 Linux swap / Solaris
/dev/cciss/c0d0p3 4216 36468 259072222+ 83 Linux
Disk /dev/sda: 2097.2 GB, 2097152000000 bytes
255 heads, 63 sectors/track, 254964 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000a13a0
Device Boot Start End Blocks Id System
/dev/sda1 1 254964 2047998298+ 83 Linux
Disk /dev/sdb: 2196.8 GB, 2196766720000 bytes
255 heads, 63 sectors/track, 267075 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x24828d3f
Device Boot Start End Blocks Id System
/dev/sdb1 1 267075 2145279906 83 Linux
注意,这里尝试使用了mkfs格式化文件分区,由于文件系统2T,格式化时间相当长,最终取消了这一操作,注意kill操作也不能很快的结束,只有等待,随即重新划分了存储空间,进行分区,但不进行格式化
开始数据备份
dd if=/dev/sda1 of=/dev/sdb1 bs=8M
最开始的时候未指定bs的大小,默认只有512字节,经过约30小时的等待后,测速发现只有1M/s,后中断该过程,改为bs=8M
应用服务器未安装stat包,补充测速的方法:
>strace.log
time strace -o strace.log -p 11929
运行一段时间后ctrl+c终止
统计write出现的次数
grep -c write starace.log
echo "次数*8/time得到的时间" |bc
即为估算的每秒复制的速度。
30个小时后备份结束
249999+1 records in
249999+1 records out
2097150257664 bytes (2.1 TB) copied, 130468 s, 16.1 MB/s
由于原应用服务器还使用临时空间在承担业务,因此通过IP存储将分区挂载到其他操作系统相同的机器进行修复,首先确定超级块superblock的起始位置
linux11:~ #dumpe2fs /dev/sda1
dumpe2fs 1.41.9 (22-Aug-2009)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 34689bab-428f-4e84-b3b8-22351dfcbe9a
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 128000000
Block count: 511999574
Reserved block count: 25599978
Free blocks: 483185304
Free inodes: 126463697
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 901
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Wed May 11 16:02:51 2011
Last mount time: Thu Aug 2 17:26:01 2012
Last write time: Thu Aug 2 17:26:01 2012
Mount count: 8
Maximum mount count: -1
Last checked: Wed May 11 16:02:51 2011
Check interval: 0 (<none>)
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 7c5d0a45-f12f-4ce5-8e8a-cb1029acbf2d
Journal backup: inode blocks
Journal size: 128M
Group 0: (Blocks 0-32767)
Primary superblock at 0, Group descriptors at 1-123
Reserved GDT blocks at 124-1024
Block bitmap at 1025 (+1025), Inode bitmap at 1026 (+1026)
Inode table at 1027-1538 (+1027)
1255 free blocks, 7617 free inodes, 14 directories
Free blocks: 4442, 31514-32767
Free inodes: 576-8192
Group 1: (Blocks 32768-65535)
Backup superblock at 32768, Group descriptors at 32769-32891
Reserved GDT blocks at 32892-33792
Block bitmap at 33793 (+1025), Inode bitmap at 33794 (+1026)
Inode table at 33795-34306 (+1027)
7263 free blocks, 7415 free inodes, 16 directories
Free blocks: 34340-36863, 37296, 49851-52059, 52083-54611
Free inodes: 8970-16384
Group 2: (Blocks 65536-98303)
Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1)
Inode table at 65538-66049 (+2)
4 free blocks, 7403 free inodes, 21 directories
Free blocks: 72068-72071
Free inodes: 17174-24576
Group 3: (Blocks 98304-131071)
...有很多个superblock,见红色字体部分,以下省略...
操作系统将超级块备份到了多个位置,本次选择用32768处的进行修复
linux01:~ #fsck.ext3 -y -b 32768 /dev/sda1
e2fsck 1.41.9 (22-Aug-2009)
Superblock needs_recovery flag is clear, but journal has data.
Recovery flag not set in backup superblock, so running journal anyway.
/dev/sda1: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(22281864--22282239) -(22282760--22282957) -(22282967--22283536) -(22283552--22284122) -(22284138--22284287) -(22284582--22285018) -(100555611--100556799) -(100580404--100581255)
Fix? yes
Free blocks count wrong for group #0 (31223, counted=1255).
Fix? yes
Free blocks count wrong for group #1 (31229, counted=7263).
Fix? yes
Free blocks count wrong for group #2 (32254, counted=4).
Fix? yes
Free blocks count wrong for group #3 (31229, counted=0).
Fix? yes
...省略...
Free inodes count wrong for group #15622 (8192, counted=7167).
Fix? yes
Directories count wrong for group #15622 (0, counted=34).
Fix? yes
Free inodes count wrong for group #15623 (8192, counted=6821).
Fix? yes
Directories count wrong for group #15623 (0, counted=52).
Fix? yes
Free inodes count wrong for group #15624 (8192, counted=7247).
Fix? yes
Directories count wrong for group #15624 (0, counted=21).
Fix? yes
Free inodes count wrong (127999989, counted=120656202).
Fix? yes
/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: 7343798/128000000 files (2.8% non-contiguous), 259583311/511999574 blocks
修复成功,重新挂载文件系统,能够正常加载,文件和目录能够正常访问