环境描述:
iSCSI服务器:Windows Server 2008 R2sp1+Microsoft iSCSI Software Target 3.3
iSCSI客户端:RHEL6.4 x86_64+iscsi-initiator-utils-6.2.0.873-2.el6.x86_64
故障重现:
原先使用Windows 2008充当iSCSI服务器并映射了多块磁盘给RAC的两个节点使用:
因想模拟替换RAC环境下的ASM磁盘,但在使用新的替换成功后禁用了旧的lun的映射,RAC节点重启后无法识别iSCSI的磁盘,所有RAC服务器均无法启动:
重启前:
[root@rac1 ~]# fdisk -l
Disk /dev/sda: 32.2 GB, 32212254720 bytes
255 heads, 63 sectors/track, 3916 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000874ca
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 3917 31251456 8e Linux LVM
Disk /dev/mapper/vg00-lv_root: 23.4 GB, 23408410624 bytes
255 heads, 63 sectors/track, 2845 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/vg00-lv_swap: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdi: 21.5 GB, 21474836480 bytes
64 heads, 32 sectors/track, 20480 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xee8ee349
Device Boot Start End Blocks Id System
/dev/sdi1 1 20480 20971504 83 Linux
Disk /dev/sdj: 21.5 GB, 21474836480 bytes
64 heads, 32 sectors/track, 20480 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x7785654b
Device Boot Start End Blocks Id System
/dev/sdj1 1 20480 20971504 83 Linux
Disk /dev/sdk: 32.2 GB, 32212254720 bytes
64 heads, 32 sectors/track, 30720 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xcae7e952
Device Boot Start End Blocks Id System
/dev/sdk1 1 30720 31457264 83 Linux
Disk /dev/sdl: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x6e15849d
Device Boot Start End Blocks Id System
/dev/sdl1 1 1009 2095662 83 Linux
Disk /dev/sdm: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xb137a7a4
Device Boot Start End Blocks Id System
/dev/sdm1 1 1009 2095662 83 Linux
Disk /dev/sdn: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xb523ed12
Device Boot Start End Blocks Id System
/dev/sdn1 1 1009 2095662 83 Linux
Disk /dev/sdo: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xcda3ffae
Device Boot Start End Blocks Id System
/dev/sdo1 1 1009 2095662 83 Linux
重启系统之后
[root@rac1 ~]# fdisk -l
Disk /dev/sda: 32.2 GB, 32212254720 bytes
255 heads, 63 sectors/track, 3916 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000874ca
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 3917 31251456 8e Linux LVM
Disk /dev/mapper/vg00-lv_root: 23.4 GB, 23408410624 bytes
255 heads, 63 sectors/track, 2845 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/vg00-lv_swap: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
重新登录iscsi target也一样发现不了磁盘
[root@rac1 ~]# iscsiadm -m discovery -t st -p 192.168.253.100
192.168.253.100:3260,1 iqn.1991-05.com.microsoft:dnsserver-rac1-target
[root@rac1 ~]# iscsiadm -m node -T iqn.1991-05.com.microsoft:dnsserver-rac1-target -l
[root@rac1 ~]# fdisk -l
Disk /dev/sda: 32.2 GB, 32212254720 bytes
255 heads, 63 sectors/track, 3916 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000874ca
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 3917 31251456 8e Linux LVM
Disk /dev/mapper/vg00-lv_root: 23.4 GB, 23408410624 bytes
255 heads, 63 sectors/track, 2845 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/vg00-lv_swap: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
但服务器端看到的状态又正常
在客户端看到的错误日志,但却没有更多的信息可供分析
[root@rac1 ~]# dmesg|tail
scsi33 : iSCSI Initiator over TCP/IP
scsi 33:0:0:0: Device offlined - not ready after error recovery
scsi34 : iSCSI Initiator over TCP/IP
scsi 34:0:0:0: Device offlined - not ready after error recovery
测试一:
禁用除lun0之外的其它磁盘(经测试只有保留lun0重启后才能识别到磁盘)
重启客户端,除sdb之外所有磁盘标识均做了变动
[root@rac2 ~]# fdisk -l |grep "Disk /dev/sd*"
Disk /dev/sda: 32.2 GB, 32212254720 bytes
Disk /dev/sdb: 2147 MB, 2147483648 bytes
Disk /dev/sdc: 21.5 GB, 21474836480 bytes
Disk /dev/sdd: 21.5 GB, 21474836480 bytes
Disk /dev/sde: 32.2 GB, 32212254720 bytes
Disk /dev/sdf: 2147 MB, 2147483648 bytes
Disk /dev/sdg: 2147 MB, 2147483648 bytes
Disk /dev/sdh: 2147 MB, 2147483648 bytes
变动关系
sdi --> sdc
sdj --> sdd
sdk --> sde
sdl --> sdf
sdm --> sdg
sdn --> sdh
查看asm磁盘状态也是正常的,说明磁盘号变动并不影响asmlib对磁盘的标识
[root@rac2 ~]# oracleasm querydisk -d /dev/sdc1
Device "/dev/sdc1" is marked an ASM disk with the label "NEW_DATA01"
[root@rac2 ~]# oracleasm querydisk -d /dev/sdd1
Device "/dev/sdd1" is marked an ASM disk with the label "NEW_DATA02"
[root@rac2 ~]# oracleasm querydisk -d /dev/sde1
Device "/dev/sde1" is marked an ASM disk with the label "NEW_FRA01"
[root@rac2 ~]# oracleasm querydisk -d /dev/sdf1
Device "/dev/sdf1" is marked an ASM disk with the label "NEW_OCR01"
[root@rac2 ~]# oracleasm querydisk -d /dev/sdg1
Device "/dev/sdg1" is marked an ASM disk with the label "NEW_OCR02"
[root@rac2 ~]# oracleasm querydisk -d /dev/sdh1
Device "/dev/sdh1" is marked an ASM disk with the label "NEW_OCR03"
尝试清空sdb
dd if=/dev/zero of=/dev/sdb bs=1024k count=2048
取消LUN 0的映射后重启问题依旧
为了更好的查看iscsi映射过来的磁盘情况,安装lsscsi工具查看
[root@rac1 ~]# yum install -y lsscsi
[root@rac1 ~]# lsscsi -t
[0:0:0:0] disk spi:0 /dev/sda
[4:0:0:0] cd/dvd sata: /dev/sr0
[33:0:0:0] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdb
[33:0:0:7] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdc
[33:0:0:8] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdd
[33:0:0:9] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sde
[33:0:0:10] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdf
[33:0:0:11] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdg
[33:0:0:12] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdh
取消lun 0的映射
fdisk -l中sdb已不存在
lsscsi -t中iscsi映射过来的所有磁盘路径还是存在
使用iscsiadmin从target端logout后
[root@rac1 ~]# iscsiadm -m node -T iqn.1991-05.com.microsoft:dnsserver-rac1-target -u
lsscsi -t中所有映射过来的设备都不存在
[root@rac1 ~]# lsscsi -t
[0:0:0:0] disk spi:0 /dev/sda
[4:0:0:0] cd/dvd sata: /dev/sr0
恢复lun 0的映射
使用iscsiadmin从target端login后,磁盘又恢复正常
[root@rac1 ~]# iscsiadm -m node -T iqn.1991-05.com.microsoft:dnsserver-rac1-target -l
[root@rac2 ~]# lsscsi -t
[0:0:0:0] disk spi:0 /dev/sda
[4:0:0:0] cd/dvd sata: /dev/sr0
[34:0:0:0] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdb
[34:0:0:7] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdc
[34:0:0:8] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdd
[34:0:0:9] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sde
[34:0:0:10] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdf
[34:0:0:11] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdg
[34:0:0:12] disk iqn.1991-05.com.microsoft:dnsserver-rac2-target,t,0x1 /dev/sdh
测试一结论:无法得知到底是什么原因,要么是Linux系统对sdb有什么特殊操作,要么是Windows iSCSI target的问题
测试二:
使用Linux iSCSI target utils充当服务器,再测试故障是否会重现,用以测试到底是Linux客户端的原因还是Windows iSCSI target服务器的原因
具体搭建方法请参考另外《iSCSI服务器以及客户端安装配置》的笔记,链接:http://childres.blog.51cto.com/11420270/1763069
启动tgtd服务后,查看当前的LUN情况,发现iSCSI target utils默认LUN 0是分给了控制器
[root@target ~]# tgtadm --mode target --op show
Target 1: iqn.2016-04.target:vdisk
System information:
Driver: iscsi
State: ready
I_T nexus information:
LUN information:
LUN: 0
Type: controller
SCSI ID: IET 00010000
SCSI SN: beaf10
Size: 0 MB, Block size: 1
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: null
Backing store path: None
Backing store flags:
LUN: 1
Type: disk
SCSI ID: IET 00010001
SCSI SN: beaf11
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /iscsi_data/data1.img
Backing store flags:
LUN: 2
Type: disk
SCSI ID: IET 00010002
SCSI SN: beaf12
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /iscsi_data/fra1.img
Backing store flags:
LUN: 3
Type: disk
SCSI ID: IET 00010003
SCSI SN: beaf13
Size: 1074 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /iscsi_data/ocr1.img
Backing store flags:
LUN: 4
Type: disk
SCSI ID: IET 00010004
SCSI SN: beaf14
Size: 1074 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /iscsi_data/ocr2.img
Backing store flags:
LUN: 5
Type: disk
SCSI ID: IET 00010005
SCSI SN: beaf15
Size: 1074 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /iscsi_data/ocr3.img
Backing store flags:
Account information:
ACL information:
ALL
且在客户端login后看到0:0:0是没有配置任何设备的,故也无法取消LUN 0的映射
[root@rac1 ~]# lsscsi -t
[0:0:0:0] disk spi:0 /dev/sda
[4:0:0:0] cd/dvd sata: /dev/sr0
[34:0:0:0] storage iqn.2016-04.target:vdisk,t,0x1 -
[34:0:0:1] disk iqn.2016-04.target:vdisk,t,0x1 /dev/sdb
[34:0:0:2] disk iqn.2016-04.target:vdisk,t,0x1 /dev/sdc
[34:0:0:3] disk iqn.2016-04.target:vdisk,t,0x1 /dev/sdd
[34:0:0:4] disk iqn.2016-04.target:vdisk,t,0x1 /dev/sde
[34:0:0:5] disk iqn.2016-04.target:vdisk,t,0x1 /dev/sdf
测试二结论:Microsoft iSCSI Software Target的LUN 0中含有iSCSI控制器信息,取消LUN 0的映射也会影响到其它LUN,导致客户端无法正常识别
P.S.以上故障环境均是在测试虚拟机的平台上,不知道使用了Windows Storage Server系统的iSCSI存储(比如说HP StoreEasy序列)会不会也存在的这种问题呢?暂时没有这类设备,看看后期再下载Windows Storage Server 2012装在虚拟机上测试测试。