问题的故障是主机断电后,solaris11起不来了。在boot阶段报错:ERROR: boot-read fail
以下是启动报错信息:
SPARC T4-1, No Keyboard
Copyright (c) 1998, 2012, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.34.1, 32256 MB memory available, Serial #101718426.
Ethernet address 0:10:e0:10:19:9a, Host ID: 8610199a.
ERROR: boot-read fail
Boot device: net File and args:
1000 Mbps full duplex Link up
Requesting Internet Address for 0:10:e0:10:19:9a
Requesting Internet Address for 0:10:e0:10:19:9a
Requesting Internet Address for 0:10:e0:10:19:9a由于在OBP中boot-device设置的是disk net, 从磁盘引导失败后,就自动从net引导了。此系统根本没配置net引导,因此一直在试着从网络上获取IP地址信息。
解决过程
1. 在ILOM里设置OS不自动启动
原系统在OBP里设置了auto-boot?=true,此处先在ILOM把自动启动去掉.
串口登录ILOM,设置OS不自动启动
set /HOST/bootmode script="setenv auto-boot? false"
重启
stop /SYS
start /SYS
切换到console
start /SP/console
启动后系统停止在ok状态
2. 检查一下磁盘情况
{0} ok probe-scsi-all
/pci@400/pci@2/pci@0/pci@f/pci@0/usb@0,2/hub@2/hub@3/storage@2
Unit 0 Removable Read Only device AMI Virtual CDROM 1.00/pci@400/pci@2/pci@0/pci@c/SUNW,qlc@0,1
QLogic QLE2562 Host Adapter FCode(SPARC): 2.03 06/30/08
Wait for link up - /
Firmware version 4.03.02
Adapter portID - 10200
************** Fabric Attached Devices **************
Dev# 0 (0) PortID 10000 Port WWN 200a00a0b848305a
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760Dev# 1 (1) PortID 10100 Port WWN 200b00a0b848305a
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760
/pci@400/pci@2/pci@0/pci@c/SUNW,qlc@0
QLogic QLE2562 Host Adapter FCode(SPARC): 2.03 06/30/08
Wait for link up - /
Firmware version 4.03.02
Adapter portID - 10200
************** Fabric Attached Devices **************
Dev# 0 (0) PortID 10000 Port WWN 200a00a0b8483059
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760
Dev# 1 (1) PortID 10100 Port WWN 200b00a0b8483059
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760
/pci@400/pci@2/pci@0/pci@4/scsi@0FCode Version 1.00.61 , MPT Version 2.00, Firmware Version 9.00.00.00
Target 9
Unit 0 Removable Read Only device TEAC DV-W28SS-V 1.0B
SATA device PhyNum 6/pci@400/pci@1/pci@0/pci@4/scsi@0
FCode Version 1.00.61 , MPT Version 2.00, Firmware Version 9.00.00.00
Target 381 Volume 0
Unit 0 Disk LSI Logical Volume 3000 583983104 Blocks, 298 GB
VolumeDeviceName 3e90f849ae6f04a5 VolumeWWID 0e90f849ae6f04a5/pci@400/pci@1/pci@0/pci@0/SUNW,qlc@0,1
QLogic QLE2562 Host Adapter FCode(SPARC): 2.03 06/30/08
Wait for link up - /
Firmware version 4.03.02
Adapter portID - 10300
************** Fabric Attached Devices **************
Dev# 0 (0) PortID 10000 Port WWN 200a00a0b848305a
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760
Dev# 1 (1) PortID 10100 Port WWN 200b00a0b848305a
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760
/pci@400/pci@1/pci@0/pci@0/SUNW,qlc@0
QLogic QLE2562 Host Adapter FCode(SPARC): 2.03 06/30/08
Wait for link up - /
Firmware version 4.03.02
Adapter portID - 10300
************** Fabric Attached Devices **************
Dev# 0 (0) PortID 10000 Port WWN 200a00a0b8483059
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760
Dev# 1 (1) PortID 10100 Port WWN 200b00a0b8483059
LUN 0(0) DISK SUN CSM200_R 0760
LUN 1(1) DISK SUN CSM200_R 0760
LUN 2(2) DISK SUN CSM200_R 0760
LUN 3(3) DISK SUN CSM200_R 0760
LUN 4(4) DISK SUN CSM200_R 0760上述输出中显示为LUN的,都是来自存储的空间。红色部分是本地盘。此系统设置了硬RAID.
3. 检查硬RAID信息
{0} ok select /pci@400/pci@1/pci@0/pci@4/scsi@0
{0} ok show-volumes
Volume 0 Target 381 Type RAID1 (Mirroring)
Name ids1vol WWID 0e90f849ae6f04a5
Optimal Enabled Volume Not Consistent
2 Members 583983104 Blocks, 298 GB
Disk 1
Primary Optimal
Target 9 HITACHI H106030SDSUN300G A2B0 PhyNum 0
Disk 0
Secondary Optimal
Target a HITACHI H106030SDSUN300G A2B0 PhyNum 1{0} ok unselect-dev
4. 从指定路径启动OS
从自动启动报错的信息来看(ERROR: boot-read fail) ,很可能是启动时没找到引导块。这并不一定表示磁盘坏了或者引导块坏了,也有可能是找的位置不对。一般情况下,主机掉电并不会造成磁盘出现损坏这么严重的问题。于是手工指定路径来启动OS.
{0} ok boot /pci@400/pci@2/pci@0/pci@4/scsi@0/disk@w3e90f849ae6f04a5,0:a
Boot device: /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@w3e90f849ae6f04a5,0:a File and args:
SunOS Release 5.11 Version 11.0 64-bit
Copyright (c) 1983, 2011, Oracle and/or its affiliates. All rights reserved.
.....结果系统起来了。
至于为什么是这个启动路径,我也不太清楚,完全是矇的。对于/pci@400/pci@1/pci@0/pci@4/scsi@0/disk@w3e90f849ae6f04a5,0:a 这个路径,大概情况如下:
/pci@400/pci@1/pci@0/pci@4/scsi@0 这个好理解,表示磁盘的系统PCI中的位置
/disk@ 这个是固定的
w3e90f849ae6f04a5 关键是这个值,此值相当于磁盘逻辑名称cXtYdZ中的"Y",此值可能与以下内容有关。
Volume 0 Target 381 Type RAID1 (Mirroring)
Name ids1vol WWID 0e90f849ae6f04a5
其中w是固定的,3估计是 "Target 381” 中的"3”, e90f849ae6f04a5是 WWID的值去掉前面的"0". 至于为什么是这样,我也不知道。只时当时我上网查的时候,刚好看到有个人的系统盘也是做了硬RAID,OS中的系统盘名称与此类似。我仔细分析了他的系统盘名称,做出了以上猜测。
,0:a 这个也好理解,0表示d0,即磁盘0;a表示分区0
5. 设置启动盘别名
系统起来后,需要设置一下别名,避免下次重启还得手工指定路径。
root@racnode1:~# eeprom "nvramrc=devalias bootdisk /pci@400/pci@1/pci@0/pci@4/scsi@0/disk@w3e90f849ae6f04a5,0:a"
root@racnode1:~# eeprom boot-device="bootdisk"
root@racnode1:~# eeprom "use-nvramrc?=true"
备注:w3e90f849ae6f04a5这个值实际上是probe-scsi-all命令输出的VolumeDeviceName值前面加上"w"
Target 381 Volume 0
Unit 0 Disk LSI Logical Volume 3000 583983104 Blocks, 298 GB
VolumeDeviceName 3e90f849ae6f04a5 VolumeWWID 0e90f849ae6f04a5