Open-Stack 集群大了以后,内存出错,磁盘出错的事情出现的次数也会增加,下面就是一个例子。
运维在重启一台Hypervisor (HV) 后,碰到 input, output 错误:
root@production-m1:~# virsh -su: /usr/bin/virsh: Input/output error root@production-m1:~#
ps aux | grep libvirt
第二步: 尝试重启 libvirtd
root@production-m1:~# service libvirtd restart -su: /usr/sbin/service: Input/output error
第三部: 查看磁盘挂载情况
root@production-m1:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 9.4G 1.9G 7.2G 21% / udev 126G 4.0K 126G 1% /dev tmpfs 51G 288K 51G 1% /run none 5.0M 0 5.0M 0% /run/lock none 126G 0 126G 0% /run/shm cgroup 126G 0 126G 0% /sys/fs/cgroup /dev/sda5 2.2T 289G 1.8T 14% /data /dev/sda3 9.4G 1.2G 7.9G 13% /var
看不出什么问题
第四步: 查看系统日志
root@production-m1:~# dmesg | tail [ 1905.986224] sd 6:0:0:0: rejecting I/O to offline device [ 1905.986371] sd 6:0:0:0: rejecting I/O to offline device [ 1905.989334] sd 6:0:0:0: rejecting I/O to offline device [ 1905.989481] sd 6:0:0:0: rejecting I/O to offline device [ 1905.992597] sd 6:0:0:0: rejecting I/O to offline device [ 1905.992745] sd 6:0:0:0: rejecting I/O to offline device [ 1905.995688] sd 6:0:0:0: rejecting I/O to offline device [ 1905.995836] sd 6:0:0:0: rejecting I/O to offline device [ 1905.998728] sd 6:0:0:0: rejecting I/O to offline device [ 1905.998893] sd 6:0:0:0: rejecting I/O to offline device
问题是: 该HV 上还有12个VM, 接下来的步骤是 运维尝试修复磁盘的过程。
运维发现 , / 已经是只读状态,所以建议进入单用户模式后 在 / 下 运行 fsck. 该建议被否决,理由是:
在磁盘已经坏的情况下不建议一上来就运行fsck, 生产环境下一般都是磁盘阵列,建议先查看磁盘阵列健康状态。 可以使用 ILOM 接口查看
physicaldrive 1I:0:1 (port 1I:box 0:bay 1, 1200.2 GB): OK physicaldrive 1I:0:2 (port 1I:box 0:bay 2, 1200.2 GB): OK physicaldrive 1I:0:3 (port 1I:box 0:bay 3, 1200.2 GB): OK physicaldrive 1I:0:4 (port 1I:box 0:bay 4, 1200.2 GB): OK
磁盘整列似乎是好的,但是:
root@production-m1:~# hpacucli HP Array Configuration Utility CLI 9.20.9.0 Bus error
接下来”悲催“的事情发生了,运维又重启了一遍机器,问题好了 , 下面摘抄运维同事的原话:
I didn't change anything but after rebooting out of the array software the disks mounted and the server looks good