问题现象描述:

    某个集群环境每天出现一台机器关机现象,随机发生,经过排查解决问题,为大家提供方便

    

环境:

        集群环境:openstack + ceph 融合集群,版本:Mitaka+jewel 

        网络环境:网卡10G+bond0(主备模式)

        版      本:centos7.3 


message  错误日志:

Aug 30 16:45:14 lxx-4-5 journal: internal error: End of file from monitor
Aug 30 16:45:14 lxx-4-5 avahi-daemon[2412]: Withdrawing address record for fe80::fc16:3eff:fef3:5076 on vnet5.
Aug 30 16:45:14 lxx-4-5 kernel: vlan206: port 3(vnet5) entered disabled state
Aug 30 16:45:14 lxx-4-5 kvm: 10 guests now active
Aug 30 16:45:14 lxx-4-5 avahi-daemon[2412]: Withdrawing workstation service for vnet5.
Aug 30 16:45:14 lxx-4-5 kernel: device vnet5 left promiscuous mode
Aug 30 16:45:14 lxx-4-5 kernel: vlan206: port 3(vnet5) entered disabled state
Aug 30 16:45:14 lxx-4-5 systemd: autolog.service holdoff time over, scheduling restart.
Aug 30 16:45:14 lxx-4-5 systemd: Started Autolog.
Aug 30 16:45:14 lxx-4-5 systemd: Starting Autolog...
Aug 30 16:45:14 lxx-4-5 systemd-machined: Machine qemu-22-instance-000002c9 terminated.
Aug 30 16:45:14 lxx-4-5 autolog: Don't have master process.
Aug 30 16:45:15 l22-4-5 journal: End of file while reading data: Input/output error
Aug 30 16:45:15 lxx-4-5 systemd: autolog.service holdoff time over, scheduling restart.
Aug 30 16:45:15 lxx-4-5 systemd: Started Autolog.
Aug 30 16:45:15 lxx-4-5 systemd: Starting Autolog...
Aug 30 16:45:15 lxx-4-5 autolog: Don't have master process.
Aug 30 16:45:15 lxx-4-5 systemd: autolog.service holdoff time over, scheduling restart.
Aug 30 16:45:15 lxx-4-5 systemd: Started Autolog.
Aug 30 16:45:15 lxx-4-5 systemd: Starting Autolog...


openstack-compute 关键日志:

2017-08-30 16:45:20.952 110867 DEBUG nova.compute.manager [req-0602316d-944c-42b4-9d3c-7d1b0e513765 - - - - -] [instance: 26f48b2e-f648-42e2-8133-7ebc060fd7ae] Updated the network info_cache for instance _heal_instance_info_cache /usr/lib/python2.7/site-packages/nova/compute/manager.py:5803
2017-08-30 16:45:30.033 110867 DEBUG nova.virt.driver [-] Emitting event  Stopped> emit_event /usr/lib/python2.7/site-packages/nova/virt/driver.py:1443
2017-08-30 16:45:30.034 110867 INFO nova.compute.manager [-] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] VM Stopped (Lifecycle Event)
2017-08-30 16:45:30.076 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1347
2017-08-30 16:45:30.079 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: None, current DB power_state: 1, VM power_state: 4 handle_lifecycle_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:1276
2017-08-30 16:45:30.119 110867 INFO nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor.
2017-08-30 16:45:30.177 110867 WARNING nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4
2017-08-30 16:45:30.178 110867 DEBUG nova.compute.api [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Going to try to stop instance force_stop /usr/lib/python2.7/site-packages/nova/compute/api.py:1954
2017-08-30 16:45:30.267 110867 DEBUG oslo_concurrency.lockutils [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] Lock "08330b10-f106-4737-b9db-0e45c84abb2e" acquired by "nova.compute.manager.do_stop_instance" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270
2017-08-30 16:45:30.268 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1347
2017-08-30 16:45:30.270 110867 DEBUG nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Stopping instance; current vm_state: active, current task_state: powering-off, current DB power_state: 4, current VM power_state: 4 do_stop_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:2545
2017-08-30 16:45:30.270 110867 INFO nova.compute.manager [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance is already powered off in the hypervisor when stop is called.
2017-08-30 16:45:30.271 110867 DEBUG nova.objects.instance [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] Lazy-loading 'metadata' on Instance uuid 08330b10-f106-4737-b9db-0e45c84abb2e obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:895
2017-08-30 16:45:30.314 110867 INFO nova.virt.libvirt.driver [req-5998b542-495c-41f2-8010-7f1c426f0127 - - - - -] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance already shutdown.
2017-08-30 16:45:30.318 110867 INFO nova.virt.libvirt.driver [-] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] Instance destroyed successfully.


关键日志:

message : Aug 30 16:45:15 l22-4-5 journal: End of file while reading data: Input/output error

Openstack-compute: 2017-08-30 16:45:30.034 110867 INFO nova.compute.manager [-] [instance: 08330b10-f106-4737-b9db-0e45c84abb2e] VM Stopped (Lifecycle Event)


解决办法:

升级libvirt 版本:
libvirt-daemon-driver-secret-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-lxc-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-lxc-2.0.0-10.el7_3.9.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-daemon-2.0.0-10.el7_3.9.x86_64
libvirt-lock-sanlock-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-storage-2.0.0-10.el7_3.9.x86_64
libvirt-gobject-0.2.3-1.el7.x86_64
libvirt-nss-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-10.el7_3.9.x86_64
libvirt-gconfig-0.2.3-1.el7.x86_64
libvirt-snmp-0.0.3-5.el7.x86_64
libvirt-daemon-driver-nodedev-2.0.0-10.el7_3.9.x86_64
libvirt-glib-devel-0.2.3-1.el7.x86_64
libvirt-gobject-devel-0.2.3-1.el7.x86_64
libvirt-java-javadoc-0.4.9-4.el7.noarch
libvirt-daemon-driver-qemu-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64
libvirt-gconfig-devel-0.2.3-1.el7.x86_64
libvirt-login-shell-2.0.0-10.el7_3.9.x86_64
libvirt-client-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-interface-2.0.0-10.el7_3.9.x86_64
libvirt-devel-2.0.0-10.el7_3.9.x86_64
libvirt-cim-0.6.3-19.el7.x86_64
libvirt-glib-0.2.3-1.el7.x86_64
libvirt-java-devel-0.4.9-4.el7.noarch
libvirt-daemon-driver-network-2.0.0-10.el7_3.9.x86_64
libvirt-docs-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-config-nwfilter-2.0.0-10.el7_3.9.x86_64
libvirt-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-config-network-2.0.0-10.el7_3.9.x86_64
libvirt-java-0.4.9-4.el7.noarch
升级qemu版本
qemu-system-lm32-2.0.0-1.el7.6.x86_64
ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch
qemu-system-cris-2.0.0-1.el7.6.x86_64
qemu-system-x86-2.0.0-1.el7.6.x86_64
qemu-kvm-tools-1.5.3-126.el7_3.10.x86_64
qemu-system-xtensa-2.0.0-1.el7.6.x86_64
qemu-system-arm-2.0.0-1.el7.6.x86_64
qemu-system-s390x-2.0.0-1.el7.6.x86_64
qemu-system-sh4-2.0.0-1.el7.6.x86_64
qemu-kvm-common-1.5.3-126.el7_3.10.x86_64
qemu-user-2.0.0-1.el7.6.x86_64
qemu-system-unicore32-2.0.0-1.el7.6.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7_3.9.x86_64
qemu-guest-agent-2.5.0-3.el7.x86_64
qemu-common-2.0.0-1.el7.6.x86_64
qemu-system-or32-2.0.0-1.el7.6.x86_64
qemu-kvm-1.5.3-126.el7_3.10.x86_64
qemu-system-moxie-2.0.0-1.el7.6.x86_64
qemu-img-1.5.3-126.el7_3.10.x86_64
qemu-system-m68k-2.0.0-1.el7.6.x86_64
qemu-system-alpha-2.0.0-1.el7.6.x86_64
qemu-system-microblaze-2.0.0-1.el7.6.x86_64
qemu-system-mips-2.0.0-1.el7.6.x86_64
qemu-2.0.0-1.el7.6.x86_64
升级kernel 
[root@~]# rpm -qa|grep kernel
kernel-3.10.0-514.26.2.el7.x86_64
kernel-tools-libs-3.10.0-514.26.2.el7.x86_64
kernel-devel-3.10.0-327.36.3.el7.x86_64
kernel-tools-3.10.0-514.26.2.el7.x86_64
kernel-devel-3.10.0-123.el7.x86_64
kernel-3.10.0-327.36.3.el7.x86_64
abrt-addon-kerneloops-2.1.11-45.el7.centos.x86_64
kernel-3.10.0-514.2.2.el7.x86_64
kernel-3.10.0-123.el7.x86_64
kernel-3.10.0-327.22.2.el7.x86_64
kernel-devel-3.10.0-327.22.2.el7.x86_64
kernel-devel-3.10.0-514.26.2.el7.x86_64
kernel-devel-3.10.0-514.2.2.el7.x86_64
kernel-headers-3.10.0-514.26.2.el7.x86_64
[root@~]# uname  -r
3.10.0-514.26.2.el7.x86_64


注意:升级版本之后一定要重启,才能成功,重启服务无效!!!