一台新上的生产库,红帽RHEL7.2,数据库频繁宕机,检查alert日志发现如下报错:
ORA-27300: OS system dependent operation:semctl failed with status: 22
ORA-27301: OS failure message: Invalid argument
ORA-27302: failure occurred at: sskgpwrm1
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 36
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
在mos上搜索到一篇文档:
ORA-27300 ORA-27301 ORA-27302 ORA-27157 Database Crash (文档 ID 438205.1)
情况描述和我当前情况十分符合,包括操作系统版本,原因是由于:
The errors are signalling that something happened at the OS level with shared memory and/or semaphores. The semaphore sets could be removed manually, or they could be dying for some reason due to a hardware error.
Either when remounting the /dev/shm or You may want to check for any possibility of a user dba using the "ipcrm" command to kill the semaphores (accidentally) since the error ora-27301 (OS failure message: Identifier removed) suggests that. Also, it could have been a bad memory stick or something else at the OS level. Someone could also have removed the shared memory segments at the OS level for some specific reason, or by accident. Most likely something had removed the shared memory and semaphore sets in use by 'oracle'. This can only be done by a root-level user or 'oracle' itself who owns the resources. If someone logged in as root and removed all IPC resources, Oracle would crash when it lost the allocated shared memory/semaphores.
This could be due to some outside user or application removing the semaphores/shared memory.意思由于人为或应用程序执行了移除semaphore或shared memory操作,导致oracle宕机。比如使用ipcrm命令。
这个需要开发那边排查下应用程序层面是否有类似操作了。
oracle 提供的workaround:
1) Set RemoveIPC=no in /etc/systemd/logind.conf if it is not in that file
2) Reboot the server or restart systemd-logind as follows:
# systemctl daemon-reload
# systemctl restart systemd-logind
------------------------------------------------------------------------
凌晨1点40,困的难受,没心思再去细看,暂时先记下,若是日后情致来了,再做他说吧。
--------------------------------------------后续--------------------------------------------
四月的最后一天,开发人员告诉我问题依然没有解决,我查看了下alert日志,发现报错依然是之前的情况,
确认了下systemd-logind的服务启动情况,发现是正常的,也就是说,我按照MOS上的步骤做了,但是问题依然没有解决。
想着要不按照mos上的要求,追踪下ipcrm命令信息,感觉有点玄,便又仔细看了下alert日志,
一下子来劲了,发现每次报错时间点都是0点,6点,12点,18点,
咦,这个时间段很熟悉啊,这个不是我们这边做的时钟同步时间段么。
然后问题来了,我们正常都是在root用户下面做的定时任务,但是该服务器上,哪位同僚一时疏忽在oracle用户下作了定时任务。
立刻验证了一下,果然,当oracle下面的时钟同步任务一执行,oracle就game over了。
颇有浮得天下一大白的感觉!