实验环境下11204的RAC环境,出现了一个节点DOWN掉的问题。检查日志信息后,在otcssd日志信息发现如下信息:
2016-01-17 23:15:20.564: [ CTSS][1175029504]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [3].
2016-01-17 23:15:20.564: [ CTSS][1175029504]ctsscomm_recv_cb4_2: Receive active version change msg. Old active version [186647552] New active version [186647552].
2016-01-17 23:15:20.564: [ CTSS][1175029504]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2016-01-17 23:15:20.564: [ CTSS][1175029504]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2016-01-17 23:15:20.564: [ CTSS][1168725760]ctssslave_swm2_3: Received time sync message from master.
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctssslave_swm: sendtime{sec[1453043718], usec[550689]}, receivetime{sec[1453043720], usec[564960]}.
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctssslave_swm: The RTT of sync msg [2014271] is too large for time sync to be accurate. Recommends retry. Returns [17].
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctssslave_swm: Received from master (mode [0x8c] nodenum [1] hostname [jason1] )
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctsselect_monitor_steysync_mode: Failed in clsctssslave_sync_with_master [17]. Retries [0/3].
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctssslave_swm1_1: Waiting for last time sync process to finish. sync_state[6].
2016-01-17 23:15:20.565: [ CTSS][1175029504]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctssslave_swm1_2: Ready to initiate new time sync process.
2016-01-17 23:15:20.565: [ CTSS][1168725760]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2016-01-17 23:15:20.566: [ CTSS][1175029504]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2016-01-17 23:15:20.566: [ CTSS][1175029504]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2016-01-17 23:15:20.566: [ CTSS][1168725760]ctssslave_swm2_3: Received time sync message from master.
2016-01-17 23:15:20.566: [ CTSS][1168725760]ctssslave_swm: The magnitude [733548803120 usec] of the offset [733548803120 usec] is larger than [86400000000 usec] sec which is the CTSS limit
.
2016-01-17 23:15:20.566: [ CTSS][1168725760]ctsselect_monitor_steysync_mode: Failed in clsctssslave_sync_with_master [12]: Time offset is too much to be corrected
2016-01-17 23:15:20.566: [ CTSS][1175029504]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2016-01-17 23:15:21.287: [ CTSS][1190360832]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xd0], offset[733548803 ms]}, length=[8].
2016-01-17 23:15:21.287: [ CTSS][1168725760]ctsselect_monitor_steysync_mode: CTSS daemon exiting [12].
2016-01-17 23:15:21.287: [ CTSS][1168725760]CTSS daemon aborting
2016-01-17 23:15:22.290: [ CTSS][1190360832]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xd0], offset[733548803 ms]}, length=[8].
查看两台服务器时间如下:
jason1:~ # date
Sat Jan 9 11:37:18 CST 2016
jason2:~ # date date
Sun Jan 17 23:23:12 CST 2016
两台服务器时间相差8天,Oracle的时间调整限制是1天。时间相差8天,远远超过Oracle时间同步服务允许的最大限制。因此其中一个节点被踢出了CLUSTER,由于时间同步的问题,导致了节点重启后试图再次加入到集群中报错。因此调整两台服务器时间一致,就可以解决节点DOWN掉的问题。首先关闭集群,然将两节点时间调整当前时间保持一致,再次启动集群或者重新启动两台服务器,问题解决。
参考:http://blog.itpub.net/4227/viewspace-695164/