Date | 2008-12-06 09:42:43 |
Component | CRS |
Title | What can cause a Node Eviction ? |
Version | 10.1.0 - 11.1.0.7 |
Problem | Node evictions can occur in a cluster environment, the main question is why did the eviction occured ? Below I try to make that part easier. |
Solution | There are 4 possible causes why a node eviction can occur.
The title start with cause, but an Node eviction is a symptom of another problem not the cause. Keep this always in mind when investigating why a node eviction can occur. Kernel Hang depended on the Operation System used. For Window or Linux this can be done based on the Hangcheck Timer and other Unix environments OPROCD is started. From Oracle 10.2.0.4 and higher OPROCD is also active on LINUX. (Still install the hangcheck timer) To validate if HANGCHECK timer or OPROCD was causing the node eviction validate the OS logfiles for the hangcheck timer. For OPROCD validate the OPROCD logfile.
An other possible node eviction can be triggered by OCLSMON starting with the 10.2.0.3 patchset or higher. The Clusterware proces is validating if there is an issue with CSSD. When this is the case it will kill the CSSD deamon, which will lead to the eviction. When this issue occur validate the oclsmon logfile and contact Oracle support. In this note we don’t focus on these parts, but on heartbeat lost. Below are two examples of a heartbeat lost symptom. The OCSSD background process is taking care of the heartbeats. In the cssd.log file you can find detail information about the node eviction. In case of an eviction validate all the cssd.log file on all the nodes in your cluster environment. But start with the evicted node. The logging information logged can be changed during patchset and Oracle releases. Node eviction due to Interconnect lost symptom. Oracle 11g
Oracle 10g [ CSSD]2006-10-18 23:49:06.226 [1] >USER: NMEVENT_SUSPEND [00][00][00][06] [ CSSD]2006-10-18 23:49:08.032 [1030] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634354) LATS(2345205587) Disk lastSeqNo(634354) [ CSSD]2006-10-18 23:49:09.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(1167) state_network(0) state_disk(3) missCount(33) [ CSSD]2006-10-18 23:49:10.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(2167) state_network(0) state_disk(3) missCount(33) …….
Here we see that the Diskkillcheck is report by node 1 and this node is evicted. The diskkillcheck is done using a poison packets trough the voting disk, as interconnect is lost. Possible action: check the availability of the Adapters, large network load/port scans and the OS logfiles for reported errrors related to the interconnect.
Node eviction due to Voting disk lost symptom. Below an example where we lose the heartbeat to the voting disk. [ CSSD]2006-10-11 00:35:33.658 [1801] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[alligator] seq[9] sync[15] [ CSSD]2006-10-11 00:35:36.956 [1801] >TRACE: clssnmHandleSync: diskTimeout set to (27000)ms [ CSSD]2006-10-11 00:35:37.960 [3343] >TRACE: clscsendx: (11145a3f0) Physical connection (111459b30) not active
Possible action: check the availability of the Disk subsystem and the OS logfiles for reported errrors related to the voting disk Trace the heartbeat: If needed you can enable a higher level of tracing to debug the heartbeat part. This can be done using the command, level 5 tracing. Level 0 disables the extra trace again. Please keep in mind that this can make your cssd.log growth hard. (4 lines added every second).
crsctl debug log css CSSD:5 crsctl debug log css CSSD:0
NOTICE: Node evictions is a symptom for another problem ! |
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/23135684/viewspace-719226/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/23135684/viewspace-719226/