The purpose of this Note is to document default CSS misscount timeout calculations in 10g Release 1, 10g Release 2 , 11g andhigher versions.
MISSCOUNT DEFINITION AND DEFAULT VALUES
The CSS misscount parameter represents the maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster reconfiguration to evict the node. The following are the default values for the misscount parameter and their respective versions when using Oracle Clusterware* in seconds:
OS |
10g (R1 &R2) |
11g |
Linux |
60 |
30 |
Unix |
30 |
30 |
VMS |
30 |
30 |
Windows |
30 |
30 |
*CSS misscount default value when using vendor (non-Oracle) clusterware is 600 seconds. This is to allow the vendor clusterware ample time to resolve any possible split brain scenarios.
On AIX platforms with HACMP starting with 10.2.0.3 BP#1, the misscount is 30. This is documented in Note 551658.1
CSS HEARTBEAT MECHANISMS AND THEIR INTERRELATIONSHIP
The synchronization services component (CSS) of the Oracle Clusterware maintains two heartbeat mechanisms 1.) the disk heartbeat to the voting device and 2.) the network heartbeat across the interconnect which establish and confirm valid node membership in the cluster. Both of these heartbeat mechanisms have an associated timeout value. The disk heartbeat has an internal i/o timeout interval (DTO Disk TimeOut), in seconds, where an i/o to the voting disk must complete. The misscount parameter (MC), as stated above, is the maximum time, in seconds, that a network heartbeat can be missed. The disk heartbeat i/o timeout interval is directly related to the misscount parameter setting. There has been some variation in this relationship
between versions as described below:
9.x.x.x |
NOTE, MISSCOUNT WAS A DIFFERENT ENTITY IN THIS RELEASE |
10.1.0.2 |
No one should be on this version |
10.1.0.3 |
DTO = MC - 15 seconds |
10.1.0.4 |
DTO = MC - 15 seconds |
10.1.0.4+Unpublished Bug 3306964 |
DTO = MC - 3 seconds |
10.1.0.4 with CRS II Merge patch |
DTO =Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration |
10.1.0.5 |
IOT = MC - 3 seconds |
10.2.0.1 +Fix for unpublishedBug 4896338 |
IOT=Disktimeout(Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration |
10.2.0.2 |
Same as above (10.2.0.1 with Patch Bug:4896338 |
10.1 - 11.1 |
During node join and leave (reconfiguration) in a cluster we need to reconfigure, in that particular case we use Short Disk TimeOut (SDTO) which is in all versions SDTO = MC – reboottime (usually 3 seconds) |
Misscount drives cluster membership reconfigurations and directly effects the availability of the cluster. In most cases, the default settings for MC should be acceptable. Modifying the default value of misscount not only influences the timeout interval for the i/o to the voting disk, but also influences the tolerance for missed network heartbeats across the interconnect.
LONG LATENCIES TO THE VOTING DISKS
If I/O latencies to the voting disk are greater than the default DTO calculations noted above, the cluster may experience CSS node evictions depending on (a)the Oracle Clusterware (CRS) version, (b)whether merge patch has been applied and (c)the state of the Cluster. More details on this are covered in the section "Change in Behavior with CRS Merge PATCH(4896338 on 10.2.0.1)".
These latencies can be attributed to any number of problems in the i/o subsystem or problems with any component in the i/o path. The following is a non exhaustive list of reported problems which resulted in CSS node eviction due to latencies to the voting disk longer than the default Oracle Clusterware i/o timeout value(DTO):
The most common problems relate to multi-path IO software drivers, and the reconfiguration times resulting from a failure in the IO path. Hardware and (re)configuration issues that introduce these latencies should be corrected. Incompatible failover times with underlying OS, network or storage hardware or software may be addressed given a complete understanding of the considerations listed below.
Misscount should NOT be modified to workaround the above-mentioned issues. Oracle support recommends that you apply the latest patchset which changes the CSS behaviour. More details covered in next section.
Change in Behavior withBug:4896338 applied on top of10.2.0.1
Starting with 10.2.0.1+Bug:4896338, CSS will not evict the node from the cluster due to (DTO) I/O to voting disk taking more than misscount seconds unless it is during the initial cluster formation or slightly before reconfiguration.
So if we have aN number of nodes in acluster and one of the nodes takes more than misscount secondsto access the voting disk, the node will not be evicted as long as the access to the voting disk is completed within disktimeout seconds.Consequently with thispatch, there is no need to increase the misscount at all.
Additionally this merge patch introduces Disktimeout which is the amount of time that a lack of disk ping to voting disk(s) will be tolerated.
Note: applying the patch will not change your value for Misscount.
The table below explains in theconditions under which the eviction will occur
Network Ping |
Disk Ping |
Reboot |
Completes within misscount seconds |
Completes withinMisscount seconds |
N |
Completes within Misscount seconds |
Takes more than misscount seconds but less than Disktimeout seconds |
N |
Completes within Misscount seconds |
Takes more than Disktimeout seconds |
Y |
Takes more than Misscount Seconds |
Completes within Misscount seconds |
Y |
* By defaultMisscount is less than Disktimeout seconds
CONSIDERATIONS WHEN CHANGING MISSCOUNT FROM THE DEFAULT VALUE
To Change MISSCOUNT back to default Pleaserefer toNote:284752.1
THIS IS THE ONLY SUPPORTED METHOD. NOT FOLLOWING THIS METHOD RISKS EVICTIONS AND/OR CORRUPTING THE OCR
10g Release 2 MIRRORED VOTING DISKS AND VENDOR MULTIPATHING SOLUTIONS
Oracle RAC 10g Release 2 allows for multiple voting disks so that the customer does not have to rely on a multipathing solution from a storage vendor. You can have n voting disks (up to 31) where n = m*2+1 where m is the number of disk failures you want to survive. Oracle recommends each voting disk to be on a separate physical disk.
From Oracle
-------------------------------------------------------------------------------------------------------
Blog: http://blog.csdn.net/tianlesoftware
Email: [email protected]
DBA1 群:62697716(满); DBA2 群:62697977(满) DBA3 群:62697850(满)
DBA 超级群:63306533(满); DBA4 群: 83829929 DBA5群: 142216823
聊天 群:40132017 聊天2群:69087192
--加群需要在备注说明Oracle表空间和数据文件的关系,否则拒绝申请