了解Oracle RAC Brain Split Resolution集群脑裂协议



大约是一周前,一位资深的Oracle工程师向我和客户介绍RAC中脑裂的处理过程,据他介绍脑裂发生时通过各节点对voting disk(投票磁盘)的抢夺,那些争抢到(n/2+1)数量voting disk的节点就可以survive(幸存)下来,而没有争抢到voting disk的节点则被evicted踢出节点。

不得不说以上这番观点,来得太过随意了,一位从Oracle 6就开始从事维护工作的老工程师也会犯这样的概念性错误,只能说Oracle技术的更新过于日新月异了。

在理解脑裂(Brain Split)处理过程前,有必要介绍一下Oracle RAC Css(Cluster Synchronization Services)的工作框架:

Oracle RAC CSS提供2种后台服务包括群组管理(Group Managment简称GM)和节点监控(Node Monitor简称NM),其中GM管理组(group)和锁(lock)服务。在集群中任意时刻总有一个节点会充当GM主控节点(master node)。集群中的其他节点串行地将GM请求发送到主控节点(master node),而master node将集群成员变更信息广播给集群中的其他节点。组成员关系(group membership)在每次发生集群重置(cluster reconfiguration)时发生同步。每一个节点独立地诠释集群成员变化信息。

而节点监控NM服务则负责通过skgxn(skgxn-libskgxn.a,提供节点监控的库)与其他厂商的集群软件保持节点信息的一致性。此外NM还提供对我们熟知的网络心跳(Network heartbeat)和磁盘心跳(Disk heartbeat)的维护以保证节点始终存活着。当集群成员没有正常Network heartbeat或Disk heartbeat时NM负责将成员踢出集群,被踢出集群的节点将发生节点重启(reboot)。

NM服务通过OCR中的记录(OCR中记录了Interconnect的信息)来了解其所需要监听和交互的端点,将心跳信息通过网络发送到其他集群成员。同时它也监控来自所有其他集群成员的网络心跳Network heartbeat,每一秒钟都会发生这样的网络心跳,若某个节点的网络心跳在misscount(by the way:10.2.0.1中Linux上默认misscount为60s,其他平台为30s,若使用了第三方vendor clusterware则为600s,但10.2.0.1中未引入disktimeout;10.2.0.4以后misscount为60s,disktimeout为200s;11.2以后misscount为30s:CRS-4678: Successful get misscount 30 for Cluster Synchronization Services,CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services)指定的秒数中都没有被收到的话,该节点被认为已经”死亡”了。NM还负责当其他节点加入或离开集群时初始化集群的重置(Initiates cluster reconfiguration)。

在解决脑裂的场景中,NM还会监控voting disk以了解其他的竞争子集群(subclusters)。关于子集群我们有必要介绍一下,试想我们的环境中存在大量的节点,以Oracle官方构建过的128个节点的环境为我们的想象空间,当网络故障发生时存在多种的可能性,一种可能性是全局的网络失败,即128个节点中每个节点都不能互相发生网络心跳,此时会产生多达128个的信息”孤岛”子集群。另一种可能性是局部的网络失败,128个节点中被分成多个部分,每个部分中包含多于一个的节点,这些部分就可以被称作子集群(subclusters)。当出现网络故障时子集群内部的多个节点仍能互相通信传输投票信息(vote mesg),但子集群或者孤岛节点之间已经无法通过常规的Interconnect网络交流了,这个时候NM Reconfiguration就需要用到voting disk投票磁盘。

因为NM要使用voting disk来解决因为网络故障造成的通信障碍,所以需要保证voting disk在任意时刻都可以被正常访问。在正常状态下,每个节点都会进行磁盘心跳活动,具体来说就是会到投票磁盘的某个块上写入disk心跳信息,这种活动每一秒钟都会发生,同时CSS还会每秒读取一种称作”kill block”的”赐死块”,当”kill block”的内容表示本节点被驱逐出集群时,CSS会主动重启节点。

为了保证以上的磁盘心跳和读取”kill block”的活动始终正常运作CSS要求保证至少(N/2+1)个投票磁盘要被节点正常访问,这样就保证了每2个节点间总是至少有一个投票磁盘是它们都可以正常访问的,在正常情况下(注意是风平浪静的正常情况)只要节点所能访问的在线voting disk多于无法访问的voting disk,该节点都能幸福地活下去,当无法访问的voting disk多于正常的voting disk时,Cluster Communication Service进程将失败并引起节点重启。所以有一种说法认为voting disk只要有2个足以保证冗余度就可以了,没有必要有3个或以上voting disk,这种说法是错误的。Oracle推荐集群中至少要有3个voting disks。

补充1:

Question:
有同学问那么voting disk  必须是奇数个呢?

Answer:
实际上我们仅仅是推荐使用奇数个vote disk ,而非必须是奇数个。10gR2中vote disk的数目上限是32个。

Question
我们可以使用2或4个vote disk吗?

Answer:
可以的。 但是2、4这样的数目在“至少(N/2+1)个投票磁盘要被节点正常访问”这一disk heartbeat的硬性算法下是不利的:
当我们使用2个vote disk 时,不能发生任意个vote disk的心跳失败
当我们使用3个vote disk 时,不能发生大于1个的vote disk心跳失败
当我们使用4个vote disk 时,不能发生大于1个的vote disk心跳失败 ,这和3个时的容错率是一样,但是因为我们有更多的vote disk,这会导致管理成本和引入的风险增长
当我们使用5个vote disk 时,不能发生大于2个的vote disk心跳失败
当我们使用6个vote disk 时,仍然不能发生大于2个的vote disk心跳失败, 同样的因为比5时多出一个 ,也会引入不合理的管理成本和风险

补充2:
Question:
若节点间的网络心跳正常,且节点所能正常心跳的vote disk 大于 不能正常访问的 ,如3个votedisk 时 恰巧有1个vote disk 的disk heartbeat 超时,此时Brain split 会发生吗?

Answer:
这种情况即不会触发Brain Split,也不会引发节点驱逐协议(eviction protocol)。 当单个或小于(N/2+1)个的voting disk心跳失败(disk heartbeat failure)时,这种心跳失败可能是由于短期内节点访问voting disk发生I/O error错误而引起的,此时css会立刻将这些失败的voting disk标记为OFFLINE。虽然有一定数量的voting disk OFFLINE了,但是我们仍有至少(N/2+1)个投票磁盘可用,这保证了eviction protocol不会被调用,所以没有节点会被reboot重启。紧接着node monitor模块的Disk ping Monitor Thread(DPMT-clssnmDiskPMT)会重复尝试访问这些失败的OFFLINE voting disk,若这些投票磁盘变得再次可I/O访问且经过验证其上的数据也没有讹误,那么css会再次将此voting disk标记为ONLINE;但是如果在45s( 这里的45s是基于misscount和 内部算法获得的) 内仍不能正常访问相关的voting disk,那么DMPT将在cssd.log中生成警告信息,如:

 

CSSD]2011-11-11 20:11:20.668 >
WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/asm-votedisk1)

 

假设以上发生clssnmDiskPMT警告的RAC场景共有3个voting disk,现已有一个asm-votedisk1因为I/O error或其他原因而被标记为OFFLINE,若此时再有一个votedisk也出现了问题并disk heartbeat 失败,那么节点会因为少于规定数目(2)的votedisk而引发eviction protocol,进而重启reboot。

单个或小于(N/2+1)个的voting disk心跳失败都仅仅生成警告(Warning),而非致命的错误。因为仍有绝大数量的vote disk可被访问,因此生成的警告都是非致命的,eviction protocol将不会被触发。

 

当实际的NM Reconfiguration集群重置情况发生时所有的active节点和正在加入集群的节点都会参与到reconfig中,那些没有应答(ack)的节点都将不再被归入新的集群关系中。实际上reconfig重置包括多个阶段:
1.初始化阶段 — reconfig manager(由集群成员号最低的节点担任)向其他节点发送启动reconfig的信号
2.投票阶段 — 节点向reconfig manager发送该节点所了解的成员关系
3.脑裂检查阶段 — reconfig manager检查是否脑裂
4.驱逐阶段 — reconfig manager驱逐非成员节点
5.更新阶段 — reconfig manager向成员节点发送权威成员关系信息

 

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的节点,并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集群(subcluster)内的节点数目,并依据以下2种因素决定哪个子集群应当存活下去:

  1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
  2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node number),举例来说在一个2节点的RAC环境中总是1号节点会获胜。
  3. 采用Stonith algorithm 的IO fencing(remote power reset)

 

补充:

STONITH算法是一种常用的I/O Fencing algorithm,是RAC中必要的远程关闭节点的接口。其想法十分简单,当某个节点上运行的软件希望确保本集群内的其他节点不能使用某种资源时,拔出其他节点的插座即可。这是一种简单、可靠且有些残酷的算法。Stonith 的优势是其没有特定的硬件需求,也不会限制集群的扩展性。

Oracle Clusterware的Process Monitor模块负责实现IO fencing,保证不会因节点/实例的不协调工作而产生corruption。Process Monitor的工作具体由hangcheck timer或者oprocd 完成, 在Linux平台上10.2.0.4 之前都没有oprocd的存在(其他Unix平台在10.2.0.1就有了),在安装RAC之前需要额外安装hangcheck timer软件以保证IO fencing, 到10.2.0.4 时Linux上也有了oprocd,具体见 一文。 这些负责IO fencing的进程一般都会被锁定在内存中(locked in memory)、实时运行(Real time)、休眠固定的时间(Sleep a fixed time)、以root身份运行;若进程唤醒后发现时间已经太晚,那么它会强制reboot;若这些进程发生失败,则也会重启,所以在RAC环境中oprocd是一个很重要的进程,不要失去手动去kill这个进程。

 

 

在完成脑裂检查后进入驱逐阶段,被驱逐节点会收到发送给它们的驱逐信息(如果网络可用的话),若无法发送信息则会通过写出驱逐通知到voting disk上的”kill block”来达到驱逐通知的目的。同时还会等待被驱逐节点表示其已收到驱逐通知,这种表示可能是通过网络通信的方式也可能是投票磁盘上的状态信息。

可以看到Oracle CSS中Brain Split Check时会尽可能地保证最大的一个子集群存活下来以保证RAC系统具有最高的可用性,而并不如那位资深工程师所说的在Cluster Reconfiguration阶段会通过节点对投票磁盘的抢占来保证哪个节点存活下来。

以下为一个三节点RAC环境中的2个示例场景:

1.1号节点网络失败,2,3号节点形成子集群;2,3节点通过voting disk向1号节点发起驱逐:

以下为1号节点的ocssd.log日志:
[    CSSD]2011-04-23 17:11:42.943 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 3.280308e-268artbeat fatal, eviction in 29.610 seconds
[    CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE:   clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1037, misstime 30390
[    CSSD]2011-04-23 17:11:42.943 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 3.280308e-268artbeat fatal, eviction in 29.150 seconds

对2,3号节点发起misscount计时

[    CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE:   clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30850
[    CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2011-04-23 17:11:44.368 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 3.280308e-268artbeat fatal, eviction in 28.610 seconds
[    CSSD]2011-04-23 17:12:04.778 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 3.280308e-268artbeat fatal, eviction in 14.580 seconds
[    CSSD]2011-04-23 17:12:04.779 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 3.280308e-268artbeat fatal, eviction in 14.120 seconds
[    CSSD]2011-04-23 17:12:06.207 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 3.280308e-268artbeat fatal, eviction in 13.580 seconds
[    CSSD]2011-04-23 17:12:17.719 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 5.560 seconds
[    CSSD]2011-04-23 17:12:17.719 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 5.100 seconds
[    CSSD]2011-04-23 17:12:19.165 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 4.560 seconds
[    CSSD]2011-04-23 17:12:19.165 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 4.100 seconds
[    CSSD]2011-04-23 17:12:20.642 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 3.560 seconds
[    CSSD]2011-04-23 17:12:20.642 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 3.100 seconds
[    CSSD]2011-04-23 17:12:22.139 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 2.560 seconds
[    CSSD]2011-04-23 17:12:22.139 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 2.100 seconds
[    CSSD]2011-04-23 17:12:23.588 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 1.550 seconds
[    CSSD]2011-04-23 17:12:23.588 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 1.090 seconds

2号节点的ocssd.log日志:

[    CSSD]2011-04-23 17:11:53.054 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 8.910601e-269artbeat fatal, eviction in 29.800 seconds
[    CSSD]2011-04-23 17:11:53.054 [3053439888] >TRACE:   clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 1037, misstime 30200
[    CSSD]2011-04-23 17:11:53.054 [3053439888] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2011-04-23 17:11:54.516 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 8.910601e-269artbeat fatal, eviction in 28.790 seconds
[    CSSD]2011-04-23 17:12:14.826 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 8.910601e-269artbeat fatal, eviction in 14.800 seconds
[    CSSD]2011-04-23 17:12:16.265 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 8.910601e-269artbeat fatal, eviction in 13.800 seconds
[    CSSD]2011-04-23 17:12:27.755 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 5.800 seconds
[    CSSD]2011-04-23 17:12:29.197 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 4.800 seconds
[    CSSD]2011-04-23 17:12:30.658 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 3.800 seconds
[    CSSD]2011-04-23 17:12:32.133 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 2.800 seconds
[    CSSD]2011-04-23 17:12:33.602 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 1.790 seconds
[    CSSD]2011-04-23 17:12:35.126 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 0.800 seconds

[    CSSD]2011-04-23 17:12:35.399 [117574544] >TRACE:   clssnmHandleSync: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:12:35.399 [117574544] >TRACE:   clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[21] sync[10]

clssnmHandleSyn应答3号节点发送的同步信息

[    CSSD]2011-04-23 17:12:35.399 [5073104] >USER:    NMEVENT_SUSPEND [00][00][00][0e]

发生Node Monitoring SUSPEND事件

[    CSSD]2011-04-23 17:12:35.405 [117574544] >TRACE:   clssnmSendVoteInfo: node(3) syncSeqNo(10)

通过clssnmSendVoteInfo向3号节点发送投票信息Vote mesg

[    CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE:   clssnmUpdateNodeState: node 1, state (3/0) unique (1303592601/1303592601) prevConuni(0) birth (9/9) (old/new)
[    CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE:   clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7e80ae8), probe((nil))
[    CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE:   clssnmDeactivateNode: node 1 (vrh1) left cluster

确认1号节点离开了集群cluster

[    CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE:   clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[    CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE:   clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new)
[    CSSD]2011-04-23 17:12:35.415 [117574544] >USER:    clssnmHandleUpdate: SYNC(10) from node(3) completed
[    CSSD]2011-04-23 17:12:35.416 [117574544] >USER:    clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2011-04-23 17:12:35.416 [117574544] >USER:    clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2011-04-23 17:12:35.416 [117574544] >TRACE:   clssnmHandleUpdate: diskTimeout set to (200000)ms
[    CSSD]2011-04-23 17:12:35.416 [3021970320] >TRACE:   clssgmReconfigThread:  started for reconfig (10)
[    CSSD]2011-04-23 17:12:35.416 [3021970320] >USER:    NMEVENT_RECONFIG [00][00][00][0c]
[    CSSD]2011-04-23 17:12:35.417 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2011-04-23 17:12:35.417 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9)
[    CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
[    CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(EVMDMAIN) birth(9/9)
[    CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(CRSDMAIN) birth(9/9)
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh2 type 3
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(ocr_crs) birth(9/9)
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9)
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmEstablishConnections: 2 nodes in cluster incarn 10
[    CSSD]2011-04-23 17:12:35.419 [3063929744] >TRACE:   clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa
[    CSSD]2011-04-23 17:12:35.419 [3063929744] >TRACE:   clssgmPeerListener: connects done (2/2)
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2)
[    CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2011-04-23 17:12:35.428 [3021970320] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes

[    CSSD]CLSS-3001: local node number 2, master node number 2

完成reconfiguration

[    CSSD]2011-04-23 17:12:35.440 [3021970320] >TRACE:   clssgmReconfigThread:  completed for reconfig(10), with status(1)

以下为3号节点的ocssd.log:

[    CSSD]2011-04-23 17:12:36.303 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 1.867300e-268artbeat fatal, eviction in 29.220 seconds
[    CSSD]2011-04-23 17:12:36.303 [3053439888] >TRACE:   clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 1037, misstime 30780
[    CSSD]2011-04-23 17:12:36.303 [3053439888] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2011-04-23 17:12:57.889 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 1.867300e-268artbeat fatal, eviction in 14.220 seconds
[    CSSD]2011-04-23 17:13:10.674 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 5.220 seconds
[    CSSD]2011-04-23 17:13:12.115 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 4.220 seconds
[    CSSD]2011-04-23 17:13:13.597 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 3.210 seconds
[    CSSD]2011-04-23 17:13:15.024 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 2.220 seconds
[    CSSD]2011-04-23 17:13:16.504 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 1.220 seconds
[    CSSD]2011-04-23 17:13:17.987 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 0.220 seconds
[    CSSD]2011-04-23 17:13:18.325 [3053439888] >TRACE:   clssnmPollingThread: Eviction started for node vrh1 (1), flags 0x040d, state 3, wt4c 0
[    CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE:   clssnmDoSyncUpdate: Initiating sync 10
[    CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (11)
[    CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE:   clssnmSetupAckWait: node(2) is ALIVE
[    CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE:   clssnmSetupAckWait: node(3) is ALIVE
[    CSSD]2011-04-23 17:13:18.327 [3032460176] >TRACE:   clssnmSendSync: syncSeqNo(10)
[    CSSD]2011-04-23 17:13:18.329 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(2)
[    CSSD]2011-04-23 17:13:18.329 [89033616] >TRACE:   clssnmHandleSync: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:13:18.329 [89033616] >TRACE:   clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[21] sync[10]
[    CSSD]2011-04-23 17:13:18.330 [8136912] >USER:    NMEVENT_SUSPEND [00][00][00][0e]
[    CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE:   clssnmDoSyncUpdate: Terminating node 1, vrh1, misstime(60010) state(5)
[    CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (13)
[    CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE:   clssnmSetupAckWait: node(2) is ACTIVE
[    CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE:   clssnmSetupAckWait: node(3) is ACTIVE
[    CSSD]2011-04-23 17:13:18.334 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(2)
[    CSSD]2011-04-23 17:13:18.335 [89033616] >TRACE:   clssnmSendVoteInfo: node(3) syncSeqNo(10)
[    CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(13)

以上完成了2-3节点间的Vote mesg通信,这些信息包含Node identifier,GM peer to peer listening endpoint以及
View of cluster membership。

[    CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE:   clssnmCheckDskInfo: Checking disk info...

开始检测voting disk上的信息

[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo: node 1, vrh1, state 5 with leader 1 has smaller cluster size 1; my cluster size 2 with leader 2

发现其他子集群,包含1号节点且1号节点为该子集群的leader,为最小子集群;3号与2号节点组成最大子集群,2号节点为leader节点

[    CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE:   clssnmEvict: Start
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Evicting node 1, vrh1, birth 9, death 10, 
impendingrcfg 1, stateflags 0x40d 
发起对1号节点的驱逐

[    CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE:   clssnmSendShutdown: req to node 1, kill time 443294
[    CSSD]2011-04-23 17:13:18.339 [3032460176] >TRACE:   clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7eaf220), probe((nil))
[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmWaitOnEvictions: node 1, vrh1, undead 1
[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmCheckKillStatus: Node 1, vrh1, down, LATS(443144),timeout(150)

clssnmCheckKillStatus检查1号节点是否down了

[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (15)
[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmSetupAckWait: node(2) is ACTIVE
[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmSetupAckWait: node(3) is ACTIVE
[    CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE:   clssnmSendUpdate: syncSeqNo(10)
[    CSSD]2011-04-23 17:13:18.341 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(2)
[    CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE:   clssnmUpdateNodeState: node 1, state (5/0) unique (1303592601/1303592601) prevConuni(1303592601) birth (9/9) (old/new)
[    CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE:   clssnmDeactivateNode: node 1 (vrh1) left cluster

[    CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE:   clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[    CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE:   clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new)
[    CSSD]2011-04-23 17:13:18.342 [89033616] >USER:    clssnmHandleUpdate: SYNC(10) from node(3) completed
[    CSSD]2011-04-23 17:13:18.342 [89033616] >USER:    clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2011-04-23 17:13:18.342 [89033616] >USER:    clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2011-04-23 17:13:18.342 [89033616] >TRACE:   clssnmHandleUpdate: diskTimeout set to (200000)ms
[    CSSD]2011-04-23 17:13:18.347 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2011-04-23 17:13:18.348 [3032460176] >TRACE:   clssnmDoSyncUpdate: Sync 10 complete!
[    CSSD]2011-04-23 17:13:18.350 [3021970320] >TRACE:   clssgmReconfigThread:  started for reconfig (10)
[    CSSD]2011-04-23 17:13:18.350 [3021970320] >USER:    NMEVENT_RECONFIG [00][00][00][0c]
[    CSSD]2011-04-23 17:13:18.351 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2011-04-23 17:13:18.352 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9)
[    CSSD]2011-04-23 17:13:18.353 [3063929744] >TRACE:   clssgmDispatchCMXMSG(): got message type(7) src(2) incarn(10) during incarn(9/9)
[    CSSD]2011-04-23 17:13:18.354 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
[    CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(EVMDMAIN) birth(9/9)
[    CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(CRSDMAIN) birth(9/9)
[    CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3
[    CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh2 type 3
[    CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3
[    CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(ocr_crs) birth(9/9)
[    CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[    CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9)
[    CSSD]2011-04-23 17:13:18.357 [3021970320] >TRACE:   clssgmEstablishConnections: 2 nodes in cluster incarn 10
[    CSSD]2011-04-23 17:13:18.366 [3063929744] >TRACE:   clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa
[    CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE:   clssgmHandleDBDone(): src/dest (2/65535) size(68) incarn 10
[    CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE:   clssgmPeerListener: connects done (2/2)
[    CSSD]2011-04-23 17:13:18.369 [3021970320] >TRACE:   clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2)

更新阶段   

[    CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes

[    CSSD]CLSS-3001: local node number 3, master node number 2

[    CSSD]2011-04-23 17:13:18.372 [3021970320] >TRACE:   clssgmReconfigThread:  completed for reconfig(10), with status(1)

2.另一场景为1号节点未加入集群,2号节点的网络失败,因2号节点的member number较小故其通过voting disk向3号节点发起驱逐

以下为2号节点的ocssd.log日志

[    CSSD]2011-04-23 17:41:48.643 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 29.890 seconds
[    CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE:   clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30110
[    CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2011-04-23 17:41:50.132 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 28.890 seconds
[    CSSD]2011-04-23 17:42:10.533 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 14.860 seconds
[    CSSD]2011-04-23 17:42:11.962 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 13.860 seconds
[    CSSD]2011-04-23 17:42:23.523 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 5.840 seconds
[    CSSD]2011-04-23 17:42:24.989 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 4.840 seconds
[    CSSD]2011-04-23 17:42:26.423 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 3.840 seconds
[    CSSD]2011-04-23 17:42:27.890 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 2.840 seconds
[    CSSD]2011-04-23 17:42:29.382 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 1.840 seconds
[    CSSD]2011-04-23 17:42:30.832 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 0.830 seconds
[    CSSD]2011-04-23 17:42:32.020 [3053439888] >TRACE:   clssnmPollingThread: Eviction started for node vrh3 (3), flags 0x040d, state 3, wt4c 0
[    CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE:   clssnmDoSyncUpdate: Initiating sync 13
[    CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (11)
[    CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE:   clssnmSetupAckWait: node(2) is ALIVE
[    CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE:   clssnmSendSync: syncSeqNo(13)
[    CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(1)
[    CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE:   clssnmHandleSync: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE:   clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[13] sync[13]
[    CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE:   clssnmDoSyncUpdate: Terminating node 3, vrh3, misstime(60000) state(5)
[    CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (13)
[    CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE:   clssnmSetupAckWait: node(2) is ACTIVE
[    CSSD]2011-04-23 17:42:32.021 [5073104] >USER:    NMEVENT_SUSPEND [00][00][00][0c]
[    CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2011-04-23 17:42:32.022 [117574544] >TRACE:   clssnmSendVoteInfo: node(2) syncSeqNo(13)
[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: node 3, vrh3, state 5 with leader 3 
has smaller cluster size 1; my cluster size 1 with leader 2 检查voting disk后发现子集群3为最小"子集群"(3号节点的node number较2号大);2号节点为最大子集群 [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Start [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Evicting node 3, vrh3, birth 3, death 13, impendingrcfg 1, stateflags 0x40d [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmSendShutdown: req to node 3, kill time 1643084 发起对3号节点的驱逐和shutdown request

[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmDiscHelper: vrh3, node(3) connection failed, con (0xb7e79bb0), probe((nil))
[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmWaitOnEvictions: node 3, vrh3, undead 1
[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmCheckKillStatus: Node 3, vrh3, down, LATS(1642874),timeout(210)
[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (15)
[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmSetupAckWait: node(2) is ACTIVE
[    CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE:   clssnmSendUpdate: syncSeqNo(13)
[    CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE:   clssnmUpdateNodeState: node 1, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE:   clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[    CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE:   clssnmUpdateNodeState: node 3, state (5/0) unique (1303591326/1303591326) prevConuni(1303591326) birth (3/3) (old/new)
[    CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE:   clssnmDeactivateNode: node 3 (vrh3) left cluster

[    CSSD]2011-04-23 17:42:32.024 [117574544] >USER:    clssnmHandleUpdate: SYNC(13) from node(2) completed
[    CSSD]2011-04-23 17:42:32.024 [117574544] >USER:    clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE:   clssnmHandleUpdate: diskTimeout set to (200000)ms
[    CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE:   clssnmDoSyncUpdate: Sync 13 complete!
[    CSSD]2011-04-23 17:42:32.024 [3021970320] >TRACE:   clssgmReconfigThread:  started for reconfig (13)
[    CSSD]2011-04-23 17:42:32.024 [3021970320] >USER:    NMEVENT_RECONFIG [00][00][00][04]
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(crs_version) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_FAILOVER) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(EVMDMAIN) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(CRSDMAIN) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_vrh1) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_vrh3) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(ocr_crs) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(#CSS_CLSSOMON) birth(3/3)
[    CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 13
[    CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE:   clssgmPeerDeactivate: node 3 (vrh3), death 13, state 0x0 connstate 0xf
[    CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE:   clssgmEstablishMasterNode: MASTER for 13 is node(2) birth(2)
[    CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 13 with 1 nodes

[    CSSD]CLSS-3001: local node number 2, master node number 2

完成reconfiguration

[    CSSD]2011-04-23 17:42:32.027 [3021970320] >TRACE:   clssgmReconfigThread:  completed for reconfig(13), with status(1)

以下为3号节点的ocssd.log日志:

[    CSSD]2011-04-23 17:42:33.204 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 1.867300e-268artbeat fatal, eviction in 29.360 seconds
[    CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE:   clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1039, misstime 30640
[    CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2011-04-23 17:42:55.168 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 1.867300e-268artbeat fatal, eviction in 14.330 seconds
[    CSSD]2011-04-23 17:43:08.182 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 5.310 seconds
[    CSSD]2011-04-23 17:43:09.661 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 4.300 seconds
[    CSSD]2011-04-23 17:43:11.144 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 3.300 seconds
[    CSSD]2011-04-23 17:43:12.634 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 2.300 seconds
[    CSSD]2011-04-23 17:43:14.053 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 1.300 seconds
[    CSSD]2011-04-23 17:43:15.467 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 0.300 seconds
[    CSSD]2011-04-23 17:43:15.911 [3053439888] >TRACE:   clssnmPollingThread: Eviction started for node vrh2 (2), flags 0x040f, state 3, wt4c 0
[    CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE:   clssnmDoSyncUpdate: Initiating sync 13
[    CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (11)
[    CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE:   clssnmSetupAckWait: node(3) is ALIVE
[    CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE:   clssnmSendSync: syncSeqNo(13)
[    CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(1)
[    CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE:   clssnmHandleSync: diskTimeout set to (57000)ms
[    CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE:   clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[29] sync[13]
[    CSSD]2011-04-23 17:43:15.912 [8136912] >USER:    NMEVENT_SUSPEND [00][00][00][0c]
[    CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE:   clssnmDoSyncUpdate: Terminating node 2, vrh2, misstime(60010) state(5)
[    CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE:   clssnmSetupAckWait: Ack message type (13)
[    CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE:   clssnmSetupAckWait: node(3) is ACTIVE
[    CSSD]2011-04-23 17:43:15.913 [89033616] >TRACE:   clssnmSendVoteInfo: node(3) syncSeqNo(13)
[    CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain. [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: : my node(3), Leader(3), Size(1) VS Node(2), Leader(2), Size(1) 读取voting disk后发现kill block,为避免split brain,自我aborting!

[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:   ###################################
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:   clssscExit: CSSD aborting
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:   ###################################
[    CSSD]--- DUMP GROCK STATE DB ---
[    CSSD]----------
[    CSSD]  type 2, Id 4, Name = (crs_version)
[    CSSD]  flags: 0x1000
[    CSSD]  grant: count=0, type 0, wait 0
[    CSSD]  Member Count =2, master 2
[    CSSD]   . . . . .
[    CSSD]     memberNo =2, seq 2
[    CSSD]     flags = 0x0, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]   . . . . .
[    CSSD]     memberNo =1, seq 12
[    CSSD]     flags = 0x1000, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 2, nodeBirth 2
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 3, Id 11, Name = (_ORA_CRS_FAILOVER)
[    CSSD]  flags: 0x0
[    CSSD]  grant: count=1, type 3, wait 1
[    CSSD]  Member Count =1, master -3
[    CSSD]   . . . . .
[    CSSD]     memberNo =0, seq 0
[    CSSD]     flags = 0x12, granted 1
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 2, Id 2, Name = (EVMDMAIN)
[    CSSD]  flags: 0x1000
[    CSSD]  grant: count=0, type 0, wait 0
[    CSSD]  Member Count =2, master 2
[    CSSD]   . . . . .
[    CSSD]     memberNo =2, seq 1
[    CSSD]     flags = 0x0, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 2, nodeBirth 2
[    CSSD]     privateDataSize = 508
[    CSSD]     publicDataSize = 504
[    CSSD]   . . . . .
[    CSSD]     memberNo =3, seq 2
[    CSSD]     flags = 0x0, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 508
[    CSSD]     publicDataSize = 504
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 2, Id 5, Name = (CRSDMAIN)
[    CSSD]  flags: 0x1000
[    CSSD]  grant: count=0, type 0, wait 0
[    CSSD]  Member Count =1, master 3
[    CSSD]   . . . . .
[    CSSD]     memberNo =3, seq 2
[    CSSD]     flags = 0x0, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 128
[    CSSD]     publicDataSize = 128
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 3, Id 12, Name = (_ORA_CRS_MEMBER_vrh1)
[    CSSD]  flags: 0x0
[    CSSD]  grant: count=1, type 3, wait 1
[    CSSD]  Member Count =1, master -3
[    CSSD]   . . . . .
[    CSSD]     memberNo =0, seq 0
[    CSSD]     flags = 0x12, granted 1
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 3, Id 12, Name = (_ORA_CRS_MEMBER_vrh3)
[    CSSD]  flags: 0x0
[    CSSD]  grant: count=1, type 3, wait 1
[    CSSD]  Member Count =1, master -3
[    CSSD]   . . . . .
[    CSSD]     memberNo =0, seq 0
[    CSSD]     flags = 0x12, granted 1
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 2, Id 3, Name = (ocr_crs)
[    CSSD]  flags: 0x1000
[    CSSD]  grant: count=0, type 0, wait 0
[    CSSD]  Member Count =2, master 3
[    CSSD]   . . . . .
[    CSSD]     memberNo =3, seq 2
[    CSSD]     flags = 0x0, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 32
[    CSSD]   . . . . .
[    CSSD]     memberNo =2, seq 12
[    CSSD]     flags = 0x1000, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 2, nodeBirth 2
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 32
[    CSSD]----------
[    CSSD]----------
[    CSSD]  type 2, Id 1, Name = (#CSS_CLSSOMON)
[    CSSD]  flags: 0x1000
[    CSSD]  grant: count=0, type 0, wait 0
[    CSSD]  Member Count =2, master 2
[    CSSD]   . . . . .
[    CSSD]     memberNo =2, seq 1
[    CSSD]     flags = 0x1000, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 2, nodeBirth 2
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]   . . . . .
[    CSSD]     memberNo =3, seq 2
[    CSSD]     flags = 0x1000, granted 0
[    CSSD]     refCnt = 1
[    CSSD]     nodeNum = 3, nodeBirth 3
[    CSSD]     privateDataSize = 0
[    CSSD]     publicDataSize = 0
[    CSSD]----------
[    CSSD]--- END OF GROCK STATE DUMP ---
[    CSSD]------- Begin Dump -------


关于本篇文章,答网友疑问

Question 1:

文档里的有些说法 我觉得不是很靠谱
比如
所以
有一种说法认为voting disk只要有2个足以保证冗余度就可以了,没有必要有3个或以上
voting disk,这种说法是错误的。Oracle推荐集群中至少要有3个voting disks。 实际上应该是奇数个,1个也是可以的

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的
节点,并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集
群(subcluster)内的节点数目,并依据以下2种因素决定哪个子集群应当存活下去:
1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node
number),举例来说在一个2节点的RAC环境中总是1号节点会获胜。 ??
这句话 有官方 出处吗? 还是自己的理解?

如果这个1,2 成立 ,那么voting disk 还有什么用,不用voting disk 就已经可以判断出 那个节点获胜了

Answer:
1.
1个当然也可以,但是注意写的是推荐

2. 没有votedisk ,这些sub-cluster如何通信呢?

关于若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node
number) 这个Oracle的内部文档是有描述的,且也是文档中的实验可证实的

Question 2:
lz很用心,不过至少有2种情况可能没遇到过
1.发生在磁盘级的脑裂,这个在exadata环境中很常见
2.当被驱除的节点没cpu时间片反应驱逐信号,咋办

Answer:
文章的主旨是写 脑裂决议的一些原理, 真实世界中的 多节点RAC 的情况可能复杂得多

Question:
lz 能不能 具体的讲讲
1.为什么 voting disk 必须是奇数? 1个voting disk 算一票,那么3个voting disk是3票吗?
2.什么情况下 voting disk 参与投票
我看到 1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node
number),举例来说在一个2节点的RAC环境中总是1号节点会获胜。
如果上面1,2成立,那么voting disk 何时参与投票? 能否举个例子 ,谢谢

Answer:

1.
为了保证以上的磁盘心跳和读取”kill block”的活动始终正常运作CSS要求保证至少(N/2+1)个
投票磁盘要被节点正常访问,这样就保证了每2个节点间总是至少有一个投票磁盘是它们都
可以正常访问的

从来都没有说必须是奇数个,但推荐是奇数个

为什么是奇数个?例如有3个votedisk ,那么其中一个节点只要能访问其中2个就 ok,而如果共有2个votedisk 那么节点需要能正常访问所有的VD。

如果是 4个ok吗? 当然,4个也可以。

2.
“在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的
节点,并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集
群(subcluster)内的节点数目,并依据以下2种因素决定哪个子集群应当存活下去”

另见文档 实验部分的日志,如:


[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo:
node 1, vrh1, state 5
with leader 1 has smaller cluster size 1; my cluster size 2 with leader 2″

3. 实验里的场景2说明了该问题:

“另一场景为1号节点未加入集群,2号节点的网络失败,因2号节点的member number较
小故其通过voting disk向3号节点发起驱逐”

“[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo:
node 3, vrh3, state 5 with leader 3
has smaller cluster size 1; my cluster size 1 with leader 2″

如果还不理解 ,那么读读这段:

The RM (Reconfig Manager) sends a sync message to all participating nodes. Participating nodes respond with a sync acknowledgement. After this the vote phase begins and the master sends a vote message to all participating nodes. Participating nodes repond with a vote info message containing their node identifier and GM peer to peer listening endpoint. In the split-check phase, the RM uses the voting disk to verify there is no split-brain. It finds nodes heartbeating to disk that are not connected via the network. If it finds these, it will determine which nodes are talking to which and the largest subcluster survives. For example, if we have a 5 node cluster and all of the nodes are heartbeating to the voting disk but only a group of 3 can communicate via the network and a group of 2 can communication via the network, this means we have 2 subclusters. The largest subcluster (3) would survive while the other subcluster (2) would not. After this the evict phase would evict nodes previously in the cluster but not considered members in this incarnation. In this case we would send a message to evicted nodes (if possible) and write eviction notice to a ‘kill’ block in the voting file. We would wait for the node to indicate it got the eviction notice (wait for seconds). The wait is terminated by a message or status on the voting file indicating that the node got the eviction notice. In the update phase the master sends an update message containing the definitive cluster membership and node information for all particpating nodes. The participating nodes send update acknowledgements. All members queue the reconfiguration event ot their GM.

As far as voting disks are concerned, a node must be able to access strictly more than half of the voting disks at any time. So if you want to be able to tolerate a failure of n voting disks, you must have at least 2n+1 configured. (n=1 means 3 voting disks). You can configure up to 32 voting disks, providing protection against 15 simultaneous disk failures.
Oracle recommends that customers use 3 or more voting disks in Oracle RAC 10g Release 2. Note: For best availability, the 3 voting files should be physically separate disks. It is recommended to use an odd number as 4 disks will not be any more highly available than 3 disks, 1/2 of 3 is 1.5…rounded to 2, 1/2 of 4 is 2, once we lose 2 disks, our cluster will fail with both 4 voting disks or 3 voting disks.


Question:
i have a interesting circumstances.
3 node rac, 3 voting disk.each node can write to two voting disks, if node 1 is the master node.
Node1 lost access to vote2 and can only access vote1 and vote3
Node2 lost access to vote3 and can only access vote2 and vote1
Node3 lost access to vote1 and can only access vote2 and vote3
how Brain Split takes place?is it the supposition possible?
thanks maclean in advance!


Answer:

Problems with access to a single vote disk will result in a Warning as opposed to an Error as long as there are still a majority of vote disks accessible from each node, so the warning is non-fatal, and in this case, eviction protocol is not invoked.



补充: 

再议RAC Brain Split脑裂

这2天在面试DBA Candidate的时候,我问到Oracle RAC中Brain Split脑裂决议的一些概念, 几乎所有的Candidate都告诉我当”只有2个节点的时候,投票算法就失效了,会让2个节点去抢占Quorum Disk,最先获得的节点将活下来” 。 我们姑且把这套理论叫做” 抢占论”。

“抢占论”的具体观点可能与下面这一段文字大同小异:

 

“在集群中,节点间通过某种机制(心跳)了解彼此的健康状态,以确保各节点协调工作。 假设只有”心跳”出现问题, 各个节点还在正常运行, 这时,每个节点都认为其他的节点宕机了, 自己是整个集群环境中的”唯一建在者”,自己应该获得整个集群的”控制权”。 在集群环境中,存储设备都是共享的, 这就意味着数据灾难, 这种情况就是”脑裂”
解决这个问题的通常办法是使用投票算法(Quorum Algorithm). 它的算法机理如下:

观点1:

集群中各个节点需要心跳机制来通报彼此的”健康状态”,假设每收到一个节点的”通报”代表一票。对于三个节点的集群,正常运行时,每个节点都会有3票。 当结点A心跳出现故障但节点A还在运行,这时整个集群就会分裂成2个小的partition。 节点A是一个,剩下的2个是一个。 这是必须剔除一个partition才能保障集群的健康运行。 对于有3个节点的集群, A 心跳出现问题后, B 和 C 是一个partion,有2票, A只有1票。 按照投票算法, B 和C 组成的集群获得控制权, A 被剔除。

 

 

观点2:

如果只有2个节点,投票算法就失效了。 因为每个节点上都只有1票。 这时就需要引入第三个设备:Quorum Device. Quorum Device 通常采用饿是共享磁盘,这个磁盘也叫作Quorum disk。 这个Quorum Disk 也代表一票。 当2个结点的心跳出现问题时, 2个节点同时去争取Quorum Disk 这一票, 最早到达的请求被最先满足。 故最先获得Quorum Disk的节点就获得2票。另一个节点就会被剔除。

 

 

以上这段文字描述中观点1 与我在 一文中提出的看法其实是类似的。  这里再列出我的描述:

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的节点,并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集群(subcluster)内的节点数目,并依据以下2种因素决定哪个子集群应当存活下去:

  1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
  2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node number),举例来说在一个2节点的RAC环境中总是1号节点会获胜。

补充:关于 我引入的子集群的概念的介绍:

“在解决脑裂的场景中,NM还会监控voting disk以了解其他的竞争子集群(subclusters)。关于子集群我们有必要介绍一下,试想我们的环境中存在大量的节点,以Oracle官方构建过的128个节点的环境为我们的想象空间,当网络故障发生时存在多种的可能性,一种可能性是全局的网络失败,即128个节点中每个节点都不能互相发生网络心跳,此时会产生多达128个的信息”孤岛”子集群。另一种可能性是局部的网络失败,128个节点中被分成多个部分,每个部分中包含多于一个的节点,这些部分就可以被称作子集群(subclusters)。当出现网络故障时子集群内部的多个节点仍能互相通信传输投票信息(vote mesg),但子集群或者孤岛节点之间已经无法通过常规的Interconnect网络交流了,这个时候NM Reconfiguration就需要用到voting disk投票磁盘。”

 

争议主要体现在 , “抢占论” 认为当 只有2个节点时 是通过抢占votedisk 的结果来决定具体哪个节点存活下来同时” 抢占论”没有介绍 当存在多个相同节点数目的子集群情况下的结论(譬如4节点的RAC , 1、2节点组成一个子集群,3、4节点组成一个子集群), 若按照2节点时的做法那么依然是通过子集群间抢占votedisk来决定。

 

我个人认为这种说法(“抢占论”)是错误的,不管是具体脑裂时的CRS关键进程css的日志,还是Oracle官方的内部文档都可以说明该问题。

 

我们来看10.2 RAC中的一个场景,假设集群中共有3个节点,其中1号实例没有被启动,集群中只有2个活动节点(active node),发生2号节点的网络失败的故障,因2号节点的member number较小故其通过voting disk向3号节点发起驱逐,具体日志如下:

观察红色部分的日志 ,明确显示了NM(Node Monitor)节点监控服务检查votedisk信息,并计算出了smaller cluster size

以下为2号节点的ocssd.log日志

[    CSSD]2011-04-23 17:42:32.022 [3032460176] >
TRACE: clssnmCheckDskInfo: node 3, vrh3, state 5 with leader 3
has smaller cluster size 1; my cluster size 1 with leader 2

检查voting disk后发现子集群3为最小"子集群"(3号节点的node number较2号大);2号节点为最大子集群

[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:   clssnmEvict: Start
[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:   clssnmEvict:
Evicting node 3, vrh3, birth 3, death 13, impendingrcfg 1, stateflags 0x40d
[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:
clssnmSendShutdown: req to node 3, kill time 1643084

发起对3号节点的驱逐和shutdown request

以下为3号节点的ocssd.log日志:
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:   clssnmCheckDskInfo:
Aborting local node to avoid splitbrain.
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:                     :
my node(3), Leader(3), Size(1) VS Node(2), Leader(2), Size(1)

读取voting disk后发现kill block,为避免split brain,自我aborting!

 

 

此外Metalink 上一些官方Note 也明确说明了我以上的观点 , 摘录部分内容如下:

 

1.
When interconnect breaks – keeps the largest cluster possible up, other nodes will be evicted, in 2 node cluster lowest number node remains. 

2.
Node eviction: pick a cluster node as victim to reboot.Always keep the largest cluster possible up, evicted other nodes two nodes: keep the lowest number node up and evict other

 

实际上有部分Vendor Unix Clusterware集群软件的脑裂可能如确实是以谁先获得 “Quorum disk”为决定因素, 但是自10g 推出的Oracle 自己的Real Application Cluster(RAC) 的clusterware 或者说 CRS( cluster ready services) 在Brain Split Resolution时并非如此,在这方面类推并不能帮助我们找出正确的结论。



Question :


看了作者的分析,并结合我这两天在一个客户那遇到的情况。我认为” 抢占论”是实际存在的。客户环境是2节点的RAC,其中一个节点的Network Heartbeat失效,经过测试发现NODE1,NODE2无论那个节点先启动,则另一个节点启动时就会重启。所以作者提到的member number较小的会驱逐较大的说法,这里没有得到验证。



Answer:


为什么说抢占论是 错误的?

1. 抢占论没有 理论依据 没有任何一个官方文档介绍了 抢占论 ,而又有大量 内部文档明确写了”Sub-cluster with lowest node number”

2. 抢占论 没有 实验依据 , 若我在文章中列出的 场景的 cssd.log 信息明确说明了 在比较subcluster size 的算法

你所说的 测试表现 并不能说明问题, 另一方面 你的测试结果本身 也不支持 ” 抢占论” ,如果你希望用你的测试来解释该问题 , 请将 2个节点 上Brain split 时 clssnmCheckDskInfo 相关的信息罗列出来


Question:


[ CSSD]2011-10-30 01:15:33.879 >USER: Oracle Database 10g CSS Release 10.2.0.3.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ CSSD]2011-10-30 01:15:33.879 >USER: CSS daemon log for node hdmis2, number 2, in cluster crs
[ CSSD]2011-10-30 01:15:33.885 [1] >TRACE: clssscmain: local-only set to false
[ CSSD]2011-10-30 01:15:33.893 [1] >TRACE: clssnmReadNodeInfo: added node 1 (hdmis1) to cluster
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=hdmis2DBG_CSSD))
[ CSSD]2011-10-30 01:15:33.899 [1] >TRACE: clssnmReadNodeInfo: added node 2 (hdmis2) to cluster
[ CSSD]2011-10-30 01:15:33.904 [1029] >TRACE: clssnm_skgxnmon: skgxn init failed
[ CSSD]2011-10-30 01:15:33.904 [1] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[ CSSD]2011-10-30 01:15:33.906 [1] >TRACE: clssnmNMInitialize: misscount set to (30), impending reconfig threshold set to (26000)
[ CSSD]2011-10-30 01:15:33.907 [1] >TRACE: clssnmNMInitialize: diskShortTimeout set to (27000)ms
[ CSSD]2011-10-30 01:15:33.908 [1] >TRACE: clssnmNMInitialize: diskLongTimeout set to (200000)ms
[ CSSD]2011-10-30 01:15:33.912 [1] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/rhdisk12)
[ CSSD]2011-10-30 01:15:33.912 [1030] >TRACE: clssnmvDPT: spawned for disk 0 (/dev/rhdisk12)
[ CSSD]2011-10-30 01:15:33.914 [1] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (1//dev/rhdisk13)
[ CSSD]2011-10-30 01:15:33.914 [1287] >TRACE: clssnmvDPT: spawned for disk 1 (/dev/rhdisk13)
[ CSSD]2011-10-30 01:15:33.916 [1] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (2//dev/rhdisk14)
[ CSSD]2011-10-30 01:15:33.916 [1544] >TRACE: clssnmvDPT: spawned for disk 2 (/dev/rhdisk14)
[ CSSD]2011-10-30 01:15:35.932 [1544] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (2//dev/rhdisk14)
[ CSSD]2011-10-30 01:15:35.935 [1287] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (1//dev/rhdisk13)
[ CSSD]2011-10-30 01:15:35.935 [1801] >TRACE: clssnmvKillBlockThread: spawned for disk 2 (/dev/rhdisk14) initial sleep interval (1000)ms
[ CSSD]2011-10-30 01:15:35.935 [1030] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/rhdisk12)
[ CSSD]2011-10-30 01:15:35.935 [2058] >TRACE: clssnmvKillBlockThread: spawned for disk 1 (/dev/rhdisk13) initial sleep interval (1000)ms
[ CSSD]2011-10-30 01:15:35.935 [2315] >TRACE: clssnmvKillBlockThread: spawned for disk 0 (/dev/rhdisk12) initial sleep interval (1000)ms
[ CSSD]2011-10-30 01:15:35.935 [1] >TRACE: clssnmFatalInit: fatal mode enabled
[ CSSD]2011-10-30 01:15:35.943 [2829] >TRACE: clssnmconnect: connecting to node 2, flags 0×0001, connector 1
[ CSSD]2011-10-30 01:15:35.952 [2829] >TRACE: clssnmClusterListener: Listening on (ADDRESS=(PROTOCOL=tcp)(HOST=hdmis2_priv)(PORT=49895))

[ CSSD]2011-10-30 01:15:35.952 [2829] >TRACE: clssnmconnect: connecting to node 0, flags 0×0000, connector 1
[ CSSD]2011-10-30 01:15:35.952 [2829] >TRACE: clssnmconnect: connecting to node 1, flags 0×0001, connector 0
[ CSSD]2011-10-30 01:15:35.994 [3086] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_2))
[ CSSD]2011-10-30 01:15:35.994 [3086] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_hdmis2_crs))
[ CSSD]2011-10-30 01:15:36.003 [3857] >TRACE: clssgmPeerListener: Listening on (ADDRESS=(PROTOCOL=tcp)(DEV=25)(HOST=10.1.1.2)(PORT=32902))
[ CSSD]2011-10-30 01:15:35.935 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(10) LATS(4148068) Disk lastSeqNo(10)
[ CSSD]2011-10-30 01:15:36.135 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(10) LATS(4148268) Disk lastSeqNo(10)
[ CSSD]2011-10-30 01:15:36.136 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(10) LATS(4148268) Disk lastSeqNo(10)
[ CSSD]2011-10-30 01:15:37.138 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(11) LATS(4149271) Disk lastSeqNo(11)
[ CSSD]2011-10-30 01:15:37.139 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(11) LATS(4149271) Disk lastSeqNo(11)
[ CSSD]2011-10-30 01:15:37.139 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(11) LATS(4149271) Disk lastSeqNo(11)
[ CSSD]2011-10-30 01:15:38.145 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(12) LATS(4150278) Disk lastSeqNo(12)
[ CSSD]2011-10-30 01:15:38.149 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(12) LATS(4150281) Disk lastSeqNo(12)
[ CSSD]2011-10-30 01:15:38.149 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(12) LATS(4150281) Disk lastSeqNo(12)
[ CSSD]2011-10-30 01:15:39.148 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(13) LATS(4151281) Disk lastSeqNo(13)
[ CSSD]2011-10-30 01:15:39.159 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(13) LATS(4151291) Disk lastSeqNo(13)
[ CSSD]2011-10-30 01:15:39.159 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(13) LATS(4151291) Disk lastSeqNo(13)
[ CSSD]2011-10-30 01:15:40.156 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(14) LATS(4152288) Disk lastSeqNo(14)
[ CSSD]2011-10-30 01:15:40.160 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(14) LATS(4152292) Disk lastSeqNo(14)
[ CSSD]2011-10-30 01:15:40.169 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(14) LATS(4152301) Disk lastSeqNo(14)
[ CSSD]2011-10-30 01:15:41.158 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(15) LATS(4153291) Disk lastSeqNo(15)
[ CSSD]2011-10-30 01:15:41.160 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(15) LATS(4153292) Disk lastSeqNo(15)
[ CSSD]2011-10-30 01:15:41.179 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(15) LATS(4153311) Disk lastSeqNo(15)
[ CSSD]2011-10-30 01:15:42.160 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(16) LATS(4154293) Disk lastSeqNo(16)
[ CSSD]2011-10-30 01:15:42.166 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(16) LATS(4154299) Disk lastSeqNo(16)
[ CSSD]2011-10-30 01:15:42.189 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(16) LATS(4154321) Disk lastSeqNo(16)
[ CSSD]2011-10-30 01:15:43.161 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(17) LATS(4155293) Disk lastSeqNo(17)
[ CSSD]2011-10-30 01:15:43.177 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(17) LATS(4155309) Disk lastSeqNo(17)
[ CSSD]2011-10-30 01:15:43.199 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(17) LATS(4155331) Disk lastSeqNo(17)
[ CSSD]2011-10-30 01:15:44.161 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(18) LATS(4156294) Disk lastSeqNo(18)
[ CSSD]2011-10-30 01:15:44.178 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(18) LATS(4156311) Disk lastSeqNo(18)
[ CSSD]2011-10-30 01:15:44.209 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(18) LATS(4156341) Disk lastSeqNo(18)
[ CSSD]2011-10-30 01:15:45.162 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(19) LATS(4157294) Disk lastSeqNo(19)
[ CSSD]2011-10-30 01:15:45.187 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(19) LATS(4157319) Disk lastSeqNo(19)
[ CSSD]2011-10-30 01:15:45.219 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(19) LATS(4157351) Disk lastSeqNo(19)
[ CSSD]2011-10-30 01:15:46.162 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(20) LATS(4158294) Disk lastSeqNo(20)
[ CSSD]2011-10-30 01:15:46.196 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(20) LATS(4158328) Disk lastSeqNo(20)
[ CSSD]2011-10-30 01:15:46.229 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(20) LATS(4158361) Disk lastSeqNo(20)
[ CSSD]2011-10-30 01:15:47.162 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(21) LATS(4159295) Disk lastSeqNo(21)
[ CSSD]2011-10-30 01:15:47.265 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(21) LATS(4159397) Disk lastSeqNo(21)
[ CSSD]2011-10-30 01:15:47.267 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(21) LATS(4159399) Disk lastSeqNo(21)
[ CSSD]2011-10-30 01:15:48.163 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(22) LATS(4160295) Disk lastSeqNo(22)
[ CSSD]2011-10-30 01:15:48.268 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(22) LATS(4160400) Disk lastSeqNo(22)
[ CSSD]2011-10-30 01:15:48.269 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(22) LATS(4160401) Disk lastSeqNo(22)
[ CSSD]2011-10-30 01:15:49.163 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(23) LATS(4161296) Disk lastSeqNo(23)
[ CSSD]2011-10-30 01:15:49.278 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(23) LATS(4161410) Disk lastSeqNo(23)
[ CSSD]2011-10-30 01:15:49.279 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(23) LATS(4161411) Disk lastSeqNo(23)
[ CSSD]2011-10-30 01:15:50.164 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(24) LATS(4162296) Disk lastSeqNo(24)
[ CSSD]2011-10-30 01:15:50.288 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(24) LATS(4162420) Disk lastSeqNo(24)
[ CSSD]2011-10-30 01:15:50.289 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(24) LATS(4162421) Disk lastSeqNo(24)
[ CSSD]2011-10-30 01:15:51.164 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(25) LATS(4163297) Disk lastSeqNo(25)
[ CSSD]2011-10-30 01:15:51.289 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(25) LATS(4163422) Disk lastSeqNo(25)
[ CSSD]2011-10-30 01:15:51.296 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(25) LATS(4163428) Disk lastSeqNo(25)
[ CSSD]2011-10-30 01:15:52.165 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(26) LATS(4164297) Disk lastSeqNo(26)
[ CSSD]2011-10-30 01:15:52.290 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(26) LATS(4164422) Disk lastSeqNo(26)
[ CSSD]2011-10-30 01:15:52.296 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(26) LATS(4164429) Disk lastSeqNo(26)
[ CSSD]2011-10-30 01:15:53.165 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(27) LATS(4165297) Disk lastSeqNo(27)
[ CSSD]2011-10-30 01:15:53.290 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(27) LATS(4165422) Disk lastSeqNo(27)
[ CSSD]2011-10-30 01:15:53.297 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(27) LATS(4165429) Disk lastSeqNo(27)
[ CSSD]2011-10-30 01:15:54.165 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(28) LATS(4166298) Disk lastSeqNo(28)
[ CSSD]2011-10-30 01:15:54.290 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(28) LATS(4166423) Disk lastSeqNo(28)
[ CSSD]2011-10-30 01:15:54.297 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(28) LATS(4166430) Disk lastSeqNo(28)
[ CSSD]2011-10-30 01:15:55.166 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(29) LATS(4167298) Disk lastSeqNo(29)
[ CSSD]2011-10-30 01:15:55.291 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(29) LATS(4167423) Disk lastSeqNo(29)
[ CSSD]2011-10-30 01:15:55.298 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(29) LATS(4167430) Disk lastSeqNo(29)
[ CSSD]2011-10-30 01:15:56.166 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(30) LATS(4168299) Disk lastSeqNo(30)
[ CSSD]2011-10-30 01:15:56.291 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(30) LATS(4168424) Disk lastSeqNo(30)
[ CSSD]2011-10-30 01:15:56.298 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(30) LATS(4168431) Disk lastSeqNo(30)
[ CSSD]2011-10-30 01:15:57.167 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(31) LATS(4169299) Disk lastSeqNo(31)
[ CSSD]2011-10-30 01:15:57.292 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(31) LATS(4169424) Disk lastSeqNo(31)
[ CSSD]2011-10-30 01:15:57.299 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(31) LATS(4169431) Disk lastSeqNo(31)
[ CSSD]2011-10-30 01:15:58.167 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(32) LATS(4170299) Disk lastSeqNo(32)
[ CSSD]2011-10-30 01:15:58.292 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(32) LATS(4170424) Disk lastSeqNo(32)
[ CSSD]2011-10-30 01:15:58.299 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(32) LATS(4170431) Disk lastSeqNo(32)
[ CSSD]2011-10-30 01:15:59.168 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(33) LATS(4171300) Disk lastSeqNo(33)
[ CSSD]2011-10-30 01:15:59.292 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(33) LATS(4171425) Disk lastSeqNo(33)
[ CSSD]2011-10-30 01:15:59.299 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(33) LATS(4171432) Disk lastSeqNo(33)
[ CSSD]2011-10-30 01:16:00.168 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(34) LATS(4172300) Disk lastSeqNo(34)
[ CSSD]2011-10-30 01:16:00.293 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(34) LATS(4172425) Disk lastSeqNo(34)
[ CSSD]2011-10-30 01:16:00.300 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(34) LATS(4172432) Disk lastSeqNo(34)
[ CSSD]2011-10-30 01:16:01.168 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(35) LATS(4173301) Disk lastSeqNo(35)
[ CSSD]2011-10-30 01:16:01.293 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(35) LATS(4173426) Disk lastSeqNo(35)
[ CSSD]2011-10-30 01:16:01.300 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(35) LATS(4173433) Disk lastSeqNo(35)
[ CSSD]2011-10-30 01:16:02.169 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(36) LATS(4174301) Disk lastSeqNo(36)
[ CSSD]2011-10-30 01:16:02.294 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(36) LATS(4174426) Disk lastSeqNo(36)
[ CSSD]2011-10-30 01:16:02.301 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(36) LATS(4174433) Disk lastSeqNo(36)
[ CSSD]2011-10-30 01:16:03.169 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(37) LATS(4175302) Disk lastSeqNo(37)
[ CSSD]2011-10-30 01:16:03.294 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(37) LATS(4175426) Disk lastSeqNo(37)
[ CSSD]2011-10-30 01:16:03.301 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(37) LATS(4175434) Disk lastSeqNo(37)
[ CSSD]2011-10-30 01:16:04.170 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(38) LATS(4176302) Disk lastSeqNo(38)
[ CSSD]2011-10-30 01:16:04.295 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(38) LATS(4176427) Disk lastSeqNo(38)
[ CSSD]2011-10-30 01:16:04.302 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(38) LATS(4176434) Disk lastSeqNo(38)
[ CSSD]2011-10-30 01:16:05.170 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(39) LATS(4177302) Disk lastSeqNo(39)
[ CSSD]2011-10-30 01:16:05.295 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(39) LATS(4177427) Disk lastSeqNo(39)
[ CSSD]2011-10-30 01:16:05.302 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(39) LATS(4177434) Disk lastSeqNo(39)
[ CSSD]2011-10-30 01:16:06.171 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(40) LATS(4178303) Disk lastSeqNo(40)
[ CSSD]2011-10-30 01:16:06.295 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(40) LATS(4178428) Disk lastSeqNo(40)
[ CSSD]2011-10-30 01:16:06.303 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(40) LATS(4178435) Disk lastSeqNo(40)
[ CSSD]2011-10-30 01:16:07.171 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(41) LATS(4179303) Disk lastSeqNo(41)
[ CSSD]2011-10-30 01:16:07.296 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(41) LATS(4179428) Disk lastSeqNo(41)
[ CSSD]2011-10-30 01:16:07.303 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(41) LATS(4179435) Disk lastSeqNo(41)
[ CSSD]2011-10-30 01:16:08.171 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(42) LATS(4180304) Disk lastSeqNo(42)
[ CSSD]2011-10-30 01:16:08.296 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(42) LATS(4180429) Disk lastSeqNo(42)
[ CSSD]2011-10-30 01:16:08.303 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(42) LATS(4180436) Disk lastSeqNo(42)
[ CSSD]2011-10-30 01:16:09.172 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(43) LATS(4181304) Disk lastSeqNo(43)
[ CSSD]2011-10-30 01:16:09.297 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(43) LATS(4181429) Disk lastSeqNo(43)
[ CSSD]2011-10-30 01:16:09.304 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(43) LATS(4181436) Disk lastSeqNo(43)
[ CSSD]2011-10-30 01:16:10.172 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(44) LATS(4182305) Disk lastSeqNo(44)
[ CSSD]2011-10-30 01:16:10.297 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(44) LATS(4182429) Disk lastSeqNo(44)
[ CSSD]2011-10-30 01:16:10.304 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(44) LATS(4182437) Disk lastSeqNo(44)
[ CSSD]2011-10-30 01:16:11.173 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(45) LATS(4183305) Disk lastSeqNo(45)
[ CSSD]2011-10-30 01:16:11.297 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(45) LATS(4183430) Disk lastSeqNo(45)
[ CSSD]2011-10-30 01:16:11.305 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(45) LATS(4183437) Disk lastSeqNo(45)
[ CSSD]2011-10-30 01:16:12.173 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(46) LATS(4184305) Disk lastSeqNo(46)
[ CSSD]2011-10-30 01:16:12.298 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(46) LATS(4184430) Disk lastSeqNo(46)
[ CSSD]2011-10-30 01:16:12.305 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(46) LATS(4184437) Disk lastSeqNo(46)
[ CSSD]2011-10-30 01:16:13.173 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(47) LATS(4185306) Disk lastSeqNo(47)
[ CSSD]2011-10-30 01:16:13.298 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(47) LATS(4185431) Disk lastSeqNo(47)
[ CSSD]2011-10-30 01:16:13.306 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(47) LATS(4185438) Disk lastSeqNo(47)
[ CSSD]2011-10-30 01:16:14.174 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(48) LATS(4186306) Disk lastSeqNo(48)
[ CSSD]2011-10-30 01:16:14.299 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(48) LATS(4186431) Disk lastSeqNo(48)
[ CSSD]2011-10-30 01:16:14.306 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(48) LATS(4186438) Disk lastSeqNo(48)
[ CSSD]2011-10-30 01:16:15.174 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(49) LATS(4187307) Disk lastSeqNo(49)
[ CSSD]2011-10-30 01:16:15.299 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(49) LATS(4187431) Disk lastSeqNo(49)
[ CSSD]2011-10-30 01:16:15.306 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(49) LATS(4187439) Disk lastSeqNo(49)
[ CSSD]2011-10-30 01:16:16.175 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(50) LATS(4188307) Disk lastSeqNo(50)
[ CSSD]2011-10-30 01:16:16.299 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(50) LATS(4188432) Disk lastSeqNo(50)
[ CSSD]2011-10-30 01:16:16.307 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(50) LATS(4188439) Disk lastSeqNo(50)
[ CSSD]2011-10-30 01:16:17.175 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(51) LATS(4189308) Disk lastSeqNo(51)
[ CSSD]2011-10-30 01:16:17.300 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(51) LATS(4189432) Disk lastSeqNo(51)
[ CSSD]2011-10-30 01:16:17.307 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(51) LATS(4189440) Disk lastSeqNo(51)
[ CSSD]2011-10-30 01:16:18.176 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(52) LATS(4190308) Disk lastSeqNo(52)
[ CSSD]2011-10-30 01:16:18.300 [1030] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(52) LATS(4190433) Disk lastSeqNo(52)
[ CSSD]2011-10-30 01:16:18.308 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(52) LATS(4190440) Disk lastSeqNo(52)
[ CSSD]2011-10-30 01:16:19.176 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(53) LATS(4191308) Disk lastSeqNo(53)


Answer:


以下是 我对这段日志的分析:

[ CSSD]2011-10-30 01:15:33.906 [1] >TRACE: clssnmNMInitialize: misscount set to (30), impending reconfig threshold set to (26000)
[ CSSD]2011-10-30 01:15:33.907 [1] >TRACE: clssnmNMInitialize: diskShortTimeout set to (27000)ms
[ CSSD]2011-10-30 01:15:33.908 [1] >TRACE: clssnmNMInitialize: diskLongTimeout set to (200000)ms

———— 这里开始misscount 30s 的计时

CSSD]2011-10-30 01:15:33.907 [1] >TRACE: clssnmNMInitialize: diskShortTimeout set to (27000)ms
[ CSSD]2011-10-30 01:15:33.908 [1] >TRACE: clssnmNMInitialize: diskLongTimeout set to (200000)ms

初始化了 磁盘心跳的超时限制

[ CSSD]2011-10-30 01:15:35.935 [1801] >TRACE: clssnmvKillBlockThread: spawned for disk 2 (/dev/rhdisk14) initial sleep interval (1000)ms
[ CSSD]2011-10-30 01:15:35.935 [1030] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/rhdisk12)
[ CSSD]2011-10-30 01:15:35.935 [2058] >TRACE: clssnmvKillBlockThread: spawned for disk 1 (/dev/rhdisk13) initial sleep interval (1000)ms
[ CSSD]2011-10-30 01:15:35.935 [2315] >TRACE: clssnmvKillBlockThread: spawned for disk 0 (/dev/rhdisk12) initial sleep interval (1000)ms

KillBlockThread 初始化, 负责读取 “killblock”

[ CSSD]2011-10-30 01:15:36.003 [3857] >TRACE: clssgmPeerListener: Listening on (ADDRESS=(PROTOCOL=tcp)(DEV=25)(HOST=10.1.1.2)(PORT=32902))
[ CSSD]2011-10-30 01:15:35.935 [1544] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(10) LATS(4148068) Disk lastSeqNo(10)
[ CSSD]2011-10-30 01:15:36.135 [1287] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(10) LATS(4148268) Disk lastSeqNo(10)

clssnmReadDskHeartbeat 负责读取 磁盘心跳信息, 发现1号节点未写入 Dskhearbeat ,2号节点认为1号节点已经down了

disktimeout默认为200s (default 200 seconds) , 200s 以后 发生reconfiguration

这段日志说明 发生网络故障后 , 1号节点未能正常进行 每秒 都需要进行的 Disk Heartbeat , 注意磁盘心跳并非 Brain Split 时才发生 , 关于Disk Heartbeat ( Voting device ) 详见 下面这段描述

“Disk heartbeat
Each node writes a disk heartbeat to each voting disk once per second
Each node reads their kill block once per second, if the kill block is overwritten node commits suicide.
During reconfig (join or leave) CSSD monitors all nodes and determines whether a node has a disk heartbeat, including those with no network heartbeat.
If no disk heartbeat within I/O timeout (MissCount during cluster reconfiguration) then node is declared as dead.”

磁盘心跳超时 会导致 节点被踢 , 这是很正常的 , 这个测试的日志 对 支持” 抢占论” 没有什么意义。


reply:


感谢Maclean Liu的解释,我明白了,我这个案例跟Brain Split应该没有关系。但我还是想不明白NODE1节点已经起来了并库运行正常,NODE2启动后认为NODE1 DOWN了,为啥自己却重启了呢。若NODE2先启动,NODE1再启动时也会重启,且NODE1上的错误跟这个一样。还好现在把网络恢复后两个节点都起来了。所以还请作者有空帮忙给分析一下可能的原因,谢谢!





http://www.oracledatabase12g.com/archives/oracle-rac-brain-split-resolution.html

http://www.oracledatabase12g.com/archives/%E5%86%8D%E8%AE%AErac-brain-split%E8%84%91%E8%A3%82.html



你可能感兴趣的:(Oracle,RAC)