目录
一、修改源码
二、测试环境
三、测试结果
可以通过从库广播避免二轮投票
四、切换过程
五、各节点完整日志
https://github.com/redis-io/redis/pull/7398
通过前面分析,在网络正常的情况下,从库自动failover需要两轮投票的根本原因是,从库认为主库客观下线时,并不广播,无法通知负责投票的存活主库也修改挂掉主库的flgas。各个节点为一个主库所维护的clusterNode->flags是相互独立的。见前面博文:
Redis cluster 从库自动failover需要两轮投票的分析
本文测试目的,是通过修改源码,在从库认为主库客观下线时,也将fail信息广播给其他节点,正常网络情况下可以认为其他节点可以同时收到广播信息,再加上从库延迟投票的机制,就不会再有从库自动failover第一轮投票失败的情况。
删除源码markNodeAsFailingIfNeeded函数的if (nodeIsMaster(myself)) clusterSendFail(node->name)的if判断
void markNodeAsFailingIfNeeded(clusterNode *node) {
int failures;
int needed_quorum = (server.cluster->size / 2) + 1;
if (!nodeTimedOut(node)) return; /* We can reach it. */
if (nodeFailed(node)) return; /* Already FAILing. */
failures = clusterNodeFailureReportsCount(node);
/* Also count myself as a voter if I'm a master. */
if (nodeIsMaster(myself)) failures++;
if (failures < needed_quorum) return; /* No weak agreement from masters. */
serverLog(LL_NOTICE,
"Marking node %.40s as failing (quorum reached).", node->name);
/* Mark the node as failing. */
node->flags &= ~CLUSTER_NODE_PFAIL;
node->flags |= CLUSTER_NODE_FAIL;
node->fail_time = mstime();
/* Broadcast the failing node name to everybody, forcing all the other
* reachable nodes to flag the node as FAIL. */
if (nodeIsMaster(myself)) clusterSendFail(node->name); /*这里导致从库检测到客观下线时不广播*/
clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
}
修改为:
void markNodeAsFailingIfNeeded(clusterNode *node) {
int failures;
int needed_quorum = (server.cluster->size / 2) + 1;
if (!nodeTimedOut(node)) return; /* We can reach it. */
if (nodeFailed(node)) return; /* Already FAILing. */
failures = clusterNodeFailureReportsCount(node);
/* Also count myself as a voter if I'm a master. */
if (nodeIsMaster(myself)) failures++;
if (failures < needed_quorum) return; /* No weak agreement from masters. */
serverLog(LL_NOTICE,
"Marking node %.40s as failing (quorum reached).", node->name);
/* Mark the node as failing. */
node->flags &= ~CLUSTER_NODE_PFAIL;
node->flags |= CLUSTER_NODE_FAIL;
node->fail_time = mstime();
/* Broadcast the failing node name to everybody, forcing all the other
* reachable nodes to flag the node as FAIL. */
clusterSendFail(node->name); /*删掉if判断,让所有节点检测到客观下线时,都进行广播*/
clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
}
d36c0a9f6c37e639105ad3050c1230ed1a1ff76e 127.0.0.1:1222 主库1 /*被kill节点*/
5b0484b7455dda80afed2c1c2caa2a7c99c40044 127.0.0.1:1333 主库2
f83970e54ad2574f7a282f5c5b75e9223712a27b 127.0.0.1:1444 主库3
de82e53869b1164697653fb39c384d54c0396038 127.0.0.1:2222 从库1
1187cbcfb4381ce08edbb1d5663b16110b23eff1 127.0.0.1:2333 从库2
9c79d082d934ee8ee40ce05b04c534de1f888370 127.0.0.1:2444 从库3 /*注意这个从库的node id,后面就是由这个node id来发送fail广播*/
cluster-node-timeout=15000,即15秒
并不是要改生产环境的源码,只是为了证明,一个节点对一个主库的客观下线状态的判断,是通过自己维护的clusterNode->flags来判断的,这个clusterNode->flags可以通过自己收够多数派pfail来转变成fail,也可以通过收到其节点的fail广播来转变成fail。各个节点为一个主库所维护的clusterNode->flags是相互独立的。
1. 10:56:31.740
--主库1被kill,从库1发现master lost
6670:S 13 Jun 10:56:31.740 # Connection with master lost.
2. 10:56:48.645
--从库3 收到多数派pfail,将主库1标记为fail。这里需要解释一下,为什么不是经过15秒就判断为fail?因为节点间的pfail信息传递也是需要时间的。
6682:S 13 Jun 10:56:48.645 * Marking node d36c0a9f6c37e639105ad3050c1230ed1a1ff76e as failing (quorum reached).
3. 10:56:48.646,几乎同时,其他所有节点都收到从库3的广播
--从库1 收到 从库3 的fail广播,将 主库1 的clusterNode->flags设为fail
tail -f redis2222.log
6670:S 13 Jun 10:56:48.646 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
--从库2 收到 从库3 的fail广播,将 主库1 的clusterNode->flags设为fail
tail -f redis3333.log
6678:S 13 Jun 10:56:48.646 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
--主库2 收到 从库3 的fail广播,将 主库1 的clusterNode->flags设为fail
tail -f redis1333.log
6674:M 13 Jun 10:56:48.647 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
--主库3 收到 从库3 的fail广播,将 主库1 的clusterNode->flags设为fail
tail -f redis1444.log
6686:M 13 Jun 10:56:48.647 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
4. 10:56:49.252,从库1 通过clusterCron 检测到主库1的flags为fail, 发起选举投票
6670:S 13 Jun 10:56:48.646 # Start of election delayed for 557 milliseconds (rank #0, offset 58).
6670:S 13 Jun 10:56:49.252 # Starting a failover election for epoch 11.
5. 由于已经收到从库3的fail广播,两个主库同意投票
--主库2
6674:M 13 Jun 10:56:48.647 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
6674:M 13 Jun 10:56:49.258 # Failover auth granted to de82e53869b1164697653fb39c384d54c0396038 for epoch 11
--主库3
6686:M 13 Jun 10:56:48.647 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
6686:M 13 Jun 10:56:49.258 # Failover auth granted to de82e53869b1164697653fb39c384d54c0396038 for epoch 11
6. 从库1 选举成功,执行failover
6670:S 13 Jun 10:56:48.646 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e
6670:S 13 Jun 10:56:48.646 # Start of election delayed for 557 milliseconds (rank #0, offset 58).
6670:S 13 Jun 10:56:49.252 # Starting a failover election for epoch 11.
6670:S 13 Jun 10:56:49.259 # Failover election won: I'm the new master.
6670:S 13 Jun 10:56:49.259 # configEpoch set to 11 after successful failover
6670:M 13 Jun 10:56:49.259 * Discarding previously cached master state.
--从库1
6670:S 13 Jun 10:56:31.740 # Connection with master lost. /*发现主库connection lost*/
6670:S 13 Jun 10:56:31.740 * Caching the disconnected master state.
6670:S 13 Jun 10:56:32.144 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:32.144 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:32.144 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:33.157 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:33.157 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:33.157 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:34.171 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:34.171 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:34.171 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:35.182 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:35.182 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:35.182 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:36.195 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:36.195 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:36.195 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:37.208 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:37.208 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:37.208 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:38.220 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:38.220 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:38.220 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:39.235 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:39.235 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:39.235 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:40.247 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:40.247 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:40.247 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:41.256 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:41.256 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:41.256 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:42.267 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:42.267 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:42.267 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:43.281 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:43.281 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:43.281 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:44.291 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:44.291 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:44.291 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:45.305 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:45.305 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:45.306 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:46.316 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:46.316 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:46.316 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:47.328 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:47.328 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:47.328 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:48.342 * Connecting to MASTER 127.0.0.1:1222
6670:S 13 Jun 10:56:48.342 * MASTER <-> SLAVE sync started
6670:S 13 Jun 10:56:48.342 # Error condition on socket for SYNC: Connection refused
6670:S 13 Jun 10:56:48.646 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e /*收到来自从库3的fail广播,将主库1的flags设为fail*/
6670:S 13 Jun 10:56:48.646 # Start of election delayed for 557 milliseconds (rank #0, offset 58). /*延迟投票,可以各节点几乎可以同时fail广播,所以延迟之后,不会存在发起投票时,还有节点没有将挂掉主库转为客观下线的情况*/
6670:S 13 Jun 10:56:49.252 # Starting a failover election for epoch 11.
6670:S 13 Jun 10:56:49.259 # Failover election won: I'm the new master.
6670:S 13 Jun 10:56:49.259 # configEpoch set to 11 after successful failover
6670:M 13 Jun 10:56:49.259 * Discarding previously cached master state.
--从库2
6678:S 13 Jun 10:56:48.646 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e /*收到从库3的fail广播*/
--从库3
6682:S 13 Jun 10:56:48.645 * Marking node d36c0a9f6c37e639105ad3050c1230ed1a1ff76e as failing (quorum reached). /*通过gossip,累计收到了两个主库的pfail信息,将挂掉主库设为fail,并发送广播*/
--主库2
6674:M 13 Jun 10:56:48.647 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e /*收到从库3的fail广播,将自己的维护信息里关于挂掉主库的clusterNode->flags设为fail*/
6674:M 13 Jun 10:56:49.258 # Failover auth granted to de82e53869b1164697653fb39c384d54c0396038 for epoch 11 /*检测到自己的维护信息里关于挂掉主库的clusterNode->flags为fail,同意投票*/
--主库3
6686:M 13 Jun 10:56:48.647 * FAIL message received from 9c79d082d934ee8ee40ce05b04c534de1f888370 about d36c0a9f6c37e639105ad3050c1230ed1a1ff76e /*收到从库3的fail广播,将自己对于挂掉主库的clusterNode->flags设为fail*/
6686:M 13 Jun 10:56:49.258 # Failover auth granted to de82e53869b1164697653fb39c384d54c0396038 for epoch 11 /*检测到自己的维护信息里关于挂掉主库的clusterNode->flags为fail,同意投票*/