Redis集群的主从切换研究

目录

目录 1

1. 前言 1

2. slave发起选举 2

3. master响应选举 5

4. 选举示例 5

5. 哈希槽传播方式 6

6. 一次主从切换记录1 6

6.1. 相关参数 6

6.2. 时间点记录 6

6.3. 其它master日志 6

6.4. 其它master日志 7

6.5. slave日志 7

7. 一次主从切换记录2 8

7.1. 相关参数 8

7.2. 时间点记录 8

7.3. 其它master日志 8

7.4. 其它master日志 9

7.5. slave日志 9

8. slave延迟发起选举代码 9

 

  1. 前言

Redis官方原文:https://redis.io/topics/cluster-spec。另外,从Redis-5.0开始,slave已改叫replica,配置项和部分文档及变量已做改名。

Redis集群的主从切换采取选举机制,要求少数服从多数,而参与选举的只能为master,所以只有多数master存活动时才能进行,选举由slave发起。

Redis用了和Raft算法term(任期)类似的的概念,在Redis中叫作epoch(纪元),epoch是一个无符号的64整数,一个节点的epoch从0开始。

如果一个节点接收到的epoch比自己的大,则将自已的epoch更新接收到的epoch(假定为信任网络,无拜占庭将军问题)。

每个master都会在ping和pong消息中广播自己的epoch和所负责的slots位图,slave发起选举时,创建一个新的epoch(增一),epoch的值会持久化到文件nodes.conf中,如(最新epoch值为27,最近一次投票给了27):

vars currentEpoch 27 lastVoteEpoch 27

  1. slave发起选举

只有master为fail状态,slave才会发起选举。但并不是master为fail时立即发起选举,而是延迟下列随机时长,以避免多个slaves同时发起选举(至少延迟0.5秒后才会发起选举):

500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds

 

一个slave发起选举的条件:

  1. 它的master为fail状态(非pfail状态);
  2. 它的master至少负责了一个slot;
  3. slave和master的复制连接断开时间不超过给定的值(值可配置,目的是确保slave上的数据足够完整,所以运维时不能任由一个slave长时间不可用,需要通过监控将异常的slave及时恢复)。

 

因过长时间不可用而不能自动切换的slave日志:

slave过长时间不可用,导致无法自动切换为master

12961:S 06 Jan 2019 19:00:21.969 # Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option.

 

相关的源代码:

/* This function is called if we are a slave node and our master serving

 * a non-zero amount of hash slots is in FAIL state.

 *

 * The gaol of this function is:

 * 1) To check if we are able to perform a failover, is our data updated?

 * 2) Try to get elected by masters.

 * 3) Perform the failover informing all the other nodes.

 */

void clusterHandleSlaveFailover(void) {

     mstime_t data_age; // 与master断开的时长,单位毫秒

     mstime_t auth_age = mstime() - server.cluster->failover_auth_time;

     int needed_quorum = (server.cluster->size / 2) + 1;

     int manual_failover = server.cluster->mf_end != 0 && server.cluster->mf_can_start;

     auth_timeout = server.cluster_node_timeout*2;

     if (auth_timeout < 2000) auth_timeout = 2000;

     auth_retry_time = auth_timeout*2;

     。。。。。。

     /* Set data_age to the number of seconds we are disconnected from

     * the master. */

    if (server.repl_state == REPL_STATE_CONNECTED) {

        data_age = (mstime_t)(server.unixtime - server.master->lastinteraction) * 1000;

    } else {

        data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;

    }

 

    /* Remove the node timeout from the data age as it is fine that we are

     * disconnected from our master at least for the time it was down to be

     * flagged as FAIL, that's the baseline. */

    if (data_age > server.cluster_node_timeout)

        data_age -= server.cluster_node_timeout;

 

    /* Check if our data is recent enough according to the slave validity

     * factor configured by the user.

     *

     * Check bypassed for manual failovers. */

    if (server.cluster_slave_validity_factor &&

        data_age >

        (((mstime_t)server.repl_ping_slave_period * 1000) +

         (server.cluster_node_timeout * server.cluster_slave_validity_factor)))

    {

        // slave不可用时间过长,导致不能自动切换为master

        if (!manual_failover) { // 人工切换除外

            clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);

            return;

        }

    }

    。。。。。。

    /* Ask for votes if needed. */

    // failover_auth_sent标记是否已发送过投票消息

    if (server.cluster->failover_auth_sent == 0) {

        server.cluster->currentEpoch++;

        server.cluster->failover_auth_epoch = server.cluster->currentEpoch;

        serverLog(LL_WARNING,"Starting a failover election for epoch %llu.",

            (unsigned long long) server.cluster->currentEpoch);

 

        // 给所有节点(包括slaves)发送投票消息FAILOVE_AUTH_REQUEST(请求投票成为master消息),但注意只有master响应该消息

        clusterRequestFailoverAuth();

        server.cluster->failover_auth_sent = 1;

        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|

                             CLUSTER_TODO_UPDATE_STATE|

                             CLUSTER_TODO_FSYNC_CONFIG);

        return; /* Wait for replies. */

    }

    /* Check if we reached the quorum. */

    if (server.cluster->failover_auth_count >= needed_quorum) {

        /* We have the quorum, we can finally failover the master. */

 

        serverLog(LL_WARNING,

            "Failover election won: I'm the new master.");

 

        /* Update my configEpoch to the epoch of the election. */

        if (myself->configEpoch < server.cluster->failover_auth_epoch) {

            myself->configEpoch = server.cluster->failover_auth_epoch;

            serverLog(LL_WARNING,

                "configEpoch set to %llu after successful failover",

                (unsigned long long) myself->configEpoch);

        }

 

        /* Take responsibility for the cluster slots. */

        clusterFailoverReplaceYourMaster();

    } else {

        clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES);

    }

}

 

从上段代码,还可以看到配置项cluster-slave-validity-factor影响slave是否能够切换为master。

 

发起选举前,slave先给自己的epoch(即currentEpoch)增一,然后请求其它master给自己投票。slave是通过广播FAILOVER_AUTH_REQUEST包给集中的每一个masters。

slave发起投票后,会等待至少两倍NODE_TIMEOUT时长接收投票结果,不管NODE_TIMEOUT何值,也至少会等待2秒。

master接收投票后给slave响应FAILOVER_AUTH_ACK,并且在(NODE_TIMEOUT*2)时间内不会给同一master的其它slave投票。

如果slave收到FAILOVER_AUTH_ACK响应的epoch值小于自己的epoch,则会直接丢弃。一旦slave收到多数master的FAILOVER_AUTH_ACK,则声明自己赢得了选举。

如果slave在两倍的NODE_TIMEOUT时间内(至少2秒)未赢得选举,则放弃本次选举,然后在四倍NODE_TIMEOUT时间(至少4秒)后重新发起选举。

 

只所以强制延迟至少0.5秒选举,是为确保master的fail状态在整个集群内传开,否则可能只有小部分master知晓,而master只会给处于fail状态的master的slaves投票。如果一个slave的master状态不是fail,则其它master不会给它投票,Redis通过八卦协议(即Gossip协议,也叫谣言协议)传播fail。而在固定延迟上再加一个随机延迟,是为了避免多个slaves同时发起选举。

 

slave的SLAVE_RANK是一个与master复制数有关的值,具有最新复制时SLAVE_RANK值为0,第二则为1,以此类推。这样可让具有最全数据的slave优先发起选举。当具有更高SLAVE_RANK值的slave如果没有当选,则其它slaves会很快发起选举(至少4秒后)。

在slave赢得选举后,会向集群内的所有节点广播pong,以尽快完成重新配置(体现在node.conf的更新)。当前未能到达的节点,最终也会完成重新配置。

其它节点会发现有两个相同的master负责相同的slots,这时就看哪个master的epoch值更大。

slave成为master后,并不立即服务,而是留了一个时间差。

  1. master响应选举

master收到slave的投票请求FAILOVER_AUTH_REQUEST后,只有满足下列条件时,才会响应投票:

  1. 对一个epoch,只投票一次;
  2. 会拒绝所有更小epoch的投票请求;
  3. 不会给小于lastVoteEpoch的epoch投票;
  4. master只给master状态为fail的slave投票;
  5. 如果slave请求的currentEpoch小于master的currentEpoch,则master忽略该请求,但下列情况例外:
  • 假设master的currentEpoch值为5,lastVoteEpoch值为1(当有选举失败会出现这个情况,亦即currentEpoch值增加了,但因为选举失败,lastVoteEpoch值未变);
  • slave的currentEpoch值为3;
  • slave增一,使用值为4的epoch发起选举,这个时候master会响应epoch值为5,不巧这个响应延迟了;
  • slave重新发起选举,这个时候选举用的epoch值为5(每次发起选举epoch值均需增一),凑巧这个时候原来延迟的响应达到了,这个时候原来延迟的响应被slave认为有效。

 

在master投票后,会用请求中的epoch更新本地的lastVoteEpoch,并持久化到node.conf文件中。master不会参与选择最优的slave,由于最优的slave有最好的SLAVE_RANK,因此最优的slave可相对更快发起选举。

  1. 选举示例

假设一个master有A、B和C三个slaves节点,当这个master不可达时:

  1. 假设slave A赢得选举成为master;
  2. slave A因为网络分区不再可用;
  3. slave B赢得选举;
  4. slave B因为网络分区不再可用;
  5. 网络分区修复,slave A又可用。

 

B挂了,A又可用。同一时刻,slave C发起选举,试图替代B成为master。由于slave C的master已不可用,所以它能够选举成为master,并将configEpoch值增一。而A将不能成为master,因为C已成为master,并且C的epoch值更大。

  1. 哈希槽传播方式

有两种哈希槽(hash slot)传播途径:

  1. 心跳消息(Heartbeat messages)。节点在发送ping和pong消息时,总是携带了它所负责(或它的master所负责)的哈希槽信息;
  2. 更新消息(UPDATE messages)。由于心跳包还包含了epoch信息,当消息接收者发现心跳包携带的信息陈旧时,会响应更新的信息,这样强迫发送者更新哈希槽。
  1. 一次主从切换记录1

测试集群运行在同一个物理机上,cluster-node-timeout值比repl-timeout值大。

    1. 相关参数

cluster-slave-validity-factor值为1

cluster-node-timeout值为30000

repl-ping-slave-period值为1

repl-timeout值为10

    1. 时间点记录

master为FAIL之时的1秒左右时间内,即为主从切换之时。

master A标记fail时间:20:12:55.467

master B标记fail时间:20:12:55.467

master A投票时间:20:12:56.164

master B投票时间:20:12:56.164

slave发起选举时间:20:12:56.160

slave准备发起选举时间:20:12:55.558(延迟579毫秒)

slave发现和master心跳超时时间:20:12:32.810(在这之后24秒才发生主从切换

slave收到其它master发来的自己的master为fail时间:20:12:55.467

切换前服务最后一次正常时间:(服务异常约发生在秒)20:12:22/279275

切换后服务恢复正常时间:20:12:59/278149

服务不可用时长:约37秒

    1. 其它master日志

该master ID为c67dc9e02e25f2e6321df8ac2eb4d99789917783

30613:M 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 从其它master收到44eb43e50c101c5f44f48295c42dda878b6cb3e9已fail消息

30613:M 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

30613:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 对选举投票

30613:M 04 Jan 2019 20:12:56.204 # Cluster state changed: ok

30613:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    1. 其它master日志

该master ID为bfad383775421b1090eaa7e0b2dcfb3b38455079

30614:M 04 Jan 2019 20:12:55.467 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). // 标记44eb43e50c101c5f44f48295c42dda878b6cb3e9为已fail

30614:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 对选举投票

30614:M 04 Jan 2019 20:12:56.709 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    1. slave日志

slave的master ID为44eb43e50c101c5f44f48295c42dda878b6cb3e9,slave自己的ID为0ae8b5400d566907a3d8b425d983ac3b7cbd8412

30651:S 04 Jan 2019 20:12:32.810 # MASTER timeout: no data nor PING received... // 发现master超时,master异常10秒后发现,原因是repl-timeout的值为10

30651:S 04 Jan 2019 20:12:32.810 # Connection with master lost.

30651:S 04 Jan 2019 20:12:32.810 * Caching the disconnected master state.

30651:S 04 Jan 2019 20:12:32.810 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:32.810 * MASTER <-> REPLICA sync started

30651:S 04 Jan 2019 20:12:32.810 * Non blocking connect for SYNC fired the event.

 

30651:S 04 Jan 2019 20:12:43.834 # Timeout connecting to the MASTER...

30651:S 04 Jan 2019 20:12:43.834 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:43.834 * MASTER <-> REPLICA sync started

30651:S 04 Jan 2019 20:12:43.834 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:54.856 # Timeout connecting to the MASTER...

30651:S 04 Jan 2019 20:12:54.856 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:54.856 * MASTER <-> REPLICA sync started

30651:S 04 Jan 2019 20:12:54.856 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 从其它master收到自己的master的FAIL消息

30651:S 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

30651:S 04 Jan 2019 20:12:55.558 # Start of election delayed for 579 milliseconds (rank #0, offset 227360). // 准备发起选举,延迟579毫秒,其中500毫秒为固定延迟,279秒为随机延迟,因为RANK值为0,所以RANK延迟为0毫秒

30651:S 04 Jan 2019 20:12:56.160 # Starting a failover election for epoch 30. // 发起选举

30651:S 04 Jan 2019 20:12:56.180 # Failover election won: I'm the new master. // 赢得选举

30651:S 04 Jan 2019 20:12:56.180 # configEpoch set to 30 after successful failover

30651:M 04 Jan 2019 20:12:56.180 # Setting secondary replication ID to 154a9c2319403d610808477dcda3d4bede0f374c, valid up to offset: 227361. New replication ID is 927fb64a420236ee46d39389611ab2d8f6530b6a

30651:M 04 Jan 2019 20:12:56.181 * Discarding previously cached master state.

30651:M 04 Jan 2019 20:12:56.181 # Cluster state changed: ok

30651:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 忽略来自非集群成员1.9.16.9:4077的消息

  1. 一次主从切换记录2

测试集群运行在同一个物理机上,cluster-node-timeout值比repl-timeout值小。

    1. 相关参数

cluster-slave-validity-factor值为1

cluster-node-timeout值为10000

repl-ping-slave-period值为1

repl-timeout值为30

    1. 时间点记录

master为FAIL之时的1秒左右时间内,即为主从切换之时。

master A标记fail时间:20:37:10.398

master B标记fail时间:20:37:10.398

master A投票时间:20:37:11.084

master B投票时间:20:37:11.085

slave发起选举时间:20:37:11.077

slave准备发起选举时间:20:37:10.475(延迟539毫秒)

slave发现和master心跳超时时间:没有发生,因为slave在超时之前已成为master

slave收到其它master发来的自己的master为fail时间:20:37:10.398

切换前服务最后一次正常时间:20:36:55/266889(服务异常约发生在56秒)

切换后服务恢复正常时间:20:37:12/265802

服务不可用时长:约17秒

    1. 其它master日志

该master ID为c67dc9e02e25f2e6321df8ac2eb4d99789917783

30613:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

30613:M 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

30613:M 04 Jan 2019 20:37:11.084 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

30613:M 04 Jan 2019 20:37:11.124 # Cluster state changed: ok

30613:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    1. 其它master日志

该master ID为bfad383775421b1090eaa7e0b2dcfb3b38455079

30614:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

30614:M 04 Jan 2019 20:37:11.085 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

30614:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

    1. slave日志

slave的master ID为44eb43e50c101c5f44f48295c42dda878b6cb3e9,slave自己的ID为0ae8b5400d566907a3d8b425d983ac3b7cbd8412

30651:S 04 Jan 2019 20:37:10.398 * FAIL message received from c67dc9e02e25f2e6321df8ac2eb4d99789917783 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

30651:S 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

30651:S 04 Jan 2019 20:37:10.475 # Start of election delayed for 539 milliseconds (rank #0, offset 228620).

30651:S 04 Jan 2019 20:37:11.077 # Starting a failover election for epoch 32.

30651:S 04 Jan 2019 20:37:11.100 # Failover election won: I'm the new master.

30651:S 04 Jan 2019 20:37:11.100 # configEpoch set to 32 after successful failover

30651:M 04 Jan 2019 20:37:11.100 # Setting secondary replication ID to 0cf19d01597610c7933b7ed67c999a631655eafc, valid up to offset: 228621. New replication ID is 53daa7fa265d982aebd3c18c07ed5f178fc3f70b

30651:M 04 Jan 2019 20:37:11.101 # Connection with master lost.

30651:M 04 Jan 2019 20:37:11.101 * Caching the disconnected master state.

30651:M 04 Jan 2019 20:37:11.101 * Discarding previously cached master state.

30651:M 04 Jan 2019 20:37:11.101 # Cluster state changed: ok

30651:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

  1. slave延迟发起选举代码

// 摘自Redis-5.0.3

// cluster.c

/* This function is called if we are a slave node and our master serving

 * a non-zero amount of hash slots is in FAIL state.

 *

 * The gaol of this function is:

 * 1) To check if we are able to perform a failover, is our data updated?

 * 2) Try to get elected by masters.

 * 3) Perform the failover informing all the other nodes.

 */

void clusterHandleSlaveFailover(void) {

    。。。。。。

    /* Check if our data is recent enough according to the slave validity

     * factor configured by the user.

     *

     * Check bypassed for manual failovers. */

    if (server.cluster_slave_validity_factor &&

        data_age >

        (((mstime_t)server.repl_ping_slave_period * 1000) +

         (server.cluster_node_timeout * server.cluster_slave_validity_factor)))

    {

        if (!manual_failover) {

            clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);

            return;

        }

    }

    /* If the previous failover attempt timedout and the retry time has

     * elapsed, we can setup a new one. */

    if (auth_age > auth_retry_time) {

        server.cluster->failover_auth_time = mstime() +

            500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */

            random() % 500; /* Random delay between 0 and 500 milliseconds. */

        server.cluster->failover_auth_count = 0;

        server.cluster->failover_auth_sent = 0;

        server.cluster->failover_auth_rank = clusterGetSlaveRank();

        /* We add another delay that is proportional to the slave rank.

         * Specifically 1 second * rank. This way slaves that have a probably

         * less updated replication offset, are penalized. */

        server.cluster->failover_auth_time +=

            server.cluster->failover_auth_rank * 1000;

        /* However if this is a manual failover, no delay is needed. */

        if (server.cluster->mf_end) {

            server.cluster->failover_auth_time = mstime();

            server.cluster->failover_auth_rank = 0;

        }

        serverLog(LL_WARNING,

            "Start of election delayed for %lld milliseconds "

            "(rank #%d, offset %lld).",

            server.cluster->failover_auth_time - mstime(),

            server.cluster->failover_auth_rank,

            replicationGetSlaveOffset());

        /* Now that we have a scheduled election, broadcast our offset

         * to all the other slaves so that they'll updated their offsets

         * if our offset is better. */

        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);

        return;

    }

    。。。。。。

}

 

 

 

你可能感兴趣的:(redis,Redis)