Redis cluster gossip

目录

一、通信节点选择

1.每0.1秒,如果发现有其他节点连不上,则尝试重连

2.每1秒,从5个随机节点中,选出一个其中最久没有通信的节点,进行ping

3.每0.1秒,如果发现有超过cluster-node-time/2没有通信成功的节点,则向这个节点发送ping  

二、gossip ping    所发送的信息

1.节点自身的信息

2.附带1/10的其他节点信息,如果1/10少于3,那么至少附带3个其他节点的信息


一、通信节点选择

通过clusterCron()  /* This is executed 10 times every second */

1.每0.1秒,如果发现有其他节点连不上,则尝试重连

/* Check if we have disconnected nodes and re-establish the connection. */
    di = dictGetSafeIterator(server.cluster->nodes);
    while((de = dictNext(di)) != NULL) {
        clusterNode *node = dictGetVal(de);
        ...
            link = createClusterLink(node);
            link->fd = fd;
            node->link = link;
            aeCreateFileEvent(server.el,link->fd,AE_READABLE,
                    clusterReadHandler,link);
            /* Queue a PING in the new connection ASAP: this is crucial
             * to avoid false positives in failure detection.
             *
             * If the node is flagged as MEET, we send a MEET message instead
             * of a PING one, to force the receiver to add us in its node
             * table. */
            old_ping_sent = node->ping_sent;
            clusterSendPing(link, node->flags & CLUSTER_NODE_MEET ?
                    CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
            if (old_ping_sent) {
                /* If there was an active ping before the link was
                 * disconnected, we want to restore the ping time, otherwise
                 * replaced by the clusterSendPing() call. */
                node->ping_sent = old_ping_sent;
            }
            /* We can clear the flag after the first packet is sent.
             * If we'll never receive a PONG, we'll never send new packets
             * to this node. Instead after the PONG is received and we
             * are no longer in meet/handshake status, we want to send
             * normal PING packets. */
            node->flags &= ~CLUSTER_NODE_MEET;
            serverLog(LL_DEBUG,"Connecting with Node %.40s at %s:%d",
                    node->name, node->ip, node->port+CLUSTER_PORT_INCR);
        }
    }
    dictReleaseIterator(di);

例子:

可以看到每个0.1s都会去ping连不上的节点
76879:M 09 Jun 17:00:49.276 . Connecting with Node 64cdc10096644b5bc3624f41ade916983806c47c at 10.200.35.93:12222
76879:M 09 Jun 17:00:49.276 . I/O error reading from node link: Connection refused
76879:M 09 Jun 17:00:49.376 . Connecting with Node 64cdc10096644b5bc3624f41ade916983806c47c at 10.200.35.93:12222
76879:M 09 Jun 17:00:49.376 . I/O error reading from node link: Connection refused
76879:M 09 Jun 17:00:49.477 . Connecting with Node 64cdc10096644b5bc3624f41ade916983806c47c at 10.200.35.93:12222
76879:M 09 Jun 17:00:49.477 . I/O error reading from node link: Connection refused
76879:M 09 Jun 17:00:49.577 . Connecting with Node 64cdc10096644b5bc3624f41ade916983806c47c at 10.200.35.93:12222
76879:M 09 Jun 17:00:49.578 . I/O error reading from node link: Connection refused
76879:M 09 Jun 17:00:49.678 . Connecting with Node 64cdc10096644b5bc3624f41ade916983806c47c at 10.200.35.93:12222
76879:M 09 Jun 17:00:49.678 . I/O error reading from node link: Connection refused
76879:M 09 Jun 17:00:49.778 . Connecting with Node 64cdc10096644b5bc3624f41ade916983806c47c at 10.200.35.93:12222
76879:M 09 Jun 17:00:49.778 . I/O error reading from node link: Connection refused

2.每1秒,从5个随机节点中,选出一个其中最久没有通信的节点,进行ping

/* Ping some random node 1 time every 10 iterations, so that we usually ping
* one random node every second. */
if (!(iteration % 10)) {
    int j;
    /* Check a few random nodes and ping the one with the oldest
     * pong_received time. */
    for (j = 0; j < 5; j++) {
        de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);
        /* Don't ping nodes disconnected or with a ping currently active. */
        if (this->link == NULL || this->ping_sent != 0) continue;
        if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))
            continue;
        if (min_pong_node == NULL || min_pong > this->pong_received) {
            min_pong_node = this;
            min_pong = this->pong_received;
        }
    }
    if (min_pong_node) {
        serverLog(LL_DEBUG,"Pinging node %.40s", min_pong_node->name);
        clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
    }
}

例子:

可以看到,每隔一秒,去ping一个节点
76879:M 09 Jun 17:00:49.879 . Pinging node 178424affacc711aa18b46f67751072576592944                      /*ping*/
76879:M 09 Jun 17:00:49.879 . --- Processing packet of type 1, 2520 bytes
76879:M 09 Jun 17:00:49.880 . pong packet received: 0x2aadd9849e00
76879:M 09 Jun 17:00:49.880 . GOSSIP 64cdc10096644b5bc3624f41ade916983806c47c 10.200.35.93:2222 master
76879:M 09 Jun 17:00:49.880 . GOSSIP 3f81377c4930f5a90479e6ec2f93941e00c5ad67 10.200.35.94:2222 slave
76879:M 09 Jun 17:00:49.880 . GOSSIP bb7664a96fc83d3f31c2649ec37894a4944ed38b 10.200.35.93:3333 master
76879:M 09 Jun 17:00:50.882 . Pinging node 8a3f1674530b066e84149d2f107c400551066b7d                      /*ping*/
76879:M 09 Jun 17:00:50.882 . --- Processing packet of type 1, 2520 bytes
76879:M 09 Jun 17:00:50.882 . pong packet received: 0x2aadd984a800
76879:M 09 Jun 17:00:50.882 . GOSSIP 3f81377c4930f5a90479e6ec2f93941e00c5ad67 10.200.35.94:2222 slave
76879:M 09 Jun 17:00:50.882 . GOSSIP e353bf55e229998bc77408b5b0fe8194cbcd2c99 10.200.35.93:4444 master
76879:M 09 Jun 17:00:50.882 . GOSSIP 64cdc10096644b5bc3624f41ade916983806c47c 10.200.35.93:2222 master
76879:M 09 Jun 17:00:51.886 . Pinging node 3f81377c4930f5a90479e6ec2f93941e00c5ad67                      /*ping*/
76879:M 09 Jun 17:00:51.887 . --- Processing packet of type 1, 2520 bytes
76879:M 09 Jun 17:00:51.887 . pong packet received: 0x2aadd984b200
76879:M 09 Jun 17:00:51.887 . GOSSIP 178424affacc711aa18b46f67751072576592944 10.200.35.94:3333 slave
76879:M 09 Jun 17:00:51.887 . GOSSIP 64cdc10096644b5bc3624f41ade916983806c47c 10.200.35.93:2222 master
76879:M 09 Jun 17:00:51.887 . GOSSIP bb7664a96fc83d3f31c2649ec37894a4944ed38b 10.200.35.93:3333 master

3.每0.1秒,如果发现有超过cluster-node-time/2没有通信成功的节点,则向这个节点发送ping  

/* If we are waiting for the PONG more than half the cluster
 * timeout, reconnect the link: maybe there is a connection
 * issue even if the node is alive. */
if (node->link && /* is connected */
    now - node->link->ctime >
    server.cluster_node_timeout && /* was not already reconnected */
    node->ping_sent && /* we already sent a ping */
    node->pong_received < node->ping_sent && /* still waiting pong */
    /* and we are waiting for the pong more than timeout/2 */
    now - node->ping_sent > server.cluster_node_timeout/2)
{
    /* Disconnect the link, it will be reconnected automatically. */
    freeClusterLink(node->link);
}

 

所以总结起来:

1.每0.1s,如果发现有其他节点连不上,则尝试重连

2.每秒随机找出5个节点,然后选择其中最久未通信的节点发送ping

3.每0.1s,如果发现有超过cluster-node-time/2没有通信成功的节点,则向这个节点发送ping

 

二、gossip ping    所发送的信息

1.节点自身的信息

2.附带1/10的其他节点信息,如果1/10少于3,那么至少附带3个其他节点的信息

void clusterSendPing(clusterLink *link, int type) {
    unsigned char *buf;
    clusterMsg *hdr;
    int gossipcount = 0; /* Number of gossip sections added so far. */
    int wanted; /* Number of gossip sections we want to append if possible. */
    int totlen; /* Total packet length. */
    /* freshnodes is the max number of nodes we can hope to append at all:
     * nodes available minus two (ourself and the node we are sending the
     * message to). However practically there may be less valid nodes since
     * nodes in handshake state, disconnected, are not considered. */
    int freshnodes = dictSize(server.cluster->nodes)-2;
    /* How many gossip sections we want to add? 1/10 of the number of nodes
     * and anyway at least 3. Why 1/10?
     *
     * If we have N masters, with N/10 entries, and we consider that in
     * node_timeout we exchange with each other node at least 4 packets
     * (we ping in the worst case in node_timeout/2 time, and we also
     * receive two pings from the host), we have a total of 8 packets
     * in the node_timeout*2 falure reports validity time. So we have
     * that, for a single PFAIL node, we can expect to receive the following
     * number of failure reports (in the specified window of time):
     *
     * PROB * GOSSIP_ENTRIES_PER_PACKET * TOTAL_PACKETS:
     *
     * PROB = probability of being featured in a single gossip entry,
     *        which is 1 / NUM_OF_NODES.
     * ENTRIES = 10.
     * TOTAL_PACKETS = 2 * 4 * NUM_OF_MASTERS.
     *
     * If we assume we have just masters (so num of nodes and num of masters
     * is the same), with 1/10 we always get over the majority, and specifically
     * 80% of the number of nodes, to account for many masters failing at the
     * same time.
     *
     * Since we have non-voting slaves that lower the probability of an entry
     * to feature our node, we set the number of entires per packet as
     * 10% of the total nodes we have. */
    wanted = floor(dictSize(server.cluster->nodes)/10);
    if (wanted < 3) wanted = 3;
    if (wanted > freshnodes) wanted = freshnodes;

例子:

在3主3从的集群中,由于6/10=0.6 < 3,所以ping 需要包含3个节点的信息
某个节点ping e353bf55e229998bc77408b5b0fe8194cbcd2c99:
76879:M 09 Jun 17:00:52.889 . Pinging node e353bf55e229998bc77408b5b0fe8194cbcd2c99    
 
在e353bf55e229998bc77408b5b0fe8194cbcd2c99中,可以看到发来的ping消息,共包含3个节点
77004:M 09 Jun 17:00:52.889 . --- Processing packet of type 0, 2520 bytes
77004:M 09 Jun 17:00:52.889 . Ping packet received: (nil)
77004:M 09 Jun 17:00:52.889 . ping packet received: (nil)
77004:M 09 Jun 17:00:52.889 . GOSSIP 64cdc10096644b5bc3624f41ade916983806c47c xxx:2222 master
77004:M 09 Jun 17:00:52.889 . GOSSIP 3f81377c4930f5a90479e6ec2f93941e00c5ad67 xxx:2222 slave
77004:M 09 Jun 17:00:52.889 . GOSSIP 178424affacc711aa18b46f67751072576592944 xxx:3333 slave

 

你可能感兴趣的:(Redis)