[译] Redis 集群规范(上)

原文地址:https://redis.io/topics/cluster-spec
(据说是)Redis官方中文网的翻译:http://www.redis.cn/topics/cluster-spec.html
一个个人认为更好的翻译:https://www.cnblogs.com/kaleidoscope/p/9635163.html

Redis Cluster Specification

Welcome to the Redis Cluster Specification. Here you'll find information about algorithms and design rationales of Redis Cluster. This document is a work in progress as it is continuously synchronized with the actual implementation of Redis.

Main properties and rationales of the design

Redis Cluster goals

Redis Cluster is a distributed implementation of Redis with the following goals, in order of importance in the design:

Redis 集群是 Redis 的一种分布式实现,它主要是为了实现以下这些目标(按照在设计中的重要性排序):

  • High performance and linear scalability up to 1000 nodes. There are no proxies, asynchronous replication is used, and no merge operations are performed on values.

高性能和线性可扩展性,支持高达1000个节点。没有代理、使用异步复制,并且不对值执行合并操作。

  • Acceptable degree of write safety: the system tries (in a best-effort way) to retain all the writes originating from clients connected with the majority of the master nodes. Usually there are small windows where acknowledged writes can be lost. Windows to lose acknowledged writes are larger when clients are in a minority partition.

一定程度上的写入操作的安全性:系统最大限度尝试保留来自于与多数主节点连接的客户端的所有写操作。通常,会存在一些小的时间窗口,在这些窗口中,可能会丢失已确认的写入。当出现网络分区时,若客户端连接的是少数分区,则写操作丢失的窗口会更大。

译注:这一句应该指的是在网络分区(P)时,Redis是选择支持AP而不是CP。“small windows”指的应该就是网络分区的时间。

  • Availability: Redis Cluster is able to survive partitions where the majority of the master nodes are reachable and there is at least one reachable slave for every master node that is no longer reachable. Moreover using replicas migration, masters no longer replicated by any slave will receive one from a master which is covered by multiple slaves.

可用性:如果发生了网络分区,但是大多数主节点是可访问的,同时那些不可访问的主节点至少还有一个从节点是可以提供访问的,那么,Redis集群在这种场景下依然可以提供服务。此外,当使用“副本迁移”特性时,还可以从拥有多个从节点的主节点那里,迁移一个从节点给没有从节点的主节点。

译注:副本迁移可以保证所有主节点都拥有至少1个从节点,从而避免某个主节点挂掉之后部分槽无法提供服务。

What is described in this document is implemented in Redis 3.0 or greater.

这篇文档中所描述的内容,在Redis 3.0或者更高的版本中已实现。

Implemented subset

Redis Cluster implements all the single key commands available in the non-distributed version of Redis. Commands performing complex multi-key operations like Set type unions or intersections are implemented as well as long as the keys all hash to the same slot.

Redis集群版本实现了非分布式(单机)版本中所有的single-key命令。复杂的multi-key操作,如求集合的并集或交集,(在某些场景下)也支持(简单的就比如mget之类的),只要这些key在哈希之后在同一个槽中。

Redis Cluster implements a concept called hash tags that can be used in order to force certain keys to be stored in the same hash slot. However during manual resharding, multi-key operations may become unavailable for some time while single key operations are always available.

Redis集群实现了一个叫hash tags的概念。它可以用来强制让一些key存储在相同的哈希槽当中。但是,如果存在人工重新分片的场景,multi-key操作可能就不可用了。但是single-key的操作总是可用的。

译注:正常是对key进行CRC16,再用结果对18384取模得到具体槽位的,hash tags是为了绕开这个限制。但是resharding之后这个hash tags就不生效了。

Redis Cluster does not support multiple databases like the stand alone version of Redis. There is just database 0 and the SELECT command is not allowed.

和单机版本不同,Redis集群不支持多数据库。只有0号数据库是可用的,同时SELECT命令是禁止的。

Clients and Servers roles in the Redis Cluster protocol

In Redis Cluster nodes are responsible for holding the data, and taking the state of the cluster, including mapping keys to the right nodes. Cluster nodes are also able to auto-discover other nodes, detect non-working nodes, and promote slave nodes to master when needed in order to continue to operate when a failure occurs.

在Redis集群中,节点的主要作用是用于存储数据以及获取集群状态,包括把key映射到正确的节点上。集群节点同样也能够自动发现其他节点、探查非工作节点,以及,为了是集群能够持续提供服务,当必要时(对应的主节点挂掉),把从节点提升为主节点。

To perform their tasks all the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Every node is connected to every other node in the cluster using the cluster bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly).

为了执行上述的那些任务,所有节点之间通过TCP总线和二进制协议相连,称为Redis集群总线。所有节点间通过集群总线相连。为了发现新的节点,节点通过gossip协议传播集群信息。通过发送ping包来确保所有其他节点处于正常工作的状态。还可以在特定情况下,发送集群消息。集群总线可以用来在集群中传播发布/订阅消息。它还可以用于编排人工故障切换。人工故障切换指的是,非Redis集群故障探测机制发起的,由系统管理员直接人工执行的故障切换。

译注:这个bus翻译成总线也不是特别贴切,但是暂时没有更好的option了。

Since cluster nodes are not able to proxy requests, clients may be redirected to other nodes using redirection errors -MOVED and -ASK. The client is in theory free to send requests to all the nodes in the cluster, getting redirected if needed, so the client is not required to hold the state of the cluster. However clients that are able to cache the map between keys and nodes can improve the performance in a sensible way.

由于集群节点不能代理请求,客户端可能被重定向错误指令,如MOVED和ASK,重定向到其他节点。客户端理论上可以给集群所有的节点发送请求,必要时再重定向(到真正处理请求的节点),因此客户端不需要包括集群状态。然而,客户端应该可以缓存一份映射key和节点的信息,以一种合理的方式提升性能。

译注:这里的map between keys and nodes不知道作者的原意想表达的是什么。理论上来说在客户端进行CRC16(key) mod 16384找到对应的slot,之后再根据的映射关系找到node会更合适。

因为key的数量理论上是没有上限的,如果全量缓存下来内存不一定扛得住。

不过这里引申出一种可能的实现,在假设就是缓存了的映射,可以设置map的max size和过期淘汰策略(guava有现成的实现)。之后,如果map里有则直接取,否则就重新计算slot。

Write safety

Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.

Redis集群在(主从)节点间通过异步的方式同步数据,并采用last failover wins的机制。它是指最新选举出来的主节点,最终会代替(这个分片的)其他从节点。在出现网络分区是,存在一个时间窗口,可能会导致写丢失。这个窗口时间的大小,完全取决于,在分区时,客户端连接的是多数主节点(所在的分区),还是少数主节点(所在的分区)。

译注:这里的写丢失,指的是客户端收到服务端返回的写入成功,但最终查询的时候查询不到这条数据。

Redis Cluster tries harder to retain writes that are performed by clients connected to the majority of masters, compared to writes performed in the minority side. The following are examples of scenarios that lead to loss of acknowledged writes received in the majority partitions during failures:

相对于客户端连接到少数(主)节点所在分区的情况,当客户端连接的是多数(主)节点所在的分区,Redis集群会尽更大的努力去尝试保留已经被(服务端)确认的写操作。以下的例子,列举了一些写入操作被多数节点确认但仍出现数据丢失的场景:

  1. A write may reach a master, but while the master may be able to reply to the client, the write may not be propagated to slaves via the asynchronous replication used between master and slave nodes. If the master dies without the write reaching the slaves, the write is lost forever if the master is unreachable for a long enough period that one of its slaves is promoted. This is usually hard to observe in the case of a total, sudden failure of a master node since masters try to reply to clients (with the acknowledge of the write) and slaves (propagating the write) at about the same time. However it is a real world failure mode.

写操作发送给主节点,但由于主从节点之间的数据同步是异步的,此时可能主节点已经给客户端响应,但数据还没有同步给从节点。在这种情况下,如果主节点不可用的持续时间太长,以至于在此期间,从节点提升为新的主节点,则数据可能永久丢失。主节点突然挂掉(宕机、网络不可达等),是一种很难被观察到的写失败,因为主节点给客户端发送写入成功响应,以及给从节点发送同步数据的,几乎是在同一时间进行的,然而却这却是真实存在的一种场景。

  1. Another theoretically possible failure mode where writes are lost is the following:
  • A master is unreachable because of a partition.
  • It gets failed over by one of its slaves.
  • After some time it may be reachable again.
  • A client with an out-of-date routing table may write to the old master before it is converted into a slave (of the new master) by the cluster.

另一种理论上可能导致写入失败的情况是这样的:主节点因为网络分区不可达。此时从节点被选举为主节点(这句话的隐藏含义是,从节点在多数节点所在的分区)。一段时间之后,主节点重新可达(即网络分区消失)。但是因为客户端维护的路由表是过时的,导致它一直在跟这台主节点通信。当这台主节点变为新的主节点的从节点之后,写操作丢失。

译注:如果旧的主节点下线足够久,期间从节点晋升为新的主节点,则旧的主节点重新上线后会降级为(新的主节点的)从节点。

The second failure mode is unlikely to happen because master nodes unable to communicate with the majority of the other masters for enough time to be failed over will no longer accept writes, and when the partition is fixed writes are still refused for a small amount of time to allow other nodes to inform about configuration changes. This failure mode also requires that the client's routing table has not yet been updated.

第二种(写入)失败(实际上)不太可能会发生。因为如果主节点不能和集群中其他主节点长时间不能通信的话,将不会再支持写入操作,同时,当分区恢复时,仍会有一小段时间拒绝写操作的执行,这段时间是用来通知(inform?)配置变更的。而且,还要在客户端的路由表是过期的情况下才能发生。

译注:核心想表达的是这种写失败的发生条件非常的严苛。

Writes targeting the minority side of a partition have a larger window in which to get lost. For example, Redis Cluster loses a non-trivial number of writes on partitions where there is a minority of masters and at least one or more clients, since all the writes sent to the masters may potentially get lost if the masters are failed over in the majority side.

(如果)客户端向少数节点所在的分区发送写入操作的指令,则会有一个较大的、写丢失的时间窗口。这是因为,如果这些接受写指令的主节点,在另一个有多数节点所在的分区中进行了故障转移(也就是被彻底剔除掉了)。此时的Redis集群会丢失大量的写入数据,也就是说所有发送给这些主节点的(并且得到确认的)数据都会丢失。

译注:这一段是意译的。

Specifically, for a master to be failed over it must be unreachable by the majority of masters for at least NODE_TIMEOUT, so if the partition is fixed before that time, no writes are lost. When the partition lasts for more than NODE_TIMEOUT, all the writes performed in the minority side up to that point may be lost. However the minority side of a Redis Cluster will start refusing writes as soon as NODE_TIMEOUT time has elapsed without contact with the majority, so there is a maximum window after which the minority becomes no longer available. Hence, no writes are accepted or lost after that time.

特别的,主节点如果进行故障转移,那么它一定被(另一个分区的)大多数主节点判定为不可访问状态超过NODE_TIMEOUT所定义的时长。如果再此期间内网络分区修复了,则没有写入丢失。如果网络分区的持续时间超过了NODE_TIMEOUT,则期间内,所有往较少节点所在分区发送的写操作指令都可能丢失。少数节点所在的集群在超过NODE_TIMEOUT的时限后,如果依旧无法与大多数节点所在的集群通信,则会开始拒绝写入指令,因此,在网络分区后,有一个最大的(写丢失)窗口,在此之后所有的写操作都不再被接受。

Availability

Redis Cluster is not available in the minority side of the partition. In the majority side of the partition assuming that there are at least the majority of masters and a slave for every unreachable master, the cluster becomes available again after NODE_TIMEOUT time plus a few more seconds required for a slave to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds).

当发生网络分区时,majority side指的是分区中的这样一端,其中大部分主节点是可用的,同时,那些不可用的主节点至少有一个从节点是可用的。与之相对的一端称为minority side,它在分区发生时是不可用的。分区发生时,majority side会有一段短暂的不可用的时间。时长等于NODE_TIMEOUT加上一小段从节点完成选举(成为主节点)并对原先的主节点进行故障转移的时间,通常是1-2秒。此后,majority side恢复可用状态。

This means that Redis Cluster is designed to survive failures of a few nodes in the cluster, but it is not a suitable solution for applications that require availability in the event of large net splits.

这意味着,Redis集群被设计为,可以适应少部分节点故障,但它不适合于发的网络分区。

In the example of a cluster composed of N master nodes where every node has a single slave, the majority side of the cluster will remain available as long as a single node is partitioned away, and will remain available with a probability of 1-(1/(N\*2-1)) when two nodes are partitioned away (after the first node fails we are left with N\*2-1 nodes in total, and the probability of the only master without a replica to fail is 1/(N\*2-1)).

以一个有N个主节点,各自有一个从节点(也就是共N个从节点)的集群为例。majority side可以在挂掉1个节点的时候,保证可用性。在挂掉2个节点的时候,有1-(1/(N\*2-1))的概率继续可用。(只要那个没有从节点的主节点不要挂掉,就可以继续服务)

For example, in a cluster with 5 nodes and a single slave per node, there is a 1/(5\*2-1) = 11.11% probability that after two nodes are partitioned away from the majority, the cluster will no longer be available.

比如有5主5从的一个集群,当其中的2个节点无法连接到majority side时,集群有11.11%的概率不再可用。

译注:说人话就是,A/A' B/B' C/C' 3主3从组成的集群,随机挂掉一台(假设是A),集群都可以提供服务。此时剩下5个节点,每台挂掉的概率都是1/5,只要A'没挂,集群就不受影响,所以集群有4/5的概率可以再承受一台节点故障。

Thanks to a Redis Cluster feature called replicas migration the Cluster availability is improved in many real world scenarios by the fact that replicas migrate to orphaned masters (masters no longer having replicas). So at every successful failure event, the cluster may reconfigure the slaves layout in order to better resist the next failure.

Redis集群提供了一个叫副本迁移的特性,它可以提升集群的可用性,来适应很多在真实世界可能发生的场景。副本迁移会将富余的从节点迁移给孤儿主节点(即没有从节点的主节点)。因此,每次成功处理集群故障的时候,集群可能会重新分配从节点(和主节点的拓扑关系),来更好的应对下一次可能的故障。

Performance

In Redis Cluster nodes don't proxy commands to the right node in charge for a given key, but instead they redirect clients to the right nodes serving a given portion of the key space.

在Redis集群中,不会通过代理的方式把某个key的请求转发给某个节点。而是通过让客户端重定向的方式完成。(前者对客户端无感知,后者对客户端感知)

Eventually clients obtain an up-to-date representation of the cluster and which node serves which subset of keys, so during normal operations clients directly contact the right nodes in order to send a given command.

最终客户端会拥有一份集群的拓扑关系(应该是node和slot的映射关系),以及哪些节点能支持哪些key的映射关系。所以,在正常的操作期间,客户端会直接和正确的节点通信。

Because of the use of asynchronous replication, nodes do not wait for other nodes' acknowledgment of writes (if not explicitly requested using the WAIT command).

因为主从之间使用的是异步的备份机制,因此,主节点不会等待从节点写入成功之后再给客户端发送回执(除非客户端显式的发送WAIT指令)。

Also, because multi-key commands are only limited to near keys, data is never moved between nodes except when resharding.

同样的,因为multi-key类型的指令限制于near keys(这里的near指的是在同一个哈希槽中的key),所以数据在非重分配期间,根本无需在节点间移动。

Normal operations are handled exactly as in the case of a single Redis instance. This means that in a Redis Cluster with N master nodes you can expect the same performance as a single Redis instance multiplied by N as the design scales linearly. At the same time the query is usually performed in a single round trip, since clients usually retain persistent connections with the nodes, so latency figures are also the same as the single standalone Redis node case.

普通的操作是在集群模式和单实例模式下一样的。这意味着,当集群拥有N个节点时,你可以拥有接近N倍于单实例的性能。与此同时,因为客户端通常会和节点保持长连接,因此请求通常在一次往返中执行,所以延迟通常也和单实例模式下一样。

Very high performance and scalability while preserving weak but reasonable forms of data safety and availability is the main goal of Redis Cluster.

追求极致的高性能和高拓展性,同时保证相对较弱但合理的数据安全性和可用性时Redis集群的首要设计目标。

Why merge operations are avoided

Redis Cluster design avoids conflicting versions of the same key-value pair in multiple nodes as in the case of the Redis data model this is not always desirable. Values in Redis are often very large; it is common to see lists or sorted sets with millions of elements. Also data types are semantically complex. Transferring and merging these kind of values can be a major bottleneck and/or may require the non-trivial involvement of application-side logic, additional memory to store meta-data, and so forth.

Redis集群在设计时避免了相同的键值对在多个节点中存在版本冲突(的可能),Redis的数据模型不提倡这么做。Redis中存储的值通常非常大,拥有百万级别元素的列表或者有序集合非常常见。数据类型在语义上也很复杂。传输和合并这些类型的值,可能成为一个主要的性能瓶颈,并且/或者,可能需要应用程序侧逻辑参与其中来做一些专门的适配,比如增加额外的存储来存储元数据信息等。

There are no strict technological limits here. CRDTs or synchronously replicated state machines can model complex data types similar to Redis. However, the actual run time behavior of such systems would not be similar to Redis Cluster. Redis Cluster was designed in order to cover the exact use cases of the non-clustered Redis version.

以上没有严格的技术限制。 CRDT或同步复制状态机可以对复杂数据类型进行建模,达到和Redis近似的效果。但是,这类系统实际的运行时行为将与Redis集群不同。 而Redis Cluster的设计,旨在保证和单机版的Redis行为一致。

Overview of Redis Cluster main components

Keys distribution model

The key space is split into 16384 slots, effectively setting an upper limit for the cluster size of 16384 master nodes (however the suggested max size of nodes is in the order of ~ 1000 nodes).

Redis集群的键空间分散在16384个槽位中。这些槽位最大可以分配给16384个节点。(不过,建议的最大节点数大约在1000左右)

Each master node in a cluster handles a subset of the 16384 hash slots. The cluster is stable when there is no cluster reconfiguration in progress (i.e. where hash slots are being moved from one node to another). When the cluster is stable, a single hash slot will be served by a single node (however the serving node can have one or more slaves that will replace it in the case of net splits or failures, and that can be used in order to scale read operations where reading stale data is acceptable).

集群中的每一个节点处理16384个哈希槽中的一部分。如果没有进行中的集群再分配行为,比如哈希槽从某个节点挪到另一个节点,那么集群的拓扑结构是稳定的。此时,任意一个哈希槽,只会有一个节点来承载并对外提供服务。一种例外的情况是,主节点可能有一个或多个从节点,以应对网络分区或故障的发生,如果可以接受读到的数据不是最新的,那么,这些从节点也可以被用来拓展读操作的性能。

The base algorithm used to map keys to hash slots is the following (read the next paragraph for the hash tag exception to this rule):

HASH_SLOT = CRC16(key) mod 16384

把key映射到具体的哈希槽的基础算法为:对key执行CRC16后对16384取模。(后面的段落介绍了通过hash tag的方式绕过这个算法)

The CRC16 is specified as follows:

  • Name: XMODEM (also known as ZMODEM or CRC-16/ACORN)
  • Width: 16 bit
  • Poly: 1021 (That is actually x16 + x12 + x5 + 1)
  • Initialization: 0000
  • Reflect Input byte: False
  • Reflect Output CRC: False
  • Xor constant to output CRC: 0000
  • Output for "123456789": 31C3
    14 out of 16 CRC16 output bits are used (this is why there is a modulo 16384 operation in the formula above).

In our tests CRC16 behaved remarkably well in distributing different kinds of keys evenly across the 16384 slots.

Note: A reference implementation of the CRC16 algorithm used is available in the Appendix A of this document.

译注:CRC16的具体细节这里就不展开翻译了,作者表示了CRC16的哈希效果非常均匀,并在文档的附录A里提供了完整的代码。

Keys hash tags

There is an exception for the computation of the hash slot that is used in order to implement hash tags. Hash tags are a way to ensure that multiple keys are allocated in the same hash slot. This is used in order to implement multi-key operations in Redis Cluster.

计算哈希槽的时候,有一种例外,被称为哈希标签。它能确保多个key被分配到同一个哈希槽上。这是为了实现Redis集群的多key操作。

In order to implement hash tags, the hash slot for a key is computed in a slightly different way in certain conditions. If the key contains a "{...}" pattern only the substring between { and } is hashed in order to obtain the hash slot. However since it is possible that there are multiple occurrences of { or } the algorithm is well specified by the following rules:

  • IF the key contains a { character.
  • AND IF there is a } character to the right of {
  • AND IF there are one or more characters between the first occurrence of { and the first occurrence of }.

Then instead of hashing the key, only what is between the first occurrence of { and the following first occurrence of } is hashed.

为了实现哈希标记的特性,当满足特定条件的时候,计算某个key属于哪个哈希槽的时候,有轻微的不同。如果key中包含{...},则只有。花括号中的内容会被用来参与计算哈希槽的位置。然而,一个key中很可能包含多个{或者},此时会根据以下的规则:提取第一对能匹配的花括号,且花括号内至少有一个字符(猜测否则的话会有大量key被哈希到一个slot中,造成数据倾斜)。则提取花括号中的字符串代替整个key来计算哈希槽。

译注:这里的第一对不是人眼直观上的第一对,如果是{{...}}的情况,则提取出来的是{...

以下例子比较冗长,个人感受还不如直接看Ruby代码来的直观。简短说就是先定位到一个左花括号所在的index,再往后定位到第一个右花括号所在的index,如果中间有内容就提取出来。

Examples:

  • The two keys {user1000}.following and {user1000}.followers will hash to the same hash slot since only the substring user1000 will be hashed in order to compute the hash slot.
  • For the key foo{}{bar} the whole key will be hashed as usually since the first occurrence of { is followed by } on the right without characters in the middle.
  • For the key foo{{bar}}zap the substring {bar will be hashed, because it is the substring between the first occurrence of { and the first occurrence of } on its right.
  • For the key foo{bar}{zap} the substring bar will be hashed, since the algorithm stops at the first valid or invalid (without bytes inside) match of { and }.
  • What follows from the algorithm is that if the key starts with {}, it is guaranteed to be hashed as a whole. This is useful when using binary data as key names.

Adding the hash tags exception, the following is an implementation of the HASH_SLOT function in Ruby and C language.

Ruby example code:

def HASH_SLOT(key)
    s = key.index "{"
    if s
        e = key.index "}",s+1
        if e && e != s+1
            key = key[s+1..e-1]
        end
    end
    crc16(key) % 16384
end

C example code:

unsigned int HASH_SLOT(char *key, int keylen) {
    int s, e; /* start-end indexes of { and } */

    /* Search the first occurrence of '{'. */
    for (s = 0; s < keylen; s++)
        if (key[s] == '{') break;

    /* No '{' ? Hash the whole key. This is the base case. */
    if (s == keylen) return crc16(key,keylen) & 16383;

    /* '{' found? Check if we have the corresponding '}'. */
    for (e = s+1; e < keylen; e++)
        if (key[e] == '}') break;

    /* No '}' or nothing between {} ? Hash the whole key. */
    if (e == keylen || e == s+1) return crc16(key,keylen) & 16383;

    /* If we are here there is both a { and a } on its right. Hash
     * what is in the middle between { and }. */
    return crc16(key+s+1,e-s-1) & 16383;
}

Cluster nodes attributes

Every node has a unique name in the cluster. The node name is the hex representation of a 160 bit random number, obtained the first time a node is started (usually using /dev/urandom). The node will save its ID in the node configuration file, and will use the same ID forever, or at least as long as the node configuration file is not deleted by the system administrator, or a hard reset is requested via the CLUSTER RESET command.

每个节点都有一个独一无二的名字。这个名字是用16进制表示的一个160bit的随机字符串。它通常是节点启动的时候(从/dev/urandom中)获得的。节点会把这个名称写进配置文件中,并且在除配置文件被删除或执行了集群的RESET指令以外的场景下,节点永远不会主动改变这个名称。

The node ID is used to identify every node across the whole cluster. It is possible for a given node to change its IP address without any need to also change the node ID. The cluster is also able to detect the change in IP/port and reconfigure using the gossip protocol running over the cluster bus.

节点ID(就是上一段的那个unique name)被用来唯一标识集群中的每一个节点。节点可能改变它的IP,但是不需要改变这个ID。集群应该有能力感知IP和端口的变化,并且通过gossip协议来重配置集群(的信息)。

The node ID is not the only information associated with each node, but is the only one that is always globally consistent. Every node has also the following set of information associated. Some information is about the cluster configuration detail of this specific node, and is eventually consistent across the cluster. Some other information, like the last time a node was pinged, is instead local to each node.

和每个节点相关的信息,不只是有节点ID信息,但节点ID总是全局一致的。每个节点还会关联以下(两类)信息。一些信息与某个特定节点的集群配置的详细信息有关,并且最终在整个集群中保持一致。 相反,另一些其他信息(例如上次对节点执行ping操作)则位于每个节点本地。

Every node maintains the following information about other nodes that it is aware of in the cluster: The node ID, IP and port of the node, a set of flags, what is the master of the node if it is flagged as slave, last time the node was pinged and the last time the pong was received, the current configuration epoch of the node (explained later in this specification), the link state and finally the set of hash slots served.

每个节点,都维护有关集群中其他节点的以下信息:节点ID、节点的IP和端口号、一系列的标识位、如果是从节点的话,那主节点是谁、最近一次心跳(ping/pong)的情况、节点配置的当前epoch(文档的后续部分会解释)、连接的状态以及哈希槽的分配情况。

A detailed explanation of all the node fields is described in the CLUSTER NODES documentation.

关于所有节点字段更详细的解释参考CLUSTER NODES相关的文档。

The CLUSTER NODES command can be sent to any node in the cluster and provides the state of the cluster and the information for each node according to the local view the queried node has of the cluster.

CLUSTER NODES命令可以发送到群集中的任何节点,并根据执行查询的节点的本地视图,提供群集的状态以及每个节点的信息。

The following is sample output of the CLUSTER NODES command sent to a master node in a small cluster of three nodes.

以下是一个包含三个主节点的小集群的示例输出。

$ redis-cli cluster nodes
d1861060fe6a534d42d8a19aeb36600e18785e04 127.0.0.1:6379 myself - 0 1318428930 1 connected 0-1364
3886e65cc906bfd9b1f7e7bde468726a052d1dae 127.0.0.1:6380 master - 1318428930 1318428931 2 connected 1365-2729
d289c575dcbc4bdd2931585fd4339089e461a27d 127.0.0.1:6381 master - 1318428931 1318428931 3 connected 2730-4095

In the above listing the different fields are in order: node id, address:port, flags, last ping sent, last pong received, configuration epoch, link state, slots. Details about the above fields will be covered as soon as we talk of specific parts of Redis Cluster.

上述列表中不同的字段按如下顺序排列:节点ID、IP:端口、标识信息(主或者从,从的话是谁的从),最近一次发送ping消息的时间(其实是相对于1970-01-01 00:00:00的毫秒数;如果没有阻塞的ping请求,则为0)、最近一次接收到pong消息的时间、配置的epoch、连接状态、槽分配情况。细节的信息在我们讨论Redis集群的特定部分时会详细讨论。

The Cluster bus

Every Redis Cluster node has an additional TCP port for receiving incoming connections from other Redis Cluster nodes. This port is at a fixed offset from the normal TCP port used to receive incoming connections from clients. To obtain the Redis Cluster port, 10000 should be added to the normal commands port. For example, if a Redis node is listening for client connections on port 6379, the Cluster bus port 16379 will also be opened.

Redis集群中的每一个节点,都有一个额外的TCP端口,用来接收其他节点发来的连接。这个端口和用于接收客户端连接的端口之间有一个固定大小的偏移量,通常是10000。比如,如果客户端通过6379端口连接服务端,则16379会被用来作为额外的端口。

Node-to-node communication happens exclusively using the Cluster bus and the Cluster bus protocol: a binary protocol composed of frames of different types and sizes. The Cluster bus binary protocol is not publicly documented since it is not intended for external software devices to talk with Redis Cluster nodes using this protocol. However you can obtain more details about the Cluster bus protocol by reading the cluster.h and cluster.c files in the Redis Cluster source code.

点对点的通信只会通过集群总线和集群总线协议(一种二进制的协议,由不同类型和大小的帧构成)来进行。集群总线协议并没有公开的文档记录,因为并不打算用它来作为外部软件和集群节点通信的协议。当然,你可以通过Redis源码的cluster.hcluster.c两个文件了解更多细节。

Cluster topology

Redis Cluster is a full mesh where every node is connected with every other node using a TCP connection.

Redis集群是一个完全的网格:每一个节点之间互相通过TCP连接。

In a cluster of N nodes, every node has N-1 outgoing TCP connections, and N-1 incoming connections.

一个N个几点的集群,每个节点都分别有N-1个连入和连出连接。

These TCP connections are kept alive all the time and are not created on demand. When a node expects a pong reply in response to a ping in the cluster bus, before waiting long enough to mark the node as unreachable, it will try to refresh the connection with the node by reconnecting from scratch.

这些TCP连接会始终保持活跃,并且不会按需创建。当一个节点期望收到某个节点的pong响应时,在等待足够长的时间之后,它会尝试重新刷新与该节点的连接,而不是直接将该节点标记为不可访问。

While Redis Cluster nodes form a full mesh, nodes use a gossip protocol and a configuration update mechanism in order to avoid exchanging too many messages between nodes during normal conditions, so the number of messages exchanged is not exponential.

尽管Redis集群会组成一个完全的网格,但是在正常情况下,节点之间是通过gossip协议以及某些配置更新机制,来防止节点间有过多的消息(这里指两两发送消息),因此(当节点增加时)消息交换并不是指数级(增加的)。

Nodes handshake

Nodes always accept connections on the cluster bus port, and even reply to pings when received, even if the pinging node is not trusted. However, all other packets will be discarded by the receiving node if the sending node is not considered part of the cluster.

节点总是接受来自集群总线的连接请求(大概是想表达不支持拜占庭容错)。
即使发送ping消息的节点不值得信任,也依然会给予回执。但是,如果某个节点不是集群的一部分,那么它所发送的其他类型(非ping类型)的包都会被集群节点丢弃掉。

A node will accept another node as part of the cluster only in two ways:

  • If a node presents itself with a MEET message. A meet message is exactly like a PING message, but forces the receiver to accept the node as part of the cluster. Nodes will send MEET messages to other nodes only if the system administrator requests this via the following command:

    CLUSTER MEET ip port

  • A node will also register another node as part of the cluster if a node that is already trusted will gossip about this other node. So if A knows B, and B knows C, eventually B will send gossip messages to A about C. When this happens, A will register C as part of the network, and will try to connect with C.

一个节点只会在以下两种情况下,接受另一个节点作为集群的一部分:

一是,如果一个节点给另一个节点发送MEET消息。MEET消息和PING消息类似,但是强迫接收消息的节点把发送消息的节点作为集群的一部分。同时,只有系统管理员通过cluster meet命令才能让节点发送MEET消息。

二是,一个节点同样会把另一个节点注册为集群的一部分,如果那个节点已经被某个网络中的节点所信任。因此,如果A知道B,B知道C,最后B会给A发送关于C的gossip消息。这时,A会把C注册为网络的一部分,并尝试和C建立连接。

This means that as long as we join nodes in any connected graph, they'll eventually form a fully connected graph automatically. This means that the cluster is able to auto-discover other nodes, but only if there is a trusted relationship that was forced by the system administrator.

这意味着,只要我们把节点以图的形式相连,他们最终会自动的互相完全连接(指的是借助于gossip协议,不需要人为的把所有节点信息配置起来)。这也意味着,如果有一个系统管理员手动指定的可信关系(指的是meet命令),集群有能力自动发现其他节点(并把它们加入集群中)。

This mechanism makes the cluster more robust but prevents different Redis clusters from accidentally mixing after change of IP addresses or other network related events.

这一机制让集群更加健壮,同时防止了不同的Redis集群因为偶然的IP地址变更或其他网络相关事件混在在一起。

Redirection and resharding

MOVED Redirection

A Redis client is free to send queries to every node in the cluster, including slave nodes. The node will analyze the query, and if it is acceptable (that is, only a single key is mentioned in the query, or the multiple keys mentioned are all to the same hash slot) it will lookup what node is responsible for the hash slot where the key or keys belong.

Redis客户端可以把请求发送给集群的任意节点,也包括从节点。节点会分析请求,如果请求是合法的(只有一个key或者所有key都属于同一个哈希槽),就会判断应该由哪个节点来负责响应请求。

If the hash slot is served by the node, the query is simply processed, otherwise the node will check its internal hash slot to node map, and will reply to the client with a MOVED error, like in the following example:

GET x
-MOVED 3999 127.0.0.1:6381

如果这个请求所在的哈希槽就是由当前节点处理的,则它会处理掉这个请求,否则它会检查自己的slot和node映射关系表,并且给客户端一个含有MOVED错误的回执。

The error includes the hash slot of the key (3999) and the ip:port of the instance that can serve the query. The client needs to reissue the query to the specified node's IP address and port. Note that even if the client waits a long time before reissuing the query, and in the meantime the cluster configuration changed, the destination node will reply again with a MOVED error if the hash slot 3999 is now served by another node. The same happens if the contacted node had no updated information.

这个错误中包含了请求的key的哈希槽(3999)以及可以处理这个key(负责3999哈希槽)的实例的ip:port。注意,即使客户端在重新发起请求之前,等待了很长的时间,与此同时,集群的拓扑关系发生了变化,如果3999这个哈希槽移动到了其他节点,那么目标节点会再次发送一个MOVED错误。如果目标节点的集群信息不是最新的,这样的情况也会发生。

译注:比如认为3999应该在A节点,实际上在B节点,则会给客户端发送一个MOVED错误,重定向给A。当A收到请求时,发现自己无法处理,会再次重定向给B。

So while from the point of view of the cluster nodes are identified by IDs we try to simplify our interface with the client just exposing a map between hash slots and Redis nodes identified by IP:port pairs.

从集群的视角来看,节点是通过ID来标识的,所以我们试着简化我们的接口,客户端只需要记录哈希槽和IP:端口的映射关系。

The client is not required to, but should try to memorize that hash slot 3999 is served by 127.0.0.1:6381. This way once a new command needs to be issued it can compute the hash slot of the target key and have a greater chance of choosing the right node.

虽然不是必须的,但是客户端应该试着记住3999槽位,是由127.0.0.1:6381提供服务的。用这种方法,一旦有新的命令,那么就可以计算出哈希槽,并且(根据槽和节点的映射关系)有较大的概率选择正确的节点(避免多余的重定向)。

An alternative is to just refresh the whole client-side cluster layout using the CLUSTER NODES or CLUSTER SLOTS commands when a MOVED redirection is received. When a redirection is encountered, it is likely multiple slots were reconfigured rather than just one, so updating the client configuration as soon as possible is often the best strategy.

一个可选的方式,是客户端收到MOVED重定向错误时,通过CLUSTER NODES和CLUSTER SLOTS两个命令(之一)把整个映射关系都刷新为最新的。因为,当重定向发生时,大概率不止一个槽位被重分配了,因此最佳策略是尽快更新客户端的配置。

Note that when the Cluster is stable (no ongoing changes in the configuration), eventually all the clients will obtain a map of hash slots -> nodes, making the cluster efficient, with clients directly addressing the right nodes without redirections, proxies or other single point of failure entities.

值得一提的是,当集群处于稳定状态下(没有正在槽迁移之类的调整),最终客户端会获得一个slot -> node的映射关系,这种方式会使得集群更加高效,因为客户端会直接跟正确的节点通信,而不需要重定向或者代理,也没有单点故障。

A client must be also able to handle -ASK redirections that are described later in this document, otherwise it is not a complete Redis Cluster client.

客户端必须还能够处理ASK重定向,稍后的章节会详细描述它。否则,它不是一个完整的Redis集群客户端。

Cluster live reconfiguration

Redis Cluster supports the ability to add and remove nodes while the cluster is running. Adding or removing a node is abstracted into the same operation: moving a hash slot from one node to another. This means that the same basic mechanism can be used in order to rebalance the cluster, add or remove nodes, and so forth.

  • To add a new node to the cluster an empty node is added to the cluster and some set of hash slots are moved from existing nodes to the new node.
  • To remove a node from the cluster the hash slots assigned to that node are moved to other existing nodes.
  • To rebalance the cluster a given set of hash slots are moved between nodes.

Redis集群支持在线添加/删除节点的能力。新增和删除一个节点,被统一抽象为在两个节点间进行哈希槽的迁移。这也意味着,可以使用相同的基础机制对群集进行再平衡以及添加或删除节点等等。新增就是给集群添加一个空的新节点,并把一些哈希槽从旧节点迁移到新节点;删除就是把某个(要被移除的)节点上,把它负责的哈希槽重新分配给另一些节点;再平衡就是在集群节点间(互相)迁移哈希槽。

The core of the implementation is the ability to move hash slots around. From a practical point of view a hash slot is just a set of keys, so what Redis Cluster really does during resharding is to move keys from an instance to another instance. Moving a hash slot means moving all the keys that happen to hash into this hash slot.

实现的核心是能够移动哈希槽。实际上,哈希槽就是一系列key的集合,因此Redis集群在重分片期间真正做的,是把key从一个实例移动到另一个实例。移动一个哈希槽,意味着移动所有刚好哈希到这个槽位的key。

To understand how this works we need to show the CLUSTER subcommands that are used to manipulate the slots translation table in a Redis Cluster node.

The following subcommands are available (among others not useful in this case):

  • CLUSTER ADDSLOTS slot1 [slot2] ... [slotN]
  • CLUSTER DELSLOTS slot1 [slot2] ... [slotN]
  • CLUSTER SETSLOT slot NODE node
  • CLUSTER SETSLOT slot MIGRATING node
  • CLUSTER SETSLOT slot IMPORTING node

为了理解它怎么工作的,我们需要展示用来维护Redis集群节中的槽转换表的CLUSTER相关的子命令。以下子命令是可用的。

The first two commands, ADDSLOTS and DELSLOTS, are simply used to assign (or remove) slots to a Redis node. Assigning a slot means to tell a given master node that it will be in charge of storing and serving content for the specified hash slot.

最开始的2条子命令,ADDSLOTS和DELSLOTS是用于把哈希槽分配给某个节点的(或者把哈希槽从某个节点删掉)。分配一个哈希槽,意味着告诉那个节点,它要负责提供和这个槽相关的存储和查询服务。

After the hash slots are assigned they will propagate across the cluster using the gossip protocol, as specified later in the configuration propagation section.

当哈希槽被分配好,这个信息会通过gossip协议在集群中传播,如后续的配置传播章节说的那样。

The ADDSLOTS command is usually used when a new cluster is created from scratch to assign each master node a subset of all the 16384 hash slots available.

ADDSLOTS命令通常用于当一个集群刚创建时,把16384个哈希槽分配给各个主节点的动作。

The DELSLOTS is mainly used for manual modification of a cluster configuration or for debugging tasks: in practice it is rarely used.

DELSLOTS命令主要用于手动修改集群的配置或者调试任务,在实践中很少使用。

The SETSLOT subcommand is used to assign a slot to a specific node ID if the SETSLOT NODE form is used. Otherwise the slot can be set in the two special states MIGRATING and IMPORTING. Those two special states are used in order to migrate a hash slot from one node to another.

  • When a slot is set as MIGRATING, the node will accept all queries that are about this hash slot, but only if the key in question exists, otherwise the query is forwarded using a -ASK redirection to the node that is target of the migration.
  • When a slot is set as IMPORTING, the node will accept all queries that are about this hash slot, but only if the request is preceded by an ASKING command. If the ASKING command was not given by the client, the query is redirected to the real hash slot owner via a -MOVED redirection error, as would happen normally.

如果采用SETSLOT NODE这样形式的命令,它的作用是把一个槽位分配给一个特殊的节点ID。除此之外,槽位可以被设置为两种特殊状态MIGRATINGIMPORTING。这两种特殊状态是用来从一个节点把哈希槽迁移到另一个节点的:当一个槽位被置为MIGRATING,这个节点将会(尝试)接受所有这个槽位的请求,如果key存在则返回数据,不存在则返回一个ASK错误,把请求重定向给目标节点(数据迁入的那个);当一个槽位被置为IMPORTING,节点将会(尝试)接受所有这个槽位的请求,前提是先发送了ASKING命令。否则的话,会通过MOVED错误,重定向到哈希槽当前真正的节点。

Let's make this clearer with an example of hash slot migration. Assume that we have two Redis master nodes, called A and B. We want to move hash slot 8 from A to B, so we issue commands like this:

  • We send B: CLUSTER SETSLOT 8 IMPORTING A
  • We send A: CLUSTER SETSLOT 8 MIGRATING B

让我们用一个例子更清晰的阐述一下哈希槽的迁移。假设我们有两个Redis主节点,A和B。我们想把编号为8的哈希槽从A迁移到B,则我们:在B节点上执行IMPORTING A,在A节点上执行MIGRATING B。

All the other nodes will continue to point clients to node "A" every time they are queried with a key that belongs to hash slot 8, so what happens is that:

  • All queries about existing keys are processed by "A".
  • All queries about non-existing keys in A are processed by "B", because "A" will redirect clients to "B".

当客户端请求编号为8的哈希槽时,所有其他的节点都会把客户端指向A节点。所有已存在的key都会由A来处理。所有不存在的key都会由B来处理,因为A会把客户端重定向给B。

译注:这里已存在的应该指GET这种读操作,不存在的指SET这种写操作。

This way we no longer create new keys in "A". In the meantime, a special program called redis-trib used during reshardings and Redis Cluster configuration will migrate existing keys in hash slot 8 from A to B. This is performed using the following command:

CLUSTER GETKEYSINSLOT slot count

通过这种方式,不会再在A节点创建新的key。与此同时,一个名为redis-trib的特殊程序,会在重分配期间被使用。Redis的集群配置也会把哈希槽8的归属从A设置为B。通过GETKEYSINSLOT命令实现。

The above command will return count keys in the specified hash slot. For every key returned, redis-trib sends node "A" a MIGRATE command, that will migrate the specified key from A to B in an atomic way (both instances are locked for the time (usually very small time) needed to migrate a key so there are no race conditions). This is how MIGRATE works:

以上的命令会返回某个槽中的count个key(count为1就返回1个key,为2就返回2个key)。每一个返回的key,redis-trib都会给A节点发送一个MIGRATE命令,这个命令会把某个key从A原子的迁移到B中。两个节点通常都会锁(很小的)一段时间。这是为了防止某个key出现竞态条件(这个破翻译,算了约定俗成吧,其实指的是至少有1个写)。

MIGRATE target_host target_port key target_database id timeout

MIGRATE will connect to the target instance, send a serialized version of the key, and once an OK code is received, the old key from its own dataset will be deleted. From the point of view of an external client a key exists either in A or B at any given time.

MIGRATE会连接目标实例,发送一个序列化版本的key,一旦收到“OK”的返回,在老节点上的旧的key就会被删除掉。从外部客户端的视角来看,一个key在任意时间,只存在于A或者B上(不会同时存在)。

In Redis Cluster there is no need to specify a database other than 0, but MIGRATE is a general command that can be used for other tasks not involving Redis Cluster. MIGRATE is optimized to be as fast as possible even when moving complex keys such as long lists, but in Redis Cluster reconfiguring the cluster where big keys are present is not considered a wise procedure if there are latency constraints in the application using the database.

在Redis的集群模式下,不需要指定数据库,只能使用0号数据库。但是MIGRATE是一个通用命令,可以被用作其他非集群模式的任务。MIGRATE被优化为尽可能快的执行,即使是移动诸如长列表之类的复杂键时也一样。但是,当在Redis集群重分片期间有big key、应用间有网络延时,则迁移被认为是一种不明智的行为。

When the migration process is finally finished, the SETSLOT NODE command is sent to the two nodes involved in the migration in order to set the slots to their normal state again. The same command is usually sent to all other nodes to avoid waiting for the natural propagation of the new configuration across the cluster.

迁移过程最终完成后,为了把槽位重新设置为正常状态,会把SETSLOT NODE 命令发送到迁移中涉及的两个节点上。为了避免配置在集群节点中(通过Gossip协议)自然传播,同一命令也会被发送给所有其他节点。

ASK redirection

In the previous section we briefly talked about ASK redirection. Why can't we simply use MOVED redirection? Because while MOVED means that we think the hash slot is permanently served by a different node and the next queries should be tried against the specified node, ASK means to send only the next query to the specified node.

在上一节,我们简要的讨论了ASK重定向。为什么我们不能简单的使用MOVED重定向呢?因为MOVED重定向意味着,我们认为某个哈希槽永久的由另一个节点提供服务了,同时后续的所有查询请求都应该发送给那个节点。而ASK代表仅仅下一次查询需要发送给那个节点。

This is needed because the next query about hash slot 8 can be about a key that is still in A, so we always want the client to try A and then B if needed. Since this happens only for one hash slot out of 16384 available, the performance hit on the cluster is acceptable.

这样的行为是必须的,因为下一次某个key的查询,可能落在在哈希槽8上,(在迁移完成前)依然位于A节点。因此我们总是在必要时希望客户端先查询A节点再查询B节点。由于这是16384个槽中,仅1个槽出现迁移的场景,所以集群的性能在一个可接受的范围内。

We need to force that client behavior, so to make sure that clients will only try node B after A was tried, node B will only accept queries of a slot that is set as IMPORTING if the client sends the ASKING command before sending the query.

我们需要强制规范客户端的行为,以保证客户端只有在从A节点查询不到数据的情况下,才去B节点查找。如果客户端在发生查询请求前,发送了ASKING指令,则B节点将只会接受被设为IMPORTING的槽的查询。

Basically the ASKING command sets a one-time flag on the client that forces a node to serve a query about an IMPORTING slot.

基本上,ASKING命令在客户端设置了一个一次性的标识,它用来强制某个节点执行一次处于IMPORTING状态的槽的查询。

The full semantics of ASK redirection from the point of view of the client is as follows:

  • If ASK redirection is received, send only the query that was redirected to the specified node but continue sending subsequent queries to the old node.
  • Start the redirected query with the ASKING command.
  • Don't yet update local client tables to map hash slot 8 to B.

从客户端的视角,ASK重定向的完整语义如下:

  • 如果接收到(来自于原节点的)ASK重定向回执,仅仅把下一次查询请求发给特定的节点,但是后续的查询依然发给原节点。
  • 先发送 ASKING 命令,再开始发送重定向的查询请求。
  • 不要更新本地客户端的映射表,即不要把哈希槽8从节点A映射到节点 B。

Once hash slot 8 migration is completed, A will send a MOVED message and the client may permanently map hash slot 8 to the new IP and port pair. Note that if a buggy client performs the map earlier this is not a problem since it will not send the ASKING command before issuing the query, so B will redirect the client to A using a MOVED redirection error.

一旦哈希槽8迁移完成,A节点会发送一个MOVED消息,客户端也许会永久的把哈希槽8映射到新的IP、端口上。注意,即使一个有BUG的客户端,过早地执行这个映射更新,也是没有问题的,因为它不会在查询前发送 ASKING 命令,所以节点B会用MOVED重定向错误把客户端重定向到节点 A 上。

Slots migration is explained in similar terms but with different wording (for the sake of redundancy in the documentation) in the CLUSTER SETSLOT command documentation.

在CLUSTER SETSLOT的命令文档中,使用了相似术语的不同表述(为了冗余)来解释槽位迁移。

Clients first connection and handling of redirections

While it is possible to have a Redis Cluster client implementation that does not remember the slots configuration (the map between slot numbers and addresses of nodes serving it) in memory and only works by contacting random nodes waiting to be redirected, such a client would be very inefficient.

尽管一个Redis集群的客户端,可以允许不在内存中记录哈希槽配置信息(槽编号和为其提供服务的节点地址的映射关系),每次都随机挑选一个节点通信,并且等着被重定向,但是,这样的客户端是非常低效的。(花费的时间几乎是正常的一倍)

Redis Cluster clients should try to be smart enough to memorize the slots configuration. However this configuration is not required to be up to date. Since contacting the wrong node will simply result in a redirection, that should trigger an update of the client view.

Redis集群的客户端应该足够智能到去缓存哈希槽配置。但是,这份配置并不要求一定是最新的。因为,如果和错误的节点通信,会收到一个重定向错误,这将会触发客户端去更新槽配置信息。

Clients usually need to fetch a complete list of slots and mapped node addresses in two different situations:

  • At startup in order to populate the initial slots configuration.
  • When a MOVED redirection is received.

客户端在以下2类不同的情景下,总是需要获取完整的槽列表和节点地址间的映射信息:客户端刚启动时;收到MOVED重定向时。

Note that a client may handle the MOVED redirection by updating just the moved slot in its table, however this is usually not efficient since often the configuration of multiple slots is modified at once (for example if a slave is promoted to master, all the slots served by the old master will be remapped). It is much simpler to react to a MOVED redirection by fetching the full map of slots to nodes from scratch.

客户端在收到MOVED重定向时,可能仅更新那一个槽位在映射表中的信息,但这样做通常不是很高效,因为多个槽位通常是一起被更新的(举例来说,如果某个从节点提升为主节点时,所有曾经在主节点上的槽位都会迁移到从节点)。当收到MOVED重定向时,直接重新抓去全量的槽位和节点的映射信息,相对来说更加简单。

In order to retrieve the slots configuration Redis Cluster offers an alternative to the CLUSTER NODES command that does not require parsing, and only provides the information strictly needed to clients.

为了获取槽位配置,Redis集群提供了一个(类似)CLUSTER NODES的备选命令,这个命令不需要对结果进行解析,直接提供了客户端需要的信息。

The new command is called CLUSTER SLOTS and provides an array of slots ranges, and the associated master and slave nodes serving the specified range.

这个新命令就是CLUSTER SLOTS,它提供了一个数组,数组内部记录了包括槽位的范围,以及这些槽位被分配到的主、从节点。

The following is an example of output of CLUSTER SLOTS:

以下是CLUSTER SLOTS命令返回的示例。

127.0.0.1:7000> cluster slots
1) 1) (integer) 5461
   2) (integer) 10922
   3) 1) "127.0.0.1"
      2) (integer) 7001
   4) 1) "127.0.0.1"
      2) (integer) 7004
2) 1) (integer) 0
   2) (integer) 5460
   3) 1) "127.0.0.1"
      2) (integer) 7000
   4) 1) "127.0.0.1"
      2) (integer) 7003
3) 1) (integer) 10923
   2) (integer) 16383
   3) 1) "127.0.0.1"
      2) (integer) 7002
   4) 1) "127.0.0.1"
      2) (integer) 7005

The first two sub-elements of every element of the returned array are the start-end slots of the range. The additional elements represent address-port pairs. The first address-port pair is the master serving the slot, and the additional address-port pairs are all the slaves serving the same slot that are not in an error condition (i.e. the FAIL flag is not set).

最前面的2个子元素是开始和结束的哈希槽范围。之后的子元素是地址和端口对。第一对是主节点,之后的是从节点,且这些从节点都处于正常提供服务的状态(比如FAIL标志位没有被设置)。

For example the first element of the output says that slots from 5461 to 10922 (start and end included) are served by 127.0.0.1:7001, and it is possible to scale read-only load contacting the slave at 127.0.0.1:7004.

举例来说,示例中返回的第一个元素表明,从5461到10922(含)都由127.0.0.1:7001提供(读写)服务,同时可以由127.0.0.1:7004提供只读服务。

CLUSTER SLOTS is not guaranteed to return ranges that cover the full 16384 slots if the cluster is misconfigured, so clients should initialize the slots configuration map filling the target nodes with NULL objects, and report an error if the user tries to execute commands about keys that belong to unassigned slots.

如果集群漏配了某些槽位信息,CLUSTER SLOTS命令不保证返回的槽范围信息能覆盖所有的16384个哈希槽。所以客户端应该在初始化槽的映射关系时,针对那些没有配置目标节点的槽,用NULL来代替目标节点。同时,当用户想在某个未分配的槽上执行程序,客户端应该报错。

Before returning an error to the caller when a slot is found to be unassigned, the client should try to fetch the slots configuration again to check if the cluster is now configured properly.

在上诉情况(发现某个槽未分配)下,客户端应该在给用户报错前再获取一次槽配置信息,以检查集群当前是否已经正确配置了。

Multiple keys operations

Using hash tags, clients are free to use multi-key operations. For example the following operation is valid:

MSET {user:1000}.name Angela {user:1000}.surname White

当使用hash tag的时候,客户端可以使用multi-key操作(反正key会被哈希到同一个槽位)。比如以下的MSET命令是允许的。

Multi-key operations may become unavailable when a resharding of the hash slot the keys belong to is in progress.

当集群正在针对某个槽进行重分片时,针对这个槽的multi-key操作是不允许的。

More specifically, even during a resharding the multi-key operations targeting keys that all exist and all still hash to the same slot (either the source or destination node) are still available.

特别的,当重分片期间,如果那些key全部都存在于某个槽,且依然都会被哈希到这个槽时(无论是源节点还是目标节点),multi-key操作依然是可用的。

Operations on keys that don't exist or are - during the resharding - split between the source and destination nodes, will generate a -TRYAGAIN error. The client can try the operation after some time, or report back the error.

如果key不存在或者在重分片期间被拆分到目标和源节点上时,会产生一个TRYAGAIN错误。客户端可以在等待一段时间之后重试或者报告这个错误。

As soon as migration of the specified hash slot has terminated, all multi-key operations are available again for that hash slot.

当某个槽的槽迁移完成时,所有针对这个槽的multi-key操作会再一次可用。

Scaling reads using slave nodes

Normally slave nodes will redirect clients to the authoritative master for the hash slot involved in a given command, however clients can use slaves in order to scale reads using the READONLY command.

通常,从节点会根据命令中的哈希槽信息,把客户端重定向到负责这个哈希槽的主节点上。但是,客户端可以通过READONLY命令,来强制从从节点读取数据,以拓展读操作的性能。

READONLY tells a Redis Cluster slave node that the client is ok reading possibly stale data and is not interested in running write queries.

READONLY命令告诉Redis集群的从节点,客户端可以接受读取到过期的数据,并且不会存在写操作。

When the connection is in readonly mode, the cluster will send a redirection to the client only if the operation involves keys not served by the slave's master node. This may happen because:

  1. The client sent a command about hash slots never served by the master of this slave.
  2. The cluster was reconfigured (for example resharded) and the slave is no longer able to serve commands for a given hash slot.

当通过只读的模式连接,仅当某些key不被这个从节点的主节点所支持时,集群才会发送一个重定向错误给客户端。这种情况,只会在以下两种场景下发生:一是客户端发送了一个命令,这个key所对应的哈希槽,从来没有被从节点所对应的主节点服务过。二是集群正在重新配置(比如说重分片),从节点不再能针对某个命令提供服务。

When this happens the client should update its hashslot map as explained in the previous sections.

当这样的情况发生时,客户端应该如前面章节叙述的那样,更新哈希槽的映射信息。

The readonly state of the connection can be cleared using the READWRITE command.

当使用READWRITE命令时,连接的只读状态会被清除。

你可能感兴趣的:([译] Redis 集群规范(上))