【原创】RabbitMQ 之 Clustering 和 Network Partition(翻译)


Clustering and Network Partitions

RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don't. You should use federation or the shovel instead.
RabbitMQ 集群无法很好的应对网络分区情况。如果你打算跨 WAN 构建集群,请 别。你应该考虑使用 federationshovel 插件作为替代方案。

However, sometimes accidents happen. This page documents how to detect network partitions, some of the bad effects that may happen during partitions, and how to recover from them.
本文描述了如何检测网络分区问题,分区发生时的一些后果,以及如何进行恢复。

RabbitMQ stores information about queues, exchanges, bindings etc in Erlang's distributed database, Mnesia. Many of the details of what happens around network partitions are related to Mnesia's behaviour.
RabbitMQ 会将 fabric 信息保存在 Erlang 的分布式数据库 Mnesia 中。而和网络分区相关的许多细节问题都和 Mnesia 的行为相关

Detecting network partitions
网络分区的探测

Mnesia will typically determine that a node is down if another node is unable to contact it for a minute or so (see the page on net_ticktime). If two nodes come back into contact, both having thought the other is down, Mnesia will determine that a partition has occured. This will be written to the RabbitMQ log in a form like:
Mnesia 判定某个 node 失效的根据是,如果其他 node 无法连接该 node 的时间达到 1 分钟以上(详情请参 net_ticktime 说明)。当这两个 node 恢复到能联系上的状态时,都会认为对端 node 已 down 掉了,此时 Mnesia 将会判定发生了网络分区。这种情况会被记录进 RabbitMQ 的日志文件中,如下:
=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@smacmullen): ** ERROR ** mnesia_event got
    {inconsistent_database, running_partitioned_network, hare@smacmullen}
RabbitMQ nodes will record whether this event has ever occured while the node is up, and expose this information through rabbitmqctl cluster_status and the management plugin.
RabbitMQ node 会记录下在当前 node 运行期间是否发生过这个 event ,并会通过 rabbitmqctl cluster_status 命令和管理插件将该信息暴露出来。

rabbitmqctl cluster_status will normally show an empty list for partitions:
在正常情况下,rabbitmqctl cluster_status 显示结果中的 partitions 部分为空列表 [] :
# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
 {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
 {partitions,[]}]
...done.
However, if a network partition has occured then information about partitions will appear there:
然而,如果发生了网络分区,那么会有如下信息显示出来:
# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
 {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
 {partitions,[{rabbit@smacmullen,[hare@smacmullen]},
              {hare@smacmullen,[rabbit@smacmullen]}]}]
...done.
The management plugin API will return partition information for each node under partitions in /api/nodes. The management plugin UI will show a large red warning on the overview page if a partition has occured.
管理插件的 API 将会返回通过 /api/nodes 得到的分区中所有 node 的信息。当发生分区时,管理插件对应的 UI 将会展示一个大大的红色警告信息在 overview 页面上。

During a network partition
当发生网络分区时

While a network partition is in place, the two (or more!) sides of the cluster can evolve independently, with both sides thinking the other has crashed. Queues, bindings, exchanges can be created or deleted separately. Mirrored queues which are split across the partition will end up with one master on each side of the partition, again with both sides acting independently. Other undefined and weird behaviour may occur.
当发生网络分区时,分区的两侧(或者多侧)均能够独立的进化,同时认为另外一侧已经处于不可用状态。其中 queue、binding、exchange 均能够在各个分区中创建和删除。而由于网络分区而被割裂的镜像队列,最终会演变成每个分区中产生一个 master ,并且每一侧均能独立进行工作。其他未定义和奇怪的行为也可能发生。

It is important to understand that when network connectivity is restored, this state of affairs persists. The cluster will continue to act in this way until you take action to fix it.
需要额外注意的是, 当网络分区的情况得到恢复后,上述问题仍旧存在,直到你采取行动进行修复。

Partitions caused by suspend / resume
由于 挂起/恢复 而导致的分区

While we refer to "network" partitions, really a partition is any case in which the different nodes of a cluster can have communication interrupted without any node failing. In addition to network failures, suspending and resuming an entire OS can also cause partitions when used against running cluster nodes - as the suspended node will not consider itself to have failed, or even stopped, but the other nodes in the cluster will consider it to have done so.
当我们谈及“网络”分区时,其真正的意思是指:在任何情况下,同一个集群中的 node 在没有 down 掉的情况下,相互之前的通信被中断的情况。除了网络失效导致的分区外,当挂起和恢复集群 node 所在机器的整个 OS ,同样能够导致分区的发生。这种情况下,被挂起的 node 并不认为自己已经失效了,或者被停掉了,但是同一集群中的其他 node 会认为是这样。

While you could suspend a cluster node by running it on a laptop and closing the lid, the most common reason for this to happen is for a virtual machine to have been suspended by the hypervisor. While it's fine to run RabbitMQ clusters in virtualised environments, you should make sure that VMs are not suspended while running. Note that some virtualisation features such as migration of a VM from one host to another will tend to involve the VM being suspended.
尽管(看起来)你能通过合上笔记本盖子的方式,挂起运行在笔记本上的集群中的一个 node ,但更常见的情况是由于虚拟机被监管程序挂起导致。尽管允许将 RabbitMQ 集群运行在虚拟化环境中,你需要确保 VM 不会被在运行中被挂起。需要注意的是,一些虚拟化技术特性,例如将 VM 从一个 host 迁移至另外一个 host 时,会导致 VM 被挂起。

Partitions caused by suspend and resume will tend to be asymmetrical - the suspended node will not necessarily see the other nodes as having gone down, but will be seen as down by the rest of the cluster. This has particular implications for pause_minority mode.
由于挂起导致的网络分区,在恢复的时候行为是不对称的。被挂起的 node 将有可能不会认为其他 node 已经 down 掉,但是会被集群中的其他 node 看作 down 掉。这个行为对于 pause_minority 模式来说有特殊含义。

Recovering from a network partition
从网络分区中恢复

To recover from a network partition, first choose one partition which you trust the most. This partition will become the authority for the state of Mnesia to use; any changes which have occured on other partitions will be lost.
为了从网络分区中恢复,首先要选择你最相信的一个分区。选中的分区将会作为“权威机构”被 Mnesia 使用。任何发生在未被选中分区中的变更将会丢失。

Stop all nodes in the other partitions, then start them all up again. When they rejoin the cluster they will restore state from the trusted partition.
停止其他分区的所有 node ,之后再重新启动它们。当它们重新加入到集群中时,它们将会从受信分区恢复自身的状态。

Finally, you should also restart all the nodes in the trusted partition to clear the warning.
最后,你同样应该重启受信分区中的所有 node 以便清除警告信息

It may be simpler to stop the whole cluster and start it again; if so make sure that the first node you start is from the trusted partition.
一种更简单的方式是,停止整个集群,再重启集群。如果你是按照这种方式来恢复网络分区的,那么请确保你所启动的第一个 node 为受信分区中的 node

Automatically handling partitions
网络分区自动处理

RabbitMQ also offers three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. (The default behaviour is referred to as ignore mode).
RabbitMQ 同样提供了三种方式来自动处理网络分区问题:pause-minority 模式,pause-if-all-down 模式和 autoheal 模式

In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends.
pause-minority 模式中,一旦发现有其他 node 失效,RabbitMQ 将会自动停止“特定”集群中的所有 node ,只要确定该集群为少数派集群(即少于或等于半数 node 数)。可以看出,这种策略是选择了 CAP 理论中的分区容错性(P),而放弃了可用性(A)。这种策略保证了当发生网络分区时,最多只有一个分区中的 nodes 会继续工作。而处于少数派集群中的 node 将在分区发生的开始就被停止,在分区恢复后重新启动。

In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in datacenter A and two nodes in datacenter B, and the link between datacenters is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in datacenter A, only nodes in datacenter B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.
pause-if-all-down 模式中,RabbitMQ 会自动停止集群中的 node ,只要其无法与列举出来的任何 node 进行通信 。这和上一个模式比较接近,但是该模式允许管理员来决定根据哪些 node 做判定,而不直接取决于与上下文环境。例如,如果集群是由位于数据中心 A 的两个 node ,以及位于数据中心 B 的两个 node 构成的,并且两个数据中心之间的连接断开了,那么 pause-minority 模式会导致所有的 node 被停掉。而对于 pause-if-all-down 模式来说,如果管理员列举出来的 node 是数据中心 A 中的那两个 node ,那么将只有数据中心 B 里的两个 node 被停掉。需要注意的是,可能存在列举出来的多个 node 本身就处于无法通信的不同分区中:在这种情况下,将不会有任何 node 被停掉。这也就是为什么存在一个额外的 ignore/autoheal 参数来进一步指示如何从分区中恢复。

In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.
autoheal 模式中,RabbitMQ 将在发生网络分区时,自动决议出一个胜出分区,并重启不在该分区中的所有 node 。与 pause_minority 模式不同的是,autoheal 模式是在分区结束阶段(已经形成稳定的分区)是起作用,而不是在分区开始阶段(刚刚开始分区)。

The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
胜出分区是获得最多客户端连接的那个分区(或者如果产生了平局,则选择拥有最多 node 的那个;如果仍旧是平局,则随机选择一个分区)。

You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in your configuration file to:
你可以在配置文件中设置 cluster_partition_handling 项的值为上述任何值:
pause_minority
{pause_if_all_down, [nodes], ignore | autoheal}
autoheal

Which mode should I pick?
如何在各种模式中进行选择?

It's important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use federation or the shovel.
需要明确的一点是,允许 RabbitMQ 自行处理网络分区问题并不代表你可以认为该问题就不存在了。无论何时网络分区都会导致 RabbitMQ 集群产生问题。你只是在可能遇到何种层次的问题上面多了些选择。正如在本文开始处所说的,如果你打算基于不可靠连接接入 RabbitMQ 集群,你应该使用 federation 或 shovel 。

With that said, you might wish to pick a recovery mode as follows:
下面回答如何进行选择的问题,你可以按照如下的说明进行恢复策略选择:

ignore - Your network really is reliable. All your nodes are in a rack, connected with a switch, and that switch is also the route to the outside world. You don't want to run any risk of any of your cluster shutting down if any other part of it fails (or you have a two node cluster).
ignore - 要求你所在的网络环境非常可靠。例如,你的所有 node 都在同一个机架上,通过交换机互联,并且该交换机还是与外界通信的必经之路。 并且你不想因为集群中的任意 node 失效而导致集群停工,即使集群中有 node 真的失效。

pause_minority - Your network is maybe less reliable. You have clustered across 3 AZs in EC2, and you assume that only one AZ will fail at once. In that scenario you want the remaining two AZs to continue working and the nodes from the failed AZ to rejoin automatically and without fuss when the AZ comes back.
pause_minority - 你的网络环境可能没有那么可靠。例如,你在 EC2 上构建了一个横跨 3 个 AZs 的集群,并且你假定同一时刻最多只有一个 AZ 会失效。在这种场景下,你希望剩余的 2 个 AZs 能够继续工作,直到失效 AZ 恢复后,位于其中的 node 重新自动加入集群,并且不会造成任何混乱。

autoheal - Your network may not be reliable. You are more concerned with continuity of service than with data integrity. You may have a two node cluster.
autoheal - 你的网络环境可能是不可靠的。你会更加关心服务的可持续性,而非数据完整性。你可以构建一个包含 2 个 node 的集群。

More about pause-minority mode
关于 pause-minority 模式的更多说明

The Erlang VM on the paused nodes will continue running but the nodes will not listen on any ports or do any other work. They will check once per second to see if the rest of the cluster has reappeared, and start up again if it has.
被动关停服务的 node 上的 Erlang VM 将持续运行,但该 node 将不再监听任何 port ,也不会再进行任何工作。这种 node 会每秒检查一次集群中的其余 node 是否已重新出现,并在检查成功后重新激活自身的服务。

Note that nodes will not enter the paused state at startup, even if they are in a minority then. It is expected that any such minority at startup is due to the rest of the cluster not having been started yet.
值得注意的是,在启动阶段 node 不会进入关停状态,即使当前 node 确实处于少数派集群中。我们认为在启动阶段出现的这种少数派集群,是由于集群中的其他 node 尚未启动好的缘故。

Also note that RabbitMQ will pause nodes which are not in a strict majority of the cluster - i.e. containing more than half of all nodes. It is therefore not a good idea to enable pause-minority mode on a cluster of two nodes since in the event of any network partition or node failure, both nodes will pause. However, pause_minority mode is likely to be safer than ignore mode for clusters of more than two nodes, especially if the most likely form of network partition is that a single minority of nodes drops off the network.
同样需要注意的是,RabbitMQ 会停掉未处于严格意义上的多数派集群中的 node 。所以,在由两个 node 构成的集群上使用 pause_minority 模式是不明智的,因为只要出现网络分区,或者任意 node 失效,都会导致两个 node 同时被关停。然而,在集群包含多于 2 个 node 的情况下,pause_minority 模式很可能比 ignore 模式更安全,尤其是在最可能发生的一种网络分区情况中,即仅有一个 node 作为少数派集群发生了网络分区。

Finally, note that pause_minority mode will do nothing to defend against partitions caused by cluster nodes being suspended. This is because the suspended node will never see the rest of the cluster vanish, so will have no trigger to disconnect itself from the cluster.
最后需要注意的一点是,处于 pause_minority 模式下的 node 一旦被挂起,就无法处理(后续发生的)网络分区情况了。这是因为被挂起的 node 无法看到集群中其他 node 的消失,所以也就无法触发将自身从集群中断开的行为。



你可能感兴趣的:(rabbitmq,cluster,partition)