【原创】使用 rabbitmq 中 heartbeat 功能可能会遇到的问题


【问题场景】
      客户端以 consumer 身份订阅到 rabbitmq server 上的 queue 上,客户端侧在 AMQP 协议的 Connection.Tune-Ok 信令中,设置 heartbeat 为 0,即要求服务器侧不启用 heartbeat 功能。服务器由于异常断电原因停止服务,结果客户端在短时间内无法感知到服务器端已经异常。

       刚刚出现这个问题时,就有测试人员和业务人员找到我这边说:经过改造的 rabbitmq-c 库可能存在重大 bug,服务器都关闭了,客户端怎么还那像什么都没发生一样继续工作着呢?听到这种疑问,我只问了两个问题就想到了答案:
  • 业务中是不是仅仅作为 consumer 运行的?
  • 服务器能否确认是因为异常断电导致停止服务?
  • 服务器和业务程序之间是否还有中间路由设备?
业务人员告诉我上述问题的答案分别是:是的、是的、没有。呵呵~~所以答案就已经确定了,你想到了么?

【问题分析】
这个问题可以从以下两个层面进行分析:
1. TCP 协议层面
      在此层面上讲,上述问题属于典型的 TCP 协议中的“半打开”问题,典型描述如下:
如果一方已经关闭或异常终止连接而另一方却还不知道,我们将这样的 TCP 连接称为半打开(Half-Open)的。任何一端的主机异常都可能导致发生这种情况。只要不打算在半打开连接上传输数据,仍处于连接状态的一方就不会检测另一方已经出现异常。
半打开连接的一个常见原因是,当客户主机突然掉电,而不是正常的结束客户应用程序后再关机。当然这里所谓的客户机并不是仅仅表示客户端。
      在这种情况发生时,作为 TCP 链路上只接收不发送数据的一方,只能依靠 TCP 协议本身的 keepalive 机制来检查链路是否处于正常状态。而通常 keepalive 机制下,需要大约 2 个小时时间才能触发。

2. AMQP 协议层面
      在此层面上讲,客户端由于是作为 consumer 订阅到 queue 上的,所以在该 AMQP/TCP 连接上客户端不会主动发送数据到 rabbitmq server 侧。当服务器由于异常断电停止服务后,consumer 不会接收到 AMQP 协议层面的终止信令,所以无法感知对端的情况。
      一种可能的解决办法是客户端侧在接收 N 次超时后,通过发送 AMQP 协议中的 Heartbeat 信令检测服务器端是否处于正常状态。


      在场景描述中说道“客户端侧在 AMQP 协议的 Connection.Tune-Ok 信令中,设置 heartbeat 为 0”,如果是将 heartbeat 设置为 30 会如何?答案是会同时触发服务器端和客户端的 heartbeat 功能,即服务器端会在一段时间内没有数据需要发送给客户端的情况下,发送一个心跳包给客户端;或者一段时间内没有收到任何数据,则判定为心跳超时,最终会关闭tcp连接(参考这里)。而客户端侧同样会触发对发送和接收 heartbeat 计时器的维护,分别用于判定发送和接收的超时情况。

在 amqp.h 头文件中可以看到目前 rabbitmq-c 对 heartbeat 的支持情况:
 * \param [in] heartbeat the number of seconds between heartbeat frame to 
 *             request of the broker. A value of 0 disables heartbeats. 
 *             Note rabbitmq-c only has partial support for hearts, as of 
 *             v0.4.0 heartbeats are only serviced during amqp_basic_publish(), 
 *             and amqp_simple_wait_frame()/amqp_simple_wait_frame_noblock() 
目前 github 上的 rabbitmq-c 0.4.1 版本在 heartbeat 功能上的支持仅限上述 3 种 API。

      所以,需要解决的问题可以描述为: 客户端作为 consumer 订阅到服务器上的 queue 后,在无业务数据需要处理时,需要通过检测 Heartbeat 帧(信令)来判定服务器是否处于异常状态(换句话说,自己是否已经是“半打开”的 TCP 连接)。


【解决办法】
建议的解决办法如下:
  • 客户端必须启用 heartbeat 功能(解决“半打开”问题的基础); 
  • 客户端需要支持在发送空闲时,发送 heartbeat 的功能(因为目前客户端作为 producer 是长连接到 rabbitmq server 上的); 
  • 客户端需要支持在接收空闲时,通过检测服务器端发送来的 heartbeat 帧来判定服务器端(或网络)是否处于正常状态(因为客户端作为 consumer 也是长连接到 rabbitmq server 上的,同时不会主动向 rabbitmq server 发送数据)。 

总结:
      只要客户端启用 heartbeat ,那么服务器就会在满足“一定条件”时,定时向客户端发送 heartbeat 信令,同时也会检测在空闲状态达到规定时间后是否收到 heartbeat 信令;而客户端侧作为 consumer 时,需要判定是否接收到数据(无论是常规数据还是 heartbeat 信令),若在一定时间内没有接收到数据,则认为当前链路可能存在问题。后续可以从业务上触发 consume 关系的重新建立


      如下为使能了 heartbeat 功能后的打印输出:
作为 consumer 的情况下出现网络断开时的打印
[warn] evsignal_init: socketpair: No error
drive_machine: [conn_init]  ---  TCP 3-way handshake start! --> [172.16.81.111:5672][s:53144]
drive_machine: [conn_connecting]  ---  connection timeout 1 time on socket(53144)
drive_machine: [conn_connected]  ---  connected on socket(53144)
53144: conn_state change   connected ==> snd_protocol_header
  --> Send Protocol.Header!
53144: conn_state change   snd_protocol_header ==> rcv_connection_start_method
[53144] drive_machine: wait for Connection.Start method another 10 seconds!!
  <-- Recv Connection.Start Method frame!
53144: conn_state change   rcv_connection_start_method ==> snd_connection_start_rsp_method
  --> Send Connection.Start-Ok Method frame!
53144: conn_state change   snd_connection_start_rsp_method ==> rcv_connection_tune_method
  <-- Recv Connection.Tune Method frame!
53144: conn_state change   rcv_connection_tune_method ==> snd_connection_tune_rsp_method
  --> Send Connection.Tune-Ok Method frame!
53144: conn_state change   snd_connection_tune_rsp_method ==> snd_connection_open_method
  --> Send Connection.Open Method frame!
53144: conn_state change   snd_connection_open_method ==> rcv_connection_open_rsp_method
  <-- Recv Connection.Open-Ok Method frame!
53144: conn_state change   rcv_connection_open_rsp_method ==> snd_channel_open_method
  --> Send Channel.Open Method frame!
53144: conn_state change   snd_channel_open_method ==> rcv_channel_open_rsp_method
[53144] drive_machine: wait for Channel.Open-Ok method another 10 seconds!!
  <-- Recv Channel.Open-Ok Method frame!
53144: conn_state change   rcv_channel_open_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Queue Declaring!
53144: conn_state change   idle ==> snd_queue_declare_method
  --> Send Queue.Declare Method frame!
53144: conn_state change   snd_queue_declare_method ==> rcv_queue_declare_rsp_method
[53144] drive_machine: wait for Queue.Declare-Ok method another 10 seconds!!
  <-- Recv Queue.Declare-Ok Method frame!
53144: conn_state change   rcv_queue_declare_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Queue Binding!
53144: conn_state change   idle ==> snd_queue_bind_method
  --> Send Queue.Bind Method frame!
53144: conn_state change   snd_queue_bind_method ==> rcv_queue_bind_rsp_method
[53144] drive_machine: wait for Queue.Bind method another 10 seconds!!
  <-- Recv Queue.Bind Method frame!
need to code something!
53144: conn_state change   rcv_queue_bind_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Basic QoS!
53144: conn_state change   idle ==> snd_basic_qos_method
  --> Send Basic.Qos Method frame!
53144: conn_state change   snd_basic_qos_method ==> rcv_basic_qos_rsp_method
  <-- Recv Queue.Qos-Ok Method frame!
need to code something!
53144: conn_state change   rcv_basic_qos_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Basic Consuming!
53144: conn_state change   idle ==> snd_basic_consume_method
  --> Send Basic.Consume Method frame!
53144: conn_state change   snd_basic_consume_method ==> rcv_basic_consume_rsp_method
  <-- Recv Basic.Consume-Ok Method frame!
need to code something!
53144: conn_state change   rcv_basic_consume_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Start waiting to recv!
53144: conn_state change   idle ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: Recv nothing for 60s!
[53144] drive_machine: Maybe network broken or rabbitmq server fucked! Plz retry consuming!
53144: conn_state change   rcv_basic_deliver_method ==> close
[53144] drive_machine: [conn_close]  ---  Connection Disconnect!
### CB: Connection Disconnect!    Msg : [Connection Disconnect]



作为 producer 的情况下出现网络断开时的打印
[warn] evsignal_init: socketpair: No error
drive_machine: [conn_init]  ---  TCP 3-way handshake start! --> [172.16.81.111:5672][s:12184]
drive_machine: [conn_connecting]  ---  connection timeout 1 time on socket(12184)
drive_machine: [conn_connected]  ---  connected on socket(12184)
12184: conn_state change   connected ==> snd_protocol_header
  --> Send Protocol.Header!
12184: conn_state change   snd_protocol_header ==> rcv_connection_start_method
  <-- Recv Connection.Start Method frame!
12184: conn_state change   rcv_connection_start_method ==> snd_connection_start_rsp_method
  --> Send Connection.Start-Ok Method frame!
12184: conn_state change   snd_connection_start_rsp_method ==> rcv_connection_tune_method
[12184] drive_machine: wait for Connection.Tune method another 10 seconds!!
  <-- Recv Connection.Tune Method frame!
12184: conn_state change   rcv_connection_tune_method ==> snd_connection_tune_rsp_method
  --> Send Connection.Tune-Ok Method frame!
12184: conn_state change   snd_connection_tune_rsp_method ==> snd_connection_open_method
  --> Send Connection.Open Method frame!
12184: conn_state change   snd_connection_open_method ==> rcv_connection_open_rsp_method
[12184] drive_machine: wait for Connection.Open-Ok method another 10 seconds!!
  <-- Recv Connection.Open-Ok Method frame!
12184: conn_state change   rcv_connection_open_rsp_method ==> snd_channel_open_method
  --> Send Channel.Open Method frame!
12184: conn_state change   snd_channel_open_method ==> rcv_channel_open_rsp_method
[12184] drive_machine: wait for Channel.Open-Ok method another 10 seconds!!
  <-- Recv Channel.Open-Ok Method frame!
12184: conn_state change   rcv_channel_open_rsp_method ==> snd_channel_confirm_select_method
  --> Send Confirm.Select Method frame!
12184: conn_state change   snd_channel_confirm_select_method ==> rcv_channel_confirm_select_rsp_method
  <-- Recv Confirm.Select-Ok Method frame!
Channel in Confirm Mode!
12184: conn_state change   rcv_channel_confirm_select_rsp_method ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find msg to send!
12184: conn_state change   idle ==> snd_basic_publish_method
  --> Send Basic.Publish Method frame!
12184: conn_state change   snd_basic_publish_method ==> snd_basic_content_header
  --> Send Content-Header frame!
12184: conn_state change   snd_basic_content_header ==> snd_basic_content_body
  --> Send Content-Body frame!
12184: conn_state change   snd_basic_content_body ==> rcv_basic_ack_method
  <-- Recv Basic.Ack Method frame!
### CB: Publisher Confirm -- [Basic.Ack]  Delivery_Tag:[1]  multiple:[0]
12184: conn_state change   rcv_basic_ack_method ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
12184: conn_state change   idle ==> snd_heartbeat
  --> Send Heartbeat frame!
12184: conn_state change   snd_heartbeat ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
12184: conn_state change   idle ==> snd_heartbeat
[12184] drive_machine: Send Heartbeat failed! status = -9
12184: conn_state change   snd_heartbeat ==> close
[12184] drive_machine: [conn_close]  ---  Connection Disconnect!
### CB: Connection Disconnect!    Msg : [Connection Disconnect]






你可能感兴趣的:(c,rabbitmq,heartbeat)