[Zookeeper学习笔记之五]Zookeeper连接丢失和会话超时

Zookeeper的会话状态变迁图:

[Zookeeper学习笔记之五]Zookeeper连接丢失和会话超时_第1张图片
 

Connection Loss:
CONNECTION_LOSS意味着客户端和服务器端的连接断开,比如,客户端创建一个Zookeeper实例,开始客户端和服务端的会话,然后进行一系列的操作。如果客户端挂了,网络出现异常或者服务器端挂了,都会导致
客户端和服务器端的连接断开。连接断开时,如果客户端进程正常工作,它将收到一个Disconnected事件,收到此事件,客户端不能假设是服务器端挂了还是网络出现问题,同样,服务器如果仍然正常工作,也不能假设是客户端挂了还是出现了网络问题。

 

在Connection Loss的情况下,客户端不能假设它之前发出的请求是否已经成功执行,例如客户端发起一个创建一个znode的操作,这个请求的处理可能存在如下几种情况

1. 请求发送到服务器端,服务端执行完,返回的过程中,连接断开

2. 请求尚未发送到服务器端,因此请求压根没有执行

3. 请求发送到服务器端,请求在执行过程中,服务器端挂了

 

第三种情况是一种极限的情况,对于一致性要求很高的场景,这个请求已经执行的部分操作应该全部失败,服务器端的状态应该是请求未执行前的状态,Zookeeper的读写操作都是原子操作,因此可以保证不会部分读取和部分写入的情况,这就保证了数据一致性。

 

客户端与服务器断开链接后,客户端不能确定是网络链接问题还是Zookeeper服务器挂了,因此,客户端在受到COLLECTION_LOSS事件后,

1.客户端不需要重新创建一个Zookeeper会话,客户端在Zookeeper Client Library的帮助下会持续处于CONNECTING状态,不会出现会话超时的情况(虽然会话超时时间在客户端创建Zookeeper时指定,但是Zookeeper Client Libarary不会检测会话超时)

2.客户端需要检测上次上次操作的执行情况,比如通过检查znode是否存在以判断znode是否创建成功,检查znode的数据以判断znode是否更新成功

3.在1中提到,在服务器不可用的情况,客户端在Zookeeper Client Library的帮助下会持续处于CONNECTING状态,当Zookeeper服务器恢复可用的情况下,Zookeeper尝试于Zookeeper服务器恢复链接,加入在session超时之前,恢复链接,那么对于客户端来说,会话恢复,包括已经注册的watcher,客户端会受到一个SyncConnection事件;如果超时,那么客户端会收到一个Session Expired事件。

 

假如session的超时时间是10s,而session持续3s的时候链接断开,那么当链接恢复时,在7s内完成就会恢复会话,如果超过7s,那么客户端会收到session expired事件(这个计算方式是否正确??)

 

2. How should I handle the CONNECTION_LOSS error?

CONNECTION_LOSS means the link between the client and server was broken. It doesn't necessarily mean that the request failed. If you are doing a create request and the link was broken after the request reached the server and before the response was returned, the create request will succeed. If the link was broken before the packet went onto the wire, the create request failed. Unfortunately, there is no way for the client library to know, so it returns CONNECTION_LOSS. The programmer must figure out if the request succeeded or needs to be retried. Usually this is done in an application specific way. Examples of success detection include checking for the presence of a file to be created or checking the value of a znode to be modified.

When a client (session) becomes partitioned from the ZK serving cluster it will begin searching the list of servers that were specified during session creation. Eventually, when connectivity between the client and at least one of the servers is re-established, the session will either again transition to the "connected" state (if reconnected within the session timeout value) or it will transition to the "expired" state (if reconnected after the session timeout). The ZK client library will handle reconnect for you automatically. In particular we have heuristics built into the client library to handle things like "herd effect", etc... Only create a new session when you are notified of session expiration (mandatory).
 

Session Expired

客户端收到会话超时的事件后,表明这个Zookeeper对象已经不可再使用,需要重新初始化一个

客户端长时间不做增删改查znode操作,客户端并没有收到会话超时,原因是客户端会定时的向Zookeeper服务器端发送心跳包,以保持会话有效

客户端链接到一个Zookeeper集群中,如果它链接的server挂了,Zookeeper Client Library会自动将它与其它server链接,只要会话还没有超时,那么session会保持到原来的状态,包括已经在该session上注册的watcher

 

 

 

 How should I handle SESSION_EXPIRED?

SESSION_EXPIRED automatically closes the ZooKeeper handle. In a correctly operating cluster, you should never see SESSION_EXPIRED. It means that the client was partitioned off from the ZooKeeper service for more the the session timeout and ZooKeeper decided that the client died. Because the ZooKeeper service is ground truth, the client should consider itself dead and go into recovery. If the client is only reading state from ZooKeeper, recovery means just reconnecting. In more complex applications, recovery means recreating ephemeral nodes, vying for leadership roles, and reconstructing published state.

Library writers should be conscious of the severity of the expired state and not try to recover from it. Instead libraries should return a fatal error. Even if the library is simply reading from ZooKeeper, the user of the library may also be doing other things with ZooKeeper that requires more complex recovery.

Session expiration is managed by the ZooKeeper cluster itself, not by the client. When the ZK client establishes a session with the cluster it provides a "timeout" value. This value is used by the cluster to determine when the client's session expires. Expirations happens when the cluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/all connected clients of the change (anyone watching those znodes). At this point the client of the expired session is still disconnected from the cluster, it will not be notified of the session expiration until/unless it is able to re-establish a connection to the cluster. The client will stay in disconnected state until the TCP connection is re-established with the cluster, at which point the watcher of the expired session will receive the "session expired" notification.

Example state transitions for an expired session as seen by the expired session's watcher:

  1. 'connected' : session is established and client is communicating with cluster (client/server communication is operating properly)
  2. .... client is partitioned from the cluster
  3. 'disconnected' : client has lost connectivity with the cluster
  4. .... time elapses, after 'timeout' period the cluster expires the session, nothing is seen by client as it is disconnected from cluster
  5. .... time elapses, the client regains network level connectivity with the cluster
  6. 'expired' : eventually the client reconnects to the cluster, it is then notified of the expiration

 

 参考

1. Zookeeper FAQ

http://wiki.apache.org/hadoop/ZooKeeper/FAQ#1

2. Zookeeper关于超时和链接断开的邮件组讨论

http://markmail.org/message/p5j7rvy5zf5qjsje#query:+page:1+mid:s4d7dnsxulv5yieh+state:results

3.关于Session Timeout和Connection Lost的Blog

http://www.ngdata.com/so-you-want-to-be-a-zookeeper/

 

 

 

 

你可能感兴趣的:(zookeeper)