【ZooKeeper是Apache Hadoop下的开源软件,是一个分布式的协调器,本文来自于Zookeeper的官方网站,地址为:http://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html】
This documentis a guide for developers wishing to create distributed applications that takeadvantage of ZooKeeper's coordination services. It contains conceptual andpractical information.
The first four sections of this guide present higher level discussions of variousZooKeeper concepts. These are necessary both for an understanding of howZooKeeper works as well how to work with it. It does not contain source code,but it does assume a familiarity with the problems associated with distributedcomputing. The sections in this first group are:
The next four sections provide practical programming information. These are:
The book concludes with an appendix containing links to other useful, ZooKeeper-related information.
Most of information in this document is written to be accessible as stand-alonereference material. However, before starting your first ZooKeeper application,you should probably at least read the chapters on theZooKeeper DataModel and ZooKeeper Basic Operations. Also, the Simple Programmming Example [tbd] is helpful for understanding the basic structure of a ZooKeeper client application.
本文适合于开发人员,他们希望利用ZooKeeper的协调服务来构建分布式系统,本文包含概念性内容和实际使用经验。
本手册的前4节是各种高层次的ZooKeeper概念,了解它们,对理解ZooKeeper如何工作和如何使用是必需的,这里不包含源代码,但假设读者熟悉分布式计算所面临的问题。这一部分包括:
下一部分的4节包含了实际的编程信息,即:
本文包含一个附录,里面有与ZooKeeper相关的有用的信息的链接。
本文中的大部分信息可以单独拿来使用,但在开始你第一个ZooKeeper程序之前,你可能需要至少读一下ZooKeeper数据模型及ZooKeeper基本操作这两节,并且,简单编程实例【待完成】也有助于你理解ZooKeeperd客户端的基本结构。
TheZooKeeper Data Model
ZooKeeper数据模型
ZooKeeper has a hierarchal name space, much like a distributed file system. The only difference is that each node in the namespace can have data associated with itas well as children. It is like having a file system that allows a file to alsobe a directory. Paths to nodes are always expressed as canonical, absolute,slash-separated paths; there are no relative reference. Any unicode charactercan be used in a path subject to the following constraints:
ZooKeeper有一个层次化的命名空间,特别像一个分布式文件系统,唯一不同的是这个命名空间中的每个节点既可以有子节点,也可以与之关联的数据,好像在一个文件系统中,一个文件也是一个目录。节点的路径总是被表达成规范的、绝对的、以/为分割符的形式,没有相对路径。任何符合下面规则的unicode字符都可以作为路径名称:
ZNodes
Every node in a ZooKeeper tree is referred to as aznode. Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestampallow ZooKeeper to validate the cache and to coordinate updates. Each time aznode's data changes, the version number increases. For instance, whenever aclient retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the dataof the znode it is changing. If the version it supplies doesn't match theactual version of the data, the update will fail. (This behavior can beoverridden. For more information see... )[tbd...]
Znodes arethe main enitity that a programmer access. They have several characteristicsthat are worth mentioning here.
ZooKeeper树中每个节点被称作znode,Znode维护一个stat结构,其中包含了数据变化、acl变化的版本号,该结构也有一个时间戳。版本号加上时间戳,被ZooKeeper用来验证缓存的内容和协调更新。Znode内容更新一次,版本号增加一次。每次客户端读取数据,也会得到该数据的版本号。当客户端执行更新和删除操作时,它必须提供所操作数据的版本号。如果客户端提供的版本与数据的实际版本不匹配,更新操作会失败(这个操作不能是覆盖,详情请参阅…[待完成])。
Znode是开发人员主要访问的对象,它有几个值得关注的特性。
Watches
Clients canset watches on znodes. Changes to that znode trigger the watch and then clearthe watch. When a watch triggers, ZooKeeper sends the client a notification.More information about watches can be found in the sectionZooKeeperWatches.
监视器
客户端可以在znode上设置监视器,该znode的变化将触发并清除监视器。当监视器触发时,ZooKeeper通知客户端,关于监视器的详细情况,请参阅“ZooKeeperWatches”。
Data Access
The datastored at each znode in a namespace is read and written atomically. Reads getall the data bytes associated with a znode and a write replaces all the data.Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper wasnot designed to be a general database or large object store. Instead, itmanages coordination data. This data can come in the form of configuration,status information, rendezvous, etc. A common property of the various forms ofcoordination data is that they are relatively small: measured in kilobytes. TheZooKeeper client and the server implementations have sanity checks to ensurethat znodes have less than 1M of data, but the data should be much less thanthat on average. Operating on relatively large data sizes will cause someoperations to take much more time than others and will affect the latencies ofsome operations because of the extra time needed to move more data over thenetwork and onto storage media. If large data storage is needed, the usuallypattern of dealing with such data is to store it on a bulk storage system, suchas NFS or HDFS, and store pointers to the storage locations in ZooKeeper.
数据访问
命名空间中每个znode的数据能被读和写,读是指得到该znode上关联的所有数据,写是指替换所有数据。每个节点上有一个ACL,控制谁可以做什么。
ZooKeeper并没用被设计用来做一个通用的数据库或大容量对象存储器,相反,它只管理有关协调所用的数据。这个数据可以是配置、状态信息、汇聚信息等,这些协调信息的一个共同特征是它们都较小:KB数量级。ZooKeeper服务器和客户端都应该检查znode数据小于1M,但真正存储的数据的平均大小应远小于它。因为需要更多的时间来通过网络传递数据或写入介质,较大的数据会使某些操作花费更多的时间,这会影响延迟。如果大容量数据是必需的,通常的处理方法是将它存储到大容量存储系统中,例如NFS或HDFS,而将指针保存到ZooKeeper中。
Ephemeral Nodes
ZooKeeperalso has the notion of ephemeral nodes. These znodes exists as long as thesession that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.
暂态节点
ZeeKeeper有暂态节点的概念,这些节点仅在创建它们的会话存在而存在,当会话结束后,节点就被删除了,由于暂态节点的这种特性,它不允许有子节点。
SequenceNodes -- Unique Naming
When creating a znode you can also request that ZooKeeper append a monotonically increasing counter to the end of path. This counter is unique to the parent znode. The counter has a format of %010d -- that is 10 digits with 0 (zero) padding (the counter is formatted in this way to simplify sorting), i.e."
序列化节点-唯一命名
当创建一个znode时,你可以要求ZooKeeper添加一个单调增的数字在路径的后面,这个数字对父节点是唯一的,采用%010d这种格式,即补零方式的10位数字(这样做是为了简化排序),例如"
Time in ZooKeeper
ZooKeeper tracks time multiple ways:
ZooKeeper中的时间
ZooKeeper中有关时间的使用有几方面:
ZooKeeper Stat Structure
The Stat structure for each znode in ZooKeeper is made up of the following fields:
ZooKeeper Stat
结构每个znode节点上的Stat结构由以下域构成:
A ZooKeeper client establishes a session with the ZooKeeper service by creating a handle tothe service using a language binding. Once created, the handle starts of in the CONNECTING state and the client library tries to connect to one of the serversthat make up the ZooKeeper service at which point it switches to the CONNECTEDstate. During normal operation will be in one of these two states. If an unrecoverable error occurs, such as session expiration or authentication failure, or if the application explicitly closes the handle, the handle will move to the CLOSED state. The following figure shows the possible state transitions of a ZooKeeper client:
与一种语言绑定,ZooKeeper客户端可以通过创建一个句柄,建立一个与ZooKeeper服务的会话。一旦创建句柄,这个句柄被标志为CONNECTING状态,客户端库就会连接构成ZooKeeper服务的其中一台服务器,直到句柄变为CONNECTED状态。正常的操作中,就是这两种状态之一。如果发生了不能恢复的错误,例如,会话超期或安全认证失败,或者如果应用显式地关闭了这个句柄,这个句柄就变成CLOSED状态。下图给出了ZooKeerp客户端几种可能状态的转变。
To create aclient session the application code must provide a connection string containing a comma separated list of host:port pairs, each corresponding to a ZooKeeper server (e.g. "127.0.0.1:4545" or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"). The ZooKeeper client library will pick an arbitrary server and try to connect to it. If this connection fails, or if the client becomes disconnected from the server for anyreason, the client will automatically try the next server in the list, until aconnection is (re-)established.
为创建一个客户端会话,应用程序必须提供一个以逗号分隔的host:port形式的ZooKeeper服务器列表作为连接字符串(如,"127.0.0.1:4545"或"127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002")。ZooKeeper客户端库会任意选中一台服务器并尝试连接它。如果连接失败或客户端由于某种原因与服务器断开了,客户端会自动尝试这个列表的下一台服务器,直到连接(重新)建立。
Added in3.2.0: An optional "chroot" suffix may also beappended to the connection string. This will run the client commands whileinterpreting all paths relative to this root (similar to the unix chrootcommand). If used the example would look like: "127.0.0.1:4545/app/a"or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002/app/a" where theclient would be rooted at "/app/a" and all paths would be relative tothis root - ie getting/setting/etc... "/foo/bar" would result in operations being run on "/app/a/foo/bar" (from the serverperspective). This feature is particularly useful in multi-tenant environments where each user of a particular ZooKeeper service could be rooted differently.This makes re-use much simpler as each user can code his/her application as ifit were rooted at "/", while actual location (say /app/a) could bedetermined at deployment time.
3.2.0版新添加的:一个可选的后缀”chroot”可以被添加到连接字符串,这将使客户端的命令解析都从这个根开始(类似于unix的chroot命令)。如果使用如下的例子:"127.0.0.1:4545/app/a"或"127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002/app/a",客户端将以"/app/a"作为根,所有的路径将相对于这个根,也就是说,从"/foo/bar"取数据将导致在"/app/a/foo/bar"(从服务器的角度看)处操作。这个特性特别对多租户场合有用,这种情况下,一个特定ZooKeeper服务的每个用户都有不同的根,因为每个用户编码时,都可以从根”/”开始考虑,这样重用就变得很简单,当然,实际的位置(比如/app/a)是在部署时确定的。
When a clientgets a handle to the ZooKeeper service, ZooKeeper creates a ZooKeeper session,represented as a 64-bit number, that it assigns to the client. If the client connects to a different ZooKeeper server, it will send the session id as a part of the connection handshake. As a security measure, the server creates apassword for the session id that any ZooKeeper server can validate.The passwordis sent to the client with the session id when the client establishes thesession. The client sends this password with the session id whenever it reestablishes the session with a new server.
当客户端从ZooKeeper服务得到一个句柄,ZooKeeper就创建了一个会话,以一个64位整数表示,并将其赋给这个客户。如果这个客户端连接一个不同的服务器,它会将这个会话id作为连接握手的一部分。作为一种安全策略,服务器可以为这个会话id创建一个密码,任何一台ZooKeeper服务器都可以验证。在建立会话时,这个密码随会话id发送给客户端,当这个客户端想与一台新服务器建立连接时,就会发送会话id及这个密码。
One of the parameters to the ZooKeeper client library call to create a ZooKeeper sessionis the session timeout in milliseconds. The client sends a requested timeout,the server responds with the timeout that it can give the client. The current implementation requires that the timeout be a minimum of 2 times the tickTime(as set in the server configuration) and a maximum of 20 times the tickTime.The ZooKeeper client API allows access to the negotiated timeout.
ZooKeeper客户端库用来创建会话的一个参数是会话超时时间(毫秒)。客户端发送一个要求的超时时间,服务器响应这个时间并发送给客户端。目前的实现是这个超时时间最小为2倍的tickTime(在服务器配置中),最大为20倍tickTime。ZooKeeper客户端API允许协商这个超时时间。
When a client(session) becomes partitioned from the ZK serving cluster it will begin searching the list of servers that were specified during session creation.Eventually, when connectivity between the client and at least one of theservers is re-established, the session will either again transition to the"connected" state (if reconnected within the session timeout value) orit will transition to the "expired" state (if reconnected after thesession timeout). It is not advisable to create a new session object (a newZooKeeper.class or zookeeper handle in the c binding) for disconnection. The ZKclient library will handle reconnect for you. In particular we have heuristics built into the client library to handle things like "herd effect",etc... Only create a new session when you are notified of session expiration(mandatory).
当一个客户(会话)从ZooKeeper集群断开时,它将开始搜索指定的服务器列表,最终,当客户端与至少某台服务器再次建立连接后,这个会话将变成“连接“状态(如果再连接的时间小于超时时间)或者变成”过期“状态(再连接时间在超时以后)。断开之后不建议再创建一个新会话对象(即一个新的ZooKeeper.class或在C语言绑定时的handle),ZooKeeper客户端库会为你处理再连接。特别的,我们在客户端库中有探索式尝试等处理这类事情。仅在你被通知会话过期时才创建新会话对象(强制性的)。
Session expiration is managed by the ZooKeeper cluster itself, not by the client. When the ZK client establishes a session with the cluster it provides a"timeout" value detailed above. This value is used by the cluster to determine when the client's session expires. Expirations happens when thecluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/allconnected clients of the change (anyone watching those znodes). At this point the client of the expired session is still disconnected from the cluster, itwill not be notified of the session expiration until/unless it is able tore-establish a connection to the cluster. The client will stay in disconnected state until the TCP connection is re-established with the cluster, at which point the watcher of the expired session will receive the "sessionexpired" notification.
会话过期由ZooKeeper集群自己管理,而不是客户端。当一个客户端与集群连接时,如上面表述的,它会提供一个“超时“值。这个值被集群用来判断客户端会话是否超时。超时发生在集群在超时时间内没有收到来自客户端的消息(例如,心跳)。发生超时时,集群删除这个会话所拥有的暂态节点,并立即通知有关连接的客户端(那些对这些节点添加了监视器的客户)。这时,如果过期的会话仍没有连接到集群,它不会被通知到已过期,直到它再次连接,才通知它会话过期了。客户端保持断开状态,直到与集群的TCP连接再次建立,这时,过期的会话的监视器将接收到”会话过期“的通知。
Example state transitions for an expired session as seen by the expired session's watcher:Another parameter to the ZooKeeper session establishment call is the default watcher. Watchers are notified when any state change occurs in the client. For exampleif the client loses connectivity to the server the client will be notified, orif the client's session expires, etc... This watcher should consider theinitial state to be disconnected (i.e. before any state changes events are sentto the watcher by the client lib). In the case of a new connection, the firstevent sent to the watcher is typically the session connection event.
ZooKeeper建立会话的另一个参数是缺省的监视器。在客户端的任何状态的变化就会通知监视器。例如如果客户端与服务器失去连接,客户端就会被通知,或者,如果客户端会话过期…,等,监视器认为开始处于断开状态(即,任何事件发生前,客户端库都将事件通知监视器)。当建立新连接时,第一个给监视器的时间通常是会话连接事件。
The sessionis kept alive by requests sent by the client. If the session is idle for aperiod of time that would timeout the session, the client will send a PING request to keep the session alive. This PING request not only allows the ZooKeeperserver to know that the client is still active, but it also allows the clientto verify that its connection to the ZooKeeper server is still active. The timing of the PING is conservative enough to ensure reasonable time to detect adead connection and reconnect to a new server.
会话由客户端不断发送的请求保持住。如果会话在一段时间内无事可做,有可能引起超时,则会话应发送PING请求来保持住这个会话。这个PING请求不但让ZooKeeper服务器知道客户端还存在,也让客户端知道服务器也存在。PING的时间间隔应足够保守,保证有充分的时间来检测一个死连接和重建一个到新服务器的新连接。
Once aconnection to the server is successfully established (connected) there arebasically two cases where the client lib generates connection loss (the resultcode in c binding, exception in Java -- see the API documentation for binding specific details) when either a synchronous or asynchronous operation is performed and one of the following holds:
一旦建立到服务器的连接,当进行一个同步或异步操作时,基本上,有两种情况可以让客户端库产生“失去连接“(C绑定时的返回值,Java绑定时的异常—详情请参阅特定绑定的API文档)
Added in3.2.0 -- SessionMovedException. There is an internal exception that is generally not seen by clients called the SessionMovedException. This exception occurs because a request was received on a connection for a session which has be reestablished on a different server. The normal cause of this error is a client that sends a request to a server, but the network packet gets delayed, so the client times out and connects to a new server. When the delayed packet arrives at the first server, the old server detects that the session has moved, and closes the client connection. Clients normally do not see this error since they do not read from those old connections. (Old connections are usually closed.) One situation in which this condition can be seen is when two clientstry to reestablish the same connection using a saved session id and password.One of the clients will reestablish the connection and the second client willbe disconnected (causing the pair to attempt to re-establish it'sconnection/session indefinitely).
3.2.0版新增的—SessionMovedException。有一个通常客户端看不到的内部异常,SessionMovedException,这个异常发生在客户端请求被接收,会话在另一台服务器被重新建立的时候。通常,这个错误是客户端发送一个请求给服务器,但网络数据包被延迟了,客户端超时时间到并连接到一个新服务器。当延迟的数据包到达第一个服务器,这个服务器检测到会话已转移了,就关闭连接。客户端通常不会看到这个错误因为它不会从那些旧的连接中读取数据(这些旧连接通常已关闭)。一种能看到这个错误的场合是:两个客户端使用保存的会话id和密码尝试再次建立这个会话,其中一个客户端能再次建立这个连接而另一个将断开(这一对尝试再连接的客户端中哪个连接上不明确)。
All of the read operations in ZooKeeper - getData(), getChildren(), and exists()- have the option of setting a watch as a side effect. Here is ZooKeeper'sdefinition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was setchanges. There are three key points to consider in this definition of a watch:
ZooKeeper中所有的读操作—getData(), getChildren(), exists()—都有一个选项:设置一个监视器,作为附带的功能。ZooKeeper监视器的定义如下:一个监视器事件是一个一次性触发事件,它被发送到设置它的客户端,它发生的条件是它监视的数据发生变化了。关于监视器的定义,这里有3个关键点需要考虑:
Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be light weight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered andtriggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existance of a znode not yetcreated will be missed if the znode is created and deleted while disconnected.
监视器在客户端所连接的ZooKeeper服务器上维护,这样使监视器可以被轻量级地设置、维护和分发。当一个客户端连接到新服务器,对任何会话事件的监视器被触。如何客户端不能连接到服务器,则不能接收到监视器。当客户端再次连接上,以前注册的监视器被再次注册和触发(如果需要)。通常,这些是透明发生的,有一种情况监视器可能被遗漏:对还未被创建的zonde设置的存在监视器,在断开时,被创建和删除。
What ZooKeeper Guarantees about Watches
With regard to watches, ZooKeeper maintains these guarantees:
Things to Remember about Watches
对于监视器,ZooKeeper保证了什么
对于监视器,ZooKeeper能做得以下保证:
关于监视器,应该注意什么
ZooKeeper uses ACLs to control access to its znodes (the data nodes of a ZooKeeper datatree). The ACL implementation is quite similar to UNIX file access permissions:it employs permission bits to allow/disallow various operations against a nodeand the scope to which the bits apply. Unlike standard UNIX permissions, a ZooKeeper node is not limited by the three standard scopes for user (owner ofthe file), group, and world (other). ZooKeeper does not have a notion of anowner of a znode. Instead, an ACL specifies sets of ids and permissions thatare associated with those ids.
Note also that an ACL pertains only to a specific znode. In particular it does not applyto children. For example, if/app is only readable by ip:172.16.16.1 and/app/status is world readable, anyone will be able to read/app/status;ACLs are not recursive.
ZooKeeper supports pluggable authentication schemes. Ids are specified using the formscheme:id,where scheme is a the authentication scheme that the id corresponds to.For example,ip:172.16.16.1 is an id for a host with the address 172.16.16.1.
When a client connects to ZooKeeper and authenticates itself, ZooKeeper associates all the ids that correspond to a client with the clients connection. These ids arechecked against the ACLs of znodes when a clients tries to access a node. ACLsare made up of pairs of (scheme:expression, perms). The format of theexpressionis specific to the scheme. For example, the pair (ip:19.22.0.0/16, READ)gives theREAD permission to any clients with an IP address that startswith 19.22.
ZooKeeper采用ACL来控制znode的访问,ACL的实现方式与UNIX中文件的访问控制很相似:它采用权限位来允许/拒绝对节点的各种操作以及能进行操作的范围,与UNIX权限不同的是,ZooKeeper节点并不局限于标准的三类范围:文件的拥有者、组和其他人。ZooKeeper并没有znode拥有者的概念,相反,一条ACL指定id集以及与之对应的权限。
还要注意的是一条ACL只针对一个特点的znode,即,它不适用于子节点。例如,如果/app只对ip:172.16.16.1可读,而/app/status对任何人可读,ACL不是递归的。
ZooKeeper支持插入的验证方案。Id采用如下的形式:scheme:id,其中scheme是id所对应的认证方案。例如,对ip:172.16.16.1,id是主机的地址172.16.16.1。
当客户端连接到ZooKeeper验证自己时,ZooKeeper将有关该客户端的所有Id与客户连接关联,当客户端想访问节点时,这些Id与该节点的ACL进行验证,而ACL由(scheme:expression, perms)对构成,其中expression的格式指定为scheme。例如,(ip:19.22.0.0/16, READ)值对表示对所有起始IP为19.22的客户端具有读权限。
ACL Permissions
ZooKeeper supports the following permissions:
The CREATE and DELETE permissions have been broken out of theWRITE permission for finer grained access controls. The cases for CREATE andDELETE are the following:
You want A to be able to do a set on a ZooKeeper node, but not be able toCREATE or DELETE children.
CREATE without DELETE: clients create requests by creating ZooKeeper nodes in aparent directory. You want all clients to be able to add, but only request processor can delete. (This is kind of like the APPEND permission for files.)
Also, the ADMIN permission is there since ZooKeeper doesn’t have a notion of file owner. Insome sense theADMIN permission designates the entity as the owner. ZooKeeper doesn’t support the LOOKUP permission (execute permission bit ondirectories to allow you to LOOKUP even though you can't list the directory).Everyone implicitly has LOOKUP permission. This allows you to stat a node, but nothing more. (The problem is, if you want to call zoo_exists() on a node that doesn't exist, there is no permission to check.)
ACL权限
ZooKeeper支持以下权限:
CREATE和DELETE权限从写权限中分离出来,为的是获得更好的访问控制。运用CREATE和DELETE的场合如下:
你想让A用户能够设置节点数据,但不允许创建或删除子节点。
具有CREATE但无DELETE权限:客户端发出创建请求,是在父目录下创建创建节点,你想让所有的客户能添加节点,但只有创建的申请者能删除(这类似于文件的APPEND权限)。
另外,具有ADMIN权限是因为ZooKeeper没有文件拥有者这个概念。从某些意义上,具有ADMIN权限就意味着节点的拥有者。ZooKeeper不支持LOOKUP权限(目录上的执行权限位允许你查看,即使你不能列出目录)。所有人都隐含具有LOOKUP权限。这允许你查看一个节点的状态,但不能做其他事情(问题是,如果你对一个不存在的节点调用zoo_exists(),不会进行安全检查)。
Builtin ACL SchemesZooKeeeper has the following built in schemes:
内置的ACL方案
ZooKeeper有如下内置的方案
ZooKeeper C client API
The following constants are provided by the ZooKeeper C library:
The followingare the standard ACL IDs:
ZOO_AUTH_IDS empty identity string should be interpreted as “the identity of the creator”.
ZooKeeper client comes with three standard ACLs:
TheZOO_OPEN_ACL_UNSAFE is completely open free for all ACL: any application can execute any operation on the node and can create, list and delete its children.The ZOO_READ_ACL_UNSAFE is read-only access for any application. CREATE_ALL_ACLgrants all permissions to the creator of the node. The creator must have been authenticated by the server (for example, using “digest” scheme) beforeit can create nodes with this ACL.
ZooKeeper C客户端API
以下的常量是ZooKeeperC语言库中提供的:
以下是标准的ACLID:
ZOO_AUTH_IDS 为空时,应被解释成“创建者的Id”
ZooKeeper 客户端有3种标准的ACL:
ZOO_OPEN_ACL_UNSAFE使所有ACL都“开放”了:任何应用程序在节点上可进行任何操作,能创建、列出和删除它的子节点。对任何应用程序,ZOO_READ_ACL_UNSAFE是只读的。CREATE_ALL_ACL赋予了节点的创建者所有的权限,在创建者采用此ACL创建节点之前,已经被服务器所认证(例如,采用 “digest”方案)。
The following ZooKeeper operations deal with ACLs:
The application uses the zoo_add_auth function to authenticate itself to theserver. The function can be called multiple times if the application wants to authenticate using different schemes and/or identities.
zoo_create(...) operation creates a new node. The acl parameter is a list of ACLs associated with the node. The parent node must have the CREATE permission bit set.
This operation returns a node’s ACL info.
This function replaces node’s ACL list with a new one. The node must have the ADMINpermission set.
以下ZooKeeper方法处理ACL:
应用程序使用zoo_add_auth方法来向服务器认证自,如果想用不同的方案来认证,这个方法可以被调用多次。
zoo_create(...)方法创建一个新节点。acl 参数是一个与这个节点关联的ACL列表,父节点权限项的CREATE位已被设(set,即由权限)。
这个方法返回这个节点的ACL信息。
这个方法用新的ACL列表替换老的,这个节点的ADMIN位必须被设置(set,即具有ADMIN权限)。
Here is as ample code that makes use of the above APIs to authenticate itself using the “foo”scheme and create an ephemeral node “/xyz” with create-only permissions.
这有一个使用以上API的例子,采用”foo”方案认证,创建一个“/xyz”的暂态节点,设置其为”只创建“权限。
#include
#include
#include "zookeeper.h"
static zhandle_t *zh;
/**
* In this example this method gets the cert for your
* environment -- you must provide
*/
char *foo_get_cert_once(char* id) { return 0; }
/** Watcher function -- empty for this example, not something you should
* do in real code */
void watcher(zhandle_t *zzh, int type, int state, const char *path,
void *watcherCtx) {}
int main(int argc, char argv) {
char buffer[512];
char p[2048];
char *cert=0;
char appId[64];
strcpy(appId, "example.foo_test");
cert = foo_get_cert_once(appId);
if(cert!=0) {
fprintf(stderr,
"Certificate for appid [%s] is [%s]\n",appId,cert);
strncpy(p,cert, sizeof(p)-1);
free(cert);
} else {
fprintf(stderr, "Certificate for appid [%s] not found\n",appId);
strcpy(p, "dummy");
}
zoo_set_debug_level(ZOO_LOG_LEVEL_DEBUG);
zh = zookeeper_init("localhost:3181", watcher, 10000, 0, 0, 0);
if (!zh) {
return errno;
}
if(zoo_add_auth(zh,"foo",p,strlen(p),0,0)!=ZOK)
return 2;
struct ACL CREATE_ONLY_ACL[] = {{ZOO_PERM_CREATE, ZOO_AUTH_IDS}};
struct ACL_vector CREATE_ONLY = {1, CREATE_ONLY_ACL};
int rc = zoo_create(zh,"/xyz","value", 5, &CREATE_ONLY, ZOO_EPHEMERAL,
buffer, sizeof(buffer)-1);
/** this operation will fail with a ZNOAUTH error */
int buflen= sizeof(buffer);
struct Stat stat;
rc = zoo_get(zh, "/xyz", 0, buffer, &buflen, &stat);
if (rc) {
fprintf(stderr, "Error %d for %s\n", rc, __LINE__);
}
zookeeper_close(zh);
return 0;
}
ZooKeeper runs in a variety of different environments with various different authentication schemes, so it has a completely pluggable authentication framework. Even the builtin authentication schemes use the pluggable authentication framework.
To understand how the authentication framework works, first you must understand the two main authentication operations. The framework first must authenticate the client.This is usually done as soon as the client connects to a server and consists of validating information sent from or gathered about a client and associating itwith the connection. The second operation handled by the framework is findingthe entries in an ACL that correspond to client. ACL entries are <idspec,permissions> pairs. The idspec may be a simple string match against the authentication information associated with the connection or it maybe a expression that is evaluated against that information. It is up to the implementation of the authentication plugin to do the match. Here is the interface that an authentication plugin must implement:
public interface AuthenticationProvider {
String getScheme();
KeeperException.Code handleAuthentication(ServerCnxn cnxn, byte authData[]);
boolean isValid(String id);
boolean matches(String id, String aclExpr);
boolean isAuthenticated();
}
The first method getScheme returns the string that identifies the plugin. Because we support multiple methods of authentication, an authentication credential oranidspec will always be prefixed with scheme:. The ZooKeeper server uses the scheme returned by the authentication plugin to determine which ids the scheme applies to.
handleAuthentication iscalled when a client sends authentication information to be associated with aconnection. The client specifies the scheme to which the information corresponds. The ZooKeeper server passes the information to the authenticationplugin whose getScheme matches the scheme passed by the client. Theimplementor ofhandleAuthentication will usually return an error if it determines that the information is bad, or it will associate information with the connection usingcnxn.getAuthInfo().add(new Id(getScheme(), data)).
The authentication plugin is involved in both setting and using ACLs. When an ACLis set for a znode, the ZooKeeper server will pass the id part of the entry totheisValid(String id) method. It is up to the plugin to verify that theid has a correct form. For example,ip:172.16.0.0/16 is a valid id, but ip:host.comis not. If the new ACL includes an "auth" entry,isAuthenticatedis used to see if the authentication information for this scheme that isassocatied with the connection should be added to the ACL. Some schemes shouldnot be included in auth. For example, the IP address of the client is notconsidered as an id that should be added to the ACL if auth is specified.
ZooKeeper invokes matches(String id, String aclExpr) when checking an ACL. It needs to match authentication information of the client against the relevant ACL entries. To find the entries which apply to the client, the ZooKeeper server will find the scheme of each entry and if there is authenticationinformation from that client for that scheme,matches(String id, StringaclExpr) will be called with id set to the authentication information that was previously added to the connection byhandleAuthenticationand aclExpr set to the id of the ACL entry. The authentication pluginuses its own logic and matching scheme to determine ifid is included inaclExpr.
There are two built in authentication plugins: ip and digest. Additional plugins can adding using system properties. At startup the ZooKeeper server will look for system properties that start with"zookeeper.authProvider." and interpret the value of those properties as the class name of an authentication plugin. These properties can be setusing the -Dzookeeeper.authProvider.X=com.f.MyAuth or adding entries such as the following in the server configuration file:
authProvider.1=com.f.MyAuth
authProvider.2=com.f.MyAuth2
Care should be taking to ensure that the suffix on the property is unique. If there are duplicates such as-Dzookeeeper.authProvider.X=com.f.MyAuth-Dzookeeper.authProvider.X=com.f.MyAuth2, only one will be used. Also all servers must have the same plugins defined, otherwise clients using the authentication schemes provided by the plugins will have problems connecting to some servers.
ZooKeeper可以采用不同的认证方案,运行在各种不同的环境,所以它有一个完全可插拔的认证架构,即使内置的认证方案,也采用的是这一架构。
为理解认证架构是如何工作的,首先,你要明白两个主要的认证操作。架构首先要认证客户端,这通常发生在客户端一连上服务器的时刻,它包含认证客户端发过来的或从客户端收集到的身份信息,然后与连接关联起来。第二个架构要处理的操作是从一个ACL中找到与此客户端有关的项。ACL中的项是
public interface AuthenticationProvider {
String getScheme();
KeeperException.Code handleAuthentication(ServerCnxn cnxn, byte authData[]);
boolean isValid(String id);
boolean matches(String id, String aclExpr);
boolean isAuthenticated();
}
第一个方法getScheme返回这个可插拔件的身份Id。因为我们支持多个认证方法,一种认证凭证或一个idspec总需要添加scheme:作为前缀。ZooKeeper服务器使用这个从可插件返回的scheme来决定哪些id用这个scheme前缀。
当客户端随着连接而发生过来认证信息时,handleAuthentication被调用。客户端指定与这个信息相关的scheme。ZooKeeper将这个认证信息传给插拔架构,这个可插件的getScheme需要与客户端传过来的scheme一致。如果这个信息不对,handleAuthentication的实现者通常返回一个错误,或者用cnxn.getAuthInfo().add(newId(getScheme(), data))将连接与信息关联。
可插件涉及了设置和ACL。当对一个节点设置了一个ACL时,ZooKeeper服务器将此项中的id部分传给isValid(String id)方法,验证这个id的格式是否正确是可插件的事情。例如,ip:172.16.0.0/16是一个合法的id,但ip:host.com不是。如果这个新的ACL包含一个”auth”项,isAuthenticated被用来看看是否将与此连接关联的有关此scheme的认证信息加到ACL中。一些scheme不应该被包含到auth。例如,如果auth被指定了,客户端的IP地址不应该被认为是一个需要加入到ACL的id。
与一个ACL做检查时,ZooKeeper调用matches(String id,String aclExpr),它要将客户端的认证信息与相关的ACL项进行对比,为了找出对比的ACL项,ZooKeeper服务器将找出每项的scheme,如果来自客户端的认证信息由该scheme,则matches(String id, StringaclExpr)被调用,其中,id是前面由handleAuthentication加入到连接的认证信息,aclExpr是ACL项的id,可插件运用它自己的逻辑和匹配规则来决定这个id是否包含在aclExpr中。
有两个内置的认证可插件:ip和digest。额外的可插件可以用系统参数来添加。在启动的时候,ZooKeeper服务器会搜索以” zookeeper.authProvider”为起始的系统参数,并将这些参数的值解释为认证插件的类名。这些参数可以采用如下方式设置:-Dzookeeeper.authProvider.X=com.f.MyAuth,或者在服务器的配置文件中添加如下项:
authProvider.1=com.f.MyAuth
authProvider.2=com.f.MyAuth2
需要小心的是这些值的后缀应保证唯一,如果有重复,例如
-Dzookeeeper.authProvider.X=com.f.MyAuth
-Dzookeeper.authProvider.X=com.f.MyAuth2
只应用一个。另外,所有的服务器应该有相同的插件定义,否则,在连接某些服务器时,客户端采用插件提供的认证方案时会出问题。
ZooKeeper is a high performance, scalable service. Both reads and write operations aredesigned to be fast, though reads are faster than writes. The reason for thisis that in the case of reads, ZooKeeper can serve older data, which in turn isdue to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single SystemImage
A client will see the same view of the service regardless of the server that itconnects to.
Reliability
Once an update has been applied, it will persist from that time forward until aclient overwrites the update. This guarantee has two corollaries:
- If a client gets a successful return code, the update will have been applied. Onsome failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but theonly guarantee is only present with successful return codes. (This is calledthemonotonicity condition in Paxos.)
- Any updates that are seen by the client, through a read request or successfulupdate, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen bya client within this bound, or the client will detect a service outage.
Using thesec onsistency guarantees it is easy to build higher level functions such as leader election, barriers, queues, and read/write revocable locks solely at the ZooKeeper client (no additions needed to ZooKeeper). SeeRecipes andSolutions for more details.
ZooKeeper是一个高性能、高可扩展服务,读和写都被设计得很快,当然,读的速度比写更快一些,原因在于读时,ZooKeeper依然可以提供旧数据服务,之所以能这样做,是由于ZooKeeper的如下一致性保证:
顺序一致性:
来自于客户端的更新是根据它们发送的先后顺序进行的。
原子性
更新要么成功,要么失败—没有中间结果
单一的系统映像
一个客户端无论与哪个服务器连接,它所看到的服务场景都是一样的。
可靠性
一旦一个更新被完成后,它的状态将一直保持,直到客户端覆盖了这个更新。这个保证有两个推论:
- 如果一个客户端得到了一个成功的返回,那么这个更新已经完成了。在某些故障情况下(通讯故障、超时等),客户端不会知道更新是否完成。我们可以采取措施减小故障,但是成功的返回码的唯一的保证(在Paxos中,这叫做单一性条件)。
- 客户端通过读请求或成功的更新操作看到的所有更新,不会随着服务器(从故障中)恢复而回滚。
时效性
在一个时间范围内,客户端看到的系统保证是最新的(数十秒级别),在此期间,或者系统的变化被客户端看到,或者客户端检查到服务中断。
使用这些一致性保证,仅仅在客户端(对ZooKeep来说,不需要额外的东西)就很容易构建更高级的功能,例如leader选举,壁垒,排队以及可撤销的read/write锁。详情参见Recipes andSolutions。
The ZooKeeper client libraries come in two languages: Java and C. The following sectionsdescribe these.
ZooKeeper客户端库有两种语言:Java和C。以下节描述它们。
Java Binding
There are two packages that make up the ZooKeeper Java binding:org.apache.zookeeperand org.apache.zookeeper.data. The rest of the packages that make up ZooKeeper are used internally or are part of the server implementation. Theorg.apache.zookeeper.datapackage is made up of generated classes that are used simply as containers.
The main class used by a ZooKeeper Java client is theZooKeeper class. Its two constructors differ only by an optional session id and password. ZooKeeper supports session recovery accross instances of a process. A Java program maysave its session id and password to stable storage, restart, and recover the session that was used by the earlier instance of the program.
When a ZooKeeper object is created, two threads are created as well: an IO thread andan event thread. All IO happens on the IO thread (using Java NIO). All event callbacks happen on the event thread. Session maintenance such as reconnecting to ZooKeeper servers and maintaining heart beat is done on the IO thread.Responses for synchronous methods are also processed in the IO thread. All responses to asynchronous methods and watch events are processed on the event thread. There are a few things to notice that result from this design:
Note that if there is a change to /a between the asynchronous read and thesynchronous read, the client library will receive the watch event saying /a changed before the response for the synchronous read, but because thecompletion callback is blocking the event queue, the synchronous read willreturn with the new value of /a before the watch event is processed.
Finally, the rules associated with shutdown are straightforward: once a ZooKeeper object isclosed or receives a fatal event (SESSION_EXPIRED and AUTH_FAILED), the ZooKeeper object becomes invalid. On a close, the two threads shut down and anyfurther access on zookeeper handle is undefined behavior and should be avoided.
Java 绑定
ZooKeeper的Java绑定有两个包:org.apache.zookeeper和 org.apache.zookeeper.data. 构成ZooKeeper的其他包或者是内部使用,或者是服务器端实现使用。org.apache.zookeeper.data包由生成的类组成,这些类可以仅用作容器。
ZooKeeperJava客户端最主要的类是ZooKeeper,它的两个构造函数的区别仅在于可选的会话id和password。在进程内,ZooKeeper支持跨实例的会话恢复,Java程序可以将会话id和password保存到持久的存储中,重启后,能恢复以前实例的会话。
当一个ZooKeeper对象被创建,两个线程也被同时创建:一个IO线程和一个事件线程。所有IO发生在IO线程(采用JavaNIO)。所有事件回调发生在事件线程。会话的维护,例如与ZooKeeper的重连接和维护心跳,发生在IO线程。同步应答也在IO线程处理。所有对异步方法和监视器事件的应答都在事件线程处理。对于这种设计,应注意以下事情:
注意,对节点/a,如果在异步读和同步读之间发生了变化,客户端库在得到同步读响应前,会接收到一个监视器事件,说/a变化了,但是由于异步读回调阻塞了事件队列,同步读会返回/a的新值,然后监视器事件才被处理。
最后,与shutdown相关联的规则很直白:一旦一个ZooKeeper对象关闭或者收到一个严重事件(SESSION_EXPIRED和AUTH_FAILED), ZooKeeper对象就变成无效了。关闭后,这两个线程也停止了,任何在ZooKeeper句柄上的操作就变得不可预测,应避免这种情况出现。
C Binding
The C binding has a single-threaded and multi-threaded library. The multi-threaded library is easiest to use and is most similar to the Java API. This library will create anIO thread and an event dispatch thread for handling connection maintenance andcallbacks. The single-threaded library allows ZooKeeper to be used in eventdriven applications by exposing the event loop used in the multi-threadedlibrary.
The package includes two shared libraries: zookeeper_st and zookeeper_mt. The former only provides the asynchronous APIs and callbacks for integrating into the application's event loop. The only reason this library exists is to support theplatforms were a pthread library is not available or is unstable (i.e.FreeBSD 4.x). In all other cases, application developers should link withzookeeper_mt, as it includes support for both Sync and Async API.
C绑定
C绑定有一个单线程库和多线程库。多线程库用起来最简单,并且与JavaAPI很相似。这个库将创建一个IO线程和事件分发器线程,后者处理连接维护和回调。单线程库允许ZooKeeper用在事件驱动应用程序中,此时,它暴露事件循环,这与多线程库中的一样。
程序包包含两个共享库:zookeeper_st和zookeeper_mt,前者仅提供异步API和回调函数,它们可以整合到应用程序的事件循环中。这个库存在的唯一理由是它是针对那些不支持pthread库或pthread库运行不稳定的平台(即FreeBSD 4.x)。其他情况下,程序开发者应链接zookeeper_mt,它同时支持同步和异步API。
Installation
If you're building the client from a check-out from the Apache repository, follow the steps outlined below. If you're building from a project source packagedownloaded from apache, skip to step 3.
Enables optimization and enables debug info compiler options. (Disabled by default.)
DisablesSync API support; zookeeper_mt library won't be built. (Enabled by default.)
Donot build static libraries. (Enabled by default.)
Donot build shared libraries. (Enabled by default.)
NoteSee INSTALL for general information about running configure..
安装
如果你是通过从Apache库中采用check-out操作来构建客户端,参考以下大致的步骤,如果你是从Apache下载源代码包构建客户端,跳到步骤3。
编译器选项,允许优化及debug信息(缺省是不允许)
不支持同步API,不生成zookeeper_mt库 (缺省是支持)
不生成静态库(缺省是生成)
不生成共享库(缺省是生成)
注关于运行configure的一般信息,参阅INSTALL .
Using the C Client
You can test your client by running a ZooKeeper server (see instructions on the project wikipage on how to run it) and connecting to it using one of the cli applicationsthat were built as part of the installation procedure. cli_mt (multithreaded,built against zookeeper_mt library) is shown in this example, but you couldalso use cli_st (single threaded, built against zookeeper_st library):
$ cli_mt zookeeper_host:9876
This is a client application that gives you a shell for executing simple ZooKeeper commands.Once successfully started and connected to the server it displays a shell prompt. You can now enter ZooKeeper commands. For example, to create a node:
> create /my_new_node
To verify that the node's been created:
> ls /
You should see a list of node who are children of the root node "/".
In order to be able to use the ZooKeeper API in your application you have to remember to
Refer to ProgramStructure, with Simple Example for examplesof usage in Java and C.[tbd]
使用C客户端
测试客户端,你先运行一个ZooKeeper服务器(关于运行它的指令,请参阅项目的wiki页),然后采用前面安装过程中生成的某个cli应用程序连接它。下面的例子采用cli_mt(多线程,采用zookeeper_mt库生成的),但你也可以用cli_st(单线程,采用zookeeper_st库生成的):
$ cli_mt zookeeper_host:9876
这是一个客户端程序,它为你提供了一个shell,你可以运行简单的ZooKeeper命令,一旦成功启动并连接到服务器,它显示一个shell提示符,你可以键入ZooKeeper命令,例如,创建一个节点:
> create /my_new_node
验证这个节点确实被创建:
> ls /
你将看到根节点”/”下子节点的列表。
为了能在你的应用程序中使用ZooKeeper的API,你应该记住:
This sectionsurveys all the operations a developer can perform against a ZooKeeper server.It is lower level information than the earlier concepts chapters in this manual, but higher level than the ZooKeeper API Reference. It covers thesetopics:
Handling Errors
Both the Java and C client bindings may report errors. The Java client binding does so by throwing KeeperException, calling code() on the exception will return the specific error code. The C client binding returns an error code as defined inthe enum ZOO_ERRORS. API callbacks indicate result code for both language bindings. See the API documentation (javadoc for Java, doxygen for C) for fulldetails on the possible errors and their meaning.
Connecting to ZooKeeper
Read Operations
Write Operations
Handling Watches
Miscelleaneous ZooKeeper Operations
构建积木:ZooKeeper操作指导
这一节调查了一个开发人员能用到的所有ZooKeeper服务器的操作。与本手册前面章节相比,它是更底层的信息,但比ZooKeeperAPI参考信息高,它包含如下主题:
处理错误
Java和C客户端都会报告错误,Java客户端是通过抛出KeeperException异常的方式,在异常处理中调用code()会返回特定的错误编号;C客户端返回一个错误编号,编号在ZOO_ERRORS枚举类型中定义。对两种语言绑定,API回调都由结果码指示调用结果。关于可能的错误值及其意义的详细信息,参阅API文档(对java是javadoc,对C是doxygen)。
连接到ZooKeeper
读操作
写操作
处理监视器
其他ZooKeeper选项
[tbd]
【待完成】
Gotchas: Common Problems and Troubleshooting
So now you know ZooKeeper. It's fast, simple, your application works, but wait ...something's wrong. Here are some pitfalls that ZooKeeper users fall into:
To avoid swapping, try to set the heap size to the amount of physical memory youhave, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is torun load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a4G machine, a 3G heap is a conservative estimate to start with.
陷阱:常见问题及其解决
你现在以及了解ZooKeeper了,它快速、简单,你的应用程序工作正常,但等等…有些东西错了。这里是一些ZooKeeper用户可能掉入的陷阱:
为了避免内存交换,将heapsize设置为物理内存的大小减去操作系统和缓存的大小,优化heapsize最好的办法是运行负荷试验。如果出于某些原因,你不能这样做,那么对你的估算保守一些,选择一个刚好低于使你的机器产生内存交换的值,例如,在一个4G的机器上,选择3G作为保守堆大小的起点。
Outside the formal documentation, there're several other sources of information for ZooKeeper developers.
ZooKeeperWhitepaper [tbd: find url]
The definitive discussion of ZooKeeper design and performance, by Yahoo! Research
API Reference[tbd: find url]
The complete reference to the ZooKeeper API
ZooKeeper Talkat the Hadoup Summit 2008
Avideo introduction to ZooKeeper, by Benjamin Reed of Yahoo! Research
Barrier andQueue Tutorial
The excellent Java tutorial by Flavio Junqueira, implementing simple barriers andproducer-consumer queues using ZooKeeper.
ZooKeeper - AReliable, Scalable Distributed Coordination System
An article by Todd Hoff (07/15/2008)
ZooKeeperRecipes
Pseudo-level discussion of the implementation of various synchronization solutions withZooKeeper: Event Handles, Queues, Locks, and Two-phase Commits.
[tbd]
Any other good sources anyone can think of...
除了正式的官方文档外,还有一些其他的信息源,供ZooKeeper开发者参考。
ZooKeeperWhitepaper [url待定]
它明确地讨论了ZooKeeper的设计和性能,由Yahoo!Research编写
API Reference[url待定]
ZooKeeperAPI的完整参考
ZooKeeper Talkat the Hadoup Summit 2008
Yahoo!Research的Benjamin Reed讲的一个介绍ZooKeeper的视频
Barrier andQueue Tutorial
Theexcellent Java tutorial by Flavio Junqueira编写的一个优秀的Java教程,用ZooKeeper实现了一个简单的壁垒(barriers)以及producer-consumer 模式的队列。
ZooKeeper - AReliable, Scalable Distributed Coordination System
ToddHoff (07/15/2008)写的一篇文章
ZooKeeperRecipes
采用ZooKeeper来实现的各种同步方案(模拟级别的讨论):事件处理,队列,锁及两段提交。