MODULE
supervisor
MODULE SUMMARY
通用监控行为模块
DESCRIPTION
这是用来实现监控者的一个模块,用来监控其他称之为子进程的进程。一个子进程可以是一个工作进程或者同样是一个监控进程。工作进程一般是用gen_event,gen_fsm或者gen_server实现的。使用本模块实现的监控者将会有一套标准的接口函数,其中包含了用来跟踪和报错的功能。监控者经常被用来构建层次结构的进程树,一般称之为监控树,是一个用来构建带容错功能应用的不错的法子。更多信息请参考OTP Design Principles。
A behaviour module for implementing a supervisor, a process which supervises other processes called child processes. A child process can either be another supervisor or a worker process. Worker processes are normally implemented using one of the gen_event, gen_fsm, or gen_server behaviours. A supervisor implemented using this module will have a standard set of interface functions and include functionality for tracing and error reporting. Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application. Refer to OTP Design Principles for more information.
监控者假设可以通过一个回调模块export出来的预定义的一组函数来找到被监控的子进程。FIXME
A supervisor assumes the definition of which child processes to supervise to be located in a callback module exporting a pre-defined set of functions.
除非另有说明,如果指定的监控者不存在或者参数错误,那么这么模块中的所有函数将失败。
Unless otherwise stated, all functions in this module will fail if the specified supervisor does not exist or if bad arguments are given.
Supervision Principles
监控者负责启动,停止和监控他的子进程。基本思路是监控者需要在必要的时候重启他的子进程来保证其存活。
The supervisor is responsible for starting, stopping and monitoring its child processes. The basic idea of a supervisor is that it should keep its child processes alive by restarting them when necessary.
监控者的子进程被定义为一个列表,当监控者启动时,其子进程将按照这个列表的顺序从左至右启动;当监控者终止时,其子进程将按照启动顺序的倒序--从右至左--终止。[译者注:这里的左右可以参考erlang的list语法,如[H|Other],很形象吧!]
The children of a supervisor is defined as a list of child specifications. When the supervisor is started, the child processes are started in order from left to right according to this list. When the supervisor terminates, it first terminates its child processes in reversed start order, from right to left.
监控者可以使用下列_重启策略_之一:
A supervisor can have one of the following restart strategies:
* one_for_one 如果一个子进程终止了并且需要被重启,那么只有那一个子进程会受到影响;
* one_for_all 如果一个子进程终止了并且需要被重启,那么所有其他子进程都将被终止然后所有子进程将被重启;
* rest_for_one 如果一个子进程终止了并且需要被重启,所有“剩余”的子进程,如:启动顺序在终止的子进程之后的进程,将会被终止,然后被终止的子进程及其之后的子进程将被重启。
* simple_one_for_one 一个简化的one_for_one监控者,只是其中的子进程会有同样类型的实例(如:运行的是同样代码)动态的加入。terminate_child/2, delete_child/2及restart_child/2这几个方法对使用simple_one_for_one重启策略的监控者来说是非法的,将返回{error,simple_one_for_one}。
* one_for_one - if one child process terminates and should be restarted, only that child process is affected.
* one_for_all - if one child process terminates and should be restarted, all other child processes are terminated and then all child processes are restarted.
* rest_for_one - if one child process terminates and should be restarted, the 'rest' of the child processes -- i.e. the child processes after the terminated child process in the start order -- are terminated. Then the terminated child process and all child processes after it are restarted.
* simple_one_for_one - a simplified one_for_one supervisor, where all child processes are dynamically added instances of the same process type, i.e. running the same code.
The functions terminate_child/2, delete_child/2 and restart_child/2 are invalid for simple_one_for_one supervisors and will return {error,simple_one_for_one} if the specified supervisor uses this restart strategy.
为了避免监控者陷入子进程的终止-重启这样的死循环,需要用两个整数MaxR和MaxT来设置一个_最高重启频率_,如果在MaxT秒内重启次数超过MaxR,那么监控者将终止所有的子进程以及自己。
To prevent a supervisor from getting into an infinite loop of child process terminations and restarts, a maximum restart frequency is defined using two integer values MaxR and MaxT. If more than MaxR restarts occur within MaxT seconds, the supervisor terminates all child processes and then itself.
子进程规格的类型定义如下:
This is the type definition of a child specification:
child_spec() = {Id,StartFunc,Restart,Shutdown,Type,Modules}
Id = term()
StartFunc = {M,F,A}
M = F = atom()
A = [term()]
Restart = permanent | transient | temporary
Shutdown = brutal_kill | int()>=0 | infinity
Type = worker | supervisor
Modules = [Module] | dynamic
Module = atom()
* Id 是在监控者内部用来标记子进程的一个名字;
* StartFunc 是用来启动子进程的函数,格式为模块-函数-参数{M,F,A},将被使用为applay(M,F,A)。
启动函数必须创建并连接到子进程,并返回{ok,Child}或者{ok,Child,Info},这里的Child是子进程的pid,Info是一些描述信息(这些描述信息监控者会忽略掉)。
如果因为某种原因启动子进程失败,启动函数可以返回ignore,这是监控者将保留子进程的定义,但这个不存在的子进程将被忽略。(FIXME)
如果出错,启动函数可能会返回错误信息{error,Error}。
注:behaviour模块的start_link函数都满足以上要求。
*
* Id is a name that is used to identify the child specification internally by the supervisor.
* StartFunc defines the function call used to start the child process. It should be a module-function-arguments tuple {M,F,A} used as apply(M,F,A).
The start function must create and link to the child process, and should return {ok,Child} or {ok,Child,Info} where Child is the pid of the child process and Info an arbitrary term which is ignored by the supervisor.
The start function can also return ignore if the child process for some reason cannot be started, in which case the child specification will be kept by the supervisor but the non-existing child process will be ignored.
If something goes wrong, the function may also return an error tuple {error,Error}.
Note that the start_link functions of the different behaviour modules fulfill the above requirements.
* Restart 定义子进程将在什么时候被重启。一个永久(permanent)子进程总是会被重启,一个临时(temporary)子进程永不会被重启,一个transient子进程只有在异常终止的时候被重启,如退出原因不是normal;
* Shutdown 定义子进程如何被终止。brutal_kill意味着子进程将被无条件的以exit(Child,kill)终止;一个正整数的超时时间意味着监控者将以exit(Child,shutdown)告诉子进程自行终止,并且等待子进程返回退出信号shutdown,如果在指定时间内没有退出信号返回,子进程将被无条件的以exit(Child,kill)终止。
如果子进程也是一个监控者,Shutdown将被设置为infinity,好让子进程的子进程树有充足的时间shutdown。
_关于simple_one_for_one类型的监控者的重要说明_:simple_one_for_one类型的监控者的自动创建的子进程不会被明确的杀死[译者注:are not explicitly killed,直译了](这是违背shutdown策略的),而是在监控者终止的时候终止(即:在收到父进程的退出信号的时候)。
注意使用标准的OTP behavior实现的子进程模块自动的加入shutdown协议[automatically adhere to the shutdown protocol]。
* Restart defines when a terminated child process should be restarted. A permanent child process should always be restarted, a temporary child process should never be restarted and a transient child process should be restarted only if it terminates abnormally, i.e. with another exit reason than normal.
* Shutdown defines how a child process should be terminated. brutal_kill means the child process will be unconditionally terminated using exit(Child,kill). An integer timeout value means that the supervisor will tell the child process to terminate by calling exit(Child,shutdown) and then wait for an exit signal with reason shutdown back from the child process. If no exit signal is received within the specified time, the child process is unconditionally terminated using exit(Child,kill).
If the child process is another supervisor, Shutdown should be set to infinity to give the subtree ample time to shutdown.
Important note on simple-one-for-one supervisors: The dynamically created child processes of a simple-one-for-one supervisor are not explicitly killed, regardless of shutdown strategy, but are expected to terminate when the supervisor does (that is, when an exit signal from the parent process is received).
Note that all child processes implemented using the standard OTP behavior modules automatically adhere to the shutdown protocol.
* Type 标志子进程是一个监控者或者是一个工作进程[supervisor or a worker];
* Modules 是发布处理器(release handler)在代码替换是检测进程对应的模块时使用的。当Module是一个回调模块的时候(子进程是一个supervisor,gen_server或gen_fsm),Modules就是一个只有一个元素的列表[Module];如果子进程是一个有一组动态的回调模块的事件处理器(gen_event)时,Modules应该是动态的。关于发布处理(release handling)的详细信息请参考OTP Design Principles。
* 在内部,监控者还跟踪子进程的pid Child,如果没有pid的话就是undefined。
* Type specifies if the child process is a supervisor or a worker.
* Modules is used by the release handler during code replacement to determine which processes are using a certain module. As a rule of thumb Modules should be a list with one element [Module], where Module is the callback module, if the child process is a supervisor, gen_server or gen_fsm. If the child process is an event manager (gen_event) with a dynamic set of callback modules, Modules should be dynamic. See OTP Design Principles for more information about release handling.
* Internally, the supervisor also keeps track of the pid Child of the child process, or undefined if no pid exists.
EXPORTS
start_link(Module, Args) -> Result
start_link(SupName, Module, Args) -> Result
Types:
SupName = {local,Name} | {global,Name}
Name = atom()
Module = atom()
Args = term()
Result = {ok,Pid} | ignore | {error,Error}
Pid = pid()
Error = {already_started,Pid}} | shutdown | term()
创建一个监控者进程作为监控树的一部分,除了其他事情外,这个函数还保证监控者进程被连接到调用的进程(它的监控者)。
Creates a supervisor process as part of a supervision tree. The function will, among other things, ensure that the supervisor is linked to the calling process (its supervisor).
新创建的监控者进程调用Module:init/1函数来获取重启策略、最高重启频率和子进程。为保证一个同步的启动过程,start_link/2,3函数要直到Module:init/1返回并且所有子进程启动完毕才返回。
The created supervisor process calls Module:init/1 to find out about restart strategy, maximum restart frequency and child processes. To ensure a synchronized start-up procedure, start_link/2,3 does not return until Module:init/1 has returned and all child processes have been started.
如果SupName={local,Name},系统使用register/2将监控者进程在本地注册为Name;如果SupName={global,Name},系统使用global:register_name/2将监控者进程在全局注册为Name;如果没有提供name,监控者进程不会被注册。
If SupName={local,Name} the supervisor is registered locally as Name using register/2. If SupName={global,Name} the supervisor is registered globally as Name using global:register_name/2. If no name is provided, the supervisor is not registered.
Moduel 是回调模块的名字。
Module is the name of the callback module.
Args 是任意传递给Module:init/1的参数。
Args is an arbitrary term which is passed as the argument to Module:init/1.
如果监控和他的子进程成功创建(如果所有子进程的start函数都返回{ok,Child},{ok,Child,Info}或ignore),函数将返回{ok,Pid},这个Pid就是监控者的pid;如果已经存在一个名为指定的SupName的进程,函数将返回{error,{already_started,Pid}},这个Pid是那个已经启动的进程的pid。
If the supervisor and its child processes are successfully created (i.e. if all child process start functions return {ok,Child}, {ok,Child,Info}, or ignore) the function returns {ok,Pid}, where Pid is the pid of the supervisor. If there already exists a process with the specified SupName the function returns {error,{already_started,Pid}}, where Pid is the pid of that process.
如果Module:init/1返回ignore,函数也返回ignore同时监控者终止,原因为normal;如果Module:init/1失败或返回一个不正确的值,函数返回{error,Term},Term是关于错误的一个描述信息,监控者也终止掉,原因为Term。
If Module:init/1 returns ignore, this function returns ignore as well and the supervisor terminates with reason normal. If Module:init/1 fails or returns an incorrect value, this function returns {error,Term} where Term is a term with information about the error, and the supervisor terminates with reason Term.
如果任何一个子进程的start函数失败或返回一个错误的元组(tuple)或者一个错误的值,函数返回{error,shutdown},监控者终止掉所有已经启动的子进程,然后终止掉自己,原因为shutdown。
If any child process start function fails or returns an error tuple or an erroneous value, the function returns {error,shutdown} and the supervisor terminates all started child processes and then itself with reason shutdown.
start_child(SupRef, ChildSpec) -> Result
Types:
SupRef = Name | {Name,Node} | {global,Name} | pid()
Name = Node = atom()
ChildSpec = child_spec() | [term()]
Result = {ok,Child} | {ok,Child,Info} | {error,Error}
Child = pid() | undefined
Info = term()
Error = already_present | {already_started,Child} | term()
动态的添加一个子进程规格到启动相应子进程的监控者SupRef。
Dynamically adds a child specification to the supervisor SupRef which starts the corresponding child process.
SupRef可以是:
* pid;
* 名字,如果监控者是注册在本地;
* {Name,Node},如果监控者是在其他节点本地注册的,或者
* {global,Name},如果监控者是全局注册的
SupRef can be:
* the pid,
* Name, if the supervisor is locally registered,
* {Name,Node}, if the supervisor is locally registered at another node, or
* {global,Name}, if the supervisor is globally registered.
ChildSpec 应该是一个合法的child规格(除非监控者是一个simple_one_for_one),子进程将被child规格里设置的启动函数启动。
ChildSpec should be a valid child specification (unless the supervisor is a simple_one_for_one supervisor, see below). The child process will be started by using the start function as defined in the child specification.
如果监控者是simple_one_for_one的,将会使用Module:init/1里定义的child规格,这是ChildSpec可以是任意的一个term列表List,然后将List附在启动函数的参数后面来启动子进程,如:child规格里定义的启动函数是{M,F,A},将会调用apply{M,F,A++List}。
If the case of a simple_one_for_one supervisor, the child specification defined in Module:init/1 will be used and ChildSpec should instead be an arbitrary list of terms List. The child process will then be started by appending List to the existing start function arguments, i.e. by calling apply(M, F, A++List) where {M,F,A} is the start function defined in the child specification.
If there already exists a child specification with the specified Id, ChildSpec is discarded and the function returns {error,already_present} or {error,{already_started,Child}}, depending on if the corresponding child process is running or not.
If the child process start function returns {ok,Child} or {ok,Child,Info}, the child specification and pid is added to the supervisor and the function returns the same value.
If the child process start function returns ignore, the child specification is added to the supervisor, the pid is set to undefined and the function returns {ok,undefined}.
If the child process start function returns an error tuple or an erroneous value, or if it fails, the child specification is discarded and the function returns {error,Error} where Error is a term containing information about the error and child specification.
terminate_child(SupRef, Id) -> Result
Types:
SupRef = Name | {Name,Node} | {global,Name} | pid()
Name = Node = atom()
Id = term()
Result = ok | {error,Error}
Error = not_found | simple_one_for_one
Tells the supervisor SupRef to terminate the child process corresponding to the child specification identified by Id. The process, if there is one, is terminated but the child specification is kept by the supervisor. This means that the child process may be later be restarted by the supervisor. The child process can also be restarted explicitly by calling restart_child/2. Use delete_child/2 to remove the child specification.
See start_child/2 for a description of SupRef.
If successful, the function returns ok. If there is no child specification with the specified Id, the function returns {error,not_found}.
delete_child(SupRef, Id) -> Result
Types:
SupRef = Name | {Name,Node} | {global,Name} | pid()
Name = Node = atom()
Id = term()
Result = ok | {error,Error}
Error = running | not_found | simple_one_for_one
Tells the supervisor SupRef to delete the child specification identified by Id. The corresponding child process must not be running, use terminate_child/2 to terminate it.
See start_child/2 for a description of SupRef.
If successful, the function returns ok. If the child specification identified by Id exists but the corresponding child process is running, the function returns {error,running}. If the child specification identified by Id does not exist, the function returns {error,not_found}.
restart_child(SupRef, Id) -> Result
Types:
SupRef = Name | {Name,Node} | {global,Name} | pid()
Name = Node = atom()
Id = term()
Result = {ok,Child} | {ok,Child,Info} | {error,Error}
Child = pid() | undefined
Error = running | not_found | simple_one_for_one | term()
Tells the supervisor SupRef to restart a child process corresponding to the child specification identified by Id. The child specification must exist and the corresponding child process must not be running.
See start_child/2 for a description of SupRef.
If the child specification identified by Id does not exist, the function returns {error,not_found}. If the child specification exists but the corresponding process is already running, the function returns {error,running}.
If the child process start function returns {ok,Child} or {ok,Child,Info}, the pid is added to the supervisor and the function returns the same value.
If the child process start function returns ignore, the pid remains set to undefined and the function returns {ok,undefined}.
If the child process start function returns an error tuple or an erroneous value, or if it fails, the function returns {error,Error} where Error is a term containing information about the error.
which_children(SupRef) -> [{Id,Child,Type,Modules}]
Types:
SupRef = Name | {Name,Node} | {global,Name} | pid()
Name = Node = atom()
Id = term() | undefined
Child = pid() | undefined
Type = worker | supervisor
Modules = [Module] | dynamic
Module = atom()
Returns a list with information about all child specifications and child processes belonging to the supervisor SupRef.
See start_child/2 for a description of SupRef.
The information given for each child specification/process is:
* Id - as defined in the child specification or undefined in the case of a simple_one_for_one supervisor.
* Child - the pid of the corresponding child process, or undefined if there is no such process.
* Type - as defined in the child specification.
* Modules - as defined in the child specification.
check_childspecs([ChildSpec]) -> Result
Types:
ChildSpec = child_spec()
Result = ok | {error,Error}
Error = term()
This function takes a list of child specification as argument and returns ok if all of them are syntactically correct, or {error,Error} otherwise.
CALLBACK FUNCTIONS
The following functions should be exported from a supervisor callback module.
EXPORTS
Module:init(Args) -> Result
Types:
Args = term()
Result = {ok,{{RestartStrategy,MaxR,MaxT},[ChildSpec]}} | ignore
RestartStrategy = one_for_all | one_for_one | rest_for_one | simple_one_for_one
MaxR = MaxT = int()>=0
ChildSpec = child_spec()
Whenever a supervisor is started using supervisor:start_link/2,3, this function is called by the new process to find out about restart strategy, maximum restart frequency and child specifications.
Args is the Args argument provided to the start function.
RestartStrategy is the restart strategy and MaxR and MaxT defines the maximum restart frequency of the supervisor. [ChildSpec] is a list of valid child specifications defining which child processes the supervisor should start and monitor. See the discussion about Supervision Principles above.
Note that when the restart strategy is simple_one_for_one, the list of child specifications must be a list with one child specification only. (The Id is ignored). No child process is then started during the initialization phase, but all children are assumed to be started dynamically using supervisor:start_child/2.
The function may also return ignore.
SEE ALSO
gen_event(3), gen_fsm(3), gen_server(3), sys(3)
stdlib 1.16
Copyright © 1991-2009 Ericsson AB