erlang应用脚本stop分析

其实这篇文章的名字应该是如何安全关闭erlang应用更加科学。

erlang应用脚本生成

使用rebar工具，创建一个erlang节点后，


 ./rebar create-node nodeid=hook_heroes

然后在rel目录里面，执行打包命令


 ./rebar generate

会生成完整的应用包，目录如下:


 bin erts-6.0 lib log releases

bin里面，有一个启动脚本名字和节点名字一样的，这里是hook_heroes

停止服务的时候，目前使用


 ./hook_heroes stop

对于hook_heroes stop分析

hook_heroes stop调用如下


 %%Tell nodetool to initiate a stop
 $NODETOOL stop
 ES=$?
 if [ "$ES" -ne 0 ]; then
 exit $ES
 fi

这里的nodetool来自


 NODETOOL="$ERTS_PATH/escript $ERTS_PATH/nodetool

即erts包下面的nodetool脚本，传入的参数stop
nodetool是一个escript脚本，作用就是“Helper Script for interacting with live nodes”


 case RestArgs of
 ["getpid"] ->
 io:format("~p\n",
 [list_to_integer(rpc:call(TargetNode, os, getpid, []))]);
 ["ping"] ->
 io:format("pong\n");
 ["stop"] ->
 io:format("~p\n", [rpc:call(TargetNode, init, stop, [], 60000)]);
 .......

可以看到，直接使用的是rpc:call()方法：调用TargetNode的init模块的stop方法，传入的参数为[]，下面来看看init模块的stop方法。

init模块的stop()方法调用

init 模块的文档给的解释是：“Coordination of System Startup”，
stop方法的注释是：


 All applications are taken down smoothly, all code is unloaded, and all ports are closed before the system terminates

显然就是用来系统关闭的，关键是需要看看他是怎么关闭系统的。

函数入口：


 stop() -> init ! {stop,stop}, ok.

给init模块发送自己发送一个{stop,stop}消息,

init自己循环接收消息


 loop(State) ->
 receive
 {'EXIT',Pid,Reason} ->
 Kernel = State#state.kernel,
 terminate(Pid,Kernel,Reason), %% If Pid is a Kernel pid, halt()!
 loop(State);
 {stop,Reason} ->
 stop(Reason,State);
 {From,fetch_loaded} -> %% The Loaded info is cleared in
 Loaded = State#state.loaded, %% boot_loop but is handled here
 From ! {init,Loaded}, %% anyway.
 loop(State);
 {From, {ensure_loaded, _}} ->
 From ! {init, not_allowed},
 loop(State);
 Msg ->
 loop(handle_msg(Msg,State))
 end.

匹配到{stop,Reason}，进入stop(Reason,State)这里调用，Reason为stop,
来打这里


 stop(Reason,State) ->
 BootPid = State#state.bootpid,
 {_,Progress} = State#state.status,
 State1 = State#state{status = {stopping, Progress}},
 clear_system(BootPid,State1),
 do_stop(Reason,State1).

重点看下clear_system函数和do_stop函数

clear_system()函数

clear_system()这里的作用就是关闭虚拟机中的进程，只用三个函数调用


 clear_system(BootPid,State) ->
 Heart = get_heart(State#state.kernel), %A
 shutdown_pids(Heart,BootPid,State), %B
 unload(Heart). %C

A和C都是在处理erlang启动参数heart，其意义在vm.args有说明


 Heartbeat management; auto-restarts VM if it dies or becomes unresponsive
 (Disabled by default..use with caution!)
 -heart

一般情况下，不使用-heart
我们这里只看shutdown_pids()怎么做的。

shutdown_pids()函数


 shutdown_pids(Heart,BootPid,State) ->
 Timer = shutdown_timer(State#state.flags),
 catch shutdown(State#state.kernel,BootPid,Timer,State),
 kill_all_pids(Heart), % Even the shutdown timer.
 kill_all_ports(Heart),
 flush_timout(Timer).

这里首先关闭定时器，然后关闭kernel进程，然后再kill其余的进程。

关闭kernel进程


 %%
 %% A kernel pid must handle the special case message
 %% {'EXIT',Parent,Reason} and terminate upon it!
 %%
 shutdown_kernel_pid(Pid, BootPid, Timer, State) ->
 Pid ! {'EXIT',BootPid,shutdown},
 shutdown_loop(Pid, Timer, State, []).

什么是erlang的kernel进程？

这句话是重点： A kernel pid must handle the special case message and terminate upon it!
那么什么是kernel进程呢？
看下bin/start.script


 ...
 {kernelProcess,heart,{heart,start,[]}},
 {kernelProcess,error_logger,{error_logger,start_link,[]}},
 {kernelProcess,application_controller,
 {application_controller,start,
 [{application,kernel,
 ...

这些带kernelProcess标签的进程都是, 特别是application！

来自http://blog.yufeng.info/archives/1411

故supervisor_tree收到的是{'EXIT',BootPid,shutdown}

kill其余的进程：


 kill_all_pids(Heart) ->
 case get_pids(Heart) of
 [] ->
 ok;
 Pids ->
 kill_em(Pids),
 kill_all_pids(Heart) % Continue until all are really killed.
 end.

最终跟下去，使用的是


 exit(Pid,kill)

向各个进程发送kill消息。

supervisor terminate方法

supervisor中的terminate()方法如下：


 -spec terminate(term(), state()) -> 'ok'.
terminate(_Reason, #state{children=[Child]} = State) when ?is_simple(State) ->
 terminate_dynamic_children(Child, dynamics_db(Child#child.restart_type,
 State#state.dynamics),
 State#state.name);
 terminate(_Reason, State) ->
 terminate_children(State#state.children, State#state.name).

分为simple_one_for_one和非simple_one_for_one两种情况。
terminate_dynamic_children()方法：


 ...
 EStack = case Child#child.shutdown of
 brutal_kill ->
 ?SETS:fold(fun(P, _) -> exit(P, kill) end, ok, Pids),
 wait_dynamic_children(Child, Pids, Sz, undefined, EStack0);
 infinity ->
 ?SETS:fold(fun(P, _) -> exit(P, shutdown) end, ok, Pids),
 wait_dynamic_children(Child, Pids, Sz, undefined, EStack0);
 Time ->
 ?SETS:fold(fun(P, _) -> exit(P, shutdown) end, ok, Pids),
 TRef = erlang:start_timer(Time, self(), kill),
 wait_dynamic_children(Child, Pids, Sz, TRef, EStack0)
 end,
 ...

可以看出ChildSpec中的ShowDown字段的设置对于关闭子进程的影响：
brutal_kill：发送kill消息，这个消息是不能捕捉的。即使如果worker设置了process_flag(trap_exit, true),仍然不会收到{'EXIT',_FROM,REASON}这个消息；
infinity和Time都会向监督的worker进程发送shutdown信号，这里worker做了 process_flag(trap_exit, true)，自然会收到{'EXIT',_FROM,REASON}。唯一的区别是infinity会一直等待，Time会设置一个超时：如果超时过了，那么supervisor会发送kill信号，直接杀死。
根据上面的分析,不难和erlang文档中对于gen_server terminate()方法


 If the gen_server is part of a supervision tree and is ordered by its supervisor to terminate, this function will be called with Reason=shutdown if the following conditions apply:
the gen_server has been set to trap exit signals, and
 the shutdown strategy as defined in the supervisor's child specification is an integer timeout value, not brutal_kill.

supervisor何时调用terminate()方法

最后一个问题来了，supervisor何时调用terminate()方法？之前分析到，关闭kernel进程的时候，supervisor监控树进程会收到来自BootPid的{'EXIT',BootPid,shutdown}消息。我们知道supervisor实际上一个gen_server，那么去看看他的handle_info()方法好了。


 -spec handle_info(term(), state()) ->
 {'noreply', state()} | {'stop', 'shutdown', state()}.
handle_info({'EXIT', Pid, Reason}, State) ->
 case restart_child(Pid, Reason, State) of %重启child
 {ok, State1} -> %A
 {noreply, State1};
 {shutdown, State1} -> %B
 {stop, shutdown, State1}
 end;
handle_info(Msg, State) ->
 error_logger:error_msg("Supervisor received unexpected message: _pn",
 [Msg]),
 {noreply, State}.

这里代码显然都是handle_info child发送过来的信号，调用restart_child()。在跟踪restart_child()进去，也没有看出原因：因为传入Pid并不是Child,而是BootPid，总是会走到A分支，也就是说不会调用terminate方法。这里陷入困境。
后来翻阅了supervisor文档，发现居然没有terminate()方法的说明，再次陷入困境。
最后，想起supervisor实际上一个gen_server，应该去看看gen_server()文档对于terminate()方法地说明。


 ...
 Even if the gen_server is not part of a supervision tree, this function will be called if it receives an 'EXIT' message from its parent. Reason will be the same as in the 'EXIT' message.
 ...

这里说明，只要gen_server收到了来自parent的'EXIT' message，terminate()方法就会调用。符合之前分析地：


 {'EXIT',BootPid,shutdown}

至于BootPid和SuperVisor是否是parent关系，这里暂时没时间探究：不过一定会是，否则，顶层的sup一定要有人通知关闭啊，而且BootPid从命名来看，相当有可能。这里留一个坑后面填上，主要是init:start()的启动。

其它

之前代码中的player进程的child_spec的show_down写的是brutal_kill，这里显然写错了；那么应用关闭的时候，自然不会调用terminate方法
Erlang OTP之terminate 深入分析这篇文章是基于erlang 14A版本的，他建议使用one_for_one。原因很简单，erlang 14A中,supervisor的terminate()函数如下
```
 terminate(_Reason, State) ->
 terminate_children(State#state.children, State#state.name),
 ok.
 
```
对于17版本，可以看出，这里没有处理单独simple_one_for_one的情况。因为simple_one_for_one和one_for_one的child信息在supervisor里面存储的是不一样的：前者child存储在dynamics属性，
后者存储在children属性。erlang 14A的版本只处理了children里面的child，对于simple_one_for_one的child直接没有处理。
对于这篇文章的实验，我在自己电脑上也做了实验，确实和他的结果不一致。

参考资料

Erlang OTP之terminate 深入分析
erlang init stop浅析
erlang doc
”Erlang supervisor 极其白痴的 Bug“的澄清——这篇文章提了下什么是erlang kernelProcess进程

erlang应用脚本stop分析