Erlang OTP之terminate 深入分析

 转载:庆亮的博客-webgame架构

一、terminate简述及问题产生

 

terminategen_server的一个回调函数,如果一个gen_server进程设置了trap_exittrueprocess_flag(trap_exit, true)),则在该进程结束时会自动调用terminate。利用这个功能,我们可以在进程退出时进行一些善后工作,例如持久化数据、清理等等。但实际上terminate不一定有时间完成所有的任务,在此之前可能已经被系统强制结束了(如果使用init:stop形式结束beam)

 

二、测试terminate

一个erlang 内部 process结束有两种形式:主动结束(如玩家下线后,玩家进程会自动结束)和被动结束(init:stop)

系统停止时(init:stop/c:q/erlang:halt),会依次停止所有的进程,如果一个进程是监控树,则该监控树会先依次停止所有的子进程,然后结束自己。对于子进程也是同样的处理方法。

 

先做测试,后分析源码。测试分为四种情况:

 

进程主动退出 + simple_one_for_one

init:stop + simple_one_for_one

进程主动退出 + one_for_one

init:stop + one_for_one

 

源码文件:

test.erl (application)

test_sup.erl (supervisor)

test_server.erl (gen_server)

 

test.erl源码:

 

-module(test).

-behaviour(application).

-export([start/0, start/2, stop/1]).

start() ->
    application:start(test).

start(_StartType, _StartArgs) ->
    case test_sup:start_link() of
        {ok, Pid} ->
            {ok, Pid};
        Error ->
            Error
    end.

stop(_State) ->
    ok.

 

非常简单,直接通过 erl -name [email protected] -setcookie 123456 -boot start_sasl -s test start 即可启动该app。

 

test_sup.erl源码:

 

-module(test_sup).

-behaviour(supervisor).

%% API
-export([start_link/0]).

%% Supervisor callbacks
-export([init/1]).

-define(SERVER, ?MODULE).

start_link() ->
    supervisor:start_link({local, ?SERVER}, ?MODULE, []).

init([]) ->
    RestartStrategy = simple_one_for_one,
    MaxRestarts = 1000,
    MaxSecondsBetweenRestarts = 3600,

    SupFlags = {RestartStrategy, MaxRestarts, MaxSecondsBetweenRestarts},

    Restart = transient,
    Shutdown = 200000,
    Type = worker,

    AChild = {test_server, {test_server, start_link, []},
              Restart, Shutdown, Type, [test_server]},

    {ok, {SupFlags, [AChild]}}.

 

源码骨架都是emacs生成的,我们只关注RestartStrategy = simple_one_for_one,这里,等会需要改成one_for_one以便测试对比。

 

test_server.erl

 

-module(test_server).

-behaviour(gen_server).

-export([
         start/0,
         start_link/0
        ]).

-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
         terminate/2, code_change/3]).

-define(SERVER, ?MODULE).

-record(state, {}).

start() ->
    {ok, _} = supervisor:start_child(test_sup, []).

start_link() ->
    gen_server:start_link(?MODULE, [], []).

init([]) ->
    erlang:process_flag(trap_exit, true),
    {ok, #state{}}.

handle_call(_Request, _From, State) ->
    Reply = ok,
    {reply, Reply, State}.

handle_cast(_Msg, State) ->
    {noreply, State}.

handle_info({'EXIT', _, Reason}, State) ->
    io:format("exit:~p~n", [Reason]),
    {stop, normal, State};

handle_info(_Info, State) ->
    {noreply, State}.

terminate(Reason, _State) ->
    io:format("i'm terminate:~p~n", [Reason]),
    timer:sleep(10000),
    io:format("~s", ["end"]),
    ok.

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

 

简单说明下test_server:

2. handle_info({'EXIT', _, Reason}, State) -> 方便simple_one_for_one下进程进程退出操作

3. terminate中的清理工作:io → timer:sleep → io

 

启动命令行:erl -name [email protected] -setcookie 123456 -boot start_sasl -s test start

 

 

1. 进程主动退出 + simple_one_for_one

([email protected])1> erlang:exit(list_to_pid("<0.51.0>"), test).

exit:test

true

i'm terminate:normal

([email protected])2> end

 

正常完成了terminate

 

2. init:stop + simple_one_for_one

 

 ([email protected])1> test_server:start().

{ok,<0.53.0>}

([email protected])2> init:stop().

ok

([email protected])3> i'm terminate:shutdown

[root@ming2_local_dev test]#

可以看到似乎没能正常的处理完terminate

 

3. 进程主动退出 + one_for_one

([email protected])1> erlang:exit(list_to_pid("<0.51.0>"), test).

exit:test

true

i'm terminate:normal

([email protected])2> end

 

正常完成了terminate

 

4. init:stop + one_for_one

([email protected])2> init:stop().

ok

([email protected])3> i'm terminate:shutdown

end[root@ming2_local_dev test]#

 

ok,很完整的执行了我们的terminate。

 

terminate执行测试结果

 

 

Simple_one_for_one

One_for_one

进程主动退出

完整执行

完整执行

init:stop

不能完整执行

完整执行

 

三、底层分析

看起来很奇怪的结果,还是从源码来分析问题。从supervisor开始:

 

 

terminate(_Reason, State) ->
    terminate_children(State#state.children, State#state.name),
    ok.

 

terminate_children/2 是一个尾递归函数,依次结束每个子进程:

terminate_children(Children, SupName) ->
    terminate_children(Children, SupName, []).

terminate_children([Child | Children], SupName, Res) ->
    NChild = do_terminate(Child, SupName),
    terminate_children(Children, SupName, [NChild | Res]);
terminate_children([], _SupName, Res) ->
    Res.

  

在看do_terminate/2

 

do_terminate(Child, SupName) when Child#child.pid =/= undefined ->
    case shutdown(Child#child.pid,
          Child#child.shutdown) of
    ok ->
        Child#child{pid = undefined};
    {error, OtherReason} ->
        report_error(shutdown_error, OtherReason, Child, SupName),
        Child#child{pid = undefined}
    end;
do_terminate(Child, _SupName) ->
    Child.

 

继续:

shutdown(Pid, brutal_kill) ->
 
    case monitor_child(Pid) of
    ok ->
        exit(Pid, kill),
        receive
        {'DOWN', _MRef, process, Pid, killed} ->
            ok;
        {'DOWN', _MRef, process, Pid, OtherReason} ->
            {error, OtherReason}
        end;
    {error, Reason} ->     
        {error, Reason}
    end;

shutdown(Pid, Time) ->
   
    case monitor_child(Pid) of
    ok ->
        exit(Pid, shutdown), %% Try to shutdown gracefully
        receive
        {'DOWN', _MRef, process, Pid, shutdown} ->
            ok;
        {'DOWN', _MRef, process, Pid, OtherReason} ->
            {error, OtherReason}
        after Time ->
            exit(Pid, kill),  %% Force termination.
            receive
            {'DOWN', _MRef, process, Pid, OtherReason} ->
                {error, OtherReason}
            end
        end
;
    {error, Reason} ->     
        {error, Reason}
    end.

 

 

Ok,结束子进程时分情况处理了,先看看monitor_child/1,代码注释的比较详细,简单的说是用于处理child自己退出的情况。

 

monitor_child(Pid) ->
   
    %% Do the monitor operation first so that if the child dies
    %% before the monitoring is done causing a 'DOWN'-message with
    %% reason noproc, we will get the real reason in the 'EXIT'-message
    %% unless a naughty child has already done unlink…

    erlang:monitor(process, Pid),
    unlink(Pid),

    receive
    %% If the child dies before the unlik we must empty
    %% the mail-box of the 'EXIT'-message and the 'DOWN'-message.

    {'EXIT', Pid, Reason} ->
        receive
        {'DOWN', _, process, Pid, _} ->
            {error, Reason}
        end
    after
0 ->
        %% If a naughty child did unlink and the child dies before
        %% monitor the result will be that shutdown/2 receives a
        %% 'DOWN'-message with reason noproc.
        %% If the child should die after the unlink there
        %% will be a 'DOWN'-message with a correct reason
        %% that will be handled in shutdown/2.

        ok  
    end.

 

回头看shutdown/2,主要区别在于exit(Pid, Reason)这一行,如果子进程的shutdown策略为brutal_kill,则子进程被直接kill,而kill消息是不能被捕捉的,也就不存在terminate被调用的可能了(terminate能被调用是因为捕捉了{‘EXIT’,_, _}消息,详细情况请自行查看gen_server实现)。如果你想在退出时清理数据,这里一定不能设置为brutal_kill,而是设置为一个较大的时间数值(毫秒),用于等待子进程做善后工作:

exit(Pid, shutdown), %% Try to shutdown gracefully
        receive
        {'DOWN', _MRef, process, Pid, shutdown} ->
            ok;
        {'DOWN', _MRef, process, Pid, OtherReason} ->
            {error, OtherReason}
        after Time ->
            exit(Pid, kill),  %% Force termination.
            receive
            {'DOWN', _MRef, process, Pid, OtherReason} ->
                {error, OtherReason}
            end
如果在指定时间内,子进程尚未结束,则强制kill。

 

从这一块的源码中我们没有看到restart strategy(one_for_one …)对terminate的影响,这跟上面的测试结果不太吻合。过一遍supervisor的代码,发现针对simple_one_for_one和one_for_one的子进程的启动过程是不同的:

 

handle_call({start_child, EArgs}, _From, State) when ?is_simple(State) ->
    #child{mfa = {M, F, A}} = hd(State#state.children),
    Args = A ++ EArgs,
    case do_start_child_i(M, F, Args) of
    {ok, Pid} ->
        NState = State#state{dynamics =
                 ?DICT:store(Pid, Args, State#state.dynamics)},
        {reply, {ok, Pid}, NState};
    {ok, Pid, Extra} ->
        NState = State#state{dynamics =
                 ?DICT:store(Pid, Args, State#state.dynamics)},
        {reply, {ok, Pid, Extra}, NState};
    What ->
        {reply, What, State}
    end;

%%% The requests terminate_child, delete_child and restart_child are
%%% invalid for simple_one_for_one supervisors.

handle_call({_Req, _Data}, _From, State) when ?is_simple(State) ->
    {reply, {error, simple_one_for_one}, State};

handle_call({start_child, ChildSpec}, _From, State) ->
    case check_childspec(ChildSpec) of
    {ok, Child} ->
        {Resp, NState} = handle_start_child(Child, State),
        {reply, Resp, NState};
    What ->
        {reply, {error, What}, State}
    end;

 

simple_one_for_one形式启动的子进程根本没有放在supervisor的state.children里面,也就是说supervisor在terminate的时候根本没管simple_one_for_one形式启动的子进程,如此当supervisor结束时,所有的simple_one_for_one子进程都会收到一条{‘EXIT’, Pid, Reason}的消息,如果子进程有处理这样的消息并返回了stop,则会调用terminate。 但在执行terminate期间,app可能已经结束,从而正在停止中的系统会直接kill掉该进程(实际上是所有剩余的进程),使得其没有时间执行完所有的功能代码(参考之前的分析《init:stop浅析》)。

 

四、结论、问题与解决办法

 

1.  结论

 

 

Simple_one_for_one

One_for_one

进程主动退出

完整执行

完整执行

init:stop

不能完整执行

完整执行

 

 

2.  问题

使用simple_one_for_one时,在系统关闭时,可能无法正常的完成某些的善后工作,如数据持久等等

 

3.  解决办法

使用one_for_one,但是one_for_one的启动过程需要做一些简单的调整:

 

sup树的init返回:

 

  init([]) ->
    RestartStrategy = one_for_one,
    MaxRestarts = 1000,
    MaxSecondsBetweenRestarts = 3600,
    SupFlags = {RestartStrategy, MaxRestarts, MaxSecondsBetweenRestarts},
    {ok, {SupFlags, []}}.

 

start_child的时候childspec需要拼凑spec id:

supervisor:start_child(mod_stall_sup,
                       {lists:concat(["mod_stall_server_", MAPID]),
                        {mod_stall_server, start_link, [MAPID]},
                        transient, 30000, worker, [mod_stall_server]})  

 

你可能感兴趣的:(Erlang OTP之terminate 深入分析)