前言
最近升级了otp24,一个提供位置无关call entity的组件在调用的entity进程退出时。会出现timeout。在追查后发现和erlang otp24的一个改进相关。
https://www.erlang.org/blog/m...
原因
实现简述
在call一个entity时,不是直接call一个pid,而是通过节点的router转发:
sequenceDiagram
client->>+router: Request
router->>+entity: Request
entity-->>-router: Reply
router-->>-client: Reply
做过一个改进,由被调用的entity,直接返回。
sequenceDiagram
client->>+router: Request
router->>+entity: Request
entity-->>-client: Reply
在被调用的entity退出中时,需要由常驻的router返回client,entity在退出中。否则client会timeout。
sequenceDiagram
client->>+router: Request
entity->>+entity: terminate start
Note over entity: 消息没有机会被处理
router->>+entity: Request
entity->>+entity: terminate end
client->>+client: timeout
所以,router会monitor entity。在entity异常退出时,由router 返回client。问题也正出现在这里。
问题根因
首先看erlang otp23的call代码:
gen.erl:202
do_call(Process, Label, Request, Timeout) when is_atom(Process) =:= false ->
Mref = erlang:monitor(process, Process),
%% OTP-21:
%% Auto-connect is asynchronous. But we still use 'noconnect' to make sure
%% we send on the monitored connection, and not trigger a new auto-connect.
%%
erlang:send(Process, {Label, {self(), Mref}, Request}, [noconnect]),
receive
{Mref, Reply} ->
erlang:demonitor(Mref, [flush]),
{ok, Reply};
{'DOWN', Mref, _, _, noconnection} ->
Node = get_node(Process),
exit({nodedown, Node});
{'DOWN', Mref, _, _, Reason} ->
exit(Reason)
after Timeout ->
erlang:demonitor(Mref, [flush]),
exit(timeout)
end.
在otp23中,我们的代码运行如下。
sequenceDiagram
client->>+router: {Label, {client_pid, Mref}, Request}
entity->>+entity: terminate start
router->>+entity: Process.monitor ret Mref2
router->>+entity: {{client_pid, Mref}, {router_pid, Mref2}, Request}
entity->>+entity: terminate end
entity->>+router: {:DOWN, Mref2, down_type, entity_pid, reason}
router->>+client: {:DOWN, Mref, down_type, router_pid, reason}
在消息到达时,entity正好在退出中时,也可以将这个事件通知client。
而otp24中,跨节点call如下:
gen.erl:223
do_call(Process, Label, Request, Timeout) when is_atom(Process) =:= false ->
Mref = erlang:monitor(process, Process, [{alias,demonitor}]),
Tag = [alias | Mref],
%% OTP-24:
%% Using alias to prevent responses after 'noconnection' and timeouts.
%% We however still may call nodes responding via process identifier, so
%% we still use 'noconnect' on send in order to try to send on the
%% monitored connection, and not trigger a new auto-connect.
%%
erlang:send(Process, {Label, {self(), Tag}, Request}, [noconnect]),
receive
{[alias | Mref], Reply} ->
erlang:demonitor(Mref, [flush]),
{ok, Reply};
{'DOWN', Mref, _, _, noconnection} ->
Node = get_node(Process),
exit({nodedown, Node});
{'DOWN', Mref, _, _, Reason} ->
exit(Reason)
after Timeout ->
erlang:demonitor(Mref, [flush]),
receive
{[alias | Mref], Reply} ->
{ok, Reply}
after 0 ->
exit(timeout)
end
end.
那么在otp24中,变成了这样:
sequenceDiagram
client->>+router: {Label, {client_pid, [alias | Mref]}, Request}
entity->>+entity: terminate start
router->>+entity: Process.monitor ret Mref2
router->>+entity: {{client_pid, [alias | Mref]}, {router_pid, Mref2}, Request}
entity->>+entity: terminate end
entity->>+router: {:DOWN, Mref2, down_type, entity_pid, reason}
Note over client: client等待的是Mref而不是 [alias | Mref]
router->>+client: {:DOWN, [alias | Mref], down_type, router_pid, reason}
client->>+client: timeout
修复就显而易见了。
总结
erlang 的 Process aliases 是一个很棒的修改。解决了调用远端timeout后,消息才返回的问题。timeout后才收到的返回,将被drop掉。这样catch住timeout也是安全的了。
推荐阅读之前写的另一篇blog:
谈谈erlang的timeout.