原创文章,转载请注明出处:服务器非业余研究http://blog.csdn.net/erlib 作者Sunface
There are a few common causes of queues blowing up and overload in Erlang systems that most people will encounter sooner or later, no matter how they approach their system.
They’re usually symptomatic of having your system grow up and require some help scaling up, or of an unexpected type of failure that ends up cascading much harder than it should have.
不论你怎么折腾,你迟早会遇到几个常见的导致Erlang系统过载和消息队列过长的问题。它们通常的症状:随着你的系统不断壮大,你遇到这种问题的可能性就越大,或者遇到意料之外的错误。
Ironically, the process in charge of error logging is one of the most fragile ones. In a default Erlang install, the error_logger 2 process will take its sweet time to log things to disk or over the network, and will do so much more slowly than errors can be generated.
比较讽刺的是,负责处理error logging的进程是最脆弱的一员之一。对于默认安装的Erlang来说,error_logger2进程会把日志记录到硬盘或网络上,但记录日志操作会常常赶不上产生错误的速度。
This is especially true of user-generated log messages (not for errors), and for crashes in large processes. For the former, this is because error_logger doesn’t really expect arbitrary levels of messages coming in continually. It’s for exceptional cases only and doesn’t expect lots of traffic. For the latter, it’s because the entire state of processes (including their mailboxes) gets copied over to be logged.
对于用户日志行为(不是error引起的)和大量进程崩溃的场合尤其如此。对于用户产生的log, error_logger从未考虑过这种大量消息同时记录的场景,因为这场景只允许在少量且特殊情况下才会发生; 至于大量进程崩溃,进程的全部信息(包括进程的信箱)都会被记录下来。
It only takes a few messages to cause memory to bubble up a lot, and if that’s not enough to cause the node to run Out Of Memory (OOM), it may slow the logger enough that additional messages will.
The best solution for this at the time of writing is to use lager as a substitute logging library.
只要几个这样的消息就可以导致内存暴增,如果这还不足以导致内存耗尽,也会减缓日志记录进程,使用lager 替换日志库是目前为止最好的解决方案。
While lager will not solve all your problems, it will truncate voluminous log messages, optionally drop OTP-generated error messages when they go over a certain threshold, and will automatically switch between asynchronous and synchronous modes for user-submitted messages in order to self-regulate.
当lager不能解决你所有问题时,它会截断冗长的日志消息,并在超过某个阈值时就会选择性地忽略OTP自身产生的错误信息,并且为了自我调节会根据用户提交的信息自动在同步和异步模式中切换。
It won’t be able to deal with very specific cases, such as when user-submitted messages are very large in volume and all coming from one-off processes. This is, however, a much rarer occurrence than everything else, and one where the programmer tends to have more control.
它不会处理那些非常特殊的情况,例如:用户产生的信息非常大并且都来自于一个一次性进程。不过这毕竟很少会发生,一旦发生,开发者往往会想有更多的控制方法。
[2] Defined at http://www.erlang.org/doc/man/error_logger.html
[注2] :定义在http://www.erlang.org/doc/man/error_logger.html
Locking and blocking operations will often be problematic when they’re taking unexpectedly long to execute in a process that’s constantly receiving new tasks.
One of the most common examples I’ve seen is a process blocking while accepting a connection or waiting for messages with TCP sockets.
对于一直接收新任务的进程来说,锁和阻塞操作经常会执行出乎意料长的时间,随之新任务又源源不断地袭来,自然就会堆积,然后成了问题。
一个我见过最常见的例子:一个进程为了接收socket连接或等待sockets消息而阻塞。
During blocking operations of this kind, messages are free to pile up in the message queue.
One particularly bad example was in a pool manager for HTTP connections that I had written in a fork of the lhttpc library. It worked fine in most test cases we had, and we even had a connection timeout set to 10 milliseconds to be sure it never took too long 3.
在这种阻塞的操作期间,大量非连接请求的消息都堆积在进程的消息队列中。(所以单独创建进程来只负责连接是不错的选择 - Sunface)
另一个是非常糟糕的例子:我写的一个lhttpc 库用于HTTP连接的进程池管理,它在绝大多数test cases里都工作正常,我们甚至把连接的timeout设置为10ms,来确保它不会花太多时间。
After a few weeks of perfect uptime, the HTTP client pool caused an outage when one of the remote servers went down.
The reason behind this degradation was that when the remote server would go down, all of a sudden, all connecting operations would take at least 10 milliseconds, the time before which the connection attempt is given up on. With around 9,000 messages per second to the central process, each usually taking under 5 milliseconds, the impact became similar to roughly 18,000 messages a second and things got out of hand.
完美地运行了几个星期后,当一个远程服务器崩溃的时候,HTTP客户端池导致了一个中断。
这次中断背后的原因是:当远程服务器挂掉后,所有的连接操作都要用至少10ms(放弃尝试连接的最小时间)的时间. 大约有9000条每秒的消息袭向中央进程,每个处理要花费5ms,等于1秒接受的请求总共需要45秒才能处理完,然后服务器就失控了。
The solution we came up with was to leave the task of connecting to the caller process, and enforce the limits as if the manager had done it on its own. The blocking operations were now distributed to all users of the library, and even less work was required to be done by the manager, now free to accept more requests.
我们随后的解决方案就是:让manager把连接工作交给调用进程(专车专用 - Sunface),这样阻塞操作就会被分发给所有的工作进程,manager需要做的活变少了,就可以处理更多的请求。
When there is any point of your program that ends up being a central hub for receiving messages, lengthy tasks should be moved out of there if possible. Handling predictable overload 4 situations by adding more processes — which either handle the blocking operations or instead act as a buffer while the "main" process blocks — is often a good idea.
如果你的程序中有接收消息的中心进程,那就尽可能不要把冗长的任务交给这个进程。一个常见的好方法:通过增加处理进程的方案来处理可预测的负载4。 这些进程用于处理阻塞操作或作为主处理进程(中央枢纽)的一个缓冲。
There will be increased complexity in managing more processes for activities that aren’t intrinsically concurrent, so make sure you need them before programming defensively.
当然对于这些负责处理任务、本质上不是并发的进程组来说控制复杂度会变得更难,所以你需要在编程前先考虑好是否需要它们。
Another option is to transform the blocking task into an asynchronous one. If the type of work allows it, start the long-running job and keep a token that identifies it uniquely, along with the original requester you’re doing work for. When the resource is available, have it send a message back to the server with the aforementioned token. The server will eventually get the message, match the token to the requester, and answer back, without being blocked by other requests in the mean time. 5
另一种选择方案:把阻塞的任务变成异步操作。如果工作场景允许的话,开启一个常驻的进程,为每个请求都生成一个唯一标识(token),并把请求按顺序分发给下面的进程(不用等返回),当被分发的进程资源到位(成功处理)后,会把包含唯一token的成功消息作为异步应答返回给常驻进程,通知请求处理成功5。
This option tends to be more obscure than using many processes and can quickly devolve into callback hell, but may use fewer resources.
这种方案比使用更多的进程,并快速地转移到回调函数的方案更加晦涩难懂,但是可能会占用更少资源。
[3] 10 milliseconds is very short, but was fine for collocated servers used for real-time bidding.
[4] Something you know for a fact gets overloaded in production
[5] The redo application is an example of a library doing this, in its redo_block module. The [underdocumented] module turns a pipelined connection into a blocking one, but does so while maintaining pipeline
aspects to the caller — this allows the caller to know that only one call failed when a timeout occurs, not all of the in-transit ones, without having the server stop accepting requests.
[注3] :10ms是非常短,但这却只是用于配置用于实时竞争的服务器。
[注4] :一些你了解的关于负载的事实。
[注5]:redo application就是这样的示例,在redo_block模块中...The [underdocumented] module turns a pipelined connection into a blocking one, but does so while maintaining pipeline
aspects to the caller — this allows the caller to know that only one call failed when a timeout occurs, not all of the in-transit ones, without having the server stop accepting requests.
Messages you didn’t know about tend to be rather rare when using OTP applications.
Because OTP behaviours pretty much expect you to handle anything with some clause in handle_info /2, unexpected messages will not accumulate much.
OTP applications里面几乎没有未知的消息存在。 因为OTP behaviours 会允许你用handle_info/2来处理匹配所有的消息,所以基本上不会堆积很多不明的消息。
However, all kinds of OTP-compliant systems end up having processes that may not implement a behaviour, or processes that go in a non-behaviour stretch where it overtakes message handling.
然而,所有服从OTP的系统也有可能不会实现behaviour的中指定的函数(编:如handle_info/2),或者不符合behaviour的进程没有做好消息处理.
If you’re lucky enough, monitoring tools 6 will show a constant memory increase, and inspecting for large queue sizes 7 will let you find which process is at fault.
如果你吉星高照,监控工具6会监测到不断增长的内存,通过检查进程的消息队列7帮你找到那个进程出错了。
You can then fix the problem by handling the messages as required.
那么你就可以按要求处理好信息并且解决问题。
[6] See Section 5.1
[7] See Subsection 5.2.1
[注6]:见5.1
[注7]:见5.2.1