There’s something rather unique in Erlang in how it approaches failure compared to most other programming languages. There’s this common way of thinking where the language, programming environment, and methodology do everything possible to prevent errors. Something going wrong at run-time is something that needs to be prevented, and if it cannot be prevented, then it’s out of scope for whatever solution people have been thinking about.
The program is written once, and after that, it’s off to production, whatever may happen there. If there are errors, new versions will need to be shipped.
相对于大多数编程语言,Erlang处理失败的(failure)方法是非常独特的。 一般来说,很多其它语言开发的软件是使用语言特性、程序环境、方法论等阻止运行时系统出错,如果不能杜绝错误,它的行为就会超出设计者原本的设想(变得不可控),这就意味着一旦程序写完并发布为产品后,如果程序在这个变化莫测的生产环境中出了问题,那么必须为修复BUG而重新发布新版本。
Erlang, on the other hand, takes the approach that failures will happen no matter what, whether they’re developer-, operator-, or hardware-related. It is rarely practical or even possible to get rid of all errors in a program or a system. 1 If you can deal with some errors rather than preventing them at all cost, then most undefined behaviours of a program can go in that "deal with it" approach.
但Erlang有办法对付这种错误,无论是开发者设计不周引起的,还是操作者或者硬件引起的,Erlang开发的软件甚至有可能驾驭来自于程序或系统的所有错误1。如果开发者能在错误发生后妥善的处理,而不是千方百计去杜绝错误,那么程序绝大部分的不确定行为都会在开发者的预料之中。
This is where the "Let it Crash" 2 idea comes from: Because you can now deal with failure, and because the cost of weeding out all of the complex bugs from a system before it hits production is often prohibitive, programmers should only deal with the errors they know how to handle, and leave the rest for another process (a supervisor) or the virtual machine to deal with.
Given that most bugs are transient 3, simply restarting processes back to a state known to be stable when encountering an error can be a surprisingly good strategy.
这就是"Let it Crash"2 的Erlang理念。这个理念起源于:首先你可以处理可能出现的失败,其次在系统部署到生产环境前,找出所有复杂的bugs,并把它们斩尽杀绝,但这几乎是不可能的。因此设计者应当先处理他们知道原因的错误,让其它不可遇料的错误都交给另一个进程(supervisor进程)或虚拟机(virtual machine)来处理。
鉴于大多数bugs转瞬即逝3,直接重启出现错误的进程,让其重新回复到正常稳定的工作状态,无疑是一个非常好的策略。
Erlang is a programming environment where the approach taken is equivalent to the human body’s immune system, where as most other languages only care about hygiene to make sure no germ enters the body. Both forms appear extremely important to me. Almost every environment offers varying degrees of hygiene. Nearly no other environment offers the immune system where errors at run time can be dealt with and seen as survivable.
Erlang的这种处理方式与人类身体的免疫系统非常像,而很多语言只是关注如何防止病菌进入身体。这两者的区别对我们是极其重要的,因为预防性的卫生保健无法提供类似于免疫系统一样机制,可以在运行时容许不明病毒进入,并把病毒隔离或消灭掉。
Because the system doesn’t collapse the first time something bad touches it, Erlang/OTP also allows you to be a doctor. You can go in the system, pry it open right there in production, carefully observe everything inside as it runs, and even try to fix it interactively. To continue with the analogy, Erlang allows you to perform extensive tests to diagnose the problem and various degrees of surgery (even very invasive surgery), without the patients needing to sit down or interrupt their daily activities.
这种免疫系统机制使得系统不会在第一次遇到某种错误时就直接崩溃掉,而且Erlang/OTP也可以让你成为一位医生:你可以深入系统,在运行时纠正错误行为,仔细查看运行时的所有信息,甚至可以一步步交互式地去修复错误。 Erlang 允许你做各种各样的测试来诊断问题,进而在不打扰病人的日常活动情况下执行不同程度的手术(甚至开刀手术)。
This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snipp ets and practices that helped developers debug production systems that were built in Erlang。
这本书的目的是:引导开发者怎样在这个战争的时代做个称职的Erlang医生。 当然,首先也是最重要的:你要学会收集信息的相关技巧,理解失败是从哪里来的,最终定位解决问题。因此这是一本帮助开发人员调试Erlang生产系统的工具书。
这本书的目标用户?
This book is not for beginners. There is a gap left between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they’ve made it to production. There’s a fumbling phase implicit to a programmer’s learning of a new language and environment where they just have to figure how to get out of the guidelines and step into the real world, with the community that go es with it.
这本书并不适合于初学者,因为本书的目标是如何在生产环境中操作,诊断,和调试,因此需要大家掌握Erlang语言和OTP的相关知识。一般来说当程序员学习新的语言和环境时,摆脱纸上谈兵并进入现实世界之前都是需要一个摸索阶段,本书就能帮你构架一座桥梁,迅速切入生产环境。
This book assumes that the reader is proficient in basic Erlang and the OTP framework. Erlang/OTP features are explained as I see fit — usually when I consider them tricky — and it is expected that a reader who feels confused by usual Erlang/OTP material will have an idea of where to lo ok for explanations if necessary 4. What is not necessarily assumed is that the reader knows how to debug Erlang software, dive into an existing code base, diagnose issues, or has an idea of the best practices about deploying Erlang in a production environment 5.
阅读此书需要读者精通基本的Erlang知识和OTP框架,因此关于Erlang/OTP的特性我只会在遇到很棘手的问题时才解释。希望感到困惑的读者能仔细阅读Erlang/OTP相关的资料(可以参考本书作者之前的大作:Erlang/OTP - Sunface)。阅读此书并不需要读者掌握调试Erlang 软件,深入现有的代码库,诊断问题,或掌握在生产环境中部署Erlang的经验。
How To Read This Book 如何去读这本书? This book is divided in two parts. Part I focuses on how to write applications. It includes how to dive into a code base (Chapter 1), general tips on writing open source Erlang software (Chapter 2), and how to plan for overload in your system design (Chapter 3).
此书分两部分:
Part I 创建Applications:
章节1:了解现有的代码库;
章节2:写Erlang软件的注意事项;
章节3:如何处理系统过载.
Part II Erlang医生怎么去关注现有的生命系统中的各项指标.
章节4:连接一个运行中节点;
章节5:基本的运行时状态信息;
章节6:通过crash dump文件来解剖系统;
章节7:识别和修复内存泄露;
章节8:找到CPU失控的原因;
章节9:在生产环境中使用recon6来trace Erlang函数,从而在系统崩溃前就定位到问题。
每个章节都有练习,请亲自动手尝试。