转自:http://www.erlangatwork.com/2008/07/hunting-bugs.html
Our Erlang gateways were developed and deployed in phases starting with AIM/ICQ, GTalk, Yahoo, and finally MSN. Aside from minor protocol implementation bugs there were no problems and we were very satisfied with stability and performance. However, not long after releasing the MSN gateway we noticed that it used a ton of memory, and also periodically suffered massive spikes in memory use which invoked the Linux kernel's OOM killer. This lead to a crash course in debugging running Erlang apps, and a great appreciation for the years of real-world lessons that have influenced the features and design of Erlang/OTP.
The first thing I looked into is why the gateway would eventually use several gigabytes of memory with only a few hundred users online. Since each Erlang process has its own heap, I started by looking for which processes were using the most memory. erlang:processes/0 returns a list of all running processes, and erlang:process_info/1 provides a ton of information about a process including heap use, stack size, etc. So I wrote a quick script to dump the process info of all processes to a file, sorted by total memory use. This was run on the live gateway instance.
It turned out that only a few active MSN sessions were using the majority of the heap, and these sessions were for users with very large contact lists. After initial login, one session could be using > 1GB of heap.
Newer versions of the MSNP protocol use SOAP requests to get authorization tokens, contact lists, allow/block lists, etc. My initial implementation was very simple, using inets to submit the HTTP request, reading the full response body as a list, and then parsing that list with xmerl. These responses could be very large and since the gateway was running on a 64bit Erlang VM, each character would occupy 16 bytes of memory. xmerl's representation of an XML document also requires quite a bit of storage. A simple XML document such as:
<a><b>foo</b><c/></a>
is represented as:
{xmlElement,a,a,[],
{xmlNamespace,[],[]},
[],1,[],
[{xmlElement,b,b,[],
{xmlNamespace,[],[]},
[{a,1}],
1,[],
[{xmlText,[{b,1},{a,1}],1,[],"foo",text}],
[],"/tmp/",undeclared},
{xmlElement,c,c,[],
{xmlNamespace,[],[]},
[{a,1}],
2,[],[],[],undefined,undeclared}],
[],"/tmp/",undeclared}
So I rewrote my SOAP module to use the streaming method http:request/4 which returns the HTTP response as a series of binary chunks. xmerl doesn't support parsing binaries so I switched to erlsom, which does, and also converted the XML to a very simple and compact format:
{a,[],
[{b,[],[<<"foo">>]},
{c,[],[]}]}
After making these changes the amount of memory used per login decreased by 2.5-3x. However the gateway was still occasionally using up all available memory and dying at what appeared to be random intervals. My best guess was that something in the protocol stream was triggering this problem so I updated the gateway to log each login attempt, and ran tcpdump to capture all MSN traffic. Eventually I was able to correlate the crashes with incoming status text messages from certain contacts of a few heysan users.
MSNP transports status text as an XML payload of the UBX command:
<Data><CurrentMedia></CurrentMedia><PSM>status text</PSM></Data>
I was still using xmerl to parse this small XML document and grab the cdata from the <PSM> tag. The status text of some contacts contained combinations of UTF-8 text and numeric unicode entities such as :. Simply attempting to parse these small XML documents would cause xmerl to allocate more than 8GB of memory and thus kill the emulator. Parsing the UBX payload with erlsom instead of xmerl completely resolved the problem, but was a bit of a letdown after so much time spent hunting hunting such an esoteric bug.
UPDATE: the crash described above is fixed in xmerl-1.1.10, which is included in Erlang/OTP R12B-4.
要善于erlang的基础设施 事半功倍!