This is Part 3 of the series "The Wonderland of Dynamic Tracing" which consists of 7 parts. I will keep updating this series to reflect the state of art of the dynamic tracing world.

The previous one, Part 2, introduced DTrace and SystemTap, two famous dynamic tracing frameworks. This part will continue looking at real world applications of SystemTap.

Applications of SystemTap in Production

The huge impact of DTrace today wouldn't be possible without the contributions of the renowned leading expert on DTrace, Brendan Gregg. I already mentioned him in Part 2 before. He previously worked on the Solaris file system in Sun Microsystems, being one of the earliest users of DTrace. He authored several books on DTrace and systems performance optimization, as well as many high quality blog posts concerning dynamic tracing in general.

After leaving Taobao in 2011, I went to Fuzhou and experienced an “idyllic life” there for a whole year. During the last several months there, I dived into Brendan’s public blog and systematically studied DTrace and dynamic tracing. Before that, I had never heard of DTrace until one of my Sina Weibo
followers mentioned it very briefly in a comment. I was immediately intrigued and did my own research to learn more about it. Well, I would never have imagined that my exploration would lead me to a totally new world and completely change my views about the entire computing world. So I devoted a lot of time to thoroughly reading each personal blog of Brendan. Ultimately, my efforts paid off. Fully enlightened, I felt I could finally take the subtleties of dynamic tracing in.

Then in 2012, I ended the “idyllic life” in Fuzhou and left for the US to join the CDN service provider and network security company mentioned before. I immediately started to apply SystemTap and the whole set of dynamic tracing methods I had acquired to the company’s global network, to solve those very weird, very strange online problems. I found that my colleagues at the time would always add additional event tracking code into the software system on their own when troubleshooting online problems. They did so by directly editing the source code and adding various counters, or event tracking code to emit log data, primarily to the applications' Lua code, and sometimes even to the code base of systems software like Nginx. In this way, a large number of logs would be collected online in real time, before entering the special database and going through offline analysis. However, their practice clearly brought colossal costs. Not only raise the cost related to hacking and maintaining the business system sharply, but also the online costs resulting from full-volume data collection and storage of enormous amount of log data. Moreover, the following situation is not uncommon: Engineer A adds an event tracking code in the business code and Engineer B does the same later. However, they may end up being forgotten and left in the code base, without being noticed again. The final result would only be that these endlessly increasing events mess up the code base. And the invasive revisions would make corresponding software, whether system software or business code, more and more difficult to maintain.

Two serious problems exist in the way metrics and event tracking code is done. The first one is “too many” event tracking counters and logging statements are added. Out of a desire to cover all, we tend to gather some totally useless information, leading to unnecessary collection and storage costs. Even if sampling is already enough to analyze problems in many cases, the habitual response is still carrying out whole-network and full-volume data collection, which is clearly very expensive in the long run. The second means “too few” counters and logging are added. It is often very difficult to plan all necessary information collection points in the first place, as no one can predict future problems needing troubleshooting like a prophet. Consequently, whenever a new problem emerges, existing information collected is almost always insufficient. What follows is to revise the software system and conduct online operations frequently, causing much heavier workload to development and maintenance engineers, and higher risk of more severe online incidents.

Another brute force debugging method some maintenance engineers often use is to drop the servers offline, and then set a series of temporary firewall rules to block or screen user traffic or their own monitoring traffic, before fiddling with the production machine. This cumbersome process has a huge impact. Firstly, as the machine is unable to continue its services, the overall throughput of the entire online system is impaired. Secondly, problems that can reproduce only when real traffic exists will no longer recur. You can imagine how frustrating it will be.

Fortunately, SystemTap dynamic tracing offers an ideal solution to such problems while avoiding those headaches. You don’t have to change the software stack itself, be it systems software or business-level applications. I often write some dedicated tools that place dynamic probes on the "key spots" of the relevant code paths. These probes collect information separately, which will be combined and transmitted by the debugging tools to the terminal. My way of doing things enables me to quickly get key information I need through sampling on one or more machines, and obtain quick answers to some very basic questions to navigate subsequent (deeper) debugging work.

We talked earlier about manually adding metrics & event tracking/logging code into the production systems to record logs and putting them in a database. The manual work is far inferior to seeing the whole production system as a directly accessible “database” from which we can obtain the very information needed in a safe and quick manner, without leaving any trace. Following this train of thought, I wrote a number of debugging tools, most being open-source on GitHub. Of them, many were targeted at systems software such as Nginx, LuaJIT and the operating system kernel, and some focused on higher-level Web frameworks like OpenResty. GitHub offers access to the following code repositories: nginx-systemtap-toolkit, perl-systemtap-toolkit and stappxx.

These tools helped me identify a lot of online problems, some even by surprise. We will walk through five examples below.

Case #1: Slow Debugging Code Left in Production

The first example is an accidental discovery when I analyzed the online Nginx process using the SystemTap-based Flame Graph. I noticed a big portion of CPU time was spent on a very strange code path. The code path turned out to be some temporary debugging statements left by one of my former colleagues when debugging an ancient problem. It’s like the “event tracking code” mentioned above. Although the problem had long been fixed, the code was left there and was forgotten, both online and in the company’s code repository. The existence of that piece of code came at a high price, that is, ongoing performance overhead unnoticed for a long time. The approach I used was sampling so that the tool can automatically draw a Flame Graph (we will cover it in detail in Part 4), from which I can understand the problem and take measures. You see, this is very efficient.

Case #2: Long request latency outliers

Long delays may be seen only in a very small portion of all the online requests, or “request latency outliers”. Though small in number, they may have delays on the level of seconds. I used to run into such things a lot. For example, one former colleague just took a wild guess that my OpenResty had a bug. Unconvinced, I immediately wrote a SystemTap tool for online sampling to analyze those requests delayed by over one second. The tool can directly test the internal time distribution of problematic requests, including delay of each typical I/O operation and pure CPU computing delay in the course of request handling. It soon found the delay appeared when OpenResty accessed the DNS server written in Go. Then the tool output details about those long-tailed DNS queries, which were all related to CNAME. Well, mystery was solved. Obviously, the delay had nothing to do with OpenResty. And the finding paved the way for further investigation and optimization.

Case #3: From Network Issues to Disk Issues

The third example is very interesting. It's about shifting from network problems to hard disk issues in debugging. My former colleagues and I once noticed machines in a computer room showed a dramatically higher ratio of network timeout errors than the other colocations or data centers, albeit at a mere 1 percent. At first, we naturally paid attention to the network protocol stack. However, a series of dedicated SystemTap tools focusing directly on those outlier requests later led me to a hard disk configuration issue. First-hand data steered us to the correct track very quickly. A presumed network issue turned out to be a hard disk problem.

Case #4: File Handle Cache Tradeoffs

The fourth example turned to the Flame Graphs again. In the CPU flame graph for online nginx processes, we observed a phenomenon in the Nginx process: file opening and closing operations took a significant portion of the total CPU time. Our natural response was to initiate the file-handle caching of Nginx itself, without yielding noticeable optimization result. With a new flame graph sampled, however, we found the “spin lock” used in the cache meta data now took a lot of CPU time. All became clear via the flame graph. Although we initiated the caching, it had been set at so large a size that its benefits were voided by the overhead of the meta data spin lock. Imagine that if we had no flame graphs and just performed black-box benchmarks, we would have reached the wrong conclusion that the file-handle cache of Nginx was useless instead of tuning cache parameters.

Case #5: Compiled Regex Cache Tuning

Now comes our last example for this section. After one online release operation, I remembered, the latest online flame graph revealed the operation of compiling regular expressions consumed a lot of CPU time, but the caching of the compiled regular expression had already been enabled online. Apparently the number of regular expressions used in our business system had exceeded the maximum cache size. Accordingly, the next thing that came to my mind was simply to increase the cache size for online regular expressions. As expected, the bottleneck then immediately disappeared from our online flame graphs after the cache size growth.

Wrapping up

These examples demonstrate new problems will always occur and vary depending on the data centers, the servers, and even the time period of the day on the same machine. Whatever the problem, the solution is to analyze the root cause of the problem directly and take online samples from the first scene of events, instead of jumping into trials and errors with wild guesses. With the help of powerful observability tools, troubleshooting can actually yield much more with much less effort.

After we founded OpenResty Inc., we developed OpenResty XRay as a brand new dynamic tracing platform, putting an end to manual uses of open-source solutions like SystemTap.

Conclusion

This part of the series covered various use cases of one of the most mature dynamic tracing frameworks, SystemTap. The next part, Part 4, will talk about a very powerful visualization method to analyze resource usage across all software code paths, Flame Graphs. Stay tuned!

A Word on OpenResty XRay

OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company. We use this product in our articles like this one to intuitively demonstrate implementation details, as well as statistics about real world applications and open source software. In general, OpenResty XRay can help users to get deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot really hard problems for performance, reliability, and security. It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.

You are welcome to contact us to try out this product for free.

About The Author

Yichun Zhang is the creator of the OpenResty^® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.

Translations

We provide a Chinese translation for this article on blog.openresty.com.cn ourselves. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.

The Wonderland of Dynamic Tracing (Part 3 of 7)