This is Part 4 of the series "The Wonderland of Dynamic Tracing" which consists of 7 parts. I will keep updating this series to reflect the state of art of the dynamic tracing world.
The previous one, Part 3, introduced various real world use cases of SystemTap in production. This part will take a close look at Flame Graphs which were frequently mentioned in the previous part.
Flame Graphs
Flame Graphs have appeared many times in the previous parts of this series. So what is it? Flame Graphs are a kind of amazing visualization, presumably invented by Brendan Gregg, of whom I already made repeated mentions before.
Flame Graphs function like X-ray images of a running software system. The graph integrates and displays time and spatial information in a very natural and vivid way, revealing a variety of quantitative statistical patterns of system performance.
I shall start with an example. The most classical kind of flame graphs looks at the distribution of CPU time among all code paths of the target running software. The resulting distribution diagram visibly distinguishes code paths consuming more CPU time from those which consume less. Furthermore, the flame graphs can be generated on different software stack levels, say, drawing a graph on the C/C++ language level of systems software, and then drawing a flame graph on a higher level, like the dynamic scripting language level, like Lua and Python code. Different flame graphs often offer different perspectives, reflecting level-specific code hot spots.
In dealing with the mailing lists of OpenResty, my own open-source software community, I often encourage users to proactively provide the flame graphs they sample when reporting a problem. Then the graph will work its magic to quickly reveal all the bottlenecks to everyone who sees it, saving all the trouble of wasting time on endless trials-and-errors. It is a big win for everybody.
It is worth noting that in the case of an unfamiliar program, the flame graph still makes it possible to gain a big picture of any performance issues, without the need of reading any source code of the target software. This capability is really marvelous, thanks to the fact that most programs are made to be reasonable or understandable, at least to some extend, meaning that each program already uses abstraction layers at the time of software construction, for example, through functions or class methods. The names of these functions usually contain semantic information and are directly displayed on the flame graph. Each name serves as a hint of what the corresponding function does, and even a hint for the corresponding code path as well. The bottlenecks in the program can thus be inferred. So it still comes down to the importance of proper function or module naming in the source code. The names are not only crucial for humans to read the source code, but also very helpful when debugging or profiling the binary programs. The flame graphs, in turn, also serve as a shortcut to learning unfamiliar software systems. Thinking it the other way, important code paths are almost always those taking up a lot of time, and so they deserve special attention; otherwise something must be very wrong in the way the software is constructed.
The most classical flame graphs focus on the distribution of CPU time across all code paths of the target software system currently running. This is the CPU time dimension. Naturally, flame graphs can also be extended to other dimensions, like off-CPU time, when a process does not run on any CPU cores. Generally speaking, off-CPU time exists because the process could be in a sleeping state for some reasons. For example, the process could be waiting for certain system-level locks, for some blocking I/O operations to complete, or is just running out of the current CPU time slice assigned by the process scheduler of the operating system. All such circumstances would prevent the process from running on any CPU cores, but a lot of wall clock time is still taken. In contrast with the CPU time dimension, the off-CPU time dimension reveals invaluable information to analyze overhead of the system locking (such as the system call sem_wait
), some blocking I/O operations (like open
and read
), as well as the CPU contention among processes and threads. All become very obvious with off-CPU flame graphs with getting overwhelmed by too many details which do not really
matter.
Technically speaking, the off-CPU flame graph was the result of a bold attempt. One day, I was reading Brendan’s blog article about off-CPU time, by Lake Tahoe straddling the California-Nevada border. A thought struck me: maybe off-CPU time, like CPU time, can be applied to the flame graphs. Later I tried it in my previous employers' production systems,
sampling the off-CPU flame graph of the nginx
processes using SystemTap. And I made it! I tweeted about the successful story and got a warm response from Brendan Gregg. He told me how he had tried it without desired results. I guess that he had used the off-CPU graph for multi-threaded programs, like MySQL. Massive thread synchronization operations in such processes will fill the off-CPU graph with too much noises that the really interesting parts get obscured. I chose a different use case, single-thread programs like Nginx or OpenResty. In such processes, the off-CPU flame graphs can often promptly reveal blocking system calls in the blocked Nginx event loops, like sem_wait
, open
, and intervenes by the process scheduler. With these functions, it is of great help for analyzing similar performance issues. The only noise will be the epoll_wait system call in the Nginx event loop, which is easy to identify and ignore.
Similarly, we can extend the flame graph idea to other system resource metric dimensions, such as the number of bytes in memory leaks, file I/O latency, network bandwidth, etc. I remembered once I used the “memory leak flame graph” tool invented by myself for rapidly figuring out what was behind a very thorny leak issue in the Nginx core. Conventional tools like Valgrind and AddressSanitizer were unable to capture the leak lurking inside the memory pool of Nginx. In another situation, the “memory leak flame graph” easily located a leak in the Nginx C module written by a European developer. He had been perplexed by the very subtle and slow leak over a long period of time, but I quickly pinpointed the culprit in his own C code without even reading his source code at all. In retrospect, I think that was indeed like magic. I hope now you can understand the omnipotence of flame graph as a visualization method for a lot of entirely
different problems.
Our OpenResty XRay product supports automated sampling for various types of flame graphs, including the C/C++ level flame graphs, Lua level flame graph, off-CPU flame graphs, CPU flame graphs, dynamic memory allocation flame graph, GC object reference relationship flame graphs, file IO flame graphs, and many more!
Conclusion
This part of the series has a close look at Flame Graphs. The next part, Part 5, will cover the methodology commonly used in the troubleshooting process involved with dynamic tracing technologies.
A Word on OpenResty XRay
OpenResty XRay is a commercial dynamic tracing product offered by our OpenResty Inc. company. We use this product in our articles like this one to intuitively demonstrate implementation details, as well as statistics about real world applications and open source software. In general, OpenResty XRay can help users to get deep insight into their online and offline software systems without any modifications or any other collaborations, and efficiently troubleshoot really hard problems for performance, reliability, and security. It utilizes advanced dynamic tracing technologies developed by OpenResty Inc. and others.
You are welcome to contact us to try out this product for free.
About The Author
Yichun Zhang is the creator of the OpenResty® open source project. He is also the founder and CEO of the OpenResty Inc. company. He contributed a dozen open source Nginx 3rd-party modules, quite some Nginx and LuaJIT core patches, and designed the OpenResty XRay platform.
Translations
We provide a Chinese translation for this article on blog.openresty.com.cn
ourselves. We also welcome interested readers to contribute translations in other natural languages as long as the full article is translated without any omissions. We thank them in advance.