In the same vein of my previous post on analyzing core dumps of .NET Core applications on Linux, let’s take a look at what it takes to do some basic performance profiling. When starting out, here are a few things I wrote down that would be nice to do:
With this task list in mind, let’s get started!
Generally speaking, a .NET Core application runs as a regular Linux process. There’s nothing particularly fancy involved, which means we can use perf and ftrace and even BPF-based tools to monitor performance. There’s just one catch: resolving symbols for call stacks. Here’s what happens when we profile a CPU-intensive application, running with defaults, using perf:
# perf record -F 97 -ag ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.364 MB perf.data (789 samples) ] # perf report
As you can see, debugging symbols are missing for pretty much everything under the dotnet process, so we only get addresses rather than method names. Fortunately, .NET Core ships with a knob that can be turned in order to get a perf map file generated in /tmp, which perf can then find and use for symbols. To turn on the knob, export COMPlus_PerfMapEnabled=1:
$ export COMPlus_PerfMapEnabled=1 $ dotnet run & [1] 23503 $ ls /tmp/perf* /tmp/perf-23503.map /tmp/perf-23517.map /tmp/perfinfo-23503.map /tmp/perfinfo-23517.map $ head -2 /tmp/perfinfo-23517.map ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Private.CoreLib.ni.dll;{14b5688c-fe9a-4a0d-a0d1-b3af5439e23b}; ImageLoad;/home/vagrant/Runny/bin/Debug/netcoreapp1.1/Runny.dll;{ebb3ede4-dc41-44f4-93d3-152cd0b54ac0}; $ head -2 /tmp/perf-23517.map 00007FABB90D4480 2e instance bool [System.Private.CoreLib] dynamicClass::IL_STUB_UnboxingStub() 00007FABB90D44D0 2e instance System.__Canon /* MT: 0x00007FABB8F60318 */ [System.Private.CoreLib] dynamicClass::IL_STUB_UnboxingStub()
Equipped with these files, we can repeat the perf recording and then the report looks a bit better, with symbols starting to appear, such as ConsoleApplication.Primes::CountPrimes. Note that because the .NET process wrote the perf map file, you might need to tell perf to ignore the fact that it’s not owned by root by using the -f switch ( perf report -f), or simply chown it.
Although, who reads perf reports anyway — let’s generate a flame graph!
Well, a flame graph is a flame graph, nothing special about .NET Core here once we have the right data in our perf files. Let’s go:
# git clone --depth=1 https://github.com/BrendanGregg/FlameGraph ... # perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flame.svg
Here’s a part of the generated flame graph, looking pretty good:
If you look closely, you’ll notice that some symbols are still missing — notably, we don’t have any symbols for libcoreclr.so. And that’s just the way it is:
$ objdump -t $(find /usr/share/dotnet -name libcoreclr.so) /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/libcoreclr.so: file format elf64-x86-64 SYMBOL TABLE: no symbols
If you build .NET Core from source, you can build with debug information, but that’s not what we get by default from the Microsoft package repository.
Now that we have the necessary building blocks for getting symbols resolved, we can of course move on to other events (and use other tools, too). For example, let’s trace context switches to see where our threads are getting blocked:
# perf record -e sched:sched_switch -ag ... # perf report -f
(This is a fairly typical stack for where the thread gets preempted to let another thread run, even though it hasn’t called any blocking API.)
Or, let’s try some of my favorite tools from BCC. For example, let’s trace file opens:
# opensnoop PID COMM FD ERR PATH 1 systemd 17 0 /proc/955/cgroup 24675 dotnet 3 0 /etc/ld.so.cache 24675 dotnet 3 0 /lib/x86_64-linux-gnu/libdl.so.2 24675 dotnet 3 0 /lib/x86_64-linux-gnu/libpthread.so.0 24675 dotnet 3 0 /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ... 24689 dotnet 47 0 /home/vagrant/Runny/perfcollect 24689 dotnet 47 0 /home/vagrant/Runny/opens.txt 24689 dotnet 47 0 /home/vagrant/Runny/project.lock.json 24689 dotnet 47 0 /home/vagrant/Runny/.Program.cs.swp 24689 dotnet 47 0 /home/vagrant/Runny/Program.cs 24689 dotnet -1 13 /home/vagrant/Runny/perf.data.old
We can conclude that everything more or less works. I dare say this is even a little easier than the JVM situation, where we need an external agent to generate debugging symbols. On the other hand, you have to run the .NET Core process with the COMPlus_PerfMapEnabled environment variable at initialization time — you can’t generate the debugging information after the process has already started without it.
But then I tried one more thing. Let’s try to aggregate file read stacks by using the stackcount tool from BCC to probe read in libpthread (which is where .NET Core’s syscalls are routed through on my box). The result is not very pretty:
$ stackcount pthread:read -p 29751 Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end. read [unknown] [unknown] [unknown] [unknown] void [Runny] ConsoleApplication.Program::Main(string[]) [unknown] [unknown] [unknown] [unknown] [unknown] coreclr_execute_assembly coreclr::execute_assembly(void*, unsigned int, int, char const**, char const*, unsigned int*) run(arguments_t const&) corehost_main ... snipped for brevity ... 16
The [unknown] frames prior to Main are not very surprising — this is libcoreclr.so, and we already know it doesn’t ship with debuginfo. But the top-most frames are disappointing — this is a managed assembly, with managed frames, and there’s no reason why we shouldn’t be able to trace them.
To figure out where these frames are coming from, I’m going to need addresses. With the -v switch, stackcountprints addresses in addition to symbols:
# stackcount pthread:read -v -p 29751 Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end. ^C 7f77b10b1680 read 7f773651f267 [unknown] 7f773651e8d5 [unknown] 7f773651e880 [unknown] 7f773651846a [unknown] 7f77364bfb5d void [Runny] ConsoleApplication.Program::Main(string[]) ...
All right, so which module is 7f773651f267 in, for example? Let’s take a look at the loaded modules (I’m keeping only executable regions):
$ cat /proc/29751/maps | grep 'xp ' ... 7f77364bf000-7f77364c6000 rwxp 00000000 00:00 0 7f7736502000-7f7736530000 r-xp 00003000 fd:00 787585 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll 7f7736534000-7f7736564000 r-xp 00003000 fd:00 787603 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.IO.FileSystem.dll 7f7736577000-7f7736578000 r-xp 00002000 fd:00 787665 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Threading.Thread.dll 7f773657a000-7f7736587000 r-xp 00002000 fd:00 787606 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.IO.dll 7f773658a000-7f773659a000 r-xp 00002000 fd:00 787668 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Threading.dll 7f773659d000-7f773659e000 r-xp 00002000 fd:00 787658 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Text.Encoding.dll ...
OK, so we seem to be making progress — the desired address is clearly in the range that belongs to System.Console.dll. But, being a managed assembly, we’re not going to find any debug information in it:
$ file /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll: PE32+ executable (DLL) (console) Mono/.Net assembly, for MS Windows $ objdump -tT /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll objdump: /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll: File format not recognized
Hmpf. So how are we supposed to get symbolic information for these addresses?
If you look online, you’ll find that there’s a tool on the .NET Core repos called perfcollect — essentially a Bash script for collecting performance information from .NET Core processes running on Linux. Let’s take a look.
The perfcollect tool is fairly self-contained, and installs its own dependencies, most notably perf and lttng — .NET Core on Linux uses LTTng to generate various events, including garbage collections, object allocations, thread starts, assembly loads, and many others. Then, perfcollect follows your instructions and runs perf and lttng to collect CPU sampling events, package them up to a big zip file, and hand that to you.
What are you supposed to do with that zip file? Open it on Windows, apparently, using PerfView. Now, I love PerfView, but a face palm is the only reasonable reaction to hearing this. What’s more, perfcollect does a bunch of work that you don’t really need if you plan to analyze the results on the same machine. But there’s one thing it does which sounds very relevant:
WriteStatus "Generating native image symbol files" # Get the list of loaded images and use the path to libcoreclr.so to find crossgen. # crossgen is expected to sit next to libcoreclr.so. local buildidList=`$perfcmd buildid-list | grep libcoreclr.so | cut -d ' ' -f 2`
That definitely sounds good! Turns out that .NET Core writes out an additional map file, named /tmp/perfinfo-$PID.map, which contains a list of image load events for your application’s assemblies. perfcollect then parses that list and invokes the crossgen tool to generate an additional perf map for each assembly, which can be fed into PerfView on the Windows side. Here’s what the perfinfo file looks like:
$ head -4 /tmp/perfinfo-29751.map ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Private.CoreLib.ni.dll;{14b5688c-fe9a-4a0d-a0d1-b3af5439e23b}; ImageLoad;/home/vagrant/Runny/bin/Debug/netcoreapp1.1/Runny.dll;{319d161b-f17e-44f6-a210-f297df920194}; ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Runtime.dll;{819d412e-d773-4dbb-8d01-20d412b6cf09}; ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/mscorlib.dll;{080dac22-6a0e-41ae-85fb-fb79cc07911b};
Now, that’s what crossgen is supposed to do. And according to the comment above, crossgen is also supposed to be in the same folder as libcoreclr.so. But it isn’t:
$ find /usr/share/dotnet -name crossgen
That’s right, no results. Looking online, it seems that crossgen is generated as part of a .NET Core build, and part of the CoreCLR runtime NuGet package, but it’s not part of the pre-packaged binaries you get from the Microsoft package repositories. But with a little effort borrowed from the corefx repo, we can fetch our own crossgen:
$ export CoreClrVersion=1.1.0 $ export Rid=$(dotnet --info | sed -n -e 's/^.*RID:[[:space:]]*//p') $ echo "{\"frameworks\":{\"netcoreapp1.1\":{\"dependencies\":{\"Microsoft.NETCore.Runtime.CoreCLR\":\"$CoreClrVersion\", \"Microsoft.NETCore.Platforms\": \"$CoreClrVersion\"}}},\"runtimes\":{\"$Rid\":{}}}" > project.json $ dotnet restore ./project.json --packages . ... output omitted for brevity ... $ ls ./runtime.$Rid.Microsoft.NETCore.Runtime.CoreCLR/$CoreClrVersion/tools crossgen
All right! So we have crossgen, at which point we can try it out to generate debug information for System.Console.dll, or any other assembly we need, really. Here goes:
$ crossgen /Platform_Assemblies_Paths /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0 \ /CreatePerfMap . /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll Microsoft (R) CoreCLR Native Image Generator - Version 4.5.22220.0 Copyright (c) Microsoft Corporation. All rights reserved. Successfully generated perfmap for native assembly '/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll'.
What does this perfmap file look like? The same as any other perfmap, except the addresses are not absolute — they are offsets from the load address of that module:
$ head -4 System.Console.ni.\{3b33b403-e8c1-44af-a7fb-369b2603f2a3\}.map 0000000000017590 58 void [System.Console] Interop::ThrowExceptionForIoErrno(valuetype Interop/ErrorInfo,string,bool,class [System.Runtime]System.Func`2) 00000000000175F0 4d void [System.Console] Interop::CheckIo(valuetype Interop/Error,string,bool,class [System.Runtime]System.Func`2 ) 0000000000017640 82 int64 [System.Console] Interop::CheckIo(int64,string,bool,class [System.Runtime]System.Func`2 ) 00000000000176D0 17 int32 [System.Console] Interop::CheckIo(int32,string,bool,class [System.Runtime]System.Func`2 )
Well, let’s see if we can at least resolve our desired address by using this approach. If you go back above, we were chasing the address 7f773651f267, loaded into System.Console.dll. First, let’s find the base address where System.Console.dll is loaded:
$ cat /proc/29751/maps | grep System.Console.dll | head -1 7f77364ff000-7f7736500000 r--p 00000000 fd:00 787585 /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll
The offset, then, is:
$ echo 'ibase=16;obase=10;7F773651F267-7F77364FF000' | bc 20267
So now we need to look for this offset in the System.Console map file. The closest match is here:
0000000000020150 286 instance valuetype System.ConsoleKeyInfo [System.Console] System.IO.StdInReader::ReadKey(bool&)
With that, we have one frame resolved! There are only a few more This process begs to be automated. It would be great to automatically run crossgen, generate the map files with the relative addresses, convert them to absolute addresses, and merge them with the main /tmp/perf-PID.map file that other tools know and love. Read on!
Well, I wrote a small script called dotnet-mapgen.py that automates the above steps and produces a single, unified map file that contains both JIT-compiled addresses and addresses that lie in crossgen’d (AOT-compiled) modules, such as System.Console.dll. The script has two modes:
$ ./dotnet-mapgen.py generate $(pgrep -n dotnet) couldn't find crossgen, trying to fetch it automatically... crossgen succesfully downloaded and placed in libcoreclr's dir crossgen map generation: 15 succeeded, 2 failed
In the “generate” mode, the script first locates crossgen (downloading it if necessary, using the NuGet restore approach shown above), and then runs crossgen on all the managed assemblies loaded into the target process. The 2 failures in the above output are for assemblies that weren’t AOT-compiled. Note that this generation step can be done once, and the map files retained for subsequent runs — unless you change the set of AOT-compiled assemblies loaded into your process.
$ ./dotnet-mapgen.py merge $(pgrep -n dotnet) perfmap merging: 14 succeeded, 3 failed
In the “merge” mode, the script calculates absolute addresses for all the symbols generated in the previous step, and concatenates this information to the main /tmp/perf-PID.map file for the target process.
There’s just one final problem. Turns out, perf refuses to use the map file for symbols that are in memory regions that belong to a module (in our case above, System.Console.dll). And there’s no way to convince perf that it should try to resolve such addresses using the map file. Fortunately, I have a bit more control over BCC tools, so I proposed a PR for retrying symbol resolution using a map file if the symbol wasn’t resolved using the original module. With this patch, here’s stackcount‘s output:
# stackcount ... pthread:read Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end. ^C read instance valuetype System.ConsoleKeyInfo [System.Console] System.IO.StdInReader::ReadKey(bool&) instance string [System.Console] System.IO.StdInReader::ReadLine(bool) instance string [System.Console] System.IO.StdInReader::ReadLine() string [System.Console] System.Console::ReadLine() void [Runny] ConsoleApplication.Program::Main(string[]) ... 16
Note how all symbols are now resolved to managed frames: the JIT-compiled Program::Main, and the AOT-compiled Console::ReadLine, StdInReader::ReadLine, and everything else.
Once this support lands in BCC, we can also do full-fidelity profiling with the profile tool, stack tracing with traceand stackcount, blocked time analysis using offcputime/ offwaketime, and a variety of other tools. For most purposes, the perf-based workflow shown in the beginning of the post is a poorer alternative, if you can run a recent-enough kernel with BPF support.
In a subsequent post, I also plan to explore the LTTng traces to see if we can trace garbage collections, object allocations, managed exceptions, and other events of interest.