Secrets of the Ftrace function tracer



Secrets of the Ftrace function tracer
ftrace function tracer的秘密

Probably the most powerful tracer derived from Ftrace is the function tracer. It has the ability to trace practically every function in the kernel. It can be run not just for debugging or analyzing, but also to learn and observe the flow of the Linux kernel.

ftrace的最强大的工具是functin tracer。它可以跟踪kernel中的每个函数。使用functin tracer,你不但可以调试,分析kernel,你还可以学习和观察linux kernel的运行。

Two previous articles, Debugging the Linux Kernel Using Ftrace parts I and II, explain some of the basic features of Ftrace and the function tracer; this article is written with the assumption that the reader has already read them. As with the previous articles, the examples in this article expect that the user has already changed to the debug file system tracing directory. The kernel configuration options that are need to be enabled to follow the examples in this article are:

前两篇文章,Debugging the Linux Kernel Using Ftrace parts I and II,解释了ftrace的基本特性和function tracer. 本文假定读者已经读过这两篇文章。本文的例子是在debugfs文件系统的目录下运行。本文需要开启下面的kernel选项。

    CONFIG_FUNCTION_TRACER
    CONFIG_DYNAMIC_FTRACE
    CONFIG_FUNCTION_GRAPH_TRACER

Note, the CONFIG_HAVE_FUNCTION_TRACER, CONFIG_HAVE_DYNAMIC_FTRACE, and CONFIG_HAVE_FUNCTION_GRAPH_TRACER options are enabled when the architecture supports the corresponding feature. Do not confuse them with the listed options. The features are only enabled when the listed configuration options are enabled and not when only the _HAVE_ options are.

注意, CONFIG_HAVE_FUNCTION_TRACER, CONFIG_HAVE_DYNAMIC_FTRACE, 和CONFIG_HAVE_FUNCTION_GRAPH_TRACER选项已经开启了,当kernel支持trace。不要把上面的选项弄混了。

As shown in the previous articles, here is a quick example of how to enable the function tracer.

如前文所述,下面是一个开启function tracer的方法。

   [tracing]# echo function > current_tracer
   [tracing]# cat trace
          <idle>-0     [000] 1726568.996435: hrtimer_get_next_event <-get_next_timer_interrupt
          <idle>-0     [000] 1726568.996436: _spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0     [000] 1726568.996436: _spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0     [000] 1726568.996437: rcu_needs_cpu <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 1726568.996438: enter_idle <-cpu_idle
          ...

The above shows you the process name (<idle>), PID (0) the CPU that the trace executed on ([000]), a time-stamp in seconds with the decimal places down to microseconds (1726568.996435) the function being traced (hrtimer_get_next_event) and the parent that called that function (get_next_timer_interrupt).

上面显示进程名(<idle>), PID(0), 执行的CPU([000]),以秒计数,精确到毫秒的时间戳(1726568.996435),跟踪的函数(hrtimer_get_next_event)和调用者(get_next_timer_interrupt).

Function filtering
函数过滤

Running the function tracer can be overwhelming. The amount of data may be vast, and very hard to get a hold of by the human brain. Ftrace provides a way to limit what functions you see. Two files exist that let you limit what functions are traced:

trace的数据量非常大,人眼无法及时处理。有两个文件可以帮我们减小数据量。

   set_ftrace_filter
   set_ftrace_notrace

These filtering features depend on the CONFIG_DYNAMIC_FTRACE option. As explained in the previous articles, when this configuration is enabled all of the mcount caller locations are stored and at boot time are converted into NOPs. These locations are saved and used to enable tracing when the function tracer is activated. But this also has a nice side effect: not all functions must be enabled. The above files will determine which functions gets enabled and which do not.

过滤特性依赖CONFIG_DYNAMIC_FTRACE选项。正如前文所介绍,当这个选项启动,所有的mcount调用者的位置被保存下来,在kernel启动时被转化为NOP.这些调用者保存下来,在trace启动时,这些调用者就被使用。这种方式有一个非常好的副作用:并不是所有的函数都必须被转换,上面的两个文件就决定了哪些函数被转换,哪些不能转换。

When any function is listed in the set_ftrace_filter, only those functions will be traced. This will help the performance of the system when the trace is active. Tracing every function incurs a large overhead, but when using the set_ftrace_filter, only those functions listed in that file will have the NOPs changed to call the tracer. Depending on which functions are being traced, just having a couple of hundred functions enabled is hardly noticeable.

set_ftrace_filter中的函数被跟踪。这有助于提高系统的性能。

The set_ftrace_notrace file is the opposite of set_ftrace_filter. Instead of limiting the trace to a set of functions, functions listed in set_ftrace_notrace will not be traced. Some functions show up quite often and not only does tracing these functions slow down the system, they can fill up the trace buffer and make it harder to analyze the functions you care about. Functions such as rcu_read_lock() and spin_lock() fall into this category.

set_ftrace_notrace中的函数不被跟踪。一些函数的使用频率太高,如果跟踪这些函数会用光缓存,并且比较难分析。rcu_read_lock()和 spin_lock()就是这类的函数。

The process to add functions to these files typically uses bash redirection. Using the symbol '>' will remove all existing functions in the file and add what is being echoed into the file. Appending to the file using '>>' will keep the existing functions and add new ones.

用重定向符号来向文件中写入。用">"来删除文件中的函数,把新的函数写入文件。用">>"向文件尾添加函数。

   [tracing]# echo sys_read > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   sys_read
   [tracing]# echo sys_write >> set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   sys_write
   sys_read
   [tracing]# echo sys_open > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   sys_open

To remove all functions just echo a blank line into the filter file.

向文件中写入空行来删除文件中的所有函数。

   [tracing]# echo sys_read sys_open sys_write > set_ftrace_notrace
   [tracing]# cat set_ftrace_notrace
   sys_open
   sys_write
   sys_read
   [tracing]# echo > set_ftrace_notrace
   [tracing]# cat set_ftrace_notrace
   [tracing]#

The functions listed in these files can also be set on the kernel command line. The options ftrace_notrace and ftrace_filter will preset these files by listing a comma delimited set of functions.

上面文件中的函数可以在kernel的命令行上写入。kernel命令行的选项ftrace_notrace和ftrace_filter设定那些函数,这些函数之间用逗号隔开。

   ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock
   ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault

Functions added by the kernel command line set what will be in the corresponding filter files. These options only pre-load the files, functions can still be removed or added using the bash redirection as explained above.

命令行上添加的函数在相应的文件中显示。这些函数用上面介绍的方法删除或者添加。

The functions listed in set_ftrace_notrace take precedence. That is, if a function is listed in both set_ftrace_notrace and set_ftrace_filter, that function will not be traced.

set_ftrace_notrace的优先级高。即,如果一个函数在set_ftrace_notrace 和 set_ftrace_filter都有,那么就不跟踪。

Wildcard filters
通配符过滤

A list of functions that can be added to the filter files is shown in the available_filter_functions file. This list of functions was derived from the list of stored mcount callers previously mentioned.

available_filter_functions文件中存有可以添加到过滤文件中的函数列表。available_filter_functions文件中的函数列表来自于前面提到的mcount保存的信息。

   [tracing]# cat available_filter_functions | head -8
   _stext
   do_one_initcall
   run_init_process
   init_post
   name_to_dev_t
   create_dev
   T.627
   set_personality_64bit

You can grep this file and redirect the result into one of the filter files:

你可以搜索这个文件,然后把搜到的结果存到过滤文件中。

   [tracing]# grep sched available_filter_functions > set_ftrace_filter
   [tracing]# cat set_ftrace_filter | head -8
   save_stack_address_nosched
   mce_schedule_work
   smp_reschedule_interrupt
   native_smp_send_reschedule
   sys32_sched_rr_get_interval
   sched_avg_update
   proc_sched_set_task
   sys_sched_get_priority_max

Unfortunately, adding lots of functions to the filtering files is slow and you will notice that the above grep took several seconds to execute. This is because each function name written into the filter file will be processed individually. The above grep produces over 300 function names. Each of those 300 names will be compared (using strcmp()) against every function name in the kernel, which is quite a lot.

添加这么多函数到过滤文件中是比较慢的,你可能注意到grep用来几秒的时间来执行。这是因为添加到过滤文件中的每个函数都要单独处理。上面grep搜到了大约300多个函数名。这300多个函数,每个都要与kernel中的函数比较比较(用strcmp())。

   [tracing]# wc -l  available_filter_functions
   24331 available_filter_functions

So the grep above caused set_ftrace_filter to generate over 300 * 24331 (7,299,300) comparisons!

上面的grep将做300*24331次比较。

Fortunately, these files also take wildcards; the following glob expressions are valid:

幸运的是,这些文件支持通配符。 下面的表达式是合法的。

    value* - Select all functions that begin with value.选择value开头的所有函数
    *value* - Select all functions that contain the text value.选择包含vlue的所有函数
    *value - Select all functions that end with value. 选择所有以value结尾的函数

The kernel contains a rather simple parser, and will not process value*value in the expected way. It will ignore the second value and select all functions that start with value regardless of what it ends with. Wildcards passed to the filter files are processed directly for each available function, which is much faster than passing in individual functions in a list.

kernel包含了一个简单的解析器,这个解析器不处理value*value字符串。解析器将忽略第二个value,选择所有以value开头的函数。通配符将直接传到过滤文件中,这比传递单个函数要快。

Because the star (*) is also used by bash, it is best to wrap the input with quotes:

因为"*"也被bash用作通配符,因此最好用引号括起来。

   [tracing]# echo set* > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   #### all functions enabled ####
   [tracing]# echo 'set*' > set_ftrace_filter
   [tracing]# cat set_ftrace_filter | head -5
   set_personality_64bit
   set_intr_gate_ist
   set_intr_gate
   set_intr_gate
   set_tsc_mode

The filters can also select only those functions that belong to a specific module by using the 'mod' command in the input to the filter file:

过滤文件还能选择属于某个模块的所有的函数。

   [tracing]# echo ':mod:tg3' > set_ftrace_filter
   [tracing]# cat set_ftrace_filter |head -8
   tg3_write32
   tg3_read32
   tg3_write_flush_reg32
   tw32_mailbox_flush
   tg3_write32_tx_mbox
   tg3_read32_mbox_5906
   tg3_write32_mbox_5906
   tg3_disable_ints

This is very useful if you are debugging a single module, and only want to see the functions that belong to that module in the
trace.

当调试某个模块的时候,这样做非常有用。

In the earlier articles, enabling and disabling recording to the ring buffer was done using the tracing_on file and the tracing_on() and tracing_off() kernel functions. But if you do not want to recompile the kernel, and you want to stop the tracing at a particular function, set_ftrace_filter has a method to do so.

在前面的文章中,用tracing_on,tracing_on(),tracing_off()来开关缓存。但是如果你不想重新编译系统,并且希望在某个函数后停止trace。set_ftrace_filter有一个方法。

The format of the command to have the function trace enable or disable the ring buffer is as follows:

下面是开关trace的命令的格式。

   function:command[:count]

This will execute the command at the start of the function. The command is either traceon or traceoff, and an optional count can be added to have the command only execute a given number of times. If the count is left off (including the leading colon) then the command will be executed every time the function is called.

在函数开始时执行这个命令。这个命令或者是traceon或者是traceoff,在命令后面添加一个可选的次数,来让命令执行一定的次数。如果次数没有写(包括一个冒号),每次执行函数时,就会执行这个命令。

A while back, I was debugging a change to the kernel I made that was causing a segmentation fault to some programs. I was having a hard time catching the trace, because by the time I was able to stop the trace after seeing the segmentation fault, the data had already been overwritten. But the backtrace on the console showed that the function __bad_area_nosemaphore was being called. I was then able to stop the tracer with the following command:

我正在调试一个patch,这个patch将导致kernel的错误。获得这个trace比较困难,因为每次看到错误后,关闭trace时,相关的信息已经被覆盖了。但是在控制台上的backtrace可以显示。我用下面的方法来停止trace.

  [tracing]# echo '__bad_area_nosemaphore:traceoff' > set_ftrace_filter
  [tracing]# cat set_ftrace_filter
  #### all functions enabled ####
  __bad_area_nosemaphore:traceoff:unlimited
  [tracing]# echo function > current_tracer

Notice that functions with commands do not affect the general filters. Even though a command has been added to __bad_area_nosemaphore, the filter still allowed all functions to be traced. Commands and filter functions are separate and do not affect each other. With the above command attached to the function __bad_area_nosemaphore, the next time the segmentation fault occurred, the trace stopped and contained the data I needed to debug the situation.

注意,带命令的函数不影响过滤。即使__bad_area_nosemaphore有命令,过滤器仍然跟踪所有的函数。命令和过滤器互不影响。__bad_area_nosemaphore带的命令,下次问题发生时,trace会停止,我就可以获得我需要的数据了。

Removing functions from the filters

从过滤器中删除函数

As stated earlier, echoing in nothing with '>' will clear the filter file. But what if you only want to remove a few functions from the filter?

如前所述,用">"向filter文件中导入空,就可以清空filter文件。如果你想从filter文件中删除一部分函数,你该怎么做呢?

   [tracing]# cat set_ftrace_filter > /tmp/filter
   [tracing]# grep -v lock /tmp/filter > set_ftrace_filter

The above works, but as mentioned, it may take a while to complete if there were several functions already in set_ftrace_filter. The following does the same thing but is much faster:

上面的命令可以工作。但是花费的时间比较长。下面的命令做同样的事情,但是速度比较快。

   [tracing]# echo '!*lock*' >> set_ftrace_filter

The '!' symbol will remove functions listed in the filter file. As shown above, the '!' works with wildcards, but could also be used with a single function. Since '!' has special meaning in bash it must be wrapped with single quotes or bash will try to execute what follows it. Also note the '>>' is used. If you make the mistake of using '>' you will end up with no functions in the filter file.

"!"可以删除filter文件中的函数。如上所示,"!"可以与通配符连用,也可以与单个函数连用。在bash中"!"有特殊的含义,必须用单引号括起来。如上所示,用到">>",如果你错误地用力">",你就会清空filter文件。

Because the commands and filters do not interfere with each other, clearing the set_ftrace_filter will not clear the commands. The commands must be cleared with the '!' symbol.

因为命令和filter互不影响。清除set_ftrace_filter没有清除命令。命令必须用"!"清除。

   [tracing]# echo 'sched*' > set_ftrace_filter
   [tracing]# echo 'schedule:traceoff' >> set_ftrace_filter
   [tracing]# cat trace | tail -5
   schedule_console_callback
   schedule_bh
   schedule_iso_resource
   schedule_reallocations
   schedule:traceoff:unlimited
   [tracing]# echo > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   #### all functions enabled ####
   schedule:traceoff:unlimited
   [tracing]# echo '!schedule:traceoff' >> set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   #### all functions enabled ####
   [tracing]#

This may seem awkward, but having the '>' and '>>' only affect the functions to be traced and not the function commands, actually simplifies the control between filtering functions and adding and removing commands.

这有点别扭,但是">"和“>>”只影响被跟踪的函数,不影响函数的命令。这简化了函数和命令之间的控制。

Tracing a specific process
跟踪线程

Perhaps you only need to trace a specific process, or set of processes. The file set_ftrace_pid lets you specify specific processes that you want to trace. To just trace the current thread you can do the following:

如果你想跟踪一个特定的线程,或者一组线程。set_ftrace_pid可以让你指定你想跟踪的线程。下面的命令让你可以跟踪当前的线程。

   [tracing]# echo $$ > set_ftrace_pid

The above will set the function tracer to only trace the bash shell that executed the echo command. If you want to trace a specific process, you can create a shell script wrapper program.

上面的命令会让function tracer跟踪shell中执行的echo命令。下面的脚本可以让你跟踪一个特定的进程。

   [tracing]# cat ~/bin/ftrace-me
   #!/bin/sh
   DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`
   echo $$ > $DEBUGFS/tracing/set_ftrace_pid
   echo function > $DEBUGFS/tracing/current_tracer
   exec $*
   [tracing]# ~/bin/ftrace-me ls -ltr

Note, you must clear the set_ftrace_pid file if you want to go back to generic function tracing after performing the above.

注意,上面的命令执行完后,你需要清空set_ftrace_pid文件。

   [tracing]# echo -1 > set_ftrace_pid

What calls a specific function?
函数调用者

Sometimes it is useful to know what is calling a particular function. The immediate predecessor is helpful, but an entire backtrace is even better. The function tracer contains an option that will create a backtrace in the ring buffer for every function that is called by the tracer. Since creating a backtrace for every function has a large overhead, which could live lock the system, care must be taken when using this feature. Imagine the timer interrupt on a slower system where it is run at 1000 HZ. It is quite possible that having every function that the timer interrupt calls produce a backtrace could take 1 millisecond to complete. By the time the timer interrupt returns, a new one will be triggered before any other work can be done, which leads to a live lock.

有时知道什么调用了一个特定的函数也是有用的。直接调用者是有帮助的,整个调用链更好。function tracer有一个选项,可以在缓存中为每个函数创建一个backtrace.使用这个特性时,要小心。因为在缓存中为每个函数创建一个backtrace是一个开销很大的工作,有可能导致系统死锁。假定1000HZ较慢的系统上的定时器中断。定时器中断调用的函数的calltrace需要1毫秒来完成。到定时器中断完成,在做其它的工作之前,一个新的中断又产生了。这样就导致了死锁。

To use the function tracer backtrace feature, it is imperative that the functions being called are limited by the function filters. The option to enable the function backtracing is unique to the function tracer and activating it can only be done when the function tracer is enabled. This means you must first enable the function tracer before you have access to the option:

为了使用function tracer的backtrace特性,强制要求被调用的函数在function filter中。启动函数backtrace的选项与function tracer不同,当function tracer启动后,这个过程才完成。这说明在启动backtrace之前必须启动function tracer.

   [tracing]# echo kfree > set_ftrace_filter
   [tracing]# cat set_ftrace_filter
   kfree
   [tracing]# echo function > current_tracer
   [tracing]# echo 1 > options/func_stack_trace
   [tracing]# cat trace | tail -8
    => sys32_execve
    => ia32_ptregs_common
                cat-6829  [000] 1867248.965100: kfree <-free_bprm
                cat-6829  [000] 1867248.965100: <stack trace>

    => free_bprm
    => compat_do_execve
    => sys32_execve
    => ia32_ptregs_common
   [tracing]# echo 0 > options/func_stack_trace
   [tracing]# echo > set_ftrace_filter

Notice that I was careful to cat the set_ftrace_filter before enabling the func_stack_trace option to ensure that the filter was enabled. At the end, I disabled the options/func_stack_trace before disabling the filter. Also note that the option is non-volatile, that is, even if you enable another tracer plugin in current_tracer, the option will still be enabled if you re-enable the function tracer.
注意,在启动backtrace之前,我查看了set_ftrace_filter文件,确保filter已经启动。最后,在关闭filter之前,我先关闭了backtrace.注意,这个选项不是volatile的。下次你启动function trace时,这个选项仍然起作用。

The function_graph tracer
function_graph trace

The function tracer is very powerful, but it may be difficult to understand the linear format that it produces. Frederic Weisbecker has extended the function tracer into the function_graph tracer. The function_graph tracer piggy-backs off of most of the code created by the function tracer, but adds its own hook in the mcount call. Because it still uses the mcount calling methods most of the function filtering explained above also applies to the function_graph tracer, with the exception of the traceon/traceoff commands and set_ftrace_pid (although the latter may change in the future).

function trace非常强大,但是理解它却比较困难。Frederic Weisbecker把function tracer扩展成了function_graph tracer。function_graph tracer去掉了function tracer的一些代码,添加了自己的hook。但是仍然使用了function filtering的大部分代码。

The function_graph tracer was also explained in the previous articles, but the set_graph_function file was not described. The func_stack_trace used in the previous section can see what might call a function, but set_graph_function can be used to see what a function calls:

前面介绍过function_graph tracer。但是set_graph_function文件没有提过。func_stack_trace可以看到函数调用的过程。在function_graph tracer中,set_graph_function可以完成这个功能。

   [tracing]# echo kfree > set_graph_function
   [tracing]# echo function_graph > current_tracer
   [tracing]# cat trace
   # tracer: function_graph
   #
   # CPU  DURATION                  FUNCTION CALLS
   # |     |   |                     |   |   |   |
    0)               |  kfree() {
    0)               |    virt_to_cache() {
    0)               |      virt_to_head_page() {
    0)   0.955 us    |        __phys_addr();
    0)   2.643 us    |      }
    0)   4.299 us    |    }
    0)   0.855 us    |    __cache_free();
    0)   ==========> |
    0)               |    smp_apic_timer_interrupt() {
    0)               |      apic_write() {
    0)   0.849 us    |        native_apic_mem_write();
    0)   2.853 us    |      }
   [tracing]# echo > set_graph_function

This displays the call graph performed only by kfree(). The "==========>" shows that an interrupt happened during the call. The trace records all functions within the kfree() block, even those functions called by an interrupt that triggered while in the scope of kfree().

这显示了kfree的调用过程。"==========>"表示在调用过程中有中断发生。trace记录了kfree的所有函数,包括中断调用的函数。

The function_graph tracer shows the time a function took in the duration field. In the previous articles, it was mentioned that only the leaf functions, the ones that do not call other functions, have an accurate duration, since the duration of parent functions also includes the overhead of the function_graph tracer calling the child functions. By using the set_ftrace_filter file, you can force any function into becoming a leaf function in the function_graph tracer, and this will allow you to see an accurate duration of that function.

function_graph tracer显示函数执行所需的时间。前文提到只有叶子函数(不调用其他函数的函数)才有准确的时间。由于调用者函数包括function_graph tracer的开销,因此不是很准确。通过使用set_ftrace_filter文件,你可以把任何函数变为叶子函数,这样你可以获得任何函数的准确执行时间。

   [tracing]# echo smp_apic_timer_interrupt > set_ftrace_filter
   [tracing]# echo function_graph > current_tracer
   [tracing]# cat trace | head
   # tracer: function_graph
   #
   # CPU  DURATION                  FUNCTION CALLS
   # |     |   |                     |   |   |   |
    1)   ==========> |
    1) + 16.433 us   |  smp_apic_timer_interrupt();
    1)   ==========> |
    1) + 25.897 us   |  smp_apic_timer_interrupt();
    1)   ==========> |
    1) + 24.764 us   |  smp_apic_timer_interrupt();

The above shows that the timer interrupt takes between 16 and 26 microseconds to complete.
上面显示中断需要16到26毫秒来完成。

Function profiling
函数分析

oprofile and perf are very powerful profiling tools that take periodic samples of the system and can show where most of the time is spent. With the function profiler, it is possible to take a good look at the actual function execution and not just samples. If CONFIG_FUNCTION_GRAPH_TRACER is configured in the kernel, the function profiler will use the function graph infrastructure to record how long a function has been executing. If just CONFIG_FUNCTION_TRACER is configured, the function profiler will just count the functions being called.

oprofile和perf是非常强大的函数分析工具。他们周期性地在系统中采样。使用函数分析工具,可以很好地查看函数的执行。如果CONFIG_FUNCTION_GRAPH_TRACER启用,函数分析工具将使用function graph框架记录函数执行的时间。如果只配置了CONFIG_FUNCTION_TRACER,只能查看函数被调用的次数。

   [tracing]# echo nop > current_tracer
   [tracing]# echo 1 > function_profile_enabled
   [tracing]# cat trace_stat/function 0 |head
     Function                               Hit    Time            Avg
     --------                               ---    ----            ---
     schedule                             22943    1994458706 us     86931.03 us
     poll_schedule_timeout                 8683    1429165515 us     164593.5 us
     schedule_hrtimeout_range              8638    1429155793 us     165449.8 us
     sys_poll                             12366    875206110 us     70775.19 us
     do_sys_poll                          12367    875136511 us     70763.84 us
     compat_sys_select                     3395    527531945 us     155384.9 us
     compat_core_sys_select                3395    527503300 us     155376.5 us
     do_select                             3395    527477553 us     155368.9 us

The above also includes the times that a function has been preempted or schedule() was called and the task was swapped out. This may seem useless, but it does give an idea of what functions get preempted often. Ftrace also includes options that allow you to have the function graph tracer ignore the time the task was scheduled out.

上面的时间包括由于函数被抢占被调度,任务被换出的时间。从这里可以看出哪些函数经常被抢占。ftrace包含了一些选项,这些选项把任务被换出的时间忽略。

   [tracing]# echo 0 > options/sleep-time
   [tracing]# echo 0 > function_profile_enabled
   [tracing]# echo 1 > function_profile_enabled
   [tracing]# cat trace_stat/function0  | head
     Function                               Hit    Time            Avg
     --------                               ---    ----            ---
     default_idle                          2493    6763414 us     2712.962 us
     native_safe_halt                      2492    6760641 us     2712.938 us
     sys_poll                              4723    714243.6 us     151.226 us 
     do_sys_poll                           4723    692887.4 us     146.704 us 
     sys_read                              9211    460896.3 us     50.037 us  
     vfs_read                              9243    434521.2 us     47.010 us  
     smp_apic_timer_interrupt              3940    275747.4 us     69.986 us  
     sock_poll                            80613    268743.2 us     3.333 us   

Note that the sleep-time option contains a "-" and is not sleep_time.

注意 sleep-time不是sleep_time.

Disabling the function profiler and then re-enabling it causes the numbers to reset. The list is sorted by the Avg times, but using scripts you can easily sort by any of the numbers. The trace_stat/function0 only represents CPU 0. There exists a trace_stat/function# for each CPU on the system. All functions that are traced and have been hit are in this file.

开关函数分析工具将重置一些统计。列表是根据Avg时间排列的。使用脚本你可以很容易地排列这些数字。trace_stat/function0只表示CPU0。系统中对每个CPU都存在一个trace_stat/function文件。这个文件记录了所有被跟踪和被调用的函数。

   [tracing]# cat trace_stat/function0  | wc -l
   2978

Functions that were not hit are not listed. The above shows that 2978 functions were hit since I started the profiling.

没有调用的函数就不会显示。上面显示了从我开启函数分析工具后,一共有2978个函数被调用。

Another option that affects profiling is graph-time (again with a "-"). By default it is enabled. When enabled, the times for a function include the times of all the functions that were called within the function. As you can see from the output in the above examples, several system calls are listed with the highest average. When disabled, the times only include the execution of the function itself, and do not contain the times of functions that are called from the function:

影响分析的另一个因素是graph-time(注意是"-").默认是打开的,当打开时,函数执行时间包括这个函数调用的其他函数执行的时间。从上面例子的输出可以看到,几个系统调用也被包含进来。关闭时,函数执行的时间只是函数本身执行的时间,不再包括该函数调用其他函数执行的时间。

   [tracing]# echo 0 > options/graph-time
   [tracing]# echo 0 > function_profile_enabled
   [tracing]# echo 1 > function_profile_enabled
   [tracing]# cat trace_stat/function0  | head
     Function                               Hit    Time            Avg
     --------                               ---    ----            ---
     mwait_idle                           10132    246835458 us     24361.96 us
     tg_shares_up                        154467    389883.5 us     2.524 us   
     _raw_spin_lock_irqsave              343012    263504.3 us     0.768 us   
     _raw_spin_unlock_irqrestore         351269    175205.6 us     0.498 us   
     walk_tg_tree                         14087    126078.4 us     8.949 us   
     __set_se_shares                     274937    88436.65 us     0.321 us   
     _raw_spin_lock                      100715    82692.61 us     0.821 us   
     kstat_irqs_cpu                      257500    80124.96 us     0.311 us   

Note that both sleep-time and graph-time also affect the duration times displayed by the function_graph tracer.
注意 sleep-time和graph-time都会影响函数执行时间。

Conclusion
结论

The function tracer is very powerful with lots of different options. It is already available in mainline Linux, and hopefully will be enabled by default in most distributions. It allows you to see into the depths of the kernel and with its arsenal of features, gives you a good idea of why things are happening the way they do. Start using the function tracer to open up the black box that we call the kernel. Have fun!

加上各种选项,function tracer是非常强大的。在Linux的mainline中可以找到源代码。希望更多的linux发行版中启用ftrace。用ftrace提供的各种功能,你可以更深入地探测kernel,了解kernel运行的方式。现在就开始用ftrace探索linux kernel吧。

你可能感兴趣的:(Secrets of the Ftrace function tracer)