Tracing the Linux kernel with ftrace

Tracing is a useful technique to find bugs in software, and ftrace is the tracing framework built into the Linux kernel.

But before we start talking about ftrace, let’s define what tracing is.

What is tracing?

Tracing is the specialized use of logging to record information about a program’s flow of execution.

You know when you add print messages in your code to debug it? You are tracing the application with your own “tracing system”. The debug messages added to the code are static tracepoints, and your “trace system” might send the debug messages to the program’s standard output or a log file.

Well, that works. But does not scale. Every time you want to trace some part of the code, you will have to add new debug messages and rebuild it. And you do not have much control over where the messages go. What if you want to save them to analyze later? What if you want to filter the messages? And the code will be a mess, with lots of print statements (you will have to be very careful to not commit any code with debug messages, right?).

The fact is that most of the time we don’t need to add debug messages to the code because there are already tracing tools and frameworks that can do that for us.

That’s where ftrace comes in…

What is ftrace?

Ftrace is a tracing framework for the Linux kernel. It was added to the kernel back in 2008 and has evolved a lot since then.

Ftrace stands for function tracer and basically lets you watch and record the execution flow of kernel functions. It was created by Steven Rostedt, derived from two other tools called latency tracer from Ingo Molnar and Steven’s logdev utility.

With ftrace you can really see what the kernel is doing. You can trace function calls and learn a lot about how the kernel works. You can find out which kernel functions are called when you run a user space application. You can profile functions, measure execution time and find out bottlenecks and performance issues. You can identify hangs in kernel space. You can measure the time it takes to run a real-time task and find out latency issues. You can measure stack usage in kernel space and find out possible stack overflows. You can really do a lot of things to monitor and find bugs in the kernel!

Black magic? Maybe a little. 🙂 But how does it work?

How does ftrace work?

There are two main types of tracing: static tracing and dynamic tracing.

Static tracing is implemented through static probes added in the source code. They have a low processing load, but the traced code is limited and defined at build time.

Dynamic tracing is implemented through dynamic probes injected into code, allowing to define at runtime the code to be traced. It has a certain processing load, but the range of source code to be traced is much larger.

Ftrace uses a combination of both static probes (function tracing, event tracing, etc) and dynamic probes (kprobes, uprobes, etc).

To trace function calls, ftrace will build the kernel with GCC’s -pg option.

When the kernel is built with -pg, GCC will prepend machine instructions to the prologue of every non-inlined function that will redirect the execution to the ftrace’s trampolines and tracer plugins that perform the actual tracing.

The disassembly listing below is the prologue of the function gpiod_set_value. The call to __gnu_mcount_nc on line 10 was added by GCC and will be used by ftrace to trace the function (this is the same technique used by gprof to profile user space applications).

$ arm-linux-gdb -q vmlinux
Reading symbols from vmlinux...done.
(gdb) disassemble /m gpiod_set_value
Dump of assembler code for function gpiod_set_value:
3133    {
   0xc03a83ac :     mov     r12, sp
   0xc03a83b0 :     push    {r4, r5, r11, r12, lr, pc}
   0xc03a83b4 :     sub     r11, r12, #4
   0xc03a83b8 :    push    {lr}            ; (str lr, [sp, #-4]!)
   0xc03a83bc :    bl      0xc001b420 
   0xc03a83c0 :    mov     r5, r1
   0xc03a83c8 :    mov     r4, r0

Ftrace is also able to trace events in the kernel.

Events are static tracepoints added by developers to monitor kernel subsystems like the scheduler, power management, interrupts, networking, gpio, etc. Just search for functions starting with trace_ in the Linux kernel source code and you will find several places where trace events are used:

$ grep -R trace_ drivers/gpio/*
drivers/gpio/gpiolib.c: trace_gpio_direction(desc_to_gpio(desc), 1, status);
drivers/gpio/gpiolib.c: trace_gpio_value(desc_to_gpio(desc), 0, val);
drivers/gpio/gpiolib.c: trace_gpio_direction(desc_to_gpio(desc), 0, ret);
drivers/gpio/gpiolib.c: trace_gpio_value(desc_to_gpio(desc), 1, value);
drivers/gpio/gpiolib.c:                 trace_gpio_value(desc_to_gpio(desc), 1, value);
drivers/gpio/gpiolib.c: trace_gpio_direction(desc_to_gpio(desc), value, err);
drivers/gpio/gpiolib.c: trace_gpio_direction(desc_to_gpio(desc), !value, err);
drivers/gpio/gpiolib.c: trace_gpio_value(desc_to_gpio(desc), 0, value);
drivers/gpio/gpiolib.c:                 trace_gpio_value(desc_to_gpio(desc), 0, value);

And how do you interface with ftrace?

When tracing is enabled, all collected trace data will be stored by ftrace in a ring buffer in memory.

There is a virtual filesystem called tracefs (usually mounted in /sys/kernel/tracing) to configure ftrace and collect the trace data. All manipulations are done with simple file operations in this directory.

And that’s one thing I like about ftrace. Since the interface to the tracing system is a file system, you can use simple tools like echo and cat to trace the Linux kernel!

So how do you use it?

How to use ftrace?

First, we should enable ftrace in the Linux kernel (CONFIG_FTRACE) and all tracers and options we plan to use. This is the menuconfig screen for kernel version 4.18 with most ftrace options enabled:

With ftrace enabled, we can mount the tracefs virtual filesystem:

# mount -t tracefs tracefs /sys/kernel/tracing

And this is the filesystem interface provided by ftrace:

# ls /sys/kernel/tracing/
README                      set_ftrace_filter
available_events            set_ftrace_notrace
available_filter_functions  set_ftrace_pid
available_tracers           set_graph_function
buffer_size_kb              set_graph_notrace
buffer_total_size_kb        snapshot
current_tracer              stack_max_size
dyn_ftrace_total_info       stack_trace
enabled_functions           stack_trace_filter
events                      timestamp_mode
free_buffer                 trace
function_profile_enabled    trace_clock
hwlat_detector              trace_marker
instances                   trace_marker_raw
max_graph_depth             trace_options
options                     trace_pipe
per_cpu                     trace_stat
printk_formats              tracing_cpumask
saved_cmdlines              tracing_max_latency
saved_cmdlines_size         tracing_on
saved_tgids                 tracing_thresh
set_event                   uprobe_events
set_event_pid               uprobe_profile

We can print the list of available tracers:

# cat available_tracers
hwlat   blk      function_graph   wakeup_dl   wakeup_rt 
wakeup  irqsoff  function         nop

There are function tracers (function, function_graph), latency tracers (wakeup_dl, wakeup_rt, irqsoff, wakeup, hwlat), I/O tracers (blk) and many more!

To enable a tracer, we just have to write its name to current_tracer:

# echo function > current_tracer

And we can read the trace buffer with the trace or trace_pipe file:

# cat trace
# tracer: function
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          -0     [001] d...    23.695208: _raw_spin_lock_irqsave