Frequently asked questions about using tracepoint with ebpf/libbpf programs

Preface

This article aims to address common issues related to tracepoints when writing eBPF/libbpf programs (such as writing ebpf programs of type BPF_PROG_TYPE_TRACEPOINT).

Types of eBPF Programs

This article focuses on the eBPF program type called BPF_PROG_TYPE_TRACEPOINT.

Events that can be Monitored by Tracepoints

  • You can find the events that can be monitored by tracepoints by viewing the contents of the file /sys/kernel/debug/tracing/available_events. Each line in the file follows the format:

    <category>:<name>
    

    For example:

    syscalls:sys_enter_execve
    
  • You can also use the bpftrace tool for querying:

    $ sudo bpftrace -l tracepoint:* | grep 'sys_enter_execve'
    tracepoint:syscalls:sys_enter_execve
    tracepoint:syscalls:sys_enter_execveat
    

Format of SEC Content

The SEC format corresponding to the tracepoint event is:

SEC("tracepoint/<category>/<name>")

// For example:
// SEC("tracepoint/syscalls/sys_enter_openat")

or:

SEC("tp/<category>/<name>")

// For example:
// SEC("tp/syscalls/sys_enter_openat")

The values of <category> and <name> are taken from the contents listed in the available_events file.

SEC("tp/xx/yy") and SEC("tracepoint/xx/yy") are actually equivalent, and you can use either one according to personal preference.

How to Determine the Parameter Types of the Tracepoint Event Handling Function and Obtain the Corresponding Kernel Call Parameters

Suppose we want to trace the chmod command and the fchmodat system call involved in it. Then, how do we determine the parameter types of the event handling function in the ebpf program and how to obtain the content of the corresponding parameters involved in the fchmodat system call, such as the file name and the value of the mode permission.

Determining the Tracepoint Event to be Tracked

The first step is to determine the system call used by chmod, which can be done in various ways, such as using the strace command:

$ strace chmod 600 a.txt
...
fchmodat(AT_FDCWD, "a.txt", 0600)       = 0
...

The second step is to find the tracepoint event that can be used for this system call:

$ sudo cat /sys/kernel/debug/tracing/available_events |grep fchmodat
syscalls:sys_exit_fchmodat
syscalls:sys_enter_fchmodat

We can see that there are two events: sys_enter_fchmodat and sys_exit_fchmodat. Here, we choose the sys_enter_fchmodat event for further explanation.

Determining the Information Included in the Event

The third step is to determine what information the event itself can provide. Although we know that the fchmodat system call requires the file name and mode information, we are not sure if we can obtain this information in the ebpf program.

  • You can obtain the information by viewing the contents of the file /sys/kernel/debug/tracing/events/<category>/<name>/format.

    For example, the contents of the file /sys/kernel/debug/tracing/events/syscalls/sys_enter_fchmodat/format for the sys_enter_fchmodat event are as follows:

    $ sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_fchmodat/format
    name: sys_enter_fchmodat
    ID: 647
    format:
            field:unsigned short common_type;       offset:0;       size:2; signed:0;
            field:unsigned char common_flags;       offset:2;       size:1; signed:0;
            field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
            field:int common_pid;   offset:4;       size:4; signed:1;
    
            field:int __syscall_nr; offset:8;       size:4; signed:1;
            field:int dfd;  offset:16;      size:8; signed:0;
            field:const char * filename;    offset:24;      size:8; signed:0;
            field:umode_t mode;     offset:32;      size:8; signed:0;
    
    print fmt: "dfd: 0x%08lx, filename: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(REC->mode))
    

    The fields listed in format cannot be directly accessed by ordinary ebpf programs for the first 8 bytes (some bpf helpers can access them) [1]. Other fields can generally be accessed, and the fields referenced in print fmt are the information we can obtain in the ebpf program.

  • You can also use the bpftrace tool to query:

    $ sudo bpftrace  -l tracepoint:syscalls:sys_enter_fchmodat -v
    tracepoint:syscalls:sys_enter_fchmodat
        int __syscall_nr
        int dfd
        const char * filename
        umode_t mode
    

From the above, we can see that we can obtain the dfd, filename, and mode information of the sys_enter_fchmodat event, which includes the file name and permission mode information mentioned earlier.

Determining the Parameters of the Event Handling Function

The fourth step is to determine the parameter types of the function. After knowing the information that the event itself can provide, we also need to know how to read this information in the ebpf program. This involves how to confirm what the parameters of the ebpf event handling function are so that we can obtain the information contained in the event from the function's input parameters.

Based on vmlinux.h

One way is to search in the vmlinux.h file. Generally, sys_enter_xx corresponds to trace_event_raw_sys_enter, sys_exit_xx corresponds to trace_event_raw_sys_exit, and others generally correspond to trace_event_raw_<name>. If not found, you can refer to the example of trace_event_raw_sys_enter to find a similar struct.

For sys_enter_fchmodat, we use the struct trace_event_raw_sys_enter:

struct trace_event_raw_sys_enter {
    struct trace_entry ent;
    long int id;
    long unsigned int args[6];
    char __data[0];
};

In this struct, the field args stores the information that the event is related to, which corresponds to the fields included in the fmt in the format file in the third step. Therefore, we can obtain the dfd with args[0], filename with args[1], and so on.

Now that the information is determined, we can write the program. For example, the example ebpf program for the sys_enter_fchmodat event is as follows:

SEC("tracepoint/syscalls/sys_enter_fchmodat")
int tracepoint__syscalls__sys_enter_fchmodat(struct trace_event_raw_sys_enter *ctx)
{
        // ...

        char *filename_ptr = (char *) BPF_CORE_READ(ctx, args[1]);
        bpf_core_read_user_str(&event->filename, sizeof(event->filename), filename_ptr);
        event->mode = BPF_CORE_READ(ctx, args[2]);

        // ...
}

Refer to the following links for complete examples using this method:

Manually Constructing the Parameter Structure

In addition to using the pre-defined structures in vmlinux.h, we can also customize a structure based on the content of the format file in the third step as the parameter of the eBPF program. For example, the contents of the file /sys/kernel/debug/tracing/events/syscalls/sys_enter_fchmodat/format for the sys_enter_fchmodat event are as follows:

$ sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_fchmodat/format
name: sys_enter_fchmodat
ID: 647
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:int __syscall_nr; offset:8;       size:4; signed:1;
        field:int dfd;  offset:16;      size:8; signed:0;
        field:const char * filename;    offset:24;      size:8; signed:0;
        field:umode_t mode;     offset:32;      size:8; signed:0;

print fmt: "dfd: 0x%08lx, filename: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(REC->mode))

Based on this information, we can define the following structure as the parameter of the eBPF event handling function:

struct sys_enter_fchmodat_args {
    char _[16];
    long dfd;
    long filename_ptr;
    long mode;
};

In this structure, we first represent the content of the first 16 bytes with char _[16], which corresponds to all the fields before dfd in the format file. Then we define the dfd, filename, and mode fields that our program wants to obtain one by one. The reason for using the long type is to ensure that the size of each member is 8 bytes as indicated in the format (the size of each field member in each event is different and needs to be adjusted according to the actual content of the format file), and you can also use other types, but you need to ensure that the offset of each field member is consistent with the description in the format.

The example ebpf program using the manually constructed custom structure as a parameter for the sys_enter_fchmodat event is as follows:

SEC("tracepoint/syscalls/sys_enter_fchmodat")
int tracepoint__syscalls__sys_enter_fchmodat(struct sys_enter_fchmodat_args *ctx) {
    // ...

    char *filename_ptr = (char *)ctx->filename_ptr;
    bpf_core_read_user_str(&event->filename, sizeof(event->filename), filename_ptr);
    event->mode = (u32)ctx->mode;

    // ...
}

Refer to the following links for complete examples using this method:


Comments