Frequently asked questions about using raw tracepoint with ebpf/libbpf programs

Preface¶

Note some common problems related to raw tracepoint when writing ebpf/libbpf programs (such as BPF_PROG_TYPE_RAW_TRACEPOINT program).

Types of eBPF Programs¶

This article focuses on the eBPF program type called BPF_PROG_TYPE_RAW_TRACEPOINT.

What events can be monitored by raw tracepoint¶

The events that raw tracepoint can monitor can be found by looking at the contents of the /sys/kernel/debug/tracing/available_events file.

The format of each line in the file is:

<category>:<name>

For example:

sched:sched_switch

However, raw tracepoint uses the value of <name> rather than the entire <category>:<name>, as described below.

Format of SEC content¶

The SEC format for the raw tracepoint event is:

SEC("raw_tracepoint/<name>")

// for example:
// SEC("raw_tracepoint/sched_switch")

or:

SEC("raw_tp/<name>")

// for example:
// SEC("raw_tp/sched_switch")

The <name> values are those listed in the available_events file earlier.

SEC("raw_tp/xx") and SEC("raw_tracepoint/xx") are actually equivalent, depending on personal preference, which one can be used at will.

There are two special cases, which are :

Uniformly use sys_enter to represent sys_enter_xxx events under the syscalls category: SEC("raw_tracepoint/sys_enter")
Uniformly use sys_exit to represent sys_exit_xxx events under the syscalls category: SEC("raw_tracepoint/sys_exit")

That is, you can use the sys_enter and sys_exit events to monitor all system call events.

How to determine the parameter type of the raw tracepoint event handler and get the corresponding kernel call parameters¶

Suppose that we want to monitor the fchmodat system call involved in the chmod command via raw tracepoint. Then, how do we determine the type of parameters of the event handler in ebpf, and how do we get the content of the corresponding fchmodat system call parameters? For example, get the name of the file to be operated on and the value of the permission mode to be operated on.

The first step is to find the raw tracepoint events that can be used for this system call. As mentioned earlier, you can use the sys_enter and sys_exit events to monitor all system call events.

The second step is to determine the argument type of the function. raw tracepoint uses the bpf_raw_tracepoint_args structure uniformly.

struct bpf_raw_tracepoint_args {
    __u64 args[0];
};

The args stores the information we can get about the event, and what information is contained in them is what we need to determine in step 3.

The third step is to determine what information can be obtained from the event itself. Here is an example of sys_enter (taken from include/trace/events/syscalls.h, where most of the events are concentrated in the include/trace/events/ directory).

TRACE_EVENT_FN(sys_enter,
    TP_PROTO(struct pt_regs *regs, long id),
    TP_ARGS(regs, id),
    TP_STRUCT__entry(
        __field(    long,           id              )
        __array(    unsigned long,  args,   6       )
    ),
    TP_fast_assign(
        __entry->id = id;
        syscall_get_arguments(current, regs, __entry->args);
    ),
    TP_printk("NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)",
          __entry->id,
          __entry->args[0], __entry->args[1], __entry->args[2],
          __entry->args[3], __entry->args[4], __entry->args[5]),
    syscall_regfunc, syscall_unregfunc
);

其中

TP_PROTO(struct pt_regs *regs, long id) defines the information that can be taken through args of bpf_raw_tracepoint_args. id is the id of the system call, and regs contains the arguments to the corresponding system call. You can filter by id to handle only system call events for fchmodat (by ordering ausyscall fchmodat to find the corresponding system call id), and

After that, continue to get the corresponding system call parameters.

fchmodat This system call has the following function definition:

int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

Since regs is of type pt_regs, we can get the value of the first argument with PT_REGS_PARM1_CORE(regs), the value of the second argument with PT_REGS_PARM2_CORE(regs) to get the value of the second argument, PT_REGS_PARM3_CORE(regs) to get the value of the third argument, and so on, with PT_REGS_PARM4_CORE and PT_REGS_PARM5_CORE to get the value of the fourth and fifth argument in regs, respectively.

Once the information is determined, you can write the program. For example, the above example ebpf program that handles the fchmodat system call via the sys_enter event is as follows:

SEC("raw_tracepoint/sys_enter")
int raw_tracepoint__sys_enter(struct bpf_raw_tracepoint_args *ctx)
{
    unsigned long syscall_id = ctx->args[1];
    if(syscall_id != 268)    // filter fchmodat system call
        return 0;

    struct pt_regs *regs;
    regs = (struct pt_regs *) ctx->args[0];

    char pathname[256];
    u32 mode;

    // xx
    char *pathname_ptr = (char *) PT_REGS_PARM2_CORE(regs);
    bpf_core_read_user_str(&pathname, sizeof(pathname), pathname_ptr);

    // xx
    mode = (u32) PT_REGS_PARM3_CORE(regs);

    char fmt[] = "fchmodat %s %d\n";
    bpf_trace_printk(fmt, sizeof(fmt), &pathname, mode);
    return 0;
}

You can check out full example codes on Github:

The difference between raw tracepoint and tracepoint¶

The main difference is that raw tracepoint does not pass the context to the ebpf program as tracepoint does (constructing the appropriate parameter fields), the The raw tracepoint ebpf program accesses the raw parameters of the event.

Therefore, raw tracepoint usually performs a little better than tracepoint (Data from https://lwn.net/Articles/750569/ )

samples/bpf/test_overhead performance on 1 cpu:

tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename   1.1M   769K        947K            1.0M
urandom_read  789K   697K        750K            755K