A case in which tracing the execve system call with tracepoints does not accurately capture the system call parameters

Background¶
Reproduce the Problem Code¶
Reason¶
Verification¶
Solution¶
References¶

Background¶

Some time ago, a netizen consulted a problem he encountered in the ebpf channel on slack :

When writing an eBPF program that uses tracepoint/syscalls/sys_enter_execve to trace the execve system call, encountered a situation where there is a failure in reading the filename parameter of execve using bpf_probe_read_user_str in the recorded events, with an error code of -14/EFAULT. Especially when executing the tmux command, there will always be an event of a failed read:

ret = bpf_probe_read_user_str(event->args, ARGSIZE, (const char*)ctx->args[0]);
if (ret < 0) {
        bpf_printk("comm=%s XXX cannot read file name", tgid, pid, event->comm);
        return 0;
}

utempter-4031732 [027] d...1 3878624.643948: bpf_trace_printk: comm=tmux: server XXX cannot read file name
   <...>-4035723 [077] d...1 3878668.047954: bpf_trace_printk: comm=tmux: server XXX cannot read file name

This article will delve into the causes of this problem and propose a solution specifically designed for this situation.

Reproduce the Problem Code¶

Since the problem mentioned that the issue always occurs when executing the tmux command, and we can see in the output of bpf_printk that the process name for the corresponding event is utempter, we can try to write a simple program to reproduce the problem based on these two pieces of information.

After searching online for the relationship between tmux and utempter, and reading the relevant tmux source code, I learned the following information:

tmux uses the utempter_add_record function provided by utempter.h to create the required pseudo-terminal:
```
xasprintf(&cp, "tmux(%lu).%%%u", (long)getpid(), new_wp->id);
utempter_add_record(new_wp->fd, cp);
```

In the utempter_add_record function, it will first construct an args from the passed parameters and then invoke the execute_helper function:

#define UTEMPTER_DEFAULT_PATHNAME     LIBEXECDIR "/utempter/utempter"
static const char *utempter_pathname;

int utempter_add_record(int master_fd, const char *hostname)
  {
      const char *const args[] = {
          utempter_pathname ? : UTEMPTER_DEFAULT_PATHNAME,
          "add",
          hostname,
          0
      };
      int status = execute_helper(master_fd, args);
      // ...
  }

In the execute_helper function, the passed args will eventually be used to call the fork() and execv() functions to start a new process ( source code ):

static int execute_helper(int master_fd, const char *const argv[])
{
    // ...
    child = fork();
    if (!child) {
        do_child(master_fd, argv[0], (char *const *) argv);
    } else if (child < 0) {
    // ...
}
static void __attribute__((__noreturn__))
do_child(int master_fd, const char *path, char *const *argv)
{
    // ...
    execv(path, argv);
    print_dbg("execv: %s", strerror(errno));
    //...
}

Based on the above information, we can write the following program to reproduce this issue reproduce.c:

#define PATHNAME "/usr/bin/echo"

int main(int argc, char **argv) {
    const char *hostname = argv[0];
    const char *const args[] = {
        PATHNAME,
        "add",
        hostname,
        0
    };
    if (!fork()) {
        execv(args[0], (char *const *) args);
    }
}

Let's test the effectiveness of this program below:

Compile using GCC:
```
gcc -o reproduce reproduce.c
```

Trace the tracepoint/syscalls/sys_enter_execve event using bpftrace in another terminal:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve{printf("%s[%d]: %s\n", comm, pid, str(args->filename));}'

Run the example program:
```
./reproduce
```

At this time, the terminal running bpftrace will output results similar to the following:

bash[34882]: ./reproduce
reproduce[34883]:

In the previous results, we did not capture the value of the filename parameter used during the execution of the execve() system call by the reproducing program. From this, it can be seen that our program has reproduced the original issue.

Reason¶

Reminded by @ssfdust, someone in the bpftrace community has previously raised a similar issue: #2523 . In this issue, @fbs provided an explanation :

Like I said earlier, the data you're using isn't in memory yet. These static strings are compiled in and are not actually faulted into memory until they're accessed. The access won't happen until its read, which is after your bpftrace probe ran. BPF won't pull the data in so you get an EFAULT/-14.

By printing the values or just a random print of a constant string you pull the small amount of data into memory (as it goes by page, not by var) and then it works

That is:

In the sample program, when executing execve(), the file name string referred to by argv[0] is a static string. These static strings are stored in the .rodata section of the ELF (Executable and Linkable Format) file after compilation. They are loaded into memory only when the program accesses them, specifically when accessing the data stored in the .rodata section.
Based on point 1, the static string referenced by argv[0] in the sample program is only loaded into memory during the execution of the execve() system call. However, since our eBPF program runs before the execve() system call, attempting to read argv[0] using bpf_probe_read_user_str will fail and result in an EFAULT/-14 error code.
By actively loading strings into memory through printf and other methods, this problem can be resolved.

Summary from @jschwinger233 :

bpf_probe_read_user can only read data that has been loaded into memory (page-faulted into memory), and the above code has not read .rodata until the execve(2) time, thus not triggering a page fault. This can be verified using tracepoint:exceptions:page_fault*.

Verification¶

Add a printf statement¶

From the above, it can be seen that actually we only need to modify the program to proactively load the data in .rodata into memory before executing execve(), for example, using the printf function.

You can use the printf function to print argv[0], or you can directly print a static string:

if (!fork()) {
    printf("test\n");
    // or
    // printf("argv[0]: %s\n", args[0]);
    execv(args[0], (char *const *) args);
}

After verification, the modified program reproduce_printf1.c does allow the ebpf program to correctly read the value of filename.

Not using static strings¶

Since the issue is caused by the fact that argv[0] is a static string, you can try changing argv[0] to use a non-static string to see if the problem still exists. The modified content is as follows:

const char *const args[] = {
    argv[1],
    "add",
    hostname,
    0
};

After verification, the modified program reproduce_non_static.c can also allow the ebpf program to correctly read the value of filename.

I wonder if anyone has noticed that the value of argv[2] in our program reproducing the issue is actually a non-static string, so it can also be verified by reading the content of this data in the eBPF program. The expectation is to be able to read the content of this non-static string argv[2] without making any modifications:

Run the new bpftrace command:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve{
    printf("%s[%d]: filename: %s, argv[1]: %s, argv[2]: %s\n", comm, pid, str(args->filename), str(args->argv[1]), str(args->argv[2]));
}'

Run ./reproduce

The output example of the bpftrace command is as follows:

bash[36523]: filename: ./reproduce, argv[1]: , argv[2]:
reproduce[36524]: filename: , argv[1]: , argv[2]: ./reproduce

page fault¶

In modern Linux systems, when we run a binary executable file, the program loader (ELF executable loader) will use mmap for lazy loading based on the information defined in the ELF file, delaying the loading of some data. When the program actually tries to read these data, if the data being read has not been loaded into memory yet, it will trigger a page fault.

The data stored in the .rodata section of ELF files is exactly the data that will be lazily loaded.

.rodata¶

The .rodata section in an ELF file stores some read-only data, such as static strings. Let's first take a look at the data in the .rodata section of the compiled example program binary ELF file.

You can use the objdump command to view the contents:

$ objdump -s -j .rodata ./reproduce

./reproduce:     file format elf64-x86-64

Contents of section .rodata:
2000 01000200 2f757372 2f62696e 2f656368  ..../usr/bin/ech
2010 6f006164 6400                        o.add.

You can also use the pwntools library from the Python community for reading:

>>> from pwn import ELF
>>> elf = ELF('reproduce')
>>> elf.section('.rodata')
b'\x01\x00\x02\x00/usr/bin/echo\x00add\x00'

From the above output, it can be seen that it contains two static strings used in the source code.

Observing page fault¶

In the Linux system, when the kernel executes the execve() system call, it calls the kernel function strncpy_from_user (execve() -> getname() -> getname_flags(), strncpy_from_user()) to copy the string data contained in argv from user space to kernel space. This process involves reading user space data. If the user space data being read has not yet been loaded into memory, it will trigger a page fault.

We can perceive this event by observing exceptions/page_fault_kernel and exceptions/page_fault_user:

To simplify the verification process, we need to make a modification to the example code. Add a line of sleep(15) before executing the execv() function to ensure the program does not terminate too quickly (reproduce_sleep.c ):
```
if (!fork()) {
     sleep(15);
     execv(args[0], (char *const *) args);
}
```

Run the following bpftrace script in a terminal to observe page faults:

cat <<EOF | sudo bpftrace -
tracepoint:exceptions:page_fault_kernel /comm == "reproduce_sleep"/ {
    printf("[%s] page_fault_kernel: %s[%d] addr=%llx ip=%llx err=%lld\n",
            strftime("%M:%S", nsecs), comm, pid, args->address, args->ip, args->error_code);
}

tracepoint:exceptions:page_fault_user /comm == "reproduce_sleep"/ {
    printf("[%s] page_fault_user:   %s[%d] addr=%llx ip=%llx err=%lld\n",
            strftime("%M:%S", nsecs), comm, pid, args->address, args->ip, args->error_code);
}
EOF

Then run ./reproduce_sleep in another terminal.

View the memory mapping of the running reproduce_sleep process in the third terminal:

$ sudo cat /proc/41291/maps | grep reproduce_sleep
55b4d9e31000-55b4d9e32000 r--p 00000000 08:01 1634479                    /XXX/reproduce_sleep
55b4d9e32000-55b4d9e33000 r-xp 00001000 08:01 1634479                    /XXX/reproduce_sleep
55b4d9e33000-55b4d9e34000 r--p 00002000 08:01 1634479                    /XXX/reproduce_sleep
55b4d9e34000-55b4d9e35000 r--p 00002000 08:01 1634479                    /XXX/reproduce_sleep
55b4d9e35000-55b4d9e36000 rw-p 00003000 08:01 1634479                    /XXX/reproduce_sleep

From the above, it can be seen that the memory mapping starting address of this process is 55b4d9e31000, that is, 0x55b4d9e.

After the example program finishes running, the last two lines of the output of the bpftrace command will be similar to the following:
```
[47:40] page_fault_user:   reproduce_sleep[41290] addr=7f669db86a50 ip=7f669d9f8940 err=7
[47:55] page_fault_kernel: reproduce_sleep[41291] addr=55b4d9e33004 ip=ffffffff8e092a40 err=0
```
The 15-second gap between these two outputs confirms that the last page fault event was triggered when the execv() function was executed. Now let's locate the corresponding data based on the addr result in the event.
The value of the addr output of the last page_fault_kernel event above is 55b4d9e33004, which means the virtual memory address that triggered this page fault event is 0x55b4d9e33004.

Based on the recipe shared by @jschwinger233: real_addr = start_addr + (elf_address - section.Address + section.Offset) can be known:

0x55b4d9e33004 = 0x55b4d9e31000 + (elf_address - section.Address + section.Offset)
(elf_address - section.Address + section.Offset) = 0x55b4d9e33004 - 0x55b4d9e31000 = 8196 = 0x2004

Use the readelf command to read the virtual address, offset, and data size of the ELF .rodata section in the binary file reproduce_sleep:
```
$ readelf -S -W reproduce_sleep  | egrep '.rodata|Address'
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [18] .rodata           PROGBITS        0000000000002000 002000 000016 00   A  0   0  4
```
From the above output, it can be seen that the virtual address of the .rodata section in this ELF file is 0000000000002000 or 0x2000, the offset is 002000 or 0x2000, the data size is 0x16, that is, the address range is 0x2000-0x2016.

Through 7 and 8, it can be known that the virtual memory address of the page fault event triggered when executing the execv() function is 0x55b4d9e33004, corresponding to the ELF address 0x2004

0x55b4d9e33004 = 0x55b4d9e31000 + (elf_address - section.Address + section.Offset)
(elf_address - section.Address + section.Offset) = 0x55b4d9e33004 - 0x55b4d9e31000 = 8196 = 0x2004
(elf_address - 0x2000 + 0x2000) = 0x2004
elf_address = 0x2004

This address is within the range of the .rodata address in the ELF file (0x2000-0x2016), corresponding to the static string /usr/bin/echo used in argv[0]:

$ objdump -s -j .rodata ./reproduce_sleep

./reproduce_sleep:     file format elf64-x86-64

Contents of section .rodata:
 2000 01000200 2f757372 2f62696e 2f656368  ..../usr/bin/ech
 2010 6f006164 6400                        o.add.

>>> elf = ELF('reproduce_sleep')
>>> elf.read(0x2004, 14)
b'/usr/bin/echo\x00'

$ gdb -ex 'disas/m main' -ex q reproduce_sleep | grep 'args\[\]' -A 13
9       const char *const args[] = {
   0x00000000000011d6 <+45>:        lea    0xe27(%rip),%rax        # 0x2004
   0x00000000000011dd <+52>:        mov    %rax,-0x30(%rbp)
   0x00000000000011e1 <+56>:        lea    0xe2a(%rip),%rax        # 0x2012
   0x00000000000011e8 <+63>:        mov    %rax,-0x28(%rbp)
   0x00000000000011ec <+67>:        mov    -0x38(%rbp),%rax
   0x00000000000011f0 <+71>:        mov    %rax,-0x20(%rbp)
   0x00000000000011f4 <+75>:        movq   $0x0,-0x18(%rbp)

10          PATHNAME,
11          "add",
12          hostname,
13          0
14      };

You can also confirm by printing the pointer address of args->filename in the eBPF program or bpftrace script. The value at this address will be consistent with the value of addr in the previous page fault event, both being 0x55b4d9e33004:
```
tracepoint:syscalls:sys_enter_execve {
    printf("%s[%d]: %p\n", comm, pid, args->filename);
}
```

BTW, you can trace the function information that triggers page fault events by using the perf trace -F all command.

Solution¶

There are many methods to solve this problem, one of which is to change to tracking the tracepoint/sched/sched_process_exec event to obtain the values of the filename and argv parameters in the execve() system call. Below is an example program corresponding to the implementation:

Example script for tracing the tracepoint/sched/sched_process_exec event using bpftrace(from @jschwinger233 ):

tracepoint:sched:sched_process_exec {
  $task=curtask;
  $arg_start=$task->mm->arg_start;
  $arg_end=$task->mm->arg_end;
  printf("%s[%d]: filename: %s, argv: %r\n", comm, pid, str(args->filename), buf(uptr($arg_start), $arg_end-$arg_start));
}

An example program that traces the tracepoint/sched/sched_process_exec event through eBPF C code: main.bpf.c