The Difference Between Container Privileged Mode and Non-Privileged Mode

Introduction

This article aims to explore the differences between privileged and non-privileged modes in containers, focusing on identifying scenarios where the privileged mode is essential for fulfilling specific business needs.

Privileged Mode

The description of privileged mode in the CRI(Container Runtime Interface) is outlined as follows:

// If set, run container in privileged mode.
// Privileged mode is incompatible with the following options. If
// privileged is set, the following features MAY have no effect:
// 1. capabilities
// 2. selinux_options
// 4. seccomp
// 5. apparmor
//
// Privileged mode implies the following specific options are applied:
// 1. All capabilities are added.
// 2. Sensitive paths, such as kernel module paths within sysfs, are not masked.
// 3. Any sysfs and procfs mounts are mounted RW.
// 4. AppArmor confinement is not applied.
// 5. Seccomp restrictions are not applied.
// 6. The device cgroup does not restrict access to any devices.
// 7. All devices from the host's /dev are available within the container.
// 8. SELinux restrictions are not applied (e.g. label=disabled).

Next, let's delve into each item's effect through illustrative examples.

All capabilities are added

In the standard mode, the processes inside a container are restricted to a limited set of Linux capabilities.

$ docker run --rm -it  r.j3ss.co/amicontained bash

Capabilities:
    BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap

On the other hand, processes inside containers operating in privileged mode have access to all Linux capabilities.

$ docker run --privileged --rm -it  r.j3ss.co/amicontained bash

Capabilities:
    BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read

Similarly, in the standard mode, one can replicate these needs by manually adjusting the --cap-add parameter.

$ docker run --cap-add=ALL  --rm -it  r.j3ss.co/amicontained bash
Capabilities:
        BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read

By the way, while processes within a container in privileged mode have the ability to utilize all Linux capabilities, this doesn't automatically grant them permission for certain actions. For instance, if a container is initiated with a non-root user, even in privileged mode, it doesn't necessarily allow it to perform actions that are otherwise unauthorized.

$ docker run --rm -it debian:buster chown 65534 /var/log/lastlog

$ docker run -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog
chown: changing ownership of '/var/log/lastlog': Operation not permitted

$ docker run --privileged -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog
chown: changing ownership of '/var/log/lastlog': Operation not permitted

Sensitive paths, such as kernel module paths within sysfs, are not masked.

In standard mode, access to certain kernel module paths, like specific directories under /proc, is selectively restricted. Some are write-protected, while others allow read and write operations. To address this, these directories are mounted into the container using the tmpfs filesystem, facilitating the implementation of directory masking.

$ docker run --rm -it debian:buster mount |grep '/proc.*tmpfs'
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)

Under privileged mode, these directories are no longer mounted using the tmpfs filesystem approach.

$ docker run --privileged --rm -it debian:buster mount |grep '/proc.*tmpfs'
$

Any sysfs and procfs mounts are mounted RW.

In standard mode, certain kernel file systems, such as sysfs and procfs, are mounted inside the container as read-only. This is to prevent processes within the container from making unauthorized changes to the system kernel.

$ docker run --rm -it debian:buster mount |grep '(ro'
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
tmpfs on /sys/firmware type tmpfs (ro,relatime)

In contrast, under privileged mode, these kernel file systems are not mounted as read-only.

$ docker run --privileged --rm -it debian:buster mount |grep '(ro'
$

AppArmor confinement is not applied.

Seccomp restrictions are not applied.

In standard mode, containers can be secured by setting up AppArmor or Seccomp security options. If these are not configured manually, the container engine typically activates certain default configurations for enhanced security.

$ docker run --rm -it  r.j3ss.co/amicontained bash
AppArmor Profile: unconfined
Seccomp: filtering
Blocked Syscalls (63):
        MSGRCV SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PREADV2 PWRITEV2 PKEY_MPROTECT PKEY_ALLOC PKEY_FREE

However, in privileged mode, these AppArmor or Seccomp configurations become ineffective.

$ docker run --privileged --rm -it  r.j3ss.co/amicontained bash
AppArmor Profile: unconfined
Seccomp: disabled

Additionally, in standard mode, AppArmor or Seccomp features can be disabled using their respective security settings.

The device cgroup does not restrict access to any devices.

Under the default mode, operations on cgroup are limited to read-only access.

$ docker run --rm -it debian:buster mount | grep 'cgroup'
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)

In contrast, privileged mode allows for both read and write operations on cgroup.

$ docker run --privileged --rm -it debian:buster mount | grep 'cgroup'
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)

All devices from the host's /dev are available within the container.

In standard mode, the device nodes unique to the /dev directory are not visible within the container's /dev directory.

# docker run --rm -it debian:buster ls /dev
console  fd    mqueue  ptmx  random  stderr  stdout  urandom
core     full  null    pts   shm     stdin   tty     zero

However, in privileged mode, the container's /dev directory includes these specific contents from the node's /dev directory.

$ docker run --privileged --rm -it debian:buster ls /dev
autofs           mapper              stdin   tty25  tty44  tty63    vcsa1
btrfs-control    mcelog              stdout  tty26  tty45  tty7     vcsa2
bus              mem                 tty     tty27  tty46  tty8     vcsa3
console          memory_bandwidth    tty0    tty28  tty47  tty9     vcsa4
core             mqueue              tty1    tty29  tty48  ttyS0    vcsa5
cpu              net                 tty10   tty3   tty49  ttyS1    vcsa6
cpu_dma_latency  network_latency     tty11   tty30  tty5   ttyS2    vcsu
cuse             network_throughput  tty12   tty31  tty50  ttyS3    vcsu1
dri              null                tty13   tty32  tty51  uhid     vcsu2
fb0              nvram               tty14   tty33  tty52  uinput   vcsu3
fd               port                tty15   tty34  tty53  urandom  vcsu4
full             ppp                 tty16   tty35  tty54  usbmon0  vcsu5
fuse             ptmx                tty17   tty36  tty55  usbmon1  vcsu6
hidraw0          ptp0                tty18   tty37  tty56  vcs      vda
hpet             pts                 tty19   tty38  tty57  vcs1     vda1
hwrng            random              tty2    tty39  tty58  vcs2     vfio
infiniband       raw                 tty20   tty4   tty59  vcs3     vga_arbiter
input            rtc0                tty21   tty40  tty6   vcs4     vhost-net
kmsg             shm                 tty22   tty41  tty60  vcs5     vhost-vsock
lightnvm         snapshot            tty23   tty42  tty61  vcs6     zero
loop-control     stderr              tty24   tty43  tty62  vcsa

SELinux restrictions are not applied (e.g. label=disabled).

In privileged mode, security hardening configurations related to SELinux are disabled.

Similarly, in standard mode, SELinux features can be turned off using the appropriate security settings.


Comments