Introduction¶
This article aims to explore the differences between privileged and non-privileged modes in containers, focusing on identifying scenarios where the privileged mode is essential for fulfilling specific business needs.
Privileged Mode¶
The description of privileged mode in the CRI(Container Runtime Interface) is outlined as follows:
// If set, run container in privileged mode. // Privileged mode is incompatible with the following options. If // privileged is set, the following features MAY have no effect: // 1. capabilities // 2. selinux_options // 4. seccomp // 5. apparmor // // Privileged mode implies the following specific options are applied: // 1. All capabilities are added. // 2. Sensitive paths, such as kernel module paths within sysfs, are not masked. // 3. Any sysfs and procfs mounts are mounted RW. // 4. AppArmor confinement is not applied. // 5. Seccomp restrictions are not applied. // 6. The device cgroup does not restrict access to any devices. // 7. All devices from the host's /dev are available within the container. // 8. SELinux restrictions are not applied (e.g. label=disabled).
Next, let's delve into each item's effect through illustrative examples.
All capabilities are added¶
In the standard mode, the processes inside a container are restricted to a limited set of Linux capabilities.
$ docker run --rm -it r.j3ss.co/amicontained bash Capabilities: BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
On the other hand, processes inside containers operating in privileged mode have access to all Linux capabilities.
$ docker run --privileged --rm -it r.j3ss.co/amicontained bash Capabilities: BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read
Similarly, in the standard mode, one can replicate these needs by manually adjusting the --cap-add parameter.
$ docker run --cap-add=ALL --rm -it r.j3ss.co/amicontained bash Capabilities: BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read
By the way, while processes within a container in privileged mode have the ability to utilize all Linux capabilities, this doesn't automatically grant them permission for certain actions. For instance, if a container is initiated with a non-root user, even in privileged mode, it doesn't necessarily allow it to perform actions that are otherwise unauthorized.
$ docker run --rm -it debian:buster chown 65534 /var/log/lastlog $ docker run -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog chown: changing ownership of '/var/log/lastlog': Operation not permitted $ docker run --privileged -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog chown: changing ownership of '/var/log/lastlog': Operation not permitted
Sensitive paths, such as kernel module paths within sysfs, are not masked.¶
In standard mode, access to certain kernel module paths, like specific directories under /proc, is selectively restricted. Some are write-protected, while others allow read and write operations. To address this, these directories are mounted into the container using the tmpfs filesystem, facilitating the implementation of directory masking.
$ docker run --rm -it debian:buster mount |grep '/proc.*tmpfs' tmpfs on /proc/acpi type tmpfs (ro,relatime) tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/sched_debug type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/scsi type tmpfs (ro,relatime)
Under privileged mode, these directories are no longer mounted using the tmpfs filesystem approach.
$ docker run --privileged --rm -it debian:buster mount |grep '/proc.*tmpfs' $
Any sysfs and procfs mounts are mounted RW.¶
In standard mode, certain kernel file systems, such as sysfs and procfs, are mounted inside the container as read-only. This is to prevent processes within the container from making unauthorized changes to the system kernel.
$ docker run --rm -it debian:buster mount |grep '(ro' sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids) proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime) tmpfs on /proc/acpi type tmpfs (ro,relatime) tmpfs on /proc/scsi type tmpfs (ro,relatime) tmpfs on /sys/firmware type tmpfs (ro,relatime)
In contrast, under privileged mode, these kernel file systems are not mounted as read-only.
$ docker run --privileged --rm -it debian:buster mount |grep '(ro' $
AppArmor confinement is not applied.¶
Seccomp restrictions are not applied.¶
In standard mode, containers can be secured by setting up AppArmor or Seccomp security options. If these are not configured manually, the container engine typically activates certain default configurations for enhanced security.
$ docker run --rm -it r.j3ss.co/amicontained bash AppArmor Profile: unconfined Seccomp: filtering Blocked Syscalls (63): MSGRCV SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PREADV2 PWRITEV2 PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
However, in privileged mode, these AppArmor or Seccomp configurations become ineffective.
$ docker run --privileged --rm -it r.j3ss.co/amicontained bash AppArmor Profile: unconfined Seccomp: disabled
Additionally, in standard mode, AppArmor or Seccomp features can be disabled using their respective security settings.
The device cgroup does not restrict access to any devices.¶
Under the default mode, operations on cgroup are limited to read-only access.
$ docker run --rm -it debian:buster mount | grep 'cgroup' tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
In contrast, privileged mode allows for both read and write operations on cgroup.
$ docker run --privileged --rm -it debian:buster mount | grep 'cgroup' tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
All devices from the host's /dev are available within the container.¶
In standard mode, the device nodes unique to the /dev directory are not visible within the container's /dev directory.
# docker run --rm -it debian:buster ls /dev console fd mqueue ptmx random stderr stdout urandom core full null pts shm stdin tty zero
However, in privileged mode, the container's /dev directory includes these specific contents from the node's /dev directory.
$ docker run --privileged --rm -it debian:buster ls /dev autofs mapper stdin tty25 tty44 tty63 vcsa1 btrfs-control mcelog stdout tty26 tty45 tty7 vcsa2 bus mem tty tty27 tty46 tty8 vcsa3 console memory_bandwidth tty0 tty28 tty47 tty9 vcsa4 core mqueue tty1 tty29 tty48 ttyS0 vcsa5 cpu net tty10 tty3 tty49 ttyS1 vcsa6 cpu_dma_latency network_latency tty11 tty30 tty5 ttyS2 vcsu cuse network_throughput tty12 tty31 tty50 ttyS3 vcsu1 dri null tty13 tty32 tty51 uhid vcsu2 fb0 nvram tty14 tty33 tty52 uinput vcsu3 fd port tty15 tty34 tty53 urandom vcsu4 full ppp tty16 tty35 tty54 usbmon0 vcsu5 fuse ptmx tty17 tty36 tty55 usbmon1 vcsu6 hidraw0 ptp0 tty18 tty37 tty56 vcs vda hpet pts tty19 tty38 tty57 vcs1 vda1 hwrng random tty2 tty39 tty58 vcs2 vfio infiniband raw tty20 tty4 tty59 vcs3 vga_arbiter input rtc0 tty21 tty40 tty6 vcs4 vhost-net kmsg shm tty22 tty41 tty60 vcs5 vhost-vsock lightnvm snapshot tty23 tty42 tty61 vcs6 zero loop-control stderr tty24 tty43 tty62 vcsa
SELinux restrictions are not applied (e.g. label=disabled).¶
In privileged mode, security hardening configurations related to SELinux are disabled.
Similarly, in standard mode, SELinux features can be turned off using the appropriate security settings.
Comments