容器特权模式与非特权模式的区别

前言

本文尝试解答容器特权模式和非特权模式的区别, 以及通过它们之间的区别找出哪些场景下必需使用特权模式才能实现业务需求。

特权模式

CRI(Container Runtime Interface) 中特权模式的说明如下:

// If set, run container in privileged mode.
// Privileged mode is incompatible with the following options. If
// privileged is set, the following features MAY have no effect:
// 1. capabilities
// 2. selinux_options
// 4. seccomp
// 5. apparmor
//
// Privileged mode implies the following specific options are applied:
// 1. All capabilities are added.
// 2. Sensitive paths, such as kernel module paths within sysfs, are not masked.
// 3. Any sysfs and procfs mounts are mounted RW.
// 4. AppArmor confinement is not applied.
// 5. Seccomp restrictions are not applied.
// 6. The device cgroup does not restrict access to any devices.
// 7. All devices from the host's /dev are available within the container.
// 8. SELinux restrictions are not applied (e.g. label=disabled).

下面我们将通过示例说明一下每一项的效果。

All capabilities are added

普通模式下容器内进程只可以使用有限的一些 linux capabilities:

$ docker run --rm -it  r.j3ss.co/amicontained bash

Capabilities:
    BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap

但是,特权模式下的容器内进程可以使用所有的 linux capabilities:

$ docker run --privileged --rm -it  r.j3ss.co/amicontained bash

Capabilities:
    BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read

也可以通过手动自定义 --cap-add 参数的方式,在普通模式下实现类似的需求:

$ docker run --cap-add=ALL  --rm -it  r.j3ss.co/amicontained bash
Capabilities:
        BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read

BTW,特权模式下,容器内进程拥有使用所有的 linux capabilities 的能力,但是, 不表示进程就一定有使用某些 linux capabilities 的权限。比如,如果容器是以非 root 用户启动的, 就算它是以特权模式启动的容器,也不表示它就能够做一些无权限做的事情:

$ docker run --rm -it debian:buster chown 65534 /var/log/lastlog

$ docker run -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog
chown: changing ownership of '/var/log/lastlog': Operation not permitted

$ docker run --privileged -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog
chown: changing ownership of '/var/log/lastlog': Operation not permitted

Sensitive paths, such as kernel module paths within sysfs, are not masked.

普通模式下,部分内核模块路径比如 /proc 下的一些目录需要阻止写入、有些又需要允许读写, 这些文件目录将会以 tmpfs 文件系统的方式挂载到容器中,以实现目录 mask 的需求 (TODO: 待进一步更新说明):

$ docker run --rm -it debian:buster mount |grep '/proc.*tmpfs'
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)

特权模式下,这些目录将不再以 tmpfs 文件系统的方式挂载:

$ docker run --privileged --rm -it debian:buster mount |grep '/proc.*tmpfs'
$

Any sysfs and procfs mounts are mounted RW.

普通模式下,部分内核文件系统(sysfs、procfs)会被以只读的方式挂载到容器中,以阻止容器内进程随意修改系统内核:

$ docker run --rm -it debian:buster mount |grep '(ro'
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
tmpfs on /sys/firmware type tmpfs (ro,relatime)

但是在特权模式下,内核文件系统将不再以只读的方式被挂载:

$ docker run --privileged --rm -it debian:buster mount |grep '(ro'
$

AppArmor confinement is not applied.

Seccomp restrictions are not applied.

普通模式下,可以通过配置 AppArmor 或 Seccomp 相关安全选项 (如果未配置的话,容器引擎默认也会启用一些对应的默认配置) 对容器进行加固:

$ docker run --rm -it  r.j3ss.co/amicontained bash
AppArmor Profile: unconfined
Seccomp: filtering
Blocked Syscalls (63):
        MSGRCV SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PREADV2 PWRITEV2 PKEY_MPROTECT PKEY_ALLOC PKEY_FREE

特权模式下,这些 AppArmor 或 Seccomp 相关配置将不再生效:

$ docker run --privileged --rm -it  r.j3ss.co/amicontained bash
AppArmor Profile: unconfined
Seccomp: disabled

普通模式下也可以通过对应的安全选项来禁用 AppArmor 或 Seccomp 特性。

The device cgroup does not restrict access to any devices.

默认模式下,只能以只读模式操作 cgroup

$ docker run --rm -it debian:buster mount | grep 'cgroup'
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)

特权模式下,将可以对 cgroup 进行读写操作:

$ docker run --privileged --rm -it debian:buster mount | grep 'cgroup'
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)

All devices from the host's /dev are available within the container.

普通模式下,容器内 /dev 目录下看不到节点 /dev 目录下特有的 devices

# docker run --rm -it debian:buster ls /dev
console  fd    mqueue  ptmx  random  stderr  stdout  urandom
core     full  null    pts   shm     stdin   tty     zero

特权模式下,容器内的 /dev 目录会包含这些来自节点 /dev 目录下的那些内容:

$ docker run --privileged --rm -it debian:buster ls /dev
autofs           mapper              stdin   tty25  tty44  tty63    vcsa1
btrfs-control    mcelog              stdout  tty26  tty45  tty7     vcsa2
bus              mem                 tty     tty27  tty46  tty8     vcsa3
console          memory_bandwidth    tty0    tty28  tty47  tty9     vcsa4
core             mqueue              tty1    tty29  tty48  ttyS0    vcsa5
cpu              net                 tty10   tty3   tty49  ttyS1    vcsa6
cpu_dma_latency  network_latency     tty11   tty30  tty5   ttyS2    vcsu
cuse             network_throughput  tty12   tty31  tty50  ttyS3    vcsu1
dri              null                tty13   tty32  tty51  uhid     vcsu2
fb0              nvram               tty14   tty33  tty52  uinput   vcsu3
fd               port                tty15   tty34  tty53  urandom  vcsu4
full             ppp                 tty16   tty35  tty54  usbmon0  vcsu5
fuse             ptmx                tty17   tty36  tty55  usbmon1  vcsu6
hidraw0          ptp0                tty18   tty37  tty56  vcs      vda
hpet             pts                 tty19   tty38  tty57  vcs1     vda1
hwrng            random              tty2    tty39  tty58  vcs2     vfio
infiniband       raw                 tty20   tty4   tty59  vcs3     vga_arbiter
input            rtc0                tty21   tty40  tty6   vcs4     vhost-net
kmsg             shm                 tty22   tty41  tty60  vcs5     vhost-vsock
lightnvm         snapshot            tty23   tty42  tty61  vcs6     zero
loop-control     stderr              tty24   tty43  tty62  vcsa

SELinux restrictions are not applied (e.g. label=disabled).

特权模式下,SELinux 相关的安全加固配置将被禁用。

普通模式下也可以通过对应的安全选项来禁用 SELinux 特性。


Comments