前言¶
本文尝试解答容器特权模式和非特权模式的区别, 以及通过它们之间的区别找出哪些场景下必需使用特权模式才能实现业务需求。
特权模式¶
CRI(Container Runtime Interface) 中特权模式的说明如下:
// If set, run container in privileged mode. // Privileged mode is incompatible with the following options. If // privileged is set, the following features MAY have no effect: // 1. capabilities // 2. selinux_options // 4. seccomp // 5. apparmor // // Privileged mode implies the following specific options are applied: // 1. All capabilities are added. // 2. Sensitive paths, such as kernel module paths within sysfs, are not masked. // 3. Any sysfs and procfs mounts are mounted RW. // 4. AppArmor confinement is not applied. // 5. Seccomp restrictions are not applied. // 6. The device cgroup does not restrict access to any devices. // 7. All devices from the host's /dev are available within the container. // 8. SELinux restrictions are not applied (e.g. label=disabled).
下面我们将通过示例说明一下每一项的效果。
All capabilities are added¶
普通模式下容器内进程只可以使用有限的一些 linux capabilities:
$ docker run --rm -it r.j3ss.co/amicontained bash Capabilities: BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
但是,特权模式下的容器内进程可以使用所有的 linux capabilities:
$ docker run --privileged --rm -it r.j3ss.co/amicontained bash Capabilities: BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read
也可以通过手动自定义 --cap-add 参数的方式,在普通模式下实现类似的需求:
$ docker run --cap-add=ALL --rm -it r.j3ss.co/amicontained bash Capabilities: BOUNDING -> chown dac_override dac_read_search fowner fsetid kill setgid setuid setpcap linux_immutable net_bind_service net_broadcast net_admin net_raw ipc_lock ipc_owner sys_module sys_rawio sys_chroot sys_ptrace sys_pacct sys_admin sys_boot sys_nice sys_resource sys_time sys_tty_config mknod lease audit_write audit_control setfcap mac_override mac_admin syslog wake_alarm block_suspend audit_read
BTW,特权模式下,容器内进程拥有使用所有的 linux capabilities 的能力,但是, 不表示进程就一定有使用某些 linux capabilities 的权限。比如,如果容器是以非 root 用户启动的, 就算它是以特权模式启动的容器,也不表示它就能够做一些无权限做的事情:
$ docker run --rm -it debian:buster chown 65534 /var/log/lastlog $ docker run -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog chown: changing ownership of '/var/log/lastlog': Operation not permitted $ docker run --privileged -u 65534 --rm -it debian:buster chown 65534 /var/log/lastlog chown: changing ownership of '/var/log/lastlog': Operation not permitted
Sensitive paths, such as kernel module paths within sysfs, are not masked.¶
普通模式下,部分内核模块路径比如 /proc 下的一些目录需要阻止写入、有些又需要允许读写, 这些文件目录将会以 tmpfs 文件系统的方式挂载到容器中,以实现目录 mask 的需求 (TODO: 待进一步更新说明):
$ docker run --rm -it debian:buster mount |grep '/proc.*tmpfs' tmpfs on /proc/acpi type tmpfs (ro,relatime) tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/sched_debug type tmpfs (rw,nosuid,size=65536k,mode=755) tmpfs on /proc/scsi type tmpfs (ro,relatime)
特权模式下,这些目录将不再以 tmpfs 文件系统的方式挂载:
$ docker run --privileged --rm -it debian:buster mount |grep '/proc.*tmpfs' $
Any sysfs and procfs mounts are mounted RW.¶
普通模式下,部分内核文件系统(sysfs、procfs)会被以只读的方式挂载到容器中,以阻止容器内进程随意修改系统内核:
$ docker run --rm -it debian:buster mount |grep '(ro' sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids) proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime) tmpfs on /proc/acpi type tmpfs (ro,relatime) tmpfs on /proc/scsi type tmpfs (ro,relatime) tmpfs on /sys/firmware type tmpfs (ro,relatime)
但是在特权模式下,内核文件系统将不再以只读的方式被挂载:
$ docker run --privileged --rm -it debian:buster mount |grep '(ro' $
AppArmor confinement is not applied.¶
Seccomp restrictions are not applied.¶
普通模式下,可以通过配置 AppArmor 或 Seccomp 相关安全选项 (如果未配置的话,容器引擎默认也会启用一些对应的默认配置) 对容器进行加固:
$ docker run --rm -it r.j3ss.co/amicontained bash AppArmor Profile: unconfined Seccomp: filtering Blocked Syscalls (63): MSGRCV SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PREADV2 PWRITEV2 PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
特权模式下,这些 AppArmor 或 Seccomp 相关配置将不再生效:
$ docker run --privileged --rm -it r.j3ss.co/amicontained bash AppArmor Profile: unconfined Seccomp: disabled
普通模式下也可以通过对应的安全选项来禁用 AppArmor 或 Seccomp 特性。
The device cgroup does not restrict access to any devices.¶
默认模式下,只能以只读模式操作 cgroup
$ docker run --rm -it debian:buster mount | grep 'cgroup' tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
特权模式下,将可以对 cgroup 进行读写操作:
$ docker run --privileged --rm -it debian:buster mount | grep 'cgroup' tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
All devices from the host's /dev are available within the container.¶
普通模式下,容器内 /dev 目录下看不到节点 /dev 目录下特有的 devices
# docker run --rm -it debian:buster ls /dev console fd mqueue ptmx random stderr stdout urandom core full null pts shm stdin tty zero
特权模式下,容器内的 /dev 目录会包含这些来自节点 /dev 目录下的那些内容:
$ docker run --privileged --rm -it debian:buster ls /dev autofs mapper stdin tty25 tty44 tty63 vcsa1 btrfs-control mcelog stdout tty26 tty45 tty7 vcsa2 bus mem tty tty27 tty46 tty8 vcsa3 console memory_bandwidth tty0 tty28 tty47 tty9 vcsa4 core mqueue tty1 tty29 tty48 ttyS0 vcsa5 cpu net tty10 tty3 tty49 ttyS1 vcsa6 cpu_dma_latency network_latency tty11 tty30 tty5 ttyS2 vcsu cuse network_throughput tty12 tty31 tty50 ttyS3 vcsu1 dri null tty13 tty32 tty51 uhid vcsu2 fb0 nvram tty14 tty33 tty52 uinput vcsu3 fd port tty15 tty34 tty53 urandom vcsu4 full ppp tty16 tty35 tty54 usbmon0 vcsu5 fuse ptmx tty17 tty36 tty55 usbmon1 vcsu6 hidraw0 ptp0 tty18 tty37 tty56 vcs vda hpet pts tty19 tty38 tty57 vcs1 vda1 hwrng random tty2 tty39 tty58 vcs2 vfio infiniband raw tty20 tty4 tty59 vcs3 vga_arbiter input rtc0 tty21 tty40 tty6 vcs4 vhost-net kmsg shm tty22 tty41 tty60 vcs5 vhost-vsock lightnvm snapshot tty23 tty42 tty61 vcs6 zero loop-control stderr tty24 tty43 tty62 vcsa
SELinux restrictions are not applied (e.g. label=disabled).¶
特权模式下,SELinux 相关的安全加固配置将被禁用。
普通模式下也可以通过对应的安全选项来禁用 SELinux 特性。
Comments