【runc 源码分析】runc spec 流程分析

DESC

NAME:
   docker-runc spec - create a new specification file

USAGE:
   docker-runc spec [command options] [arguments...]

DESCRIPTION:
   The spec command creates the new specification file named "config.json" for
the bundle.

 

    使用示例 

    mkdir hello
    cd hello
    docker pull hello-world
    docker export $(docker create hello-world) > hello-world.tar
    mkdir rootfs
    tar -C rootfs -xf hello-world.tar
    runc spec
    sed -i 's;"sh";"/hello";' config.json
    runc run container1

 

Spec结构体

   容器运行的基本配置

// Spec is the base configuration for the container.
type Spec struct {
	// Version of the Open Container Initiative Runtime Specification with which the bundle complies.
	Version string `json:"ociVersion"`
	// Process configures the container process.
	Process *Process `json:"process,omitempty"`
	// Root configures the container's root filesystem.
	Root *Root `json:"root,omitempty"`
	// Hostname configures the container's hostname.
	Hostname string `json:"hostname,omitempty"`
	// Mounts configures additional mounts (on top of Root).
	Mounts []Mount `json:"mounts,omitempty"`
	// Hooks configures callbacks for container lifecycle events.
	Hooks *Hooks `json:"hooks,omitempty" platform:"linux,solaris"`
	// Annotations contains arbitrary metadata for the container.
	Annotations map[string]string `json:"annotations,omitempty"`

	// Linux is platform-specific configuration for Linux based containers.
	Linux *Linux `json:"linux,omitempty" platform:"linux"`
	// Solaris is platform-specific configuration for Solaris based containers.
	Solaris *Solaris `json:"solaris,omitempty" platform:"solaris"`
	// Windows is platform-specific configuration for Windows based containers.
	Windows *Windows `json:"windows,omitempty" platform:"windows"`
	// VM specifies configuration for virtual-machine-based containers.
	VM *VM `json:"vm,omitempty" platform:"vm"`
}

 

LinuxCapabilities结构体

     capabilities可以分为线程capabilities和文件capabilities,而Linux内核最终检查的是进程能力中的Effective

  • Effective:内核进行线程capabilities检查时实际使用到的集合
  • Inheritable:可执行文件设置了inheritable bit位时,调用execve执行该程序会继承调用者的Inheritable集合,并将其加入到permitted集合。在非root用户下执行execve时,通常不会保留inheritable 集合,可以考虑使用ambient 集合,当一个程序drop掉一个capabilities时,只能通过execve执行SUID置位的程序或者程序的文件带有该capabilities的方式来获得该capabilities
  • permitted:effective集合和inheritable集合的超集,限制了它们的范围,如果一个capabilities不存在permitted中,不可以通过cap_set_proc来获取的。一个进程在Permitted集合中丢失一个能力,它无论如何不能再次获取该能力(除非特权用户再次赋予它)
  • ambient :内核4.3之后引入的,补充Inheritable使用上的缺陷,ambien集合可以使用函数prctl修改。当程序由于SUID(SGID)bit位而转变UID(GID),或执行带有文件capabilities的程序时会导致该集合被清空
// LinuxCapabilities specifies the whitelist of capabilities that are kept for a process.
// http://man7.org/linux/man-pages/man7/capabilities.7.html
type LinuxCapabilities struct {
	// Bounding is the set of capabilities checked by the kernel.
	Bounding []string `json:"bounding,omitempty" platform:"linux"`
	// Effective is the set of capabilities checked by the kernel.
	Effective []string `json:"effective,omitempty" platform:"linux"`
	// Inheritable is the capabilities preserved across execve.
	Inheritable []string `json:"inheritable,omitempty" platform:"linux"`
	// Permitted is the limiting superset for effective capabilities.
	Permitted []string `json:"permitted,omitempty" platform:"linux"`
	// Ambient is the ambient set of capabilities that are kept.
	Ambient []string `json:"ambient,omitempty" platform:"linux"`
}

 

1. 定义specCommand

var specCommand = cli.Command{
	Name:      "spec",
	Usage:     "create a new specification file",
	ArgsUsage: "",
	Description: `The spec command creates the new specification file named "` + specConfig + `" for
the bundle.

 

2. 定义命令行runc spec

OPTIONS:
   --bundle value, -b value  path to the root of the bundle directory
   --rootless                generate a configuration for a rootless container

	Flags: []cli.Flag{
		cli.StringFlag{
			Name:  "bundle, b",
			Value: "",
			Usage: "path to the root of the bundle directory",
		},
		cli.BoolFlag{
			Name:  "rootless",
			Usage: "generate a configuration for a rootless container",
		},
	},

 

3. 定义Action

   主要函数在specconv.Example,生成config.json配置文件

	Action: func(context *cli.Context) error {
		if err := checkArgs(context, 0, exactArgs); err != nil {
			return err
		}
		spec := specconv.Example()

		rootless := context.Bool("rootless")
		if rootless {
			specconv.ToRootless(spec)
		}

		checkNoFile := func(name string) error {
			_, err := os.Stat(name)
			if err == nil {
				return fmt.Errorf("File %s exists. Remove it first", name)
			}
			if !os.IsNotExist(err) {
				return err
			}
			return nil
		}
		bundle := context.String("bundle")
		if bundle != "" {
			if err := os.Chdir(bundle); err != nil {
				return err
			}
		}
		if err := checkNoFile(specConfig); err != nil {
			return err
		}
		data, err := json.MarshalIndent(spec, "", "\t")
		if err != nil {
			return err
		}
		return ioutil.WriteFile(specConfig, data, 0666)
	},

  3.1 填充Spec结构,默认使用rootfs目录

// Example returns an example spec file, with many options set so a user can
// see what a standard spec file looks like.
func Example() *specs.Spec {
	return &specs.Spec{
		Version: specs.Version,
		Root: &specs.Root{
			Path:     "rootfs",
			Readonly: true,
		},

  3.2 填充Spec Process结构

   包括环境变量,当前工作目录,

		Process: &specs.Process{
			Terminal: true,
			User:     specs.User{},
			Args: []string{
				"sh",
			},
			Env: []string{
				"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
				"TERM=xterm",
			},
			Cwd:             "/",
			NoNewPrivileges: true,

  3.3 设置默认的capacities

      docker可以在run的时候使用--cap-add为容器的初始进程添加capabilities,--cap-drop移除capabilities

			Capabilities: &specs.LinuxCapabilities{
				Bounding: []string{
					"CAP_AUDIT_WRITE",
					"CAP_KILL",
					"CAP_NET_BIND_SERVICE",
				},
				Permitted: []string{
					"CAP_AUDIT_WRITE",
					"CAP_KILL",
					"CAP_NET_BIND_SERVICE",
				},
				Inheritable: []string{
					"CAP_AUDIT_WRITE",
					"CAP_KILL",
					"CAP_NET_BIND_SERVICE",
				},
				Ambient: []string{
					"CAP_AUDIT_WRITE",
					"CAP_KILL",
					"CAP_NET_BIND_SERVICE",
				},
				Effective: []string{
					"CAP_AUDIT_WRITE",
					"CAP_KILL",
					"CAP_NET_BIND_SERVICE",
				},
			},

  3.4 设置Rlimits进程资源的限制

    比如进程的core file的最大值,虚拟内存的最大值等

  • soft limit是指内核所能支持的资源上限。比如对于RLIMIT_NOFILE(一个进程能打开的最大文件 数,内核默认是1024),soft limit最大也只能达到1024。对于RLIMIT_CORE(core文件的大小,内核不做限制),soft limit最大能是unlimited。
  • hard limit在资源中只是作为soft limit的上限。当你设置hard limit后,你以后设置的soft limit只能小于hard limit。要说明的是,hard limit只针对非特权进程,也就是进程的有效用户ID(effective user ID)不是0的进程。具有特权级别的进程(具有属性CAP_SYS_RESOURCE),soft limit则只有内核上限
			Rlimits: []specs.POSIXRlimit{
				{
					Type: "RLIMIT_NOFILE",
					Hard: uint64(1024),
					Soft: uint64(1024),
				},
			},

  3.4 设置默认的Mount

  • /proc:  一种伪文件系统(也即虚拟文件系统),存储的是当前内核运行状态的一系列特殊文件,用户可以通过这些文件查看有关系统硬件及当前正在运行进程的信息
  • /dev: 包含了所有Linux系统中使用的外部设备,实际上是一个访问这些外部设备的端口
  • /dev/pts: 远程登陆(telnet,ssh等)后创建的控制台设备文件所在的目录
  • /dev/shm: 不在磁盘上,而是在内存里,因此使用linux /dev/shm/的效率非常高,直接写进内存
  • /dev/mqueue: 将消息队列挂在在/dev/mqueue目录下,便于查看 
  • /sys: sysfs 文件系统总是被挂载在/sys挂载点上
Destination Type source Options
/proc
proc
proc
nil
/dev
tmpfs
tmpfs
"nosuid", "strictatime", "mode=755", "size=65536k"
/dev/pts
devpts
devpts
"nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5"
/dev/shm
tmpfs
shm
"nosuid", "noexec", "nodev", "mode=1777", "size=65536k"
/dev/mqueue
mqueue
mqueue
"nosuid", "noexec", "nodev"
/sys
sysfs
sysfs
"nosuid", "noexec", "nodev", "ro"
/sys/fs/cgroup
cgroup
cgroup
"nosuid", "noexec", "nodev", "relatime", "ro"
		Mounts: []specs.Mount{
			{
				Destination: "/proc",
				Type:        "proc",
				Source:      "proc",
				Options:     nil,
			},
			{
				Destination: "/dev",
				Type:        "tmpfs",
				Source:      "tmpfs",
				Options:     []string{"nosuid", "strictatime", "mode=755", "size=65536k"},
			},
			{
				Destination: "/dev/pts",
				Type:        "devpts",
				Source:      "devpts",
				Options:     []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5"},
			},
			{
				Destination: "/dev/shm",
				Type:        "tmpfs",
				Source:      "shm",
				Options:     []string{"nosuid", "noexec", "nodev", "mode=1777", "size=65536k"},
			},
			{
				Destination: "/dev/mqueue",
				Type:        "mqueue",
				Source:      "mqueue",
				Options:     []string{"nosuid", "noexec", "nodev"},
			},
			{
				Destination: "/sys",
				Type:        "sysfs",
				Source:      "sysfs",
				Options:     []string{"nosuid", "noexec", "nodev", "ro"},
			},
			{
				Destination: "/sys/fs/cgroup",
				Type:        "cgroup",
				Source:      "cgroup",
				Options:     []string{"nosuid", "noexec", "nodev", "relatime", "ro"},
			},
		},

  3.5 设置linux特性

		Linux: &specs.Linux{
			MaskedPaths: []string{
				"/proc/kcore",
				"/proc/latency_stats",
				"/proc/timer_list",
				"/proc/timer_stats",
				"/proc/sched_debug",
				"/sys/firmware",
				"/proc/scsi",
			},
			ReadonlyPaths: []string{
				"/proc/asound",
				"/proc/bus",
				"/proc/fs",
				"/proc/irq",
				"/proc/sys",
				"/proc/sysrq-trigger",
			},
			Resources: &specs.LinuxResources{
				Devices: []specs.LinuxDeviceCgroup{
					{
						Allow:  false,
						Access: "rwm",
					},
				},
			},
			Namespaces: []specs.LinuxNamespace{
				{
					Type: "pid",
				},
				{
					Type: "network",
				},
				{
					Type: "ipc",
				},
				{
					Type: "uts",
				},
				{
					Type: "mount",
				},
			},

 

Linux capability机制

     Linux系统中共有37项特权,可在/usr/include/linux/capability.h文件中查看

  Transformation of capabilities during execve()

           P'(ambient) = (file is privileged) ? 0 : P(ambient)

           P'(permitted) = (P(inheritable) & F(inheritable)) |
                           (F(permitted) & cap_bset) | P'(ambient)

           P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

           P'(inheritable) = P(inheritable)    [i.e., unchanged]

      P   在执行execve函数前,进程的能力
      P'  在执行execve函数后,进程的能力
      F   可执行文件的能力
      cap_bset 系统能力的边界值,在此处默认全为1

 

linux capabilities 描述
CAP_AUDIT_CONTROL Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and  filter‐ing rules 审计(记录文件变化、记录用户对文件的读写,甚至记录系统调用,文件变化通知)
CAP_AUDIT_WRITE Write records to kernel auditing log
CAP_KILL Bypass permission checks for sending signals (see kill(2)).  This includes use of the ioctl(2) KDSIGAC‐CEPT operation.
CAP_NET_BIND_SERVICE Bind a socket to Internet domain privileged ports (port numbers less than 1024)
   

 

 

你可能感兴趣的:(Docker)