NAME:
docker-runc spec - create a new specification fileUSAGE:
docker-runc spec [command options] [arguments...]DESCRIPTION:
The spec command creates the new specification file named "config.json" for
the bundle.
使用示例
mkdir hello
cd hello
docker pull hello-world
docker export $(docker create hello-world) > hello-world.tar
mkdir rootfs
tar -C rootfs -xf hello-world.tar
runc spec
sed -i 's;"sh";"/hello";' config.json
runc run container1
容器运行的基本配置
// Spec is the base configuration for the container.
type Spec struct {
// Version of the Open Container Initiative Runtime Specification with which the bundle complies.
Version string `json:"ociVersion"`
// Process configures the container process.
Process *Process `json:"process,omitempty"`
// Root configures the container's root filesystem.
Root *Root `json:"root,omitempty"`
// Hostname configures the container's hostname.
Hostname string `json:"hostname,omitempty"`
// Mounts configures additional mounts (on top of Root).
Mounts []Mount `json:"mounts,omitempty"`
// Hooks configures callbacks for container lifecycle events.
Hooks *Hooks `json:"hooks,omitempty" platform:"linux,solaris"`
// Annotations contains arbitrary metadata for the container.
Annotations map[string]string `json:"annotations,omitempty"`
// Linux is platform-specific configuration for Linux based containers.
Linux *Linux `json:"linux,omitempty" platform:"linux"`
// Solaris is platform-specific configuration for Solaris based containers.
Solaris *Solaris `json:"solaris,omitempty" platform:"solaris"`
// Windows is platform-specific configuration for Windows based containers.
Windows *Windows `json:"windows,omitempty" platform:"windows"`
// VM specifies configuration for virtual-machine-based containers.
VM *VM `json:"vm,omitempty" platform:"vm"`
}
capabilities可以分为线程capabilities和文件capabilities,而Linux内核最终检查的是进程能力中的Effective
- Effective:内核进行线程capabilities检查时实际使用到的集合
- Inheritable:可执行文件设置了inheritable bit位时,调用execve执行该程序会继承调用者的Inheritable集合,并将其加入到permitted集合。在非root用户下执行execve时,通常不会保留inheritable 集合,可以考虑使用ambient 集合,当一个程序drop掉一个capabilities时,只能通过execve执行SUID置位的程序或者程序的文件带有该capabilities的方式来获得该capabilities
- permitted:effective集合和inheritable集合的超集,限制了它们的范围,如果一个capabilities不存在permitted中,不可以通过cap_set_proc来获取的。一个进程在Permitted集合中丢失一个能力,它无论如何不能再次获取该能力(除非特权用户再次赋予它)
- ambient :内核4.3之后引入的,补充Inheritable使用上的缺陷,ambien集合可以使用函数prctl修改。当程序由于SUID(SGID)bit位而转变UID(GID),或执行带有文件capabilities的程序时会导致该集合被清空
// LinuxCapabilities specifies the whitelist of capabilities that are kept for a process.
// http://man7.org/linux/man-pages/man7/capabilities.7.html
type LinuxCapabilities struct {
// Bounding is the set of capabilities checked by the kernel.
Bounding []string `json:"bounding,omitempty" platform:"linux"`
// Effective is the set of capabilities checked by the kernel.
Effective []string `json:"effective,omitempty" platform:"linux"`
// Inheritable is the capabilities preserved across execve.
Inheritable []string `json:"inheritable,omitempty" platform:"linux"`
// Permitted is the limiting superset for effective capabilities.
Permitted []string `json:"permitted,omitempty" platform:"linux"`
// Ambient is the ambient set of capabilities that are kept.
Ambient []string `json:"ambient,omitempty" platform:"linux"`
}
var specCommand = cli.Command{
Name: "spec",
Usage: "create a new specification file",
ArgsUsage: "",
Description: `The spec command creates the new specification file named "` + specConfig + `" for
the bundle.
OPTIONS:
--bundle value, -b value path to the root of the bundle directory
--rootless generate a configuration for a rootless container
Flags: []cli.Flag{
cli.StringFlag{
Name: "bundle, b",
Value: "",
Usage: "path to the root of the bundle directory",
},
cli.BoolFlag{
Name: "rootless",
Usage: "generate a configuration for a rootless container",
},
},
主要函数在specconv.Example,生成config.json配置文件
Action: func(context *cli.Context) error {
if err := checkArgs(context, 0, exactArgs); err != nil {
return err
}
spec := specconv.Example()
rootless := context.Bool("rootless")
if rootless {
specconv.ToRootless(spec)
}
checkNoFile := func(name string) error {
_, err := os.Stat(name)
if err == nil {
return fmt.Errorf("File %s exists. Remove it first", name)
}
if !os.IsNotExist(err) {
return err
}
return nil
}
bundle := context.String("bundle")
if bundle != "" {
if err := os.Chdir(bundle); err != nil {
return err
}
}
if err := checkNoFile(specConfig); err != nil {
return err
}
data, err := json.MarshalIndent(spec, "", "\t")
if err != nil {
return err
}
return ioutil.WriteFile(specConfig, data, 0666)
},
// Example returns an example spec file, with many options set so a user can
// see what a standard spec file looks like.
func Example() *specs.Spec {
return &specs.Spec{
Version: specs.Version,
Root: &specs.Root{
Path: "rootfs",
Readonly: true,
},
包括环境变量,当前工作目录,
Process: &specs.Process{
Terminal: true,
User: specs.User{},
Args: []string{
"sh",
},
Env: []string{
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm",
},
Cwd: "/",
NoNewPrivileges: true,
docker可以在run的时候使用--cap-add为容器的初始进程添加capabilities,--cap-drop移除capabilities
Capabilities: &specs.LinuxCapabilities{
Bounding: []string{
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE",
},
Permitted: []string{
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE",
},
Inheritable: []string{
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE",
},
Ambient: []string{
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE",
},
Effective: []string{
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE",
},
},
比如进程的core file的最大值,虚拟内存的最大值等
- soft limit是指内核所能支持的资源上限。比如对于RLIMIT_NOFILE(一个进程能打开的最大文件 数,内核默认是1024),soft limit最大也只能达到1024。对于RLIMIT_CORE(core文件的大小,内核不做限制),soft limit最大能是unlimited。
- hard limit在资源中只是作为soft limit的上限。当你设置hard limit后,你以后设置的soft limit只能小于hard limit。要说明的是,hard limit只针对非特权进程,也就是进程的有效用户ID(effective user ID)不是0的进程。具有特权级别的进程(具有属性CAP_SYS_RESOURCE),soft limit则只有内核上限
Rlimits: []specs.POSIXRlimit{
{
Type: "RLIMIT_NOFILE",
Hard: uint64(1024),
Soft: uint64(1024),
},
},
Destination | Type | source | Options |
/proc |
proc |
proc |
nil |
/dev |
tmpfs |
tmpfs |
"nosuid", "strictatime", "mode=755", "size=65536k" |
/dev/pts |
devpts |
devpts |
"nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" |
/dev/shm |
tmpfs |
shm |
"nosuid", "noexec", "nodev", "mode=1777", "size=65536k" |
/dev/mqueue |
mqueue |
mqueue |
"nosuid", "noexec", "nodev" |
/sys |
sysfs |
sysfs |
"nosuid", "noexec", "nodev", "ro" |
/sys/fs/cgroup |
cgroup |
cgroup |
"nosuid", "noexec", "nodev", "relatime", "ro" |
Mounts: []specs.Mount{
{
Destination: "/proc",
Type: "proc",
Source: "proc",
Options: nil,
},
{
Destination: "/dev",
Type: "tmpfs",
Source: "tmpfs",
Options: []string{"nosuid", "strictatime", "mode=755", "size=65536k"},
},
{
Destination: "/dev/pts",
Type: "devpts",
Source: "devpts",
Options: []string{"nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5"},
},
{
Destination: "/dev/shm",
Type: "tmpfs",
Source: "shm",
Options: []string{"nosuid", "noexec", "nodev", "mode=1777", "size=65536k"},
},
{
Destination: "/dev/mqueue",
Type: "mqueue",
Source: "mqueue",
Options: []string{"nosuid", "noexec", "nodev"},
},
{
Destination: "/sys",
Type: "sysfs",
Source: "sysfs",
Options: []string{"nosuid", "noexec", "nodev", "ro"},
},
{
Destination: "/sys/fs/cgroup",
Type: "cgroup",
Source: "cgroup",
Options: []string{"nosuid", "noexec", "nodev", "relatime", "ro"},
},
},
Linux: &specs.Linux{
MaskedPaths: []string{
"/proc/kcore",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/sys/firmware",
"/proc/scsi",
},
ReadonlyPaths: []string{
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger",
},
Resources: &specs.LinuxResources{
Devices: []specs.LinuxDeviceCgroup{
{
Allow: false,
Access: "rwm",
},
},
},
Namespaces: []specs.LinuxNamespace{
{
Type: "pid",
},
{
Type: "network",
},
{
Type: "ipc",
},
{
Type: "uts",
},
{
Type: "mount",
},
},
Linux系统中共有37项特权,可在/usr/include/linux/capability.h文件中查看
P'(ambient) = (file is privileged) ? 0 : P(ambient)
P'(permitted) = (P(inheritable) & F(inheritable)) |
(F(permitted) & cap_bset) | P'(ambient)P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
P'(inheritable) = P(inheritable) [i.e., unchanged]
P 在执行execve函数前,进程的能力 P' 在执行execve函数后,进程的能力 F 可执行文件的能力 cap_bset 系统能力的边界值,在此处默认全为1
linux capabilities | 描述 |
CAP_AUDIT_CONTROL | Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filter‐ing rules 审计(记录文件变化、记录用户对文件的读写,甚至记录系统调用,文件变化通知) |
CAP_AUDIT_WRITE | Write records to kernel auditing log |
CAP_KILL | Bypass permission checks for sending signals (see kill(2)). This includes use of the ioctl(2) KDSIGAC‐CEPT operation. |
CAP_NET_BIND_SERVICE | Bind a socket to Internet domain privileged ports (port numbers less than 1024) |