前言
理解docker,主要从namesapce,cgroups,联合文件,运行时(runC),网络几个方面。接下来我们会花一些时间,分别介绍。
- docker系列--namespace解读
- docker系列--cgroups解读
- docker系列--unionfs解读
- docker系列--runC解读
- docker系列--网络模式解读
namesapce主要是隔离作用,cgroups主要是资源限制,联合文件主要用于镜像分层存储和管理,runC是运行时,遵循了oci接口,一般来说基于libcontainer。网络主要是docker单机网络和多主机通信模式。
runC
RunC 是一个轻量级的工具,它是用来运行容器的,只用来做这一件事,并且这一件事要做好。我们可以认为它就是个命令行小工具,可以不用通过 docker 引擎,直接运行容器。事实上,runC 是标准化的产物,它根据 OCI 标准来创建和运行容器。而 OCI(Open Container Initiative)组织,旨在围绕容器格式和运行时制定一个开放的工业化标准。
OCI 由 docker、coreos 以及其他容器相关公司创建于 2015 年,目前主要有两个标准文档:容器运行时标准 (runtime spec)和 容器镜像标准(image spec)。
runC 由golang语言实现,基于libcontainer库。从docker1.11以后,docker架构图:
编译
runc目前支持各种架构的Linux平台。必须使用Go 1.6或更高版本构建它才能使某些功能正常运行。
要启用seccomp支持,您需要在平台上安装libseccomp。
e.g. libseccomp-devel for CentOS, or libseccomp-dev for Ubuntu
否则,如果您不想使用seccomp支持构建runc,则可以在运行make时添加BUILDTAGS =“”。
# create a 'github.com/opencontainers' in your GOPATH/src
cd github.com/opencontainers
git clone https://github.com/opencontainers/runc
cd runc
make
sudo make install
编译选项
runc支持可选的构建标记,用于编译各种功能的支持。要将构建标记添加到make选项,必须设置BUILDTAGS变量。
make BUILDTAGS='seccomp apparmor'
Build Tag | Feature | Dependency |
---|---|---|
seccomp | Syscall filtering | libseccomp |
selinux | selinux process and mount labeling | |
apparmor | apparmor profile support | |
ambient | ambient capability support | kernel 4.3 |
使用runC
创建一个 OCI Bundle
要使用runc,您必须使用OCI包的格式容器。如果安装了Docker,则可以使用其导出方法从现有Docker容器中获取根文件系统。
# create the top most bundle directory
mkdir /mycontainer
cd /mycontainer
# create the rootfs directory
mkdir rootfs
# export busybox via Docker into the rootfs directory
docker export $(docker create busybox) | tar -C rootfs -xvf -
runc提供了一个spec命令来生成您可以编辑的基本模板规范。
runc spec
运行容器
先来准备一个工作目录,下面所有的操作都是在这个目录下执行的,比如 mycontainer:
# mkdir mycontainer
接下来,准备容器镜像的文件系统,我们选择从 docker 镜像中提取:
# mkdir rootfs
# docker export $(docker create busybox) | tar -C rootfs -xvf -
# ls rootfs
bin dev etc home proc root sys tmp usr var
有了 rootfs 之后,我们还要按照 OCI 标准有一个配置文件 config.json 说明如何运行容器,包括要运行的命令、权限、环境变量等等内容,runc 提供了一个命令可以自动帮我们生成:
# runc spec
# ls
config.json rootfs
这样就构成了一个 OCI runtime bundle 的内容,这个 bundle 非常简单,就上面两个内容:config.json 文件和 rootfs 文件系统。config.json 里面的内容很长,这里就不贴出来了,我们也不会对其进行修改,直接使用这个默认生成的文件。有了这些信息,runc 就能知道怎么怎么运行容器了,我们先来看看简单的方法 runc run(这个命令需要 root 权限),这个命令类似于 docker run,它会创建并启动一个容器:
runc run simplebusybox
/ # ls
bin dev etc home proc root sys tmp usr var
/ # hostname
runc
/ # whoami
root
/ # pwd
/
/ # ip addr
1: lo: mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 sh
11 root 0:00 ps aux
此时,另开一个终端,可以查看运行的容器信息:
runc list
ID PID STATUS BUNDLE CREATED OWNER
simplebusybox 18073 running /home/cizixs/Workspace/runc/mycontainer 2017-11-02T06:54:52.023379345Z root
runC代码解读
总体来说,runC代码比较简单。主要是引用github.com/urfave/cli库,实现了一系列命令
app.Commands = []cli.Command{
checkpointCommand,
createCommand,
deleteCommand,
eventsCommand,
execCommand,
initCommand,
killCommand,
listCommand,
pauseCommand,
psCommand,
restoreCommand,
resumeCommand,
runCommand,
specCommand,
startCommand,
stateCommand,
updateCommand,
}
熟悉docker命令的人,应该对此很熟悉了。
这些命令底层是调用 libcontainer库实现具体的操作。
例如create 命令:
var createCommand = cli.Command{
Name: "create",
Usage: "create a container",
ArgsUsage: `
Where "" is your name for the instance of the container that you
are starting. The name you provide for the container instance must be unique on
your host.`,
Description: `The create command creates an instance of a container for a bundle. The bundle
is a directory with a specification file named "` + specConfig + `" and a root
filesystem.
The specification file includes an args parameter. The args parameter is used
to specify command(s) that get run when the container is started. To change the
command(s) that get executed on start, edit the args parameter of the spec. See
"runc spec --help" for more explanation.`,
Flags: []cli.Flag{
cli.StringFlag{
Name: "bundle, b",
Value: "",
Usage: `path to the root of the bundle directory, defaults to the current directory`,
},
cli.StringFlag{
Name: "console-socket",
Value: "",
Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the master end of the console's pseudoterminal",
},
cli.StringFlag{
Name: "pid-file",
Value: "",
Usage: "specify the file to write the process id to",
},
cli.BoolFlag{
Name: "no-pivot",
Usage: "do not use pivot root to jail process inside rootfs. This should be used whenever the rootfs is on top of a ramdisk",
},
cli.BoolFlag{
Name: "no-new-keyring",
Usage: "do not create a new session keyring for the container. This will cause the container to inherit the calling processes session key",
},
cli.IntFlag{
Name: "preserve-fds",
Usage: "Pass N additional file descriptors to the container (stdio + $LISTEN_FDS + N in total)",
},
},
Action: func(context *cli.Context) error {
if err := checkArgs(context, 1, exactArgs); err != nil {
return err
}
if err := revisePidFile(context); err != nil {
return err
}
spec, err := setupSpec(context)
if err != nil {
return err
}
status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
if err != nil {
return err
}
// exit with the container's exit status so any external supervisor is
// notified of the exit with the correct exit status.
os.Exit(status)
return nil
},
}
- 规定了每个命令所需的命令行参数。
- 具体执行逻辑。
其实如果需要更深的理解,更多需要理解libcontainer了。
主要有以下几个重要的文件需要理解
- factory.go
- container.go
- process.go
- init_linux.go
下面我们通过如何创建一个容器来剖析和理解上面的几个文件。
先调用spec, err := setupSpec(context)加载配置文件config.json的内容。此处是和咱们前面提到的OCI bundle 相关。
spec, err := setupSpec(context)
if err != nil {
return err
}
最终生成了Spec对象,spec定义如下:
// Spec is the base configuration for the container.
type Spec struct {
// Version of the Open Container Runtime Specification with which the bundle complies.
Version string `json:"ociVersion"`
// Process configures the container process.
Process *Process `json:"process,omitempty"`
// Root configures the container's root filesystem.
Root *Root `json:"root,omitempty"`
// Hostname configures the container's hostname.
Hostname string `json:"hostname,omitempty"`
// Mounts configures additional mounts (on top of Root).
Mounts []Mount `json:"mounts,omitempty"`
// Hooks configures callbacks for container lifecycle events.
Hooks *Hooks `json:"hooks,omitempty" platform:"linux,solaris"`
// Annotations contains arbitrary metadata for the container.
Annotations map[string]string `json:"annotations,omitempty"`
// Linux is platform-specific configuration for Linux based containers.
Linux *Linux `json:"linux,omitempty" platform:"linux"`
// Solaris is platform-specific configuration for Solaris based containers.
Solaris *Solaris `json:"solaris,omitempty" platform:"solaris"`
// Windows is platform-specific configuration for Windows based containers.
Windows *Windows `json:"windows,omitempty" platform:"windows"`
}
之后调用status, err := startcontainer(context, spec, CT_ACT_CREATE, nil)进行容器的创建工作。其中CT_ACT_CREATE表示创建操作。CT_ACT_CREATE是一个枚举类型。
type CtAct uint8
const (
CT_ACT_CREATE CtAct = iota + 1
CT_ACT_RUN
CT_ACT_RESTORE
)
status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
而startcontainer具体代码:
func startContainer(context *cli.Context, spec *specs.Spec, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
id := context.Args().First()
if id == "" {
return -1, errEmptyID
}
notifySocket := newNotifySocket(context, os.Getenv("NOTIFY_SOCKET"), id)
if notifySocket != nil {
notifySocket.setupSpec(context, spec)
}
container, err := createContainer(context, id, spec)
if err != nil {
return -1, err
}
if notifySocket != nil {
err := notifySocket.setupSocket()
if err != nil {
return -1, err
}
}
// Support on-demand socket activation by passing file descriptors into the container init process.
listenFDs := []*os.File{}
if os.Getenv("LISTEN_FDS") != "" {
listenFDs = activation.Files(false)
}
r := &runner{
enableSubreaper: !context.Bool("no-subreaper"),
shouldDestroy: true,
container: container,
listenFDs: listenFDs,
notifySocket: notifySocket,
consoleSocket: context.String("console-socket"),
detach: context.Bool("detach"),
pidFile: context.String("pid-file"),
preserveFDs: context.Int("preserve-fds"),
action: action,
criuOpts: criuOpts,
init: true,
}
return r.run(spec.Process)
}
首先调用container, err := createContainer(context, id, spec)创建容器, 之后填充runner结构r。
func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
rootless, err := isRootless(context)
if err != nil {
return nil, err
}
config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
CgroupName: id,
UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),
NoPivotRoot: context.Bool("no-pivot"),
NoNewKeyring: context.Bool("no-new-keyring"),
Spec: spec,
Rootless: rootless,
})
if err != nil {
return nil, err
}
factory, err := loadFactory(context)
if err != nil {
return nil, err
}
return factory.Create(id, config)
}
注意factory, err := loadFactory(context)和factory.Create(id, config),这两个就是我们上面提到的factory.go。由工厂来根据配置config创建具体容器。
最后调用了run方法。run方法传递了一个process对象,表示容器内进程的信息。即上面提到的process.go文件中的内容。
// Process contains information to start a specific application inside the container.
type Process struct {
// Terminal creates an interactive terminal for the container.
Terminal bool `json:"terminal,omitempty"`
// ConsoleSize specifies the size of the console.
ConsoleSize *Box `json:"consoleSize,omitempty"`
// User specifies user information for the process.
User User `json:"user"`
// Args specifies the binary and arguments for the application to execute.
Args []string `json:"args"`
// Env populates the process environment for the process.
Env []string `json:"env,omitempty"`
// Cwd is the current working directory for the process and must be
// relative to the container's root.
Cwd string `json:"cwd"`
// Capabilities are Linux capabilities that are kept for the process.
Capabilities *LinuxCapabilities `json:"capabilities,omitempty" platform:"linux"`
// Rlimits specifies rlimit options to apply to the process.
Rlimits []POSIXRlimit `json:"rlimits,omitempty" platform:"linux,solaris"`
// NoNewPrivileges controls whether additional privileges could be gained by processes in the container.
NoNewPrivileges bool `json:"noNewPrivileges,omitempty" platform:"linux"`
// ApparmorProfile specifies the apparmor profile for the container.
ApparmorProfile string `json:"apparmorProfile,omitempty" platform:"linux"`
// Specify an oom_score_adj for the container.
OOMScoreAdj *int `json:"oomScoreAdj,omitempty" platform:"linux"`
// SelinuxLabel specifies the selinux context that the container process is run as.
SelinuxLabel string `json:"selinuxLabel,omitempty" platform:"linux"`
}
run方法主要是newProcess方法
process, err := newProcess(*config, r.init)
newProcess 主要是填充 libcontainer.Process 结构体,包括参数,环境变量,user 权限,工作目录,cpabilities,资源限制等。
具体的操作是:
switch r.action {
case CT_ACT_CREATE:
err = r.container.Start(process)
case CT_ACT_RESTORE:
err = r.container.Restore(process, r.criuOpts)
case CT_ACT_RUN:
err = r.container.Run(process)
default:
panic("Unknown action")
}
启动容器代码container.Start(process):
func (c *linuxContainer) start(process *Process) error {
parent, err := c.newParentProcess(process)
if err != nil {
return newSystemErrorWithCause(err, "creating new parent process")
}
if err := parent.start(); err != nil {
// terminate the process to ensure that it properly is reaped.
if err := ignoreTerminateErrors(parent.terminate()); err != nil {
logrus.Warn(err)
}
return newSystemErrorWithCause(err, "starting container process")
}
// generate a timestamp indicating when the container was started
c.created = time.Now().UTC()
if process.Init {
c.state = &createdState{
c: c,
}
state, err := c.updateState(parent)
if err != nil {
return err
}
c.initProcessStartTime = state.InitProcessStartTime
if c.config.Hooks != nil {
bundle, annotations := utils.Annotations(c.config.Labels)
s := configs.HookState{
Version: c.config.Version,
ID: c.id,
Pid: parent.pid(),
Bundle: bundle,
Annotations: annotations,
}
for i, hook := range c.config.Hooks.Poststart {
if err := hook.Run(s); err != nil {
if err := ignoreTerminateErrors(parent.terminate()); err != nil {
logrus.Warn(err)
}
return newSystemErrorWithCausef(err, "running poststart hook %d", i)
}
}
}
}
return nil
}
- newParentProcess
1.创建一对pipe,parentPipe和childPipe,作为 runc start 进程与容器内部 init 进程通信管道
2.创建一个命令模版作为 Parent 进程启动的模板
3.newInitProcess 封装 initProcess。主要工作为添加初始化类型环境变量,将namespace、uid/gid 映射等信息使用 bootstrapData 封装为一个 io.Reader
- newInitProcess
添加初始化类型环境变量,将namespace、uid/gid 映射等信息使用 bootstrapData 函数封装为一个 io.Reader,使用的是 netlink 用于内核间的通信,返回 initProcess 结构体。
最后调用func (l *linuxStandardInit) Init() error方法,这里是上面提到的init_linux.go文件。
func (l *linuxStandardInit) Init() error {
if !l.config.Config.NoNewKeyring {
ringname, keepperms, newperms := l.getSessionRingParams()
// Do not inherit the parent's session keyring.
sessKeyId, err := keys.JoinSessionKeyring(ringname)
if err != nil {
return errors.Wrap(err, "join session keyring")
}
// Make session keyring searcheable.
if err := keys.ModKeyringPerm(sessKeyId, keepperms, newperms); err != nil {
return errors.Wrap(err, "mod keyring permissions")
}
}
if err := setupNetwork(l.config); err != nil {
return err
}
if err := setupRoute(l.config.Config); err != nil {
return err
}
label.Init()
if err := prepareRootfs(l.pipe, l.config); err != nil {
return err
}
// Set up the console. This has to be done *before* we finalize the rootfs,
// but *after* we've given the user the chance to set up all of the mounts
// they wanted.
if l.config.CreateConsole {
if err := setupConsole(l.consoleSocket, l.config, true); err != nil {
return err
}
if err := system.Setctty(); err != nil {
return errors.Wrap(err, "setctty")
}
}
// Finish the rootfs setup.
if l.config.Config.Namespaces.Contains(configs.NEWNS) {
if err := finalizeRootfs(l.config.Config); err != nil {
return err
}
}
if hostname := l.config.Config.Hostname; hostname != "" {
if err := unix.Sethostname([]byte(hostname)); err != nil {
return errors.Wrap(err, "sethostname")
}
}
if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
return errors.Wrap(err, "apply apparmor profile")
}
if err := label.SetProcessLabel(l.config.ProcessLabel); err != nil {
return errors.Wrap(err, "set process label")
}
for key, value := range l.config.Config.Sysctl {
if err := writeSystemProperty(key, value); err != nil {
return errors.Wrapf(err, "write sysctl key %s", key)
}
}
for _, path := range l.config.Config.ReadonlyPaths {
if err := readonlyPath(path); err != nil {
return errors.Wrapf(err, "readonly path %s", path)
}
}
for _, path := range l.config.Config.MaskPaths {
if err := maskPath(path, l.config.Config.MountLabel); err != nil {
return errors.Wrapf(err, "mask path %s", path)
}
}
pdeath, err := system.GetParentDeathSignal()
if err != nil {
return errors.Wrap(err, "get pdeath signal")
}
if l.config.NoNewPrivileges {
if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
return errors.Wrap(err, "set nonewprivileges")
}
}
// Tell our parent that we're ready to Execv. This must be done before the
// Seccomp rules have been applied, because we need to be able to read and
// write to a socket.
if err := syncParentReady(l.pipe); err != nil {
return errors.Wrap(err, "sync ready")
}
// Without NoNewPrivileges seccomp is a privileged operation, so we need to
// do this before dropping capabilities; otherwise do it as late as possible
// just before execve so as few syscalls take place after it as possible.
if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil {
return err
}
}
if err := finalizeNamespace(l.config); err != nil {
return err
}
// finalizeNamespace can change user/group which clears the parent death
// signal, so we restore it here.
if err := pdeath.Restore(); err != nil {
return errors.Wrap(err, "restore pdeath signal")
}
// Compare the parent from the initial start of the init process and make
// sure that it did not change. if the parent changes that means it died
// and we were reparented to something else so we should just kill ourself
// and not cause problems for someone else.
if unix.Getppid() != l.parentPid {
return unix.Kill(unix.Getpid(), unix.SIGKILL)
}
// Check for the arg before waiting to make sure it exists and it is
// returned as a create time error.
name, err := exec.LookPath(l.config.Args[0])
if err != nil {
return err
}
// Close the pipe to signal that we have completed our init.
l.pipe.Close()
// Wait for the FIFO to be opened on the other side before exec-ing the
// user process. We open it through /proc/self/fd/$fd, because the fd that
// was given to us was an O_PATH fd to the fifo itself. Linux allows us to
// re-open an O_PATH fd through /proc.
fd, err := unix.Open(fmt.Sprintf("/proc/self/fd/%d", l.fifoFd), unix.O_WRONLY|unix.O_CLOEXEC, 0)
if err != nil {
return newSystemErrorWithCause(err, "open exec fifo")
}
if _, err := unix.Write(fd, []byte("0")); err != nil {
return newSystemErrorWithCause(err, "write 0 exec fifo")
}
// Close the O_PATH fifofd fd before exec because the kernel resets
// dumpable in the wrong order. This has been fixed in newer kernels, but
// we keep this to ensure CVE-2016-9962 doesn't re-emerge on older kernels.
// N.B. the core issue itself (passing dirfds to the host filesystem) has
// since been resolved.
// https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
unix.Close(l.fifoFd)
// Set seccomp as close to execve as possible, so as few syscalls take
// place afterward (reducing the amount of syscalls that users need to
// enable in their seccomp profiles).
if l.config.Config.Seccomp != nil && l.config.NoNewPrivileges {
if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil {
return newSystemErrorWithCause(err, "init seccomp")
}
}
if err := syscall.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
return newSystemErrorWithCause(err, "exec user process")
}
return nil
}
(1)、该函数先处理l.config.Config.NoNewKeyring,l.config.Console, setupNetwork, setupRoute, label.Init()
(2)、if l.config.Config.Namespaces.Contains(configs.NEWNS) -> setupRootfs(l.config.Config, console, l.pipe)
(3)、设置hostname, apparmor.ApplyProfile(...), label.SetProcessLabel(...),l.config.Config.Sysctl
(4)、调用remountReadonly(path)重新挂载ReadonlyPaths,在配置文件中为/proc/asound,/proc/bus, /proc/fs等等
(5)、调用maskPath(path)设置maskedPaths,pdeath := system.GetParentDeathSignal(), 处理l.config.NoNewPrivileges
(6)、调用syncParentReady(l.pipe) // 告诉父进程容器可以执行Execv了, 从父进程来看,create已经完成了
(7)、处理l.config.Config.Seccomp 和 l.config.NoNewPrivileges, finalizeNamespace(l.config),pdeath.Restore(), 判断syscall.Getppid()和l.parentPid是否相等,找到name, err := exec.Lookpath(l.config.Args[0]),最后l.pipe.Close(),init完成。此时create 在子进程中也完成了。
(8)、fd, err := syscall.Openat(l.stateDirFD, execFifoFilename, os.O_WRONLY|syscall.O_CLOEXEC, 0) ---> wait for the fifo to be opened on the other side before exec'ing the user process,其实此处就是在等待start命令。之后,再往fd中写一个字节,用于同步:syscall.Write(fd, []byte("0"))
(9)、调用syscall.Exec(name, l.config.Args[0:], os.Environ())执行容器命令