导读

接触kubernetes的时候, 搞不懂OCI,CRI,runC,containerd,shim 之间的区别和联系, 下面梳理一下OCI产生的背景,对docker的影响,以及编排工具kubernetes CRI的产生背景及变化
争取通过下面的梳理说明白

谈到OCI,会想到docker,先有docker后有OCI,说到docker就说到容器技术, docker不是最早的容器技术,比如早期的chroot Jails,Solaris Containers等,但是是docker把容器技术推向了巅峰,让更多人关注并使用了容器技术

docker自2013发布以来,github中docker的代码活跃度居高不下,更多的个人或企业使用docker,容器就是docker也逐渐被大家认可,如果docker就是容器的标准,那其他容器怎么办(比如coreOS推出的rkt),
其次容器上层的编排工具(容器集群调度,比如kubernetes,mesos等)和docker紧密耦合,docker 接口的变化导致上层编排的稳定性甚至服务异常.

如果容器以docker作为标准,那么docker接口的变化可能导致社区中所有相关工具都要更新,不然就无法使用,如果没有标准,这将导致容器实现的碎片化,出现大量的冲突和冗余,
这两种情况都是社区不愿意看到的事情，于是OCI出现了

它的核心目标围绕容器的格式和运行时制定一个开放的工业化标准,并推动这个标准,保持容器的灵活性和开放性,容器能运行在任何的硬件和系统上.
容器不应该绑定到特定的客户机或编排堆栈,不应该与任何特定的供应商紧密关联,并且可以跨多种操作系统
官网上对 OCI 的介绍如下：

Established in June 2015 by Docker and other leaders in the container industry, the OCI currently contains two specifications: the Runtime Specification (runtime-spec) and the Image Specification (image-spec). The Runtime Specification outlines how to run a “filesystem bundle” that is unpacked on disk. At a high-level an OCI implementation would download an OCI Image then unpack that image into an OCI Runtime filesystem bundle. At this point the OCI Runtime Bundle would be run by an OCI Runtime.

OCI由docker以及其他容器行业领导者创建于2015年，目前主要有两个标准：容器运行时标准(runtime-spec)和容器镜像标准(image-spec)
这两个标准通过OCI runtime filesytem bundle的标准格式连接在一起,OCI镜像可以通过工具转换成bundle,然后 OCI 容器引擎能够识别这个bundle来运行容器

文档主要做了两个事情:

创建镜像的规则
运行镜像的规则

容器镜像标准(image-spec)
- 文件系统: 以layer保存的文件系统,每个layer保存了和上层之间变化的部分,layer应该保存哪些文件,怎么表示增加、修改和删除的文件等
- config文件: 保存了文件系统的层级信息(每个层级的hash值,以及历史信息)以及容器运行时需要的一些信息(比如环境变量、工作目录、命令参数、mount 列表)
- manifest文件: 镜像的config文件索引,有哪些layer,额外的annotation信息,manifest文件中保存了很多和当前平台有关的信息
- index文件: 可选的文件,指向不同平台的manifest文件,这个文件能保证一个镜像可以跨平台使用,每个平台拥有不同的manifest文件,使用index作为索引
容器运行时标准(runtime spec)

容器的状态包括如下属性
- ociVersion: OCI版本
- id: 容器的ID,在宿主机唯一
- status: 容器运行时状态,生命周期
  - creating: 使用 create 命令创建容器,这个过程称为创建中,创建包括文件系统、namespaces、cgroups、用户权限在内的各项内容
  - created: 容器创建出来,但是还没有运行,表示镜像和配置没有错误,容器能够运行在当前平台
  - running: 容器的运行状态,里面的进程处于up状态,正在执行用户设定的任务
  - stopped: 容器运行完成,或者运行出错或者stop命令之后，容器处于暂停状态,这个状态,容器还有很多信息保存在平台中,并没有完全被删除
- pid: 容器进程在宿主机的进程ID
- bundle: 容器文件目录,存放容器rootfs及相应配置的目录
- annotations: 与容器相关的注释

了解了OCI标准, 我们再看一下docker架构的演变,早期docker所有功能都在docker daemon里面的,但是之后功能越来越多,越来越重,同时响应社区兼容OCI标准,docker做了架构调整

调整后将容器运行时相关的程序从docker daemon剥离出来，形成了containerd, Containerd向docker提供运行容器的API，二者通过gRPC进行交互, containerd最后会通过runC来实际运行容器,

调整完后的docker架构图如下（docker 1.11版本以后）

containerd.png

runC的前身是docker的libcontainer项目,在libcontainer的基础上做了封装,捐赠给OCI的一个符合标准的runtime实现,上图可以看出docker引擎内部也是基于runC构建的

关于libcontainer和runC stackoverflow 的一个回答:

The Open Container Format (OCF) specification is a written document (or set of documents) defining what a "standard container" is,
in terms of filesystem, available operations and execution environment.
The document seems to be backed up with Go code. This specification is currently (July 2015) a work-in-progress.
Runc is an implementation of the standard. At the time of writing, it is basically a repackaging of libcontainer.
Docker uses libcontainer/runc, but adds a lot of tooling and features on top, such as volumes, networking and management of containers.
There is more information on the Docker blog and Open Containers site
If you're just getting started with containers, I would start with Docker and look into the other projects later once you understand how containers work.

runC只做一件事情就是运行容器,提供创建和运行容器的CLI(command-line interface)工具, runC直接与容器所依赖的cgroup/namespace linux kernel等进行交互，
负责为容器配置cgroup/namespace等启动容器所需的环境，创建启动容器的相关进程

关于containerd-shim 看一下Michael Crosby (runC,containerd作者)的解释

The shim allows for daemonless containers. It basically sits as the parent of the container's process to facilitate a few things. First it allows the runtimes, i.e. runc,to exit after it starts the container. This way we don't have to have the long running runtime processes for containers. When you start mysql you should only see the mysql process and the shim. Second it keeps the STDIO and other fds open for the container incase containerd and/or docker both die. If the shim was not running then the parent side of the pipes or the TTY master would be closed and the container would exit.
Finally it allows the container's exit status to be reported back to a higher level tool like docker without having the be the actual parent of the container's process and do a wait4.
I did a talk on this last week at dockercon US. You can see my slides here. https://github.com/crosbymichael/dockercon-2016
Hopefully that will explain a little more about how containerd and the shim work.

containerd-shim进程由containerd进程拉起,即containerd进程是containerd-shim的父进程, 容器进程由containerd-shim进程拉起, 这样的优点比如升级,重启docker或者containerd 不会影响已经running的容器进程, 而假如这个父进程就是containerd,那每次containerd挂掉或升级,整个宿主机上所有的容器都得退出了. 而引入了 containerd-shim 就规避了这个问题(当 containerd 退出或重启时, shim 会 re-parent 到 systemd 这样的 1 号进程上)

containerd是一个简单的守护进程,它可以使用runC管理容器，使用gRPC暴露容器的其他功能. 相比较Docker引擎使用gRPC, containerd暴露出针对容器的增删改查的接口,然而Docker引擎只是使用 full-blown HTTP API接口对Images，Volumes，network，builds等暴露出这些方法

了解了OCI,以及docker在兼容OCI标准架构的调整后, 迎来我们的重点 CRI
CRI是kubernetes推出的一个标准, 推出标准可见其在容器编排领域的地位

讲CRI之前我们先简单了解一下kubelet拉起一个容器的过程,如下:

20190602155250485.png

Kubelet通过CRI接口(gRPC)调用docker-shim,请求创建一个容器这一步中,kubelet可以视作一个简单的CRI Client,而docker-shim就是接收请求的Server,
注意的是docker-shim是内嵌在Kubelet中的
docker-shim收到请求后,转化成Docker Daemon能听懂的请求,发到Docker Daemon上请求创建一个容器
Docker Daemon请求containerd创建一个容器
containerd收到请求后创建一个containerd-shim进程,通过containerd-shim操作容器,容器进程需要一个父进程来做诸如收集状态, 维持stdin等fd打开等工作
containerd-shim在调用runC来启动容器
runC 启动完容器后本身会直接退出,containerd-shim则会成为容器进程的父进程,负责收集容器进程的状态,上报给containerd

通过上面kubelet创建容器的流程, 我们可以看到kubelet通过CRI的标准来与外部容器运行时进行交互

kubernetes 早期版本1.5之前内置了docker和rkt,也就是支持两种运行时, 这个时候如果用户想自定义运行时就比较痛苦了,需要修改kubelet源码

同时不同的容器运行时各有所长,随着k8s在容器编排领域里面老大的地位,许多用户希望kubernetes支持更多的容器运行时,满足不同用户,不同环境的使用
于是从kubernetes1.5开始增加了CRI接口, 有了CRI接口无需修改kubelet源码就可以支持更多的容器运行时

于此同时内置的docker和rtk逐渐从kubernetes源码中移除,到kubernetes1.11版本Kubelet内置的rkt代码删除，CNI的实现迁移到dockers-shim之内,
除了docker之外,其他的容器运行时都通过CRI接入.

外部的容器运行时一般称为CRI shim,它除了实现CRI接口外,也要负责为容器配置网络,即CNI,有了CNI可以支持社区内的众多网络插件.

CRI主要定义两个接口, ImageService和RuntimeService,如下图

0.jpeg

ImageService:负责镜像的生命管理周期
- 查询镜像列表
- 拉取镜像到本地
- 查询镜像状态
- 删除本地镜像
- 查询镜像占用空间
RuntimeService:负责管理Pod和容器的生命周期
- PodSandbox 的管理接口
  PodSandbox是对kubernete Pod的抽象,用来给容器提供一个隔离的环境(比如挂载到相同的cgroup下面)并提供网络等共享的命名空间.PodSandbox通常对应到一个Pause容器或者一台虚拟机
- Container 的管理接口
  在指定的 PodSandbox 中创建、启动、停止和删除容器。
- Streaming API接口
  包括Exec、Attach和PortForward 等三个和容器进行数据交互的接口,这三个接口返回的是运行时Streaming Server的URL,而不是直接跟容器交互
- 状态接口
  包括查询API版本和查询运行时状态

cri-o：同时兼容OCI和CRI的容器运行时
cri-containerd：基于Containerd的Kubernetes CRI 实现
rkt：由CoreOS主推的用来跟docker抗衡的容器运行时
docker：kuberentes最初就开始支持的容器运行时，目前还没完全从kubelet中解耦，docker公司同时推广了OCI标准
Kata Containers：符合OCI规范同时兼容CRI
gVisor：由谷歌推出的容器运行时沙箱(Experimental)

容器生态可以下面的三层抽象:

Orchestration API -> Container API -> Kernel API

Orchestration API: kubernetes API标准就是这层的标准,无可非议
Container API: 标准就是CRI
Kernel API: 标准就是OCI

参考资料

https://www.opencontainers.org/
https://blog.docker.com/2017/07/demystifying-open-container-initiative-oci-specifications/
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/container-runtime-interface.md
https://searchitoperations.techtarget.com/definition/Open-Container-Initiative
https://github.com/opencontainers/
https://github.com/containerd/containerd
https://stackoverflow.com/questions/31213126/libcontainer-vs-docker-vs-ocf-vs-runc
https://aleiwu.com/post/cncf-runtime-landscape/
https://groups.google.com/forum/#!topic/docker-dev/zaZFlvIx1_k
https://feisky.xyz/posts/kubernetes-container-runtime/

OCI,CRI到kubernetes runtime

导读

容器镜像标准(image-spec)

容器运行时标准(runtime spec)

参考资料

你可能感兴趣的:(OCI,CRI到kubernetes runtime)