概述

什么是 Kubernetes?

image

核心层：Kubernetes最核心的功能，对外提供API构建高层的应用，最内提供插件式应用执行环境
应用层：部署（无状态应用、有状态应用、批处理任务、集群应用等）和路由（服务发现、DNS解析等）
管理层：系统度量（如基础设施、容器和网络的度量），自动化（如自动扩展、动态Provision等）以及策略管理（RBAC、Quota、PSP、NetworkPolicy等）
接口层：kubectl命令行工具、客户端SDK以及集群联邦
生态系统：在接口层之上的庞大容器集群管理调度的生态系统，可以划分为两个范畴
- Kubernetes外部：日志、监控、配置管理、CI、CD、Workflow、FaaS、OTS应用、ChatOps等
- Kubernetes内部：CRI、CNI、CVI、镜像仓库、Cloud Provider、集群自身的配置和管理等

Kubernetes is an open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.

特性:

Deploy your applications quickly and predictably.
Scale your applications on the fly.
Seamlessly roll out new features.
Optimize use of your hardware by using only the resources you need.

其他特点：

portable: public, private, hybrid, multi-cloud
extensible: modular, pluggable, hookable, composable
self-healing: auto-placement, auto-restart, auto-replication, auto-scaling

Why containers?

Agile application creation and deployment: Increased ease and efficiency of container image creation compared to VM image use.
Continuous development, integration, and deployment: Provides for reliable and frequent container image build and deployment with quick and easy rollbacks (due to image immutability).
Dev and Ops separation of concerns: Create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
Environmental consistency across development, testing, and production: Runs the same on a laptop as it does in the cloud.
Cloud and OS distribution portability: Runs on Ubuntu, RHEL, CoreOS, on-prem, Google Container Engine, and anywhere else.
Application-centric management: Raises the level of abstraction from running an OS on virtual hardware to run an application on an OS using logical resources.
Loosely coupled, distributed, elastic, liberated micro-services: Applications are broken into smaller, independent pieces and can be deployed and managed dynamically – not a fat monolithic stack running on one big single-purpose machine.
Resource isolation: Predictable application performance.
Resource utilization: High efficiency and density.

Kubernetes提供的功能

co-locating helper processes, facilitating composite applications and preserving the one-application-per-container model,
mounting storage systems,
distributing secrets,
application health checking,
replicating application instances,
horizontal auto-scaling,
naming and discovery,
load balancing,
rolling updates,
resource monitoring,
log access and ingestion,
support for introspection and debugging, and
identity and authorization.

总结：调度,管理，扩展（deployment/demon set/stateful set/job, health check,auto-scaling,rolling updates）应用程序,提供应用程序运行平台(日志，监控，服务发现，负载均衡，鉴权)，以及管理控制和分配平台资源（内存，cpu，网络，存储，镜像）

我们看一下操作系统的定义
操作系统(Operating System, OS)是指控制和管理整个计算机系统的硬件和软件资源，并合理地组织调度计算机的工作和资源的分配，以提供给用户和其他软件方便的接口和环境的程序集合. kubernetes就是一个分布式的操作系统，它管理一个计算机集群的软件和硬件资源，并且合理的组织调用程序（容器）和资源的分配，以提供给用户和其他软件方便的接口和环境。
单机操作系统中的大多概念都在k8s有或者正在有对应的形态。举个例子systemctl有reload操作，这个k8s也没有，但是是k8s正在做的。

Kubernetes不是什么

这段很有意思，很值得看，Kubernetes不是什么，里面很多都是Kubernetes发行商需要考虑和完成的事

Does not limit the types of applications supported. It does not dictate application frameworks (e.g., Wildfly), restrict the set of supported language runtimes (for example, Java, Python, Ruby), cater to only 12-factor applications, nor distinguish apps from services. Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
Does not provide middleware (e.g., message buses), data-processing frameworks (for example, Spark), databases (e.g., mysql), nor cluster storage systems (e.g., Ceph) as built-in services. Such applications run on Kubernetes.
Does not have a click-to-deploy service marketplace.
Does not deploy source code and does not build your application. Continuous Integration (CI) workflow is an area where different users and projects have their own requirements and preferences, so it supports layering CI workflows on Kubernetes but doesn’t dictate how layering should work.
Allows users to choose their logging, monitoring, and alerting systems. (It provides some integrations as proof of concept.)
Does not provide nor mandate a comprehensive application configuration language/system (for example, jsonnet).
Does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems.

Kubernetes Components 组件

角色	组件	说明
Master Components	kube-apiserver	kube-apiserver exposes the Kubernetes API;
-	-	it is the front-end for the Kubernetes control plane.
Master Components	etcd	Kubernetes’ backing store. stored All cluster data
Master Components	kube-controller-manager	一个binary包括：
-	-	1.Node Controller: noticing & responding when nodes go down.
-	-	2.Replication Controller:maintain correct number of pods for every Replication Controller object. -	-	3.Endpoints Controller: Populates the Endpoints object (如join Services & Pods).
-	-	4.Service Account & Token Controllers:Create default accounts,API access tokens for namespaces.
-	-	5.others.
Master Components	cloud-controller-manager	a binary run controllers interact with cloud providers.包括：
-	-	1.Node Controller: checking cloud provider,determine if node deleted in cloud after stops responding
-	-	2.Route Controller: For setting up routes in the underlying cloud infrastructure
-	-	3.Service Controller: For creating, updating and deleting cloud provider load balancers
-	-	4. Volume Controller: For creating,attaching,mounting,interacting with cloud provider to orchestrate volumes
Master Components	kube-scheduler	kube-scheduler watches newly created pods that have no node assigned, and selects a node for them to run on.
Master Components	addons	Addons are pods and services that implement cluster features.
-	-	如：DNS （Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.），
-	-	User interface，Container Resource Monitoring，Cluster-level Logging
Node components	kubelet	primary node agent，主要功能：
-	-	1.Watches for pods that have been assigned to its node (either by apiserver or via local configuration file)
-	-	2.Mounts the pod’s required volumes
-	-	3.Downloads the pod’s secrets
-	-	4.Runs the pod’s containers via docker (or, experimentally, rkt).
-	-	5.Periodically executes any requested container liveness probes.
-	-	6.Reports the status of the pod back to the rest of the system, by creating a “mirror pod” if necessary
-	-	7.Reports the status of the node back to the rest of the system.
Node components	kube-proxy	kube-proxy enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding.
Node components	docker／rkt	for actually running containers.
Node components	supervisord	supervisord is a lightweight process babysitting system for keeping kubelet and docker running.
Node components	fluentd	fluentd is a daemon which helps provide cluster-level logging.

Kubernetes Objects － Kubernetes对象

Understanding Kubernetes Objects

分类

类别	名称
资源对象	Pod、ReplicaSet、ReplicationController、Deployment、StatefulSet、DaemonSet、Job、CronJob、HorizontalPodAutoscaling
配置对象	Node、Namespace、Service、Secret、ConfigMap、Ingress、Label、ThirdPartyResource、 ServiceAccount
存储对象	Volume、Persistent Volume
策略对象	SecurityContext、ResourceQuota、LimitRange

Kubernetes Objects are persistent entities in the Kubernetes system. Kubernetes uses these entities to represent the state of your cluster. Specifically, they can describe:

What containerized applications are running (and on which nodes) 应用
The resources available to those applications 资源
The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance 策略

Kubernetes Objects描述desired state => 状态驱动

Kubernetes对象就是应用，资源和策略

Object Spec and Status

每个对象都有两个嵌套的字段Object Spec 和 Object Status
Object Spec描述desired的状态，Object Status 描述当前状态. Object Status －》match Object Spec

Kubernetes Control Plane就是要让 object’s actual state => object's desired state

参考

Name / NameSpace

略

Labels and Selectors

Labels are key/value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but which do not directly imply semantics to the core system.
不唯一
Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based (如：environment = production)and set-based（如：environment in (production, qa)）.

API

例子见 kubernetes.io/docs/concep…

label 可用在 LIST and WATCH filtering；Set references in API objects

Set references in API objects的例子

Some Kubernetes objects, such as services and replicationcontrollers, also use label selectors to specify sets of other resources, such as pods.但是支持equality-based requirement selectors

"selector": {
    "component" : "redis",
}复制代码

Newer resources, such as Job, Deployment, Replica Set, and Daemon Set, support set-based requirements as well.这些资源，同时支持set-based requirements

selector:
  matchLabels:
    component: redis
  matchExpressions:
    - {key: tier, operator: In, values: [cache]}
    - {key: environment, operator: NotIn, values: [dev]}复制代码

另一个使用场景事用label来选择node

Annotations

作用是Attaching metadata to objects

和label有区别：
You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels.

The Kubernetes API

Complete API details are documented using Swagger v1.2 and OpenAPI(就是Swagger 2.0).

API versioning

如：/api/v1, 根据稳定性分为 stabel(v1), alpha (v1alpha1), beta (v2beta3)

API groups

为了方便extend Kubernetes API
Currently there are several API groups in use:

the core (oftentimes called “legacy”, due to not having explicit group name) group, which is at REST path /api/v1 and is not specified as part of the apiVersion field, e.g. apiVersion: v1.
the named groups are at REST path /apis/$GROUP_NAME/$VERSION, and use apiVersion: $GROUP_NAME/$VERSION (e.g. apiVersion: batch/v1，再比如：/apis/apps/v1beta2/).

扩展api目前有两种方式： CustomResourceDefinition 和 kube-aggregator

某个api group可以在apiserver启动的时候被打开或者
关闭, 比如

--runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/ingress=false复制代码

API Conventions

这部分来自 github.com/kubernetes/…

kinds可以分为三类

Objects represent a persistent entity in the system.Examples: Pod, ReplicationController, Service, Namespace, Node
Lists are collections of resources of one (usually) or more (occasionally) kinds.Examples: PodLists, ServiceLists, NodeLists
Simple: used for specific actions on objects and for non-persistent entities.Many simple resources are "subresources",如/binding；/status；/scale；一个资源的小部分

Resources

All JSON objects returned by an API MUST have the following fields:

kind: a string that identifies the schema this object should have
apiVersion: a string that identifies the version of the schema the object should have

Objects

object内容	说明
Metadata	MUST: namespace,name,uid; SHOULD: resourceVersion,generation,creationTimestamp,deletionTimestamp,labels,annotations
Spec and Status	status (current) -> Spec(desired);A /status subresource MUST be provided to enable system components to update statuses of resources they manage; Status常是Conditions
References to related objects	ObjectReference type

Lists and Simple kinds

Differing Representations

Verbs on Resources

github.com/kubernetes/…

PATCH比较特别，支持三种patch

JSON Patch
Merge Patch
Strategic Merge Patch

Idempotency

All compatible Kubernetes APIs MUST support "name idempotency" and respond with an HTTP status code 409
"confict"

Optional vs. Required

Optional fields have the following properties:

They have +optional struct tag in Go.
They are a pointer type in the Go definition or have a built-in nil value
The API server should allow POSTing and PUTing a resource with this field unset

使用 +optional 而不是omitempty

Defaulting

Late Initialization

Concurrency Control and Consistency

使用resourceVersion来做Concurrency Control
All Kubernetes resources have a "resourceVersion" field as part of their metadata.
Kubernetes leverages the concept of resource versions to achieve optimistic concurrency.
The resourceVersion is changed by the server every time an object is modified.

Serialization Format

Units

Selecting Fields

Object references

HTTP Status codes

Response Status Kind

什么什么api会返回status kind类型
Kubernetes will always return the Status kind from any API endpoint when an error occurs. Clients SHOULD handle these types of objects when appropriate.

A Status kind will be returned by the API in two cases:
When an operation is not successful (i.e. when the server would return a non 2xx HTTP status code).
When a HTTP DELETE call is successful.

$ curl -v -k -H "Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc" https://10.240.122.184:443/api/v1/namespaces/default/pods/grafana

> GET /api/v1/namespaces/default/pods/grafana HTTP/1.1
> User-Agent: curl/7.26.0
> Host: 10.240.122.184
> Accept: */*
> Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc
>

< HTTP/1.1 404 Not Found
< Content-Type: application/json
< Date: Wed, 20 May 2015 18:10:42 GMT
< Content-Length: 232
<
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "pods \"grafana\" not found",
  "reason": "NotFound",
  "details": {
    "name": "grafana",
    "kind": "pods"
  },
  "code": 404
}复制代码

Events

Naming conventions

Label, selector, and annotation conventions

WebSockets and SPDY

The API therefore exposes certain operations over upgradeable HTTP connections (described in RFC 2817) via the WebSocket and SPDY protocols.
支持两种协议

Streamed channels: Kubernetes supports a SPDY based framing protocol that leverages SPDY channels and a WebSocket framing protocol that multiplexes multiple channels onto the same stream by prefixing each binary chunk with a byte indicating its channel
Streaming response: HTTP Chunked Transfer-Encoding

Validation

Kubernetes Architecture

Nodes

Node Status	描述
Addresses	HostName/ExternalIP/InternalIP
Condition	OutOfDisk / Ready / MemoryPressure / DiskPressure / NetworkUnavailable
Capacity
Info

Management

Node Controller
The node controller is a Kubernetes master component which manages various aspects of nodes.

作用:

assigning a CIDR block to the node when it is registered
keeping the node controller’s internal list of nodes up to date with the cloud provider’s list of available machines
monitoring the nodes’ health
Starting in Kubernetes 1.6, the NodeController is also responsible for evicting pods that are running on nodes with NoExecute
Starting in version 1.8, the node controller can be made responsible for creating taints that represent Node conditions.

Master-Node communication

Concepts Underlying the Cloud Controller Manager

The CCM consolidates all of the cloud-dependent logic from the preceding three components to create a single point of integration with the cloud. The new architecture with the CCM looks like this

image

TODO

Extending the Kubernetes API

Custom Resources

Custom resources

Custom controllers

CustomResourceDefinitions

API server aggregation

Extending the Kubernetes API with the aggregation layer

Containers

Images

Updating Images

The default pull policy is IfNotPresent which causes the Kubelet to not pull an image if it already exists.
如果要强制拉取，使用imagePullPolicy: Always, 推荐的做法是 "Vxx + IfNotPresent", 而不是"latest + Always",因为不知道正在运行的是什么版本，但是实际上pull是调用docker这样的runtime去pull，即使Always也不会重复下载大量数据，因为layer已经存在来，从这方面讲Always是无害的。

Using a Private Registry

可用:
Using Google Container Registry
Using AWS EC2 Container Registry
Using Azure Container Registry (ACR)

Configuring Nodes to Authenticate to a Private Repository

通过$HOME/.docker/config.json （过期问题？？）

Pre-pulling Images

Specifying ImagePullSecrets on a Pod

Creating a Secret with a Docker Config

$ kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
secret "myregistrykey" created.复制代码

Bypassing kubectl create secrets

不通过kubectl也可以从.docker/config.json的内容，用yaml创建secrets

Referring to an imagePullSecrets on a Pod

怎么使用创建出来的imagePullSecrets
可以在podspec里面指定，也可以通过serviceaccount自动完成这个设定。

You can use this in conjunction with a per-node .docker/config.json. The credentials will be merged. This approach will work on Google Container Engine (GKE).

apiVersion: v1
kind: Pod
metadata:
  name: foo
  namespace: awesomeapps
spec:
  containers:
    - name: foo
      image: janedoe/awesomeapp:v1
  imagePullSecrets:
    - name: myregistrykey复制代码

Use Cases

使用场景,值得注意的是 AlwaysPullImages admission controller，这个有时候要打开，比如多租户的情况，否则有可能获取别人的镜像。

Container Environment Variables

Container information

pod information等很多元数据信息可以通过 downward API 挂成环境变量
secret也可以挂成环境变量
pod spec中自定义的环境变量

具体多种挂在方式元数据->container里面的文件／环境变量，参考 kubernetes.io/docs/tasks/… 和相关文档

Cluster information

创建的时候存在的service host/port作为变量都会挂在container里面（目前看是这个namespace的），这个特性保证了即使没开dns addon，也可以访问service，当然这种方式不可靠。

Container Lifecycle Hooks

Container Hooks

Hook Details

现在有两种 PostStart； PreStop，如果hook调用hangs，Pod状态变化会阻塞。

PostStart:executes immediately after a container is created. 不保证在ENTRYPOINT前面执行
PreStop: called immediately before a container is terminated, 同步执行，最多可以执行的时间和grace period 有关

Hook Handler Implementations

支持Exec，HTTP两种方式

Hook Handler Execution

Hook handler calls are synchronous within the context of the Pod containing the Container. This means that for a PostStart hook, the Container ENTRYPOINT and hook fire asynchronously. However, if the hook takes too long to run or hangs, the Container cannot reach a running state.
The behavior is similar for a PreStop hook. If the hook hangs during execution, the Pod phase stays in a Terminating state and is killed after terminationGracePeriodSeconds of pod ends. If a PostStart or PreStop hook fails, it kills the Container.

从上面的特点可以看出，PostStart； PreStop的目前的设计都是针对非常轻量级的命令，如果不是可以考虑用initcontainer，defercontainer(还没实现,有issue)

Hook delivery guarantees

一般只会发一次，但是不保证

Debugging Hook Handlers

If a handler fails for some reason, it broadcasts an event.
You can see these events by running kubectl describe pod

Workloads

Pods

Pod Overview

Pod是什么：部署的最小单位; 涵盖了一个或多个application container，（共用的）存储资源，网络IP，options
A Pod encapsulates an application container (or, in some cases, multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run. A Pod represents a unit of deployment: a single instance of an application in Kubernetes, which might consist of either a single container or a small number of containers that are tightly coupled and that share resources.
参考：
blog.kubernetes.io/2015/06/the… (一个pod多个container的use case：Sidecar (git, log...), Ambassador (proxy, 透明代理),Adapter (exporter)...)
blog.kubernetes.io/2016/06/con…

Understanding Pods

image

How Pods manage multiple Containers

一个例子：

image

multiple Containers共享:

Networking
Storage

Working with Pods

Pods are designed as relatively ephemeral, disposable entities.Pods do not, by themselves, self-heal,Kubernetes uses a higher-level abstraction, called a Controller, that handles the work of managing the relatively disposable Pod instances.

Pods and Controllers

A Controller can create and manage multiple Pods for you, handling replication and rollout and providing self-healing capabilities at cluster scope. For example, if a Node fails, the Controller might automatically replace the Pod by scheduling an identical replacement on a different Node.

Some examples of Controllers that contain one or more pods include:

Deployment
StatefulSet
DaemonSet

Pod Templates

Controllers use Pod Templates to make actual pods.
没有 desired state of all replicas，不像pod，会规定desired state of all containers belonging to the pod.

Pod Lifecycle

Pod phase

A Pod’s status field is a PodStatus object, which has a phase field.

可能的状态	说明
Pending	The Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created.
Running	The Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting
Succeeded	All Containers in the Pod have terminated in success, and will not be restarted.
Failed	All Containers in the Pod have terminated, and at least one Container has terminated in failure.
Unknown

pod 终止

用户发送删除pod的命令，默认优雅删除时期是30秒；
在Pod超过该优雅删除期限后API server就会更新Pod的状态为“dead”；
在客户端命令行上显示的Pod状态为“terminating”；
跟第三步同时，当kubelet发现pod被标记为“terminating”状态时，开始停止pod进程：
1. 如果在pod中定义了preStop hook，在停止pod前会被调用。如果在优雅删除期限过期后，preStop hook依然在运行，第二步会再增加2秒的优雅时间；
2. 向Pod中的进程发送TERM信号；
跟第三步同时，该Pod将从该service的端点列表中删除，不再是replication controller的一部分。关闭的慢的pod将继续处理load balancer转发的流量；
过了优雅周期后，将向Pod中依然运行的进程发送SIGKILL信号而杀掉进程。
Kublete会在API server中完成Pod的的删除，通过将优雅周期设置为0（立即删除）。Pod在API中消失，并且在客户端也不可见。

Pod conditions

A Pod has a PodStatus, which has an array of PodConditions.Each element of the PodCondition array has a type field and a status field.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2017-10-28T06:30:03Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2017-10-28T06:30:13Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2017-10-28T06:30:03Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://dd82608cabe226247bcbc8d5fbce6121edf935320486c41046481000dbb7784f
    image: deis/brigade-api:latest
    imageID: docker-pullable://deis/brigade-api@sha256:943cf822adddf6869ff02d2e1a55cbb19c96d01be41e88d1d56bc16a50f5c91f
    lastState: {}
    name: brigade
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2017-10-28T06:30:06Z复制代码

Container probes

A Probe is a diagnostic performed periodically by the kubelet on a Container. To perform a diagnostic, the kublet calls a Handler implemented by the Container.

三种检测方式:

ExecAction
TCPSocketAction
HTTPGetAction

三种结果： Success，Failure，Unknown
两种类型：livenessProbe（和restart policy相关），readinessProbe

When should you use liveness or readiness probes?

todo

Pod and Container status

Restart policy

Pod lifetime

Use a Job for Pods that are expected to terminate, for example, batch computations. Jobs are appropriate only for Pods with restartPolicy equal to OnFailure or Never.
Use a ReplicationController, ReplicaSet, or Deployment for Pods that are not expected to terminate, for example, web servers. ReplicationControllers are appropriate only for Pods with a restartPolicy of Always.
Use a DaemonSet for Pods that need to run one per machine, because they provide a machine-specific system service.

image

Examples

pod 只有一个container

这里比较值得注意的是如果pod设计成run to complete的，那么restartPolicy不能用Always

当前pod phase	container发生事件	pod restartPolicy	对container的动作	log	pod phase
Running	exits with success	Always	Restart Container	Log completion event	Running
Running	exits with success	OnFailure	-	Log completion event	Succeeded
Running	exits with success	Never	-	Log completion event	Succeeded
Running	exits with failure	Always	Restart Container	Log failure event	Running
Running	exits with failure	OnFailure	Restart Container	Log failure event	Running
Running	exits with failure	Never	-	Log failure event	Failed
Running	oom	Always	Restart Container	Log OOM event	Running
Running	oom	OnFailure	Restart Container	Log OOM event	Running
Running	oom	Never	-	Log OOM event	Failed

pod 只有两个container

当前pod phase	container1发生事件	pod restartPolicy	对container的动作	log	pod phase
Running	exits with failure	Always	Restart Container	Log failure event	Running
Running	exits with failure	OnFailure	Restart Container	Log failure event	Running
Running	exits with failure	Never	-	Log failure event	Running, 如果container2也退出 =》Failed

Init Containers

常用来做set-up，或者等待set-up
Init Containers are exactly like regular Containers, except:

They always run to completion.
Each one must complete successfully before the next one is started.

Detailed behavior

A Pod cannot be Ready until all Init Containers have succeeded.
If the Pod is restarted, all Init Containers must execute again.
readinessProbe什么的不能使用
Use activeDeadlineSeconds on the Pod and livenessProbe on the Container to prevent Init Containers from failing forever.

Pod Preset

pod preset,是一种给pod注入元数据的方法。
使用pod preset会决定对某一类的pod，在Admission controller那里透明的对pod spec进行修改，给pod动态的注入依赖的一些信息，如env,mount volumns

表现：
当PodPreset被应用于一个或者多个Pod,Kubernetes修改pod的spec。对于Env,EnvFrom和VolumeMounts，Kubernetes修改了Pod里面所有容器的spec；对于Volume Kubernetes修改了Pod Spec。

例子：

kind: PodPreset
apiVersion: settings.k8s.io/v1alpha1
metadata:
  name: allow-database
  namespace: myns
spec:
  selector:
    matchLabels:
      role: frontend
  env:
    - name: DB_PORT
      value: "6379"
  volumeMounts:
    - mountPath: /cache
      name: cache-volume
  volumes:
    - name: cache-volume
      emptyDir: {}复制代码

参考www.jianshu.com/p/83fe99a5e…

Pod 安全策略

包含 PodSecurityPolicy 的许可控制，允许控制集群资源的创建和修改，基于这些资源在集群范围内被许可的能力。
如果某个策略能够匹配上，该 Pod 就被接受。如果请求与 PSP 不匹配，则 Pod 被拒绝

jimmysong.io/kubernetes-…

Disruptions

Voluntary and Involuntary Disruptions

unavoidable cases 即 involuntary disruptions to an application. =>比如: hardware failure,kernel panic,node disappears,eviction of a pod due to the node being out-of-resources.等等
voluntary disruptions => 比如: deleting/updating the deployment/pod, Draining a node for repair or upgrade or cluster down.

Dealing with Disruptions

如何减轻Involuntary Disruptions的影响: 指名要的资源， Replicate and spread.

Ensure your pod requests the resources it needs.
Replicate your application if you need higher availability
For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)

How Disruption Budgets Work

在Kubernetes中，为了保证业务不中断或业务SLA不降级，需要将应用进行集群化部署。通过PodDisruptionBudget控制器可以设置应用POD集群处于运行状态最低个数，也可以设置应用POD集群处于运行状态的最低百分比，这样可以保证在主动销毁应用POD的时候，不会一次性销毁太多的应用POD，从而保证业务不中断或业务SLA不降级。

使用那种调用Eviction API 的工具而不是直接删除POD，因为Eviction API 会respect Pod Disruption Budgets，比如 kubectl drain命令。

PDBs cannot prevent involuntary disruptions from occurring, but they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget,, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec.
When a pod is evicted using the eviction API, it is gracefully terminated

参考：
www.kubernetes.org.cn/2486.html
ju.outofmemory.cn/entry/32756…

PDB Example

Separating Cluster Owner and Application Owner Roles

How to perform Disruptive Actions on your Cluster

Write disruption tolerant applications and use PDBs

Controllers

Replica Sets

一般不直接用，而是通过Deployments.
mainly used by Deployments as a mechanism to orchestrate pod creation, deletion and updates.

When to use a ReplicaSet

A ReplicaSet ensures that a specified number of pod replicas are running at any given time.

Working with ReplicaSets

一些操作:

kubectl delete. Kubectl will scale the ReplicaSet to zero and wait for it to delete each pod before deleting the ReplicaSet itself
--cascade=false会只删除ReplicaSets，不删pod
通过修改pod的label，可以Isolating pods from a ReplicaSet，remove之后会被replaced automatically
scale： .spec.replica
ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). 自动scale

Replication Controller

略，现在不推荐了。

Deployments

A Deployment controller provides declarative updates for Pods and ReplicaSets.

Use case

Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
Scale up the Deployment to facilitate more load.
Pause the Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
Use the status of the Deployment as an indicator that a rollout has stuck.
Clean up older ReplicaSets that you don’t need anymore

Create

Pod-template-hash label: this label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels, and in any existing Pods that the ReplicaSet might have.

Update

Deployment can ensure that only a certain number of Pods may be down while they are being updated. By default, it ensures that at least 1 less than the desired number of Pods are up (1 max unavailable).

rollout, rollout history/status, undo......

Scaling

Proportional scaling: RollingUpdate (maxSurge，maxUnavailable)可能短暂大于预期数量

$ kubectl get deploy
NAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment     10        10        10           10          50s

$ kubectl set image deploy/nginx-deployment nginx=nginx:sometag
deployment "nginx-deployment" image updated

$ kubectl get rs
NAME                          DESIRED   CURRENT   READY     AGE
nginx-deployment-1989198191   5         5         0         9s
nginx-deployment-618515232    8         8         8         1m复制代码

Pausing and Resuming

Deployment status

Clean up Policy

You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old ReplicaSets for this Deployment you want to retain

注意：目前不支持Canary Deployment,推荐用multiple Deployment来实现

StatefulSets

since 1.5 取代PetSets,特点是：Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.
stateful意味着：

Stable, unique network identifiers.
Stable, persistent storage.
Ordered, graceful deployment and scaling. (deployment的滚动没有这么严格)
Ordered, graceful deletion and termination.

Components

components of a StatefulSet.例子

A Headless Service （带selector）, named nginx, is used to control the network domain.这种service不带lb，kube-proxy不处理，dns直接返回后端endpoint
The StatefulSet, named web, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods.
The volumeClaimTemplates will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: gcr.io/google_containers/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: my-storage-class
      resources:
        requests:
          storage: 1Gi复制代码

Pod Identity

Ordinal Index:each Pod in the StatefulSet will be assigned an integer ordinal, in the range [0,N), that is unique over the Set
Stable Network ID:The pattern for the constructed hostname is $(statefulset name)-$(ordinal). The example above will create three Pods named web-0,web-1,web-2. A StatefulSet can use a Headless Service to control the domain of its Pods.
Stable Storage:PersistentVolume

Cluster Domain	Service (ns/name)	StatefulSet (ns/name)	StatefulSet Domain	Pod DNS	Pod Hostname
cluster.local	default/nginx	default/web	nginx.default.svc.cluster.local	web-{0..N-1}.nginx.default.svc.cluster.local	web-{0..N-1}
cluster.local	foo/nginx	foo/web	nginx.foo.svc.cluster.local	web-{0..N-1}.nginx.foo.svc.cluster.local	web-{0..N-1}
kube.local	foo/nginx	foo/web	nginx.foo.svc.kube.local	web-{0..N-1}.nginx.foo.svc.kube.local	web-{0..N-1}

Deployment and Scaling Guarantees

For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
Before a Pod is terminated, all of its successors must be completely shutdown.

In Kubernetes 1.7 and later, StatefulSet allows you to relax its ordering guarantees while preserving its uniqueness and identity guarantees via its .spec.podManagementPolicy field.

Update Strategies

On Delete;Rolling Updates;Partitions

Daemon Sets

一个node跑一个pod,作为一个deamon

Alternatives to DaemonSet

Init Scripts
Static Pods: create Pods by writing a file to a certain directory watched by Kubelet.

Garbage Collection

Owners and dependents

When you delete an object, you can specify whether the object’s dependents are also deleted automatically. Deleting dependents automatically is called cascading deletion.There are two modes of cascading deletion: background and foreground.
前台删除:根对象首先进入 “删除中” 状态。=> 垃圾收集器会删除对象的所有 Dependent。 => 删除 Owner 对象。
后台删除:Kubernetes 会立即删除 Owner 对象，然后垃圾收集器会在后台删除这些 Dependent。

Deployments必须使用propagationPolicy: Foreground
自定义资源目前不支持垃圾回收

Setting the cascading deletion policy

To control the cascading deletion policy, set the deleteOptions.propagationPolicy field on your owner object. Possible values include “Orphan”, “Foreground”, or “Background”.
The default garbage collection policy for many controller resources is orphan, including ReplicationController, ReplicaSet, StatefulSet, DaemonSet, and Deployment.

Jobs - Run to Completion

todo

Cron Jobs

todo

Configuration

Configuration Best Practices

这个优点像effective k8s了：

General Config Tips

配置要带版本，可以回滚
YMAL比JSON好
ALL IN ONE YAML
Don’t specify default values unnecessarily – simple and minimal configs will reduce errors.
Put an object description in an annotation to allow better introspection.

Services

先创建service,后创建rc, This lets the scheduler spread the pods that comprise the service.
Don’t use hostPort (使用a NodePort service) and hostNetwork unless it is absolutely necessary
Use headless services for easy service discovery when you don’t need kube-proxy load balancing.

Using Labels

todo

Container Images

略

Using kubectl

Use kubectl create -f where possible.
Use kubectl run and expose to quickly create and expose single container Deployments.

Managing Compute Resources for Containers

Resource requests and limits

todo

Assigning Pods to Nodes

nodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd复制代码

Interlude: built-in node labels

kubernetes.io/hostname
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
beta.kubernetes.io/instance-type
beta.kubernetes.io/os
beta.kubernetes.io/arch复制代码

Affinity and anti-affinity

nodeAffinity
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
podAffinity
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
podAntiAffinity
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: gcr.io/google_containers/pause:2.0复制代码

Taints and Tolerations

Node affinity, 写在pod上，描述希望什么node。

Secrets

Organizing Cluster Access Using kubeconfig Files

Pod Priority and Preemption

Cluster Administration

Managing Resources

Organizing resource configurations

不同resource可以写在一个yml文件里
kubectl create/delete.. -f 文件，url，文件夹 (--recursive) 创建，删除
kubectl get 获取
kubectl label/annotate 标注
kubectl scale/autoscale/apply/edit/patch/replace 更新

Cluster Networking

Highly-coupled container-to-container communications: this is solved by pods and localhost communications.
Pod-to-Pod communications: this is the primary focus of this document.
Pod-to-Service communications: this is covered by services.
External-to-Service communications: this is covered by services.

Kubernetes model

all containers can communicate with all other containers without NAT
all nodes can communicate with all containers (and vice-versa) without NAT
the IP that a container sees itself as is the same IP that others see it as

实现：Contiv，Contrail，Flannel，GCE，L2 networks and linux bridging，Nuage，OpenVSwitch，OVN，Calico，Romana，Weave Net

Network Plugins

CNI plugins: adhere to the appc/CNI specification, designed for interoperability.
Kubenet plugin: implements basic cbr0 using the bridge and host-local CNI plugins

Logging and Monitoring Cluster Activity

Auditing

Kubernetes audit is part of kube-apiserver logging all requests coming to the server.

Resource Usage Monitoring

image

Configuring Out Of Resource Handling

Eviction Policy

The kubelet can pro-actively monitor for and prevent against total starvation of a compute resource. In those cases, the kubelet can pro-actively fail one or more pods in order to reclaim the starved resource. When the kubelet fails a pod, it terminates all containers in the pod, and the PodPhase is transitioned to Failed.
Eviction Thresholds:

A soft eviction threshold pairs an eviction threshold with a required administrator specified grace period
A hard eviction threshold has no grace period, and if observed, the kubelet will take immediate action to reclaim the associated starved resource

Using Multiple Clusters

Federation

Federation makes it easy to manage multiple clusters. It does so by providing 2 major building blocks:
－ Sync resources across clusters: Federation provides the ability to keep resources in multiple clusters in sync. This can be used, for example, to ensure that the same deployment exists in multiple clusters.
－ Cross cluster discovery: It provides the ability to auto-configure DNS servers and load balancers with backends from all clusters. This can be used, for example, to ensure that a global VIP or DNS record can be used to access backends from multiple clusters.

Setting up Cluster Federation with Kubefed

Cross-cluster Service Discovery using Federated Services

Guaranteed Scheduling For Critical Add-On Pods

Rescheduler ensures that critical add-ons are always scheduled. If the scheduler determines that no node has enough free resources to run the critical add-on pod given the pods that are already running in the cluster the rescheduler tries to free up space for the add-on by evicting some pods; then the scheduler will schedule the add-on pod.
可以设置一个临时的taint "CriticalAddonsOnly",只用来部署Critical Add-On Pod,防止其他pod调度上去

Static Pods

Static pods are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
Kubelet automatically creates so-called mirror pod on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server.

If you are running clustered Kubernetes and are using static pods to run a pod on every node, you should probably be using a DaemonSet!

可以通过--pod-manifest-path 或者 --manifest-url设置

Using Sysctls in a Kubernetes Cluster

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system.
A number of sysctls are namespaced in today’s Linux kernels. This means that they can be set independently for each pod on a node.

Safe sysctl: In addition to proper namespacing a safe sysctl must be properly isolated between pods on the same node.

Accessing Clusters

//访问restapi 方式
// 1. proxy
kubectl proxy --port=8083 &
curl localhost:8083/api

// 2.直接访问
$ APISERVER=$(kubectl config view | grep server | cut -f 2- -d ":" | tr -d " ")
$ TOKEN=$(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t')
$ curl $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure复制代码

several options for connecting to nodes, pods and services from outside the cluster:

Access services through public IPs: Use a service with type NodePort or LoadBalancer to make the service reachable outside the cluster. See
Access services, nodes, or pods using the Proxy Verb : Does apiserver authentication and authorization prior to accessing the remote service. Use this if the services are not secure enough to expose to the internet, or to gain access to ports on the node IP, or for debugging.
Access from a node or pod in the cluster : Run a pod, and then connect to a shell in it using kubectl exec. Connect to other nodes, pods, and services from that shell.

//Discovering builtin services
kubectl cluster-info复制代码

Kubernetes proxy种类

The kubectl proxy: - runs on a user’s desktop or in a pod - proxies from a localhost address to the Kubernetes apiserver - client to proxy uses HTTP - proxy to apiserver uses HTTPS - locates apiserver - adds authentication headers
The apiserver proxy: - is a bastion built into the apiserver - connects a user outside of the cluster to cluster IPs which otherwise might not be reachable - runs in the apiserver processes - client to proxy uses HTTPS (or http if apiserver so configured) - proxy to target may use HTTP or HTTPS as chosen by proxy using available information - can be used to reach a Node, Pod, or Service - does load balancing when used to reach a Service
The kube proxy: - runs on each node - proxies UDP and TCP - does not understand HTTP - provides load balancing - is just used to reach services
A Proxy/Load-balancer in front of apiserver(s): - existence and implementation varies from cluster to cluster (e.g. nginx) - sits between all clients and one or more apiservers - acts as load balancer if there are several apiservers.
Cloud Load Balancers on external services: - are provided by some cloud providers (e.g. AWS ELB, Google Cloud Load Balancer) - are created automatically when the Kubernetes service has type LoadBalancer - use UDP/TCP only - implementation varies by cloud provider.

Authenticating Across Clusters with kubeconfig

Storage

Volumes

Persistent Volumes

Services, Load Balancing, and Networking

Pod 是mortal的，但是Pod IP addresses cannot be relied upon to be stable over time
所以要使用Services
Service is (usually) determined by a Label Selector
For Kubernetes-native applications, Kubernetes offers a simple Endpoints API that is updated whenever the set of Pods in a Service changes. For non-native applications, Kubernetes offers a virtual-IP-based bridge to Services which redirects to the backend Pods
创建service会用selector创建endpoint选择后端，也可以不用selector，手动创建同名endpoint,或者使用type: ExternalName转发流量到external service
除了ExternalName,service的virtual IP由kube-proxy实现
Ingress 7层->Services 4层
port和nodePort都是service的端口，前者暴露给集群内客户访问服务
service的负载均衡有两种模式，流量过kubeproxy或者iptables
{SVCNAME}_SERVICE_HOST,{SVCNAME}_SERVICE_POR等环境变量会被注入pod
设置spec.clusterIP ＝ None => Headless service => 域名则对于所有Endpoints
ServiceType: ClusterIP(default), NodePort(会在每个node上都开一个端口->service), LoadBalancer(依赖iaas,会有一个EXTERNAL-IP), ExternalName

哪种service都可以暴露到externalip 上

kind: Service
apiVersion: v1
metadata:
name: my-service
spec:
selector:
  app: MyApp
ports:
  - protocol: TCP
    port: 80   # service暴露的port
    targetPort: 9376 #默认 ＝ port 指向的port复制代码

kind: Service
apiVersion: v1
metadata:
  name: my-service
  namespace: prod
spec:
  type: ExternalName
  externalName: my.database.example.com复制代码

DNS Pods and Services

支持 my-svc.my-namespace.svc.cluster.local, pod-ip-address.my-namespace.pod.cluster.local
默认的如kubernetes

Connecting Applications with Services

tutorial

Ingress Resources

tutorial

Network Policies

tutorial

kubectl exec -ti busybox -- nslookup kubernetes.default复制代码

kubernetes入门-概念篇

概述

什么是 Kubernetes?

Why containers?

Kubernetes提供的功能

Kubernetes不是什么

Kubernetes Components 组件

Kubernetes Objects － Kubernetes对象

Understanding Kubernetes Objects

Object Spec and Status

Name / NameSpace

Labels and Selectors

API

Set references in API objects的例子

Annotations

The Kubernetes API

API versioning

API groups

API Conventions

kinds可以分为三类

Resources

Objects

Lists and Simple kinds

Differing Representations

Verbs on Resources

Idempotency

Optional vs. Required

Defaulting

Late Initialization

Concurrency Control and Consistency

Serialization Format

Units

Selecting Fields

Object references

HTTP Status codes

Response Status Kind

Events

Naming conventions

Label, selector, and annotation conventions

WebSockets and SPDY

Validation

Kubernetes Architecture

Nodes

Management

Master-Node communication

Concepts Underlying the Cloud Controller Manager

TODO

Extending the Kubernetes API

Custom Resources

Custom resources

Custom controllers

CustomResourceDefinitions

API server aggregation

Extending the Kubernetes API with the aggregation layer

Containers

Images

Updating Images

Using a Private Registry

Configuring Nodes to Authenticate to a Private Repository

Pre-pulling Images

Specifying ImagePullSecrets on a Pod

Creating a Secret with a Docker Config

Bypassing kubectl create secrets

Referring to an imagePullSecrets on a Pod

Use Cases

Container Environment Variables

Container information

Cluster information

Container Lifecycle Hooks

Container Hooks

Hook Details

Hook Handler Implementations

Hook Handler Execution

Hook delivery guarantees

Debugging Hook Handlers

Workloads

Pods

Pod Overview

Understanding Pods

How Pods manage multiple Containers