Kubernetes 的 Container 探活机制：从特性介绍到源码解析

问题背景

在 Kubernetes 中对 Pod 中 Container 的状况检查提供了 Probe(探针)机制，以下希望能够结合官方文档和源代码理解 Probe 的使用和实现，以及对应的最佳实践。

Container Probes 容器探针

GCP 五分钟视频快速入门-Kubernetes Health Checks with Readiness and Liveness Probes (Kubernetes Best Practices)

Probe 是由Kubelet执行的，对 Container 的定期检测机制，用于确定 Container 是否存活，或者是否可以提供服务(接收访问流量)。

Probe被 Kubelet 使用时根据其作用分为两类:

livenessProbe: 表示 Container 是否为正常运行(running)状态

如果探测结果为Failure，Kubelet 会杀掉对应的容器，并且根据其 restart policy 来决定是否重启;
如果 Container 没有提供自定义的 liveness probe，默认视为返回Success。
需要定义initial delay来决定什么时候开始探测，避免初始化时间太短导致一直循环重启容器；

readinessProbe: 表示 Container 是否可以正常提供服务

如果探测结果为Failure，endpoints controller 会将对应的 Pod IP 从所有匹配上的 Service 的 Endpoint 列表中移除；
默认在initial delay时间结束之前，readiness probe 返回Failure;
如果 Container 没有提供自定义的 readiness probe，默认
视为返回Success。

Probe 实际上是通过调用由Container 实现的 Handler 来实现的，可以实现的 Handler 包括：

ExecAction: 在容器里执行一个制定命令，如果命令退出时返回0，则认为检测成功(Success)，否则认为检测失败(Failure)；
TCPSocketAction: 针对 容器IP:端口 的组合进行 TCP 连接检查，如果对应IP:端口处于开放状态，则认为成功，否则认为失败；
HTTPGetAction: 针对 容器IP:端口:API路径 的组合进行 HTTP GET 请求，如果 HTTP Response的 Status Code 在200~400之间，则认为检测成功，否则认为失败；

所以根据对应 Handler 的调用结果，每个 Probe 的探测结果可以有以下三种类型:

Success: 对应 Handler 返回成功;
Failure: 对应 Handler 返回失败;
Unknown: 对应 Handler 无法正常执行;

什么时候应该使用 liveness 或者 readiness probes?

看了上面关于两种 probe 的介绍后，就会有一个问题，是不是容器是否存活一定要定义 liveness probe 来探测，容器是否可服务一定要定义 readiness 来探测？

答案是否定的。

关于容器的存活情况:
1. 容器本身的生命周期管理能够解决的问题，不需要通过 liveness 来进行探测，比如容器的 PID1 进程在发生错误的时候退出的场景，Kubelet 会根据容器的状况和 Pod 的 restartPolicy 来进行调谐；
2. 当你希望不止基于容器本身的存活状态，而是通过某种自定义方式来决定 Kubelet 是否视容器为存活时，需要使用 liveness probe，举个例子，如果容器的 PID1 进程是一个常驻的 init 进程，而我们希望通过这个 init 启动的 flask 进程来判断容器是否为存活，如果 flask 进程启动不成功，就杀掉容器，并根据restartPolicy进行调谐，这个时候可以使用自定义 liveness probe。
关于容器的可服务情况:
1. 当你希望有某一种机制，解决容器启动成功，和容器可以提供服务之间的区分，你需要使用 readiness probe,比如应用启动成功，但需要比较长的的初始化时间后(比如拉取大量初始化数据)才能正常提供服务，这个时候仅仅以容器是否存活来决定服务状态是不够的，等到 readiness 探测成功，容器才会被加入到 endpoint 里去对外提供服务；
2. 当你希望容器在存活状态下，根据某种条件来让 Kubelet 认为它处于维护状态，自动把它从 endpoint 中去掉，停止对外提供服务，你需要使用和 liveness probe 不同的 readiness probe(容器已经启动，当对应服务正在维护中...)；
3. 容器本身生命周期能够解决的服务问题，也不需要通过 readiness probe 来探测是否可服务，比如当一个 Pod 被删除的时候，Pod 会被置为 unready 状态，不管 readiness probe 是否存在，也不管其结果如何。

probes 的实践

exec-liveness.yaml:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

观测 Pod 状态：

root@kmaster135:/home/chenjiaxi01/yaml/pods/probe# kubectl describe pod liveness-exec
Name:               liveness-exec
...
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  4m19s                default-scheduler  Successfully assigned default/liveness-exec to dnode136
  Normal   Killing    2m2s                 kubelet, dnode136  Killing container with id docker://liveness:Container failed liveness probe.. Container will be killed and recreated.
  Warning  Failed     107s                 kubelet, dnode136  Failed to pull image "k8s.gcr.io/busybox": rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/busybox/manifests/latest: dial tcp [2404:6800:4008:c06::52]:443: connect: network is unreachable
  Warning  Failed     107s                 kubelet, dnode136  Error: ErrImagePull
  Normal   BackOff    106s                 kubelet, dnode136  Back-off pulling image "k8s.gcr.io/busybox"
  Warning  Failed     106s                 kubelet, dnode136  Error: ImagePullBackOff
  Normal   Pulling    93s (x3 over 4m8s)   kubelet, dnode136  pulling image "k8s.gcr.io/busybox"
  Normal   Pulled     72s (x2 over 3m18s)  kubelet, dnode136  Successfully pulled image "k8s.gcr.io/busybox"
  Normal   Created    72s (x2 over 3m17s)  kubelet, dnode136  Created container
  Normal   Started    72s (x2 over 3m17s)  kubelet, dnode136  Started container
  Warning  Unhealthy  27s (x6 over 2m42s)  kubelet, dnode136  Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory

可以看到在 30s 过后，liveness 探测失败，kubelet 会删掉容器，并根据默认restartPolicy=Always重启容器；

发现有个问题, Node 上有镜像，但还是会去尝试远程拉取镜像，原因是imagePullPolicy: Always，如果想要在本地有对应镜像的时候不拉取，应该设置为imagePullPolcy: IfNotPresent 。

代码实现

代码版本: release-1.12

Kubelet 中的数据结构

pkg/kubelet/kubelet.go

// Kubelet is the main kubelet implementation.
type Kubelet struct {
    kubeletConfiguration componentconfig.KubeletConfiguration
    ...

    // Handles container probing.
    probeManager prober.Manager
    // Manages container health check results.
    livenessManager proberesults.Manager
    ...
}

初始化

pkg/kubelet/kubelet.go

// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, crOptions *options.ContainerRuntimeOptions, standaloneMode bool, hostnameOverride, nodeIP, providerID string) (*Kubelet, error) {
    ...
    klet := &Kubelet{
        hostname:                       hostname,
        nodeName:                       nodeName,
        kubeClient:                     kubeDeps.KubeClient,
        ...
    }
    ...
    klet.probeManager = prober.NewManager(
        klet.statusManager,
        klet.livenessManager,
        klet.runner,
        containerRefManager,
        kubeDeps.Recorder)
    ...
}

启动
pkg/kubelet/kubelet.go

// Run starts the kubelet reacting to config updates
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
    ...
    // Start component sync loops.
    kl.statusManager.Start()
    kl.probeManager.Start()
    ...
}

使用:

Pod 被创建时: pkg/kubelet/kubelet.go

// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
    start := kl.clock.Now()
    sort.Sort(sliceutils.PodsByCreationTime(pods))
    for _, pod := range pods {
        existingPods := kl.podManager.GetPods()
        // Always add the pod to the pod manager. Kubelet relies on the pod
        // manager as the source of truth for the desired state. If a pod does
        // not exist in the pod manager, it means that it has been deleted in
        // the apiserver and no action (other than cleanup) is required.
        kl.podManager.AddPod(pod)
        ...
        kl.probeManager.AddPod(pod)
    }
}

Pod 被删除时: pkg/kubelet/kubelet.go

// HandlePodRemoves is the callback in the SyncHandler interface for pods
// being removed from a config source.
func (kl *Kubelet) HandlePodRemoves(pods []*v1.Pod) {
    start := kl.clock.Now()
    for _, pod := range pods {
        kl.podManager.DeletePod(pod)
        ...
        kl.probeManager.RemovePod(pod)
    }

prober.Manager 接口

pkg/kubelet/prober/prober_manager.go

// Manager manages pod probing. It creates a probe "worker" for every container that specifies a
// probe (AddPod). The worker periodically probes its assigned container and caches the results. The
// manager use the cached probe results to set the appropriate Ready state in the PodStatus when
// requested (UpdatePodStatus). Updating probe parameters is not currently supported.
// TODO: Move liveness probing out of the runtime, to here.
type Manager interface {
    // AddPod creates new probe workers for every container probe. This should be called for every
    // pod created.
    AddPod(pod *v1.Pod)

    // RemovePod handles cleaning up the removed pod state, including terminating probe workers and
    // deleting cached results.
    RemovePod(pod *v1.Pod)

    // CleanupPods handles cleaning up pods which should no longer be running.
    // It takes a list of "active pods" which should not be cleaned up.
    CleanupPods(activePods []*v1.Pod)

    // UpdatePodStatus modifies the given PodStatus with the appropriate Ready state for each
    // container based on container running status, cached probe results and worker states.
    UpdatePodStatus(types.UID, *v1.PodStatus)

    // Start starts the Manager sync loops.
    Start()
}

prober.Manager 负责 Pod 探测的管理，提供了五个方法：

AddPod(pod *v1.Pod): 在 Pod 创建时被知道用，为每个容器 probe 创建新的 probe worker;
RemovePod(pod *v1.Pod): 清理被删除的 Pod 的 Probe 状态，包括停止 probe wokers 以及清理掉缓存的结果；
CleanupPods(activePods []*v1.Pod): 清理不需要运行的 Pods(??和 RemovePod 的区别和联系？？);
UpdatePodStatus(type.UID, *v1.PodStatus): 基于容器的运行状态、缓存的探测结果，worker 的状态来更新 PodStatus;
Start(): 启动 Manager 同步循环；

基于上述的五个方法，Manager 会通过AddPod在 Pod 创建时为每个 container创建一个probe worker指定对应的探针，worker 定期执行探测并缓存结果。基于缓存的结果，Manager会在被请求的时候通过UpdatePodStatus更新PodStatus中的Ready状态。当容器被删除的时候，通过RemovePod回收worker。

// TODO: Move liveness probing out of the runtime, to here. 如何理解

接口的实现: prober.manager

type manager struct {
    // Map of active workers for probes
    workers map[probeKey]*worker
    // Lock for accessing & mutating workers
    workerLock sync.RWMutex

    // The statusManager cache provides pod IP and container IDs for probing.
    statusManager status.Manager

    // readinessManager manages the results of readiness probes
    readinessManager results.Manager

    // livenessManager manages the results of liveness probes
    livenessManager results.Manager

    // prober executes the probe actions.
    prober *prober
}

prober.manager包括如下数据结构:

workers: 维护 probe 和 worker 之间的映射；
workerLock: 访问 worker 时需要加锁；
statusManager: 提供 Pod 和 Container 信息;
readinessManager: 保存 readiness probes 结果；
livenessManager: 保存 liveness probes 结果;
prober: 具体执行 probe 动作;

worker: probe 探测的主要逻辑

worker 对象封装了对一个 probe 探测的主要任务；

其数据结构如下:

pkg/kubelet/prober/worker.go:37

// worker handles the periodic probing of its assigned container. Each worker has a go-routine
// associated with it which runs the probe loop until the container permanently terminates, or the
// stop channel is closed. The worker uses the probe Manager's statusManager to get up-to-date
// container IDs.
type worker struct {
    // Channel for stopping the probe.
    stopCh chan struct{}
    // The pod containing this probe (read-only)
    pod *v1.Pod
    // The container to probe (read-only)
    container v1.Container
    // Describes the probe configuration (read-only)
    spec *v1.Probe
    // The type of the worker.
    probeType probeType
    // The probe value during the initial delay.
    initialValue results.Result
    // Where to store this workers results.
    resultsManager results.Manager
    probeManager   *manager
    // The last known container ID for this worker.
    containerID kubecontainer.ContainerID
    // The last probe result for this worker.
    lastResult results.Result
    // How many times in a row the probe has returned the same result.
    resultRun int
    // If set, skip probing.
    onHold bool
    // proberResultsMetricLabels holds the labels attached to this worker
    // for the ProberResults metric.
    proberResultsMetricLabels prometheus.Labels
}

其方法包括:

newWorker: 根据用户传入的 proberType等参数，初始化一个对应到 container-liveness/readiness 探测任务的worker；
run: 按照用户指定的Probe.PeriodSeconds，周期性执行 worker 的doProbe操作，直到收到退出信号;
stop: 发出终止信号，停止 woker;
doProbe: 真正执行探测动作，返回探测结果true/false；

主要看doProbe的具体实现:

// doProbe probes the container once and records the result.
// Returns whether the worker should continue.
func (w *worker) doProbe() (keepGoing bool) {
    defer func() { recover() }() // Actually eat panics (HandleCrash takes care of logging)
    defer runtime.HandleCrash(func(_ interface{}) { keepGoing = true })

    ... // 防御式编程，去掉不需要 probe 的情况，比如 Pod 不存在，Container 不存在等

    // TODO: in order for exec probes to correctly handle downward API env, we must be able to reconstruct
    // the full container environment here, OR we must make a call to the CRI in order to get those environment
    // values from the running container.
    result, err := w.probeManager.prober.probe(w.probeType, w.pod, status, w.container, w.containerID)
    if err != nil {
        // Prober error, throw away the result.
        return true
    }

    ... // 根据 Probe 的结果和对应配置(比如重试次数等)，决定是否返回成功

doProbe 对容器的不同情况进行分类，决定是否要进行 probe，并且处理 probe 的结果，决定是否返回成功(true)；

下面继续看w.probeManager.prober.probe，分别支持exec,tcp,httpGet三种 Probe 类型，代码实现:

pkg/kubelet/prober/prober.go:81:

// probe probes the container.
func (pb *prober) probe(probeType probeType, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (results.Result, error) {
    var probeSpec *v1.Probe
    switch probeType {
    case readiness:
        probeSpec = container.ReadinessProbe
    case liveness:
        probeSpec = container.LivenessProbe
    default:
        return results.Failure, fmt.Errorf("Unknown probe type: %q", probeType)
    }
    ...
    result, output, err := pb.runProbeWithRetries(probeType, probeSpec, pod, status, container, containerID, maxProbeRetries)
    ...
}

runProbeWithRetries 封装了重试逻辑，最终调用到runProbe，按照不同的 Probe 类型实现不同的 Probe 具体探测流程，基于我们的问题背景，我们目前主要关心的是 HTTPGet 的具体实现，问题是:

用户是否可以指定 HTTPGet 的 Host？
如果用户没有指定，默认的 Host 是(猜测是 ClusterIP)？

pkg/kubelet/prober/prober.go:147

func (pb *prober) runProbe(probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (probe.Result, string, error) {
    timeout := time.Duration(p.TimeoutSeconds) * time.Second
    if p.Exec != nil {
        glog.V(4).Infof("Exec-Probe Pod: %v, Container: %v, Command: %v", pod, container, p.Exec.Command)
        command := kubecontainer.ExpandContainerCommandOnlyStatic(p.Exec.Command, container.Env)
        return pb.exec.Probe(pb.newExecInContainer(container, containerID, command, timeout))
    }
    if p.HTTPGet != nil {
        scheme := strings.ToLower(string(p.HTTPGet.Scheme))
       // 1. 用户可以指定 HTTPGet 的 Host;
       // 2. 如果用户没有指定，默认的 Host 就是 PodIP。
        host := p.HTTPGet.Host
        if host == "" {
            host = status.PodIP
        }
        port, err := extractPort(p.HTTPGet.Port, container)
        if err != nil {
            return probe.Unknown, "", err
        }
        path := p.HTTPGet.Path
        glog.V(4).Infof("HTTP-Probe Host: %v://%v, Port: %v, Path: %v", scheme, host, port, path)
        url := formatURL(scheme, host, port, path)
        headers := buildHeader(p.HTTPGet.HTTPHeaders)
        glog.V(4).Infof("HTTP-Probe Headers: %v", headers)
        if probeType == liveness {
            return pb.livenessHttp.Probe(url, headers, timeout)
        } else { // readiness
            return pb.readinessHttp.Probe(url, headers, timeout)
        }
    }
    if p.TCPSocket != nil {
        port, err := extractPort(p.TCPSocket.Port, container)
        if err != nil {
            return probe.Unknown, "", err
        }
        host := p.TCPSocket.Host
        if host == "" {
            host = status.PodIP
        }
        glog.V(4).Infof("TCP-Probe Host: %v, Port: %v, Timeout: %v", host, port, timeout)
        return pb.tcp.Probe(host, port, timeout)
    }
    glog.Warningf("Failed to find probe builder for container: %v", container)
    return probe.Unknown, "", fmt.Errorf("Missing probe handler for %s:%s", format.Pod(pod), container.Name)
}

继续追查下去会追查到DoHTTPProbe: pkg/probe/http/http.go:66

// DoHTTPProbe checks if a GET request to the url succeeds.
// If the HTTP response code is successful (i.e. 400 > code >= 200), it returns Success.
// If the HTTP response code is unsuccessful or HTTP communication fails, it returns Failure.
// This is exported because some other packages may want to do direct HTTP probes.
func DoHTTPProbe(url *url.URL, headers http.Header, client HTTPGetInterface) (probe.Result, string, error) {
    req, err := http.NewRequest("GET", url.String(), nil)
    ...
    if headers.Get("Host") != "" {
        req.Host = headers.Get("Host")
    }
    res, err := client.Do(req)
    if err != nil {
        // Convert errors into failures to catch timeouts.
        return probe.Failure, err.Error(), nil
    }
    defer res.Body.Close()
    ...
    if res.StatusCode >= http.StatusOK && res.StatusCode < http.StatusBadRequest {
        glog.V(4).Infof("Probe succeeded for %s, Response: %v", url.String(), *res)
        return probe.Success, body, nil
    }
    glog.V(4).Infof("Probe failed for %s with request headers %v, response body: %v", url.String(), headers, body)
    return probe.Failure, fmt.Sprintf("HTTP probe failed with statuscode: %d", res.StatusCode), nil
}

发送 HTTP 请求进行探测，至此 HTTPGet Probe 的流程梳理完毕。

其他知识

select 作为并发控制的理解

// run periodically probes the container.
func (w *worker) run() {
    probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second

    // If kubelet restarted the probes could be started in rapid succession.
    // Let the worker wait for a random portion of tickerPeriod before probing.
    time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))

    probeTicker := time.NewTicker(probeTickerPeriod)

    defer func() {
        // Clean up.
        probeTicker.Stop()
        if !w.containerID.IsEmpty() {
            w.resultsManager.Remove(w.containerID)
        }

        w.probeManager.removeWorker(w.pod.UID, w.container.Name, w.probeType)
        ProberResults.Delete(w.proberResultsMetricLabels)
    }()

probeLoop:
    for w.doProbe() {
        // Wait for next probe tick.
        select {
        case <-w.stopCh:
            break probeLoop
        case <-probeTicker.C:
            // continue
        }
    }
}

这个probeLoop的用法不是很理解,直接写个 sample 来看看：

func main() {
    stopCh := make(chan int)
    ticker := time.NewTicker(1 * time.Second)

    go func() {
        time.Sleep(3 * time.Second)
        stopCh <- 0
        fmt.Println("Send to stopCh")
    }()
    
testLoop:
    for {
        select {
        case <-stopCh:
            fmt.Println("Receive from stopCh, break")
            break testLoop
        case <-ticker.C:
            fmt.Println("Running...")
            // continue
        }
    }

    fmt.Println("Done")
}

定义一个循环的名字而已，如果去掉的话，无法直接 break 整个循环，而只是 break 一次循环；
time.Ticker的使用方式值得学习，用于配置定时任务，直到收到某个终止信号；
for{}便是一个一直运行的循环，等同于Python中的while(true);

worker.stop 的写法

pkg/kubelet/prober/worker.go:147

// stop stops the probe worker. The worker handles cleanup and removes itself from its manager.
// It is safe to call stop multiple times.
func (w *worker) stop() {
    select {
    case w.stopCh <- struct{}{}:
    default: // Non-blocking.
    }
}

这样写和以下这么写有什么区别:

func (w *worker) stop() {
    w.stopCh <- struct{}{}
}

Non-blocking 的写法，如果 channel 已经写满，不会阻塞住 stop 所在的 Goroutine，上层就算重复执行，也不会引发错误，相当于 stop 操作是幂等的，健壮性提高；

Sample 如下:

var stopCh = make(chan struct{}, 1)

func nonblockingStop() {
    select {
    case stopCh <- struct{}{}:
        fmt.Println("Write to stopCh... Break")
    default:
        fmt.Println("Cannot write to stopCh... Running")
        // non-blocking
    }
}

func stop() {
    stopCh <- struct{}{}
}

func looping() {
testLoop:
    for {
        select {
        case <-stopCh:
            fmt.Println("Receive End Signal...Done")
            break testLoop
        default:
            fmt.Println("Cannot Receive End Signal...Done")
            time.Sleep(500 * time.Millisecond)
        }
    }
}

func main() {
    // make stop blocked
    go looping()
    time.Sleep(time.Second)
    for i := 0; i <= 2; i++ {
        //stop()
        nonblockingStop()
    }
    time.Sleep(3 * time.Second)
}

执行三次stop()会死锁，但是 nonblockingStop 不会；