问题背景
在 Kubernetes 中对 Pod 中 Container 的状况检查提供了 Probe(探针)机制,以下希望能够结合官方文档和源代码理解 Probe 的使用和实现,以及对应的最佳实践。
Container Probes 容器探针
GCP 五分钟视频快速入门-Kubernetes Health Checks with Readiness and Liveness Probes (Kubernetes Best Practices)
Probe 是由Kubelet执行的,对 Container 的定期检测机制,用于确定 Container 是否存活,或者是否可以提供服务(接收访问流量)。
Probe被 Kubelet 使用时根据其作用分为两类:
-
livenessProbe
: 表示 Container 是否为正常运行(running)状态
- 如果探测结果为
Failure
,Kubelet 会杀掉对应的容器,并且根据其 restart policy 来决定是否重启; - 如果 Container 没有提供自定义的 liveness probe,默认视为返回
Success
。 - 需要定义
initial delay
来决定什么时候开始探测,避免初始化时间太短导致一直循环重启容器;
-
readinessProbe
: 表示 Container 是否可以正常提供服务
- 如果探测结果为
Failure
,endpoints controller 会将对应的 Pod IP 从所有匹配上的 Service 的 Endpoint 列表中移除; - 默认在
initial delay
时间结束之前,readiness probe 返回Failure
; - 如果 Container 没有提供自定义的 readiness probe,默认
视为返回Success
。
Probe 实际上是通过调用由Container 实现的 Handler 来实现的,可以实现的 Handler 包括:
-
ExecAction
: 在容器里执行一个制定命令,如果命令退出时返回0
,则认为检测成功(Success),否则认为检测失败(Failure); -
TCPSocketAction
: 针对容器IP:端口
的组合进行 TCP 连接检查,如果对应IP:端口处于开放状态,则认为成功,否则认为失败; -
HTTPGetAction
: 针对容器IP:端口:API路径
的组合进行 HTTP GET 请求,如果 HTTP Response的 Status Code 在200~400
之间,则认为检测成功,否则认为失败;
所以根据对应 Handler 的调用结果,每个 Probe 的探测结果可以有以下三种类型:
-
Success
: 对应 Handler 返回成功; -
Failure
: 对应 Handler 返回失败; -
Unknown
: 对应 Handler 无法正常执行;
什么时候应该使用 liveness 或者 readiness probes?
看了上面关于两种 probe 的介绍后,就会有一个问题,是不是容器是否存活一定要定义 liveness probe 来探测,容器是否可服务一定要定义 readiness 来探测?
答案是否定的。
-
关于容器的存活情况:
- 容器本身的生命周期管理能够解决的问题,不需要通过 liveness 来进行探测,比如容器的 PID1 进程在发生错误的时候退出的场景,Kubelet 会根据容器的状况和 Pod 的
restartPolicy
来进行调谐; - 当你希望不止基于容器本身的存活状态,而是通过某种自定义方式来决定 Kubelet 是否视容器为存活时,需要使用 liveness probe,举个例子,如果容器的 PID1 进程是一个常驻的 init 进程,而我们希望通过这个 init 启动的 flask 进程来判断容器是否为存活,如果 flask 进程启动不成功,就杀掉容器,并根据
restartPolicy
进行调谐,这个时候可以使用自定义 liveness probe。
- 容器本身的生命周期管理能够解决的问题,不需要通过 liveness 来进行探测,比如容器的 PID1 进程在发生错误的时候退出的场景,Kubelet 会根据容器的状况和 Pod 的
-
关于容器的可服务情况:
- 当你希望有某一种机制,解决容器启动成功,和容器可以提供服务之间的区分,你需要使用 readiness probe,比如应用启动成功,但需要比较长的的初始化时间后(比如拉取大量初始化数据)才能正常提供服务,这个时候仅仅以容器是否存活来决定服务状态是不够的,等到 readiness 探测成功,容器才会被加入到 endpoint 里去对外提供服务;
- 当你希望容器在存活状态下,根据某种条件来让 Kubelet 认为它处于维护状态,自动把它从 endpoint 中去掉,停止对外提供服务,你需要使用和 liveness probe 不同的 readiness probe(容器已经启动,当对应服务正在维护中...);
- 容器本身生命周期能够解决的服务问题,也不需要通过 readiness probe 来探测是否可服务,比如当一个 Pod 被删除的时候,Pod 会被置为 unready 状态,不管 readiness probe 是否存在,也不管其结果如何。
probes 的实践
exec-liveness.yaml
:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
观测 Pod 状态:
root@kmaster135:/home/chenjiaxi01/yaml/pods/probe# kubectl describe pod liveness-exec
Name: liveness-exec
...
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m19s default-scheduler Successfully assigned default/liveness-exec to dnode136
Normal Killing 2m2s kubelet, dnode136 Killing container with id docker://liveness:Container failed liveness probe.. Container will be killed and recreated.
Warning Failed 107s kubelet, dnode136 Failed to pull image "k8s.gcr.io/busybox": rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/busybox/manifests/latest: dial tcp [2404:6800:4008:c06::52]:443: connect: network is unreachable
Warning Failed 107s kubelet, dnode136 Error: ErrImagePull
Normal BackOff 106s kubelet, dnode136 Back-off pulling image "k8s.gcr.io/busybox"
Warning Failed 106s kubelet, dnode136 Error: ImagePullBackOff
Normal Pulling 93s (x3 over 4m8s) kubelet, dnode136 pulling image "k8s.gcr.io/busybox"
Normal Pulled 72s (x2 over 3m18s) kubelet, dnode136 Successfully pulled image "k8s.gcr.io/busybox"
Normal Created 72s (x2 over 3m17s) kubelet, dnode136 Created container
Normal Started 72s (x2 over 3m17s) kubelet, dnode136 Started container
Warning Unhealthy 27s (x6 over 2m42s) kubelet, dnode136 Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
可以看到在 30s 过后,liveness 探测失败,kubelet 会删掉容器,并根据默认restartPolicy=Always
重启容器;
发现有个问题, Node 上有镜像,但还是会去尝试远程拉取镜像,原因是
imagePullPolicy: Always
,如果想要在本地有对应镜像的时候不拉取,应该设置为imagePullPolcy: IfNotPresent
。
代码实现
代码版本: release-1.12
- Kubelet 中的数据结构
pkg/kubelet/kubelet.go
// Kubelet is the main kubelet implementation.
type Kubelet struct {
kubeletConfiguration componentconfig.KubeletConfiguration
...
// Handles container probing.
probeManager prober.Manager
// Manages container health check results.
livenessManager proberesults.Manager
...
}
- 初始化
pkg/kubelet/kubelet.go
// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, crOptions *options.ContainerRuntimeOptions, standaloneMode bool, hostnameOverride, nodeIP, providerID string) (*Kubelet, error) {
...
klet := &Kubelet{
hostname: hostname,
nodeName: nodeName,
kubeClient: kubeDeps.KubeClient,
...
}
...
klet.probeManager = prober.NewManager(
klet.statusManager,
klet.livenessManager,
klet.runner,
containerRefManager,
kubeDeps.Recorder)
...
}
- 启动
pkg/kubelet/kubelet.go
// Run starts the kubelet reacting to config updates
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
...
// Start component sync loops.
kl.statusManager.Start()
kl.probeManager.Start()
...
}
- 使用:
Pod 被创建时: pkg/kubelet/kubelet.go
// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
start := kl.clock.Now()
sort.Sort(sliceutils.PodsByCreationTime(pods))
for _, pod := range pods {
existingPods := kl.podManager.GetPods()
// Always add the pod to the pod manager. Kubelet relies on the pod
// manager as the source of truth for the desired state. If a pod does
// not exist in the pod manager, it means that it has been deleted in
// the apiserver and no action (other than cleanup) is required.
kl.podManager.AddPod(pod)
...
kl.probeManager.AddPod(pod)
}
}
Pod 被删除时: pkg/kubelet/kubelet.go
// HandlePodRemoves is the callback in the SyncHandler interface for pods
// being removed from a config source.
func (kl *Kubelet) HandlePodRemoves(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
kl.podManager.DeletePod(pod)
...
kl.probeManager.RemovePod(pod)
}
-
prober.Manager
接口
pkg/kubelet/prober/prober_manager.go
// Manager manages pod probing. It creates a probe "worker" for every container that specifies a
// probe (AddPod). The worker periodically probes its assigned container and caches the results. The
// manager use the cached probe results to set the appropriate Ready state in the PodStatus when
// requested (UpdatePodStatus). Updating probe parameters is not currently supported.
// TODO: Move liveness probing out of the runtime, to here.
type Manager interface {
// AddPod creates new probe workers for every container probe. This should be called for every
// pod created.
AddPod(pod *v1.Pod)
// RemovePod handles cleaning up the removed pod state, including terminating probe workers and
// deleting cached results.
RemovePod(pod *v1.Pod)
// CleanupPods handles cleaning up pods which should no longer be running.
// It takes a list of "active pods" which should not be cleaned up.
CleanupPods(activePods []*v1.Pod)
// UpdatePodStatus modifies the given PodStatus with the appropriate Ready state for each
// container based on container running status, cached probe results and worker states.
UpdatePodStatus(types.UID, *v1.PodStatus)
// Start starts the Manager sync loops.
Start()
}
prober.Manager
负责 Pod 探测的管理,提供了五个方法:
-
AddPod(pod *v1.Pod)
: 在 Pod 创建时被知道用,为每个容器 probe 创建新的 probe worker; -
RemovePod(pod *v1.Pod)
: 清理被删除的 Pod 的 Probe 状态,包括停止 probe wokers 以及清理掉缓存的结果; -
CleanupPods(activePods []*v1.Pod)
: 清理不需要运行的 Pods(??和 RemovePod 的区别和联系??); -
UpdatePodStatus(type.UID, *v1.PodStatus)
: 基于容器的运行状态、缓存的探测结果,worker 的状态来更新 PodStatus; -
Start()
: 启动 Manager 同步循环;
基于上述的五个方法,Manager 会通过AddPod
在 Pod 创建时为每个 container创建一个probe worker
指定对应的探针,worker 定期执行探测并缓存结果。基于缓存的结果,Manager会在被请求的时候通过UpdatePodStatus
更新PodStatus
中的Ready
状态。当容器被删除的时候,通过RemovePod
回收worker。
// TODO: Move liveness probing out of the runtime, to here. 如何理解
- 接口的实现:
prober.manager
type manager struct {
// Map of active workers for probes
workers map[probeKey]*worker
// Lock for accessing & mutating workers
workerLock sync.RWMutex
// The statusManager cache provides pod IP and container IDs for probing.
statusManager status.Manager
// readinessManager manages the results of readiness probes
readinessManager results.Manager
// livenessManager manages the results of liveness probes
livenessManager results.Manager
// prober executes the probe actions.
prober *prober
}
prober.manager
包括如下数据结构:
-
workers
: 维护 probe 和 worker 之间的映射; -
workerLock
: 访问 worker 时需要加锁; -
statusManager
: 提供 Pod 和 Container 信息; -
readinessManager
: 保存 readiness probes 结果; -
livenessManager
: 保存 liveness probes 结果; -
prober
: 具体执行 probe 动作;
- worker: probe 探测的主要逻辑
worker 对象封装了对一个 probe 探测的主要任务;
其数据结构如下:
pkg/kubelet/prober/worker.go:37
// worker handles the periodic probing of its assigned container. Each worker has a go-routine
// associated with it which runs the probe loop until the container permanently terminates, or the
// stop channel is closed. The worker uses the probe Manager's statusManager to get up-to-date
// container IDs.
type worker struct {
// Channel for stopping the probe.
stopCh chan struct{}
// The pod containing this probe (read-only)
pod *v1.Pod
// The container to probe (read-only)
container v1.Container
// Describes the probe configuration (read-only)
spec *v1.Probe
// The type of the worker.
probeType probeType
// The probe value during the initial delay.
initialValue results.Result
// Where to store this workers results.
resultsManager results.Manager
probeManager *manager
// The last known container ID for this worker.
containerID kubecontainer.ContainerID
// The last probe result for this worker.
lastResult results.Result
// How many times in a row the probe has returned the same result.
resultRun int
// If set, skip probing.
onHold bool
// proberResultsMetricLabels holds the labels attached to this worker
// for the ProberResults metric.
proberResultsMetricLabels prometheus.Labels
}
其方法包括:
-
newWorker
: 根据用户传入的proberType
等参数,初始化一个对应到 container-liveness/readiness 探测任务的worker; -
run
: 按照用户指定的Probe.PeriodSeconds
,周期性执行 worker 的doProbe
操作,直到收到退出信号; -
stop
: 发出终止信号,停止 woker; -
doProbe
: 真正执行探测动作,返回探测结果true
/false
;
主要看doProbe
的具体实现:
// doProbe probes the container once and records the result.
// Returns whether the worker should continue.
func (w *worker) doProbe() (keepGoing bool) {
defer func() { recover() }() // Actually eat panics (HandleCrash takes care of logging)
defer runtime.HandleCrash(func(_ interface{}) { keepGoing = true })
... // 防御式编程,去掉不需要 probe 的情况,比如 Pod 不存在,Container 不存在等
// TODO: in order for exec probes to correctly handle downward API env, we must be able to reconstruct
// the full container environment here, OR we must make a call to the CRI in order to get those environment
// values from the running container.
result, err := w.probeManager.prober.probe(w.probeType, w.pod, status, w.container, w.containerID)
if err != nil {
// Prober error, throw away the result.
return true
}
... // 根据 Probe 的结果和对应配置(比如重试次数等),决定是否返回成功
doProbe
对容器的不同情况进行分类,决定是否要进行 probe,并且处理 probe 的结果,决定是否返回成功(true);
下面继续看w.probeManager.prober.probe
,分别支持exec
,tcp
,httpGet
三种 Probe 类型,代码实现:
pkg/kubelet/prober/prober.go:81
:
// probe probes the container.
func (pb *prober) probe(probeType probeType, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (results.Result, error) {
var probeSpec *v1.Probe
switch probeType {
case readiness:
probeSpec = container.ReadinessProbe
case liveness:
probeSpec = container.LivenessProbe
default:
return results.Failure, fmt.Errorf("Unknown probe type: %q", probeType)
}
...
result, output, err := pb.runProbeWithRetries(probeType, probeSpec, pod, status, container, containerID, maxProbeRetries)
...
}
runProbeWithRetries
封装了重试逻辑,最终调用到runProbe
,按照不同的 Probe 类型实现不同的 Probe 具体探测流程,基于我们的问题背景,我们目前主要关心的是 HTTPGet 的具体实现,问题是:
- 用户是否可以指定 HTTPGet 的 Host?
- 如果用户没有指定,默认的 Host 是(猜测是 ClusterIP)?
pkg/kubelet/prober/prober.go:147
func (pb *prober) runProbe(probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (probe.Result, string, error) {
timeout := time.Duration(p.TimeoutSeconds) * time.Second
if p.Exec != nil {
glog.V(4).Infof("Exec-Probe Pod: %v, Container: %v, Command: %v", pod, container, p.Exec.Command)
command := kubecontainer.ExpandContainerCommandOnlyStatic(p.Exec.Command, container.Env)
return pb.exec.Probe(pb.newExecInContainer(container, containerID, command, timeout))
}
if p.HTTPGet != nil {
scheme := strings.ToLower(string(p.HTTPGet.Scheme))
// 1. 用户可以指定 HTTPGet 的 Host;
// 2. 如果用户没有指定,默认的 Host 就是 PodIP。
host := p.HTTPGet.Host
if host == "" {
host = status.PodIP
}
port, err := extractPort(p.HTTPGet.Port, container)
if err != nil {
return probe.Unknown, "", err
}
path := p.HTTPGet.Path
glog.V(4).Infof("HTTP-Probe Host: %v://%v, Port: %v, Path: %v", scheme, host, port, path)
url := formatURL(scheme, host, port, path)
headers := buildHeader(p.HTTPGet.HTTPHeaders)
glog.V(4).Infof("HTTP-Probe Headers: %v", headers)
if probeType == liveness {
return pb.livenessHttp.Probe(url, headers, timeout)
} else { // readiness
return pb.readinessHttp.Probe(url, headers, timeout)
}
}
if p.TCPSocket != nil {
port, err := extractPort(p.TCPSocket.Port, container)
if err != nil {
return probe.Unknown, "", err
}
host := p.TCPSocket.Host
if host == "" {
host = status.PodIP
}
glog.V(4).Infof("TCP-Probe Host: %v, Port: %v, Timeout: %v", host, port, timeout)
return pb.tcp.Probe(host, port, timeout)
}
glog.Warningf("Failed to find probe builder for container: %v", container)
return probe.Unknown, "", fmt.Errorf("Missing probe handler for %s:%s", format.Pod(pod), container.Name)
}
继续追查下去会追查到DoHTTPProbe
: pkg/probe/http/http.go:66
// DoHTTPProbe checks if a GET request to the url succeeds.
// If the HTTP response code is successful (i.e. 400 > code >= 200), it returns Success.
// If the HTTP response code is unsuccessful or HTTP communication fails, it returns Failure.
// This is exported because some other packages may want to do direct HTTP probes.
func DoHTTPProbe(url *url.URL, headers http.Header, client HTTPGetInterface) (probe.Result, string, error) {
req, err := http.NewRequest("GET", url.String(), nil)
...
if headers.Get("Host") != "" {
req.Host = headers.Get("Host")
}
res, err := client.Do(req)
if err != nil {
// Convert errors into failures to catch timeouts.
return probe.Failure, err.Error(), nil
}
defer res.Body.Close()
...
if res.StatusCode >= http.StatusOK && res.StatusCode < http.StatusBadRequest {
glog.V(4).Infof("Probe succeeded for %s, Response: %v", url.String(), *res)
return probe.Success, body, nil
}
glog.V(4).Infof("Probe failed for %s with request headers %v, response body: %v", url.String(), headers, body)
return probe.Failure, fmt.Sprintf("HTTP probe failed with statuscode: %d", res.StatusCode), nil
}
发送 HTTP 请求进行探测,至此 HTTPGet Probe 的流程梳理完毕。
其他知识
select 作为并发控制的理解
// run periodically probes the container.
func (w *worker) run() {
probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second
// If kubelet restarted the probes could be started in rapid succession.
// Let the worker wait for a random portion of tickerPeriod before probing.
time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))
probeTicker := time.NewTicker(probeTickerPeriod)
defer func() {
// Clean up.
probeTicker.Stop()
if !w.containerID.IsEmpty() {
w.resultsManager.Remove(w.containerID)
}
w.probeManager.removeWorker(w.pod.UID, w.container.Name, w.probeType)
ProberResults.Delete(w.proberResultsMetricLabels)
}()
probeLoop:
for w.doProbe() {
// Wait for next probe tick.
select {
case <-w.stopCh:
break probeLoop
case <-probeTicker.C:
// continue
}
}
}
这个probeLoop
的用法不是很理解,直接写个 sample 来看看:
func main() {
stopCh := make(chan int)
ticker := time.NewTicker(1 * time.Second)
go func() {
time.Sleep(3 * time.Second)
stopCh <- 0
fmt.Println("Send to stopCh")
}()
testLoop:
for {
select {
case <-stopCh:
fmt.Println("Receive from stopCh, break")
break testLoop
case <-ticker.C:
fmt.Println("Running...")
// continue
}
}
fmt.Println("Done")
}
- 定义一个循环的名字而已,如果去掉的话,无法直接 break 整个循环,而只是 break 一次循环;
-
time.Ticker
的使用方式值得学习,用于配置定时任务,直到收到某个终止信号; -
for{}
便是一个一直运行的循环,等同于Python中的while(true)
;
worker.stop 的写法
pkg/kubelet/prober/worker.go:147
// stop stops the probe worker. The worker handles cleanup and removes itself from its manager.
// It is safe to call stop multiple times.
func (w *worker) stop() {
select {
case w.stopCh <- struct{}{}:
default: // Non-blocking.
}
}
这样写和以下这么写有什么区别:
func (w *worker) stop() {
w.stopCh <- struct{}{}
}
Non-blocking 的写法,如果 channel 已经写满,不会阻塞住 stop 所在的 Goroutine,上层就算重复执行,也不会引发错误,相当于 stop 操作是幂等的,健壮性提高;
Sample 如下:
var stopCh = make(chan struct{}, 1)
func nonblockingStop() {
select {
case stopCh <- struct{}{}:
fmt.Println("Write to stopCh... Break")
default:
fmt.Println("Cannot write to stopCh... Running")
// non-blocking
}
}
func stop() {
stopCh <- struct{}{}
}
func looping() {
testLoop:
for {
select {
case <-stopCh:
fmt.Println("Receive End Signal...Done")
break testLoop
default:
fmt.Println("Cannot Receive End Signal...Done")
time.Sleep(500 * time.Millisecond)
}
}
}
func main() {
// make stop blocked
go looping()
time.Sleep(time.Second)
for i := 0; i <= 2; i++ {
//stop()
nonblockingStop()
}
time.Sleep(3 * time.Second)
}
执行三次stop()会死锁,但是 nonblockingStop 不会;