Kubernetes容器云平台的部分重要服务的可用性监控设计与实现

Kubernetes服务组件相关的监控设计与实现

检查工作节点物理主机的健康状态

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-node1 Ready <none> 46d v1.13.1
k8s-node2 Ready <none> 46d v1.13.1
k8s-node3 Ready <none> 46d v1.13.1
  • 报警条件:当命令查询结果的返回值,没有返回3个主机的记录,或某一个主机的STATUS的值不是Ready,均报警。

检查kube-apiserver服务进程的健康状态

$ systemctl status kube-apiserver|grep active
Active: active (running) since 三 2019-01-23 13:50:54 CST; 1 weeks 3 days ago
  • 报警条件:要求在运行keepalived的三个工作节点上进程健康状态的结果输出,均为"active (running)"。

检查集群服务的健康状态

$ kubectl cluster-info
Kubernetes master is running at https://192.168.10.24:8443
CoreDNS is running at https://192.168.10.24:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
  • 报警条件:要求命令的返回结果中包含"Kubernetes master is running"和"CoreDNS is running"的输出。

检查k8s服务组件的健康状态

$ kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
scheduler Healthy ok
etcd-0 Healthy {"health":"true"}
etcd-1 Healthy {"health":"true"}
etcd-2 Healthy {"health":"true"}
  • 执行 kubectl get componentstatuses 命令时,apiserver 默认向 127.0.0.1 发送请求。当 controller-manager、scheduler 以集群模式运行时,有可能和 kube-apiserver 不在同一台机器上,这时 controller-manager 或 scheduler 的状态为 Unhealthy,但实际上它们工作正常。
  • 报警条件:对命令的返回结果中代表健康的状态标识进行识别、设计和监控报警。

检查kube-controller-manager服务进程的健康状态

$ systemctl status kube-controller-manager|grep Active
Active: active (running) since 三 2019-01-23 13:34:25 CST; 1 weeks 3 days ago
  • 报警条件:要求在运行kube-controller-manager的三个工作节点上进程健康状态的结果输出,均为"active (running)"。

检查kube-scheduler服务进程的健康状态

$ systemctl status kube-scheduler|grep Active
Active: active (running) since 三 2019-01-23 13:50:10 CST; 1 weeks 3 days ago
  • 报警条件:要求在运行kube-scheduler的三个工作节点上进程健康状态的结果输出,均为"active (running)"。

kube-scheduler 监听 10251 端口,接收 http 请求,可以查看服务监控数据

curl -s http://127.0.0.1:10251/metrics
  • 报警条件:返回的监控数据比较多,按需选择和设计使用

检查docker服务进程的健康状态

$ systemctl status docker|grep Active
Active: active (running) since 三 2019-01-23 13:50:10 CST; 1 weeks 3 days ago
  • 报警条件:要求在运行docker服务的工作节点上进程健康状态的结果输出,均为"active (running)"。

检查kubelet服务进程的健康状态

$ systemctl status kubelet|grep Active
Active: active (running) since 三 2019-01-23 13:50:10 CST; 1 weeks 3 days ago
  • 报警条件:要求在运行kubelet的工作节点上进程健康状态的结果输出,均为"active (running)"。

检查kube-proxy服务进程的健康状态

$ systemctl status kube-proxy|grep Active
Active: active (running) since 三 2019-01-23 13:50:20 CST; 1 weeks 3 days ago
  • 报警条件:要求在运行kube-proxy的工作节点上进程健康状态的结果输出,均为"active (running)"。

Etcd服务相关的监控设计与实现

检查etcd服务进程的健康状态

[k8s@k8s-node3 ~]$ systemctl status etcd|grep active
Active: active (running) since 三 2019-01-23 13:50:44 CST; 1 weeks 3 days ago
  • 报警条件:每个运行etcd服务进程的节点上,查询结果中都应该显示为"active (running)"的状态。

检查etcd服务的健康状态

[k8s@k8s-node1 ~]$ ETCDCTL_API=3 /opt/k8s/bin/etcdctl --endpoints=https://192.168.10.21:2379 --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem endpoint health
https://192.168.10.21:2379 is healthy: successfully committed proposal: took = 1.843545ms
[k8s@k8s-node1 ~]$ ETCDCTL_API=3 /opt/k8s/bin/etcdctl --endpoints=https://192.168.10.22:2379 --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem endpoint health
https://192.168.10.22:2379 is healthy: successfully committed proposal: took = 2.342839ms
[k8s@k8s-node1 ~]$ ETCDCTL_API=3 /opt/k8s/bin/etcdctl --endpoints=https://192.168.10.23:2379 --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem endpoint health
https://192.168.10.23:2379 is healthy: successfully committed proposal: took = 2.265994ms
  • 报警条件:要求在检查etcd的三个工作节点上服务健康状态的结果输出,均为"healthy"。

平台高可用服务的监控

检查haproxy服务进程的健康状态

[k8s@k8s-node1 ~]$ systemctl status haproxy|grep Active
Active: active (running) since 三 2019-01-23 13:29:40 CST; 1 weeks 3 days ago
  • 报警条件:要求运行haproxy服务的3个主机节点上,执行该命令的返回结果中均显示为"active (running)"状态。

检查keepalived服务进程的健康状态

[k8s@k8s-node1 ~]$ systemctl status keepalived|grep Active
Active: active (running) since 三 2019-01-23 13:29:40 CST; 1 weeks 3 days ago
  • 报警条件1:要求运行keepalived服务的3个主机节点上,执行该命令的返回结果中均显示为"active (running)"状态。
  • 报警条件2:监控keepalived vip地址是否存在,是否漂移,并对异常状态报警。

Calico网络服务的监控设计与实现

检查在kube-system命名空间里calico网络服务相关的pod是否正常

$ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5d94b577bb-lqp7b 1/1 Running 5 10d
calico-node-658m6 1/1 Running 7 37d
calico-node-8cpzs 1/1 Running 10 37d
calico-node-cjjmb 1/1 Running 5 37d
  • 在输出结果中,必须要有一个calico-kube-controllers-*的pod,且STATUS状态为"Running";
  • 在输出结果中,需要有3个calico-node-*的pod,如果我们使用了3个工作节点的话,且STATUS状态均为"Running";

检查在calico网络中的网络节点数量或名称

$ calicoctl get nodes|grep -v NAME
k8s-node1
k8s-node2
k8s-node3
  • 如果我们使用了3个工作节点,则输出记录行数需要等于3,有必要的话可以进一步监控到节点名称是否匹配;

检查calico网络节点的运行状态

# calicoctl node status
Calico process is running.

IPv4 BGP status
+----------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+----------------+-------------------+-------+------------+-------------+
| 192.168.10.22 | node-to-node mesh | up | 2019-01-23 | Established |
| 192.168.10.23 | node-to-node mesh | up | 2019-01-23 | Established |
+----------------+-------------------+-------+------------+-------------+

IPv6 BGP status
No IPv6 peers found.
  • 如果我们是在192.168.10.21节点上执行的检查命令,则返回的结果需要得到上面这样的记录,表明在calico网络中的另外两个节点存在,且STATE为"up",INFO为"Established ";

CoreDNS服务的监控设计与实现

检查coredns服务相关的pods运行状态

$ kubectl get pods -n kube-system|grep -v NAME|grep coredns
coredns-fff89c9b9-5h9fv 1/1 Running 4 35d
coredns-fff89c9b9-pdd4w 1/1 Running 6 36d
coredns-fff89c9b9-wm2zh 1/1 Running 7 44d
  • coredns相关的pods数量需要正确,且STATUS为"Running";

运行一个检测用途的pod,定时从该pod中检查一下k8s服务名称是否可以正确解出地址信息

创建检测使用的pod,为了方便,直接定义为运行时间在十年以上:

$ cat busybox-check.yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox-check
spec:
containers:
- image: busybox
name: busybox-check
command: [ "sleep", "360000000" ]
[k8s@k8s-node1 ~]$ kubectl create -f busybox-check.yaml

ping kubernetes服务名一个数据包:

$ kubectl exec busybox-check -- ping -c 1 kubernetes
PING kubernetes (10.254.0.1): 56 data bytes
64 bytes from 10.254.0.1: seq=0 ttl=64 time=0.074 ms

--- kubernetes ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.074/0.074/0.074 ms
  • 针对返回的结果设计适当的检查项目,以确定DNS服务工作正常。

你可能感兴趣的:(k8s,kubernetes监控)