k8s中的监控原理、prometheus采集原理 可以看这个文章
k8s-mon项目介绍
项目地址
kube-stats-metrics 没数据
排查思路 dns问题
首先观察k8s-mon-deployment的日志
kubectl logs -l app=k8s-mon-deployment -n kube-admin |grep 8080
# 如有下列报错说明网络不通
# err="Get \"http://kube-state-metrics.kube-system:8080/metrics\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
排查dns,在node上请求coredns 服务
root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl get svc -n kube-system |grep dns
kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP,9153/TCP 73d
在node上请求 coredns 解析 kube-stats-metrics 域名
# 10.96.0.10 为请求到的coredns svc 地址
# 因为node上的搜索域没有 svc.cluster.local,所以需要FQDN
dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
# 如果正常的话则会有如下 A记录
root@k8s-local-test-01:~$ dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12799
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;kube-state-metrics.kube-system.svc.cluster.local. IN A
;; ANSWER SECTION:
kube-state-metrics.kube-system.svc.cluster.local. 25 IN A 10.100.30.129
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Fri Apr 02 15:14:46 CST 2021
;; MSG SIZE rcvd: 141
在node上请求kube-stats-metrics
root@k8s-local-test-01:/etc/kubernetes/manifests$ curl -s 10.100.30.129:8080/metrics |head
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge
如果有输出证明 node上请求 coredns没问题,请求ksm服务也没问题
然后进入k8s-mon-deployment容器中查看
# 进入容器命令
kubectl -n kube-admin exec "$(kubectl -nkube-admin get pod -l app=k8s-mon-deployment -o jsonpath='{.items[0].metadata.name}')" -ti -- /bin/sh
# ping 一下 kube-state-metrics.kube-system
PING kube-state-metrics.kube-system (10.100.30.129): 56 data bytes
64 bytes from 10.100.30.129: seq=0 ttl=64 time=0.097 ms
64 bytes from 10.100.30.129: seq=1 ttl=64 time=0.093 ms
64 bytes from 10.100.30.129: seq=2 ttl=64 time=0.114 ms
64 bytes from 10.100.30.129: seq=3 ttl=64 time=0.124 ms
# wget 请求一下ksm服务
/ # wget http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics -O m |head m
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge
Connecting to kube-state-metrics.kube-system.svc.cluster.local:8080 (10.100.30.129:8080)
如果在node上可以获取到,但在pod中获取不到考虑 coredns有问题或者容器网络 有问题
打印coredns日志
oot@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs -l k8s-app=kube-dns -n kube-system -f
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs -l k8s-app=kube-dns -n kube-system |grep -i error
容器网络问题 可以按照这个文档排查 https://juejin.cn/post/684490...
apiserver等服务组件没数据
排查思路 先看看日志 报什么错
在node上手动 带token 请求下apiserver 的metrics
TOKEN=$(kubectl -n kube-admin get secret $(kubectl -n kube-admin get serviceaccount k8s-mon -o jsonpath='{.secrets[0].name}') -o jsonpath='{.data.token}' | base64 --decode )
curl https://localhost:6443/metrics --header "Authorization: Bearer $TOKEN" --insecure
# 如果正常的话可以看到metrics数据
服务组件没有部署在pod中的需要在configMap中给出地址 并设置
user_specified:true
kube_scheduler:
user_specified: true
addrs:
- "https://1.1.1.1:1234/metrics"
- "https://2.2.2.2:1234/metrics"
日志中报push到夜莺agent的错误
level=error ts=2021-04-02T14:44:21.560+08:00 caller=push.go:79 msg=HttpPostPushDataBuildNewHttpPostReqError2 funcName=api-server url=http://localhost:2080/api/collector/push err="Post \"http://localhost:2080/api/collector/push\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
可以将 k8s-mon的日志改为 debug 查看下
command:
- /opt/app/k8s-mon
- --config.file=/etc/k8s-mon/k8s-mon.yml
- --log.level=debug
debug日志会打印每个 阶段的耗时
level=debug ts=2021-04-02T16:19:20.723+08:00 caller=kube_state_metrics.go:244 msg=DoCollectSuccessfullyReadyToPush funcName=kube-stats-metrics metrics_num=3551 time_took_seconds=0.276154232 metric_addr=http://kube-state-metrics.kube-system:8080/metrics
level=debug ts=2021-04-02T16:19:20.733+08:00 caller=kube_controller_manager.go:180 msg=DoCollectSuccessfullyReadyToPush funcName=kube-controller-manager metrics_num=642 time_took_seconds=0.286183625
level=debug ts=2021-04-02T16:19:20.845+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-controller-manager url=http://localhost:2080/api/collector/push metricsNum=642 time_took_seconds=0.111731185
level=debug ts=2021-04-02T16:19:20.935+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-stats-metrics url=http://localhost:2080/api/collector/push metricsNum=3551 time_took_seconds=0.212283608
level=debug ts=2021-04-02T16:19:21.459+08:00 caller=kube_apiserver.go:357 msg=DoCollectSuccessfullyReadyToPush funcName=api-server metrics_num=2168 time_took_seconds=1.012191635
level=debug ts=2021-04-02T16:19:21.639+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=api-server url=http://localhost:2080/api/collector/push metricsNum=2168 time_took_seconds=0.179650444
可以到node上面 手动推一条数据给夜莺的agent试试
curl -X POST -H 'Accept: */*' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' -H 'Content-Length: 183' -H 'Content-Type: application/json' -H 'User-Agent: python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-1160.11.1.el7.x86_64' -d '[{"tagsMap": {"k1": "v1"}, "step": 15, "endpoint": "1", "value": 1, "tags": "k1=v1", "timestamp": 1617346924, "metric": "abc_test", "extra": "", "nid": "1", "counterType": "COUNTER"}]' http://localhost:2080/api/collector/push
localhost 还是127.0.0.1问题?
- 在容器内部ping localhost看看