k8s环境中监控不通问题排查思路

k8s中的监控原理、prometheus采集原理 可以看这个文章

k8s监控指标汇总,prometheus采集k8s原理解析

k8s-mon项目介绍

项目地址

kube-stats-metrics 没数据

排查思路 dns问题

首先观察k8s-mon-deployment的日志

kubectl logs  -l app=k8s-mon-deployment  -n kube-admin |grep 8080

# 如有下列报错说明网络不通
# err="Get \"http://kube-state-metrics.kube-system:8080/metrics\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
排查dns,在node上请求coredns 服务


root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl get svc  -n kube-system |grep dns
kube-dns             ClusterIP   10.96.0.10             53/UDP,53/TCP,9153/TCP   73d
在node上请求 coredns 解析 kube-stats-metrics 域名
# 10.96.0.10  为请求到的coredns svc 地址 
# 因为node上的搜索域没有 svc.cluster.local,所以需要FQDN
dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10 
# 如果正常的话则会有如下 A记录 

root@k8s-local-test-01:~$ dig kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> kube-state-metrics.kube-system.svc.cluster.local @10.96.0.10
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12799
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;kube-state-metrics.kube-system.svc.cluster.local. IN A

;; ANSWER SECTION:
kube-state-metrics.kube-system.svc.cluster.local. 25 IN    A 10.100.30.129

;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Fri Apr 02 15:14:46 CST 2021
;; MSG SIZE  rcvd: 141
在node上请求kube-stats-metrics
root@k8s-local-test-01:/etc/kubernetes/manifests$ curl -s 10.100.30.129:8080/metrics |head 
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge

如果有输出证明 node上请求 coredns没问题,请求ksm服务也没问题

然后进入k8s-mon-deployment容器中查看

# 进入容器命令
kubectl -n kube-admin  exec "$(kubectl -nkube-admin get pod -l app=k8s-mon-deployment -o jsonpath='{.items[0].metadata.name}')"  -ti -- /bin/sh 

# ping 一下 kube-state-metrics.kube-system
PING kube-state-metrics.kube-system (10.100.30.129): 56 data bytes
64 bytes from 10.100.30.129: seq=0 ttl=64 time=0.097 ms
64 bytes from 10.100.30.129: seq=1 ttl=64 time=0.093 ms
64 bytes from 10.100.30.129: seq=2 ttl=64 time=0.114 ms
64 bytes from 10.100.30.129: seq=3 ttl=64 time=0.124 ms

# wget 请求一下ksm服务 

/ # wget http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics -O m |head  m 
# HELP kube_certificatesigningrequest_labels Kubernetes labels converted to Prometheus labels.
# TYPE kube_certificatesigningrequest_labels gauge
# HELP kube_certificatesigningrequest_created Unix creation timestamp
# TYPE kube_certificatesigningrequest_created gauge
# HELP kube_certificatesigningrequest_condition The number of each certificatesigningrequest condition
# TYPE kube_certificatesigningrequest_condition gauge
# HELP kube_certificatesigningrequest_cert_length Length of the issued cert
# TYPE kube_certificatesigningrequest_cert_length gauge
# HELP kube_configmap_info Information about configmap.
# TYPE kube_configmap_info gauge
Connecting to kube-state-metrics.kube-system.svc.cluster.local:8080 (10.100.30.129:8080)

如果在node上可以获取到,但在pod中获取不到考虑 coredns有问题或者容器网络 有问题

打印coredns日志

oot@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs  -l k8s-app=kube-dns  -n kube-system -f 
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
 
root@k8s-local-test-01:/etc/kubernetes/manifests$ kubectl logs  -l k8s-app=kube-dns  -n kube-system |grep -i error
容器网络问题 可以按照这个文档排查 https://juejin.cn/post/684490...

apiserver等服务组件没数据

排查思路 先看看日志 报什么错
在node上手动 带token 请求下apiserver 的metrics
TOKEN=$(kubectl -n kube-admin  get secret $(kubectl -n kube-admin  get serviceaccount k8s-mon -o jsonpath='{.secrets[0].name}') -o jsonpath='{.data.token}' | base64 --decode ) 
curl  https://localhost:6443/metrics --header "Authorization: Bearer $TOKEN" --insecure 

# 如果正常的话可以看到metrics数据
服务组件没有部署在pod中的需要在configMap中给出地址 并设置 user_specified:true
  kube_scheduler:
    user_specified: true
    addrs:
      - "https://1.1.1.1:1234/metrics"
      - "https://2.2.2.2:1234/metrics"

日志中报push到夜莺agent的错误

level=error ts=2021-04-02T14:44:21.560+08:00 caller=push.go:79 msg=HttpPostPushDataBuildNewHttpPostReqError2 funcName=api-server url=http://localhost:2080/api/collector/push err="Post \"http://localhost:2080/api/collector/push\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
 
可以将 k8s-mon的日志改为 debug 查看下
  command:
    - /opt/app/k8s-mon
    - --config.file=/etc/k8s-mon/k8s-mon.yml
    - --log.level=debug
debug日志会打印每个 阶段的耗时
level=debug ts=2021-04-02T16:19:20.723+08:00 caller=kube_state_metrics.go:244 msg=DoCollectSuccessfullyReadyToPush funcName=kube-stats-metrics metrics_num=3551 time_took_seconds=0.276154232 metric_addr=http://kube-state-metrics.kube-system:8080/metrics
level=debug ts=2021-04-02T16:19:20.733+08:00 caller=kube_controller_manager.go:180 msg=DoCollectSuccessfullyReadyToPush funcName=kube-controller-manager metrics_num=642 time_took_seconds=0.286183625
level=debug ts=2021-04-02T16:19:20.845+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-controller-manager url=http://localhost:2080/api/collector/push metricsNum=642 time_took_seconds=0.111731185
level=debug ts=2021-04-02T16:19:20.935+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=kube-stats-metrics url=http://localhost:2080/api/collector/push metricsNum=3551 time_took_seconds=0.212283608
level=debug ts=2021-04-02T16:19:21.459+08:00 caller=kube_apiserver.go:357 msg=DoCollectSuccessfullyReadyToPush funcName=api-server metrics_num=2168 time_took_seconds=1.012191635
level=debug ts=2021-04-02T16:19:21.639+08:00 caller=push.go:25 msg=PushWorkSuccess funcName=api-server url=http://localhost:2080/api/collector/push metricsNum=2168 time_took_seconds=0.179650444
可以到node上面 手动推一条数据给夜莺的agent试试
curl -X POST -H 'Accept: */*' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' -H 'Content-Length: 183' -H 'Content-Type: application/json' -H 'User-Agent: python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-1160.11.1.el7.x86_64' -d '[{"tagsMap": {"k1": "v1"}, "step": 15, "endpoint": "1", "value": 1, "tags": "k1=v1", "timestamp": 1617346924, "metric": "abc_test", "extra": "", "nid": "1", "counterType": "COUNTER"}]' http://localhost:2080/api/collector/push 
localhost 还是127.0.0.1问题?
  • 在容器内部ping localhost看看

你可能感兴趣的:(k8s环境中监控不通问题排查思路)