本文在kube-prometheus
的基础上
grafana
, prometheus
, alertmanager
的存储为 local pvc
ingress
访问ingress-nginx
的 serviceMonitor
(验证添加监控无误)custom-metrics
的错误,实现了 hpaV2
下载相关文件
# 解压报错了
wget https://github.com/coreos/kube-prometheus/archive/v0.3.0.tar.gz
kube-prometheus 大致分为如下几个部分
其包含了 kube-state-metrics 及 prometheus-adapter 项目。后面会针对 prometheus-adapter 单独说明。
按照文档中的说明,应该先提交 setup 中的资源
提交 setup 中的文件
[root@docker-182 manifests]# k apply -f setup/
namespace/monitoring created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
clusterrole.rbac.authorization.k8s.io/prometheus-operator created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created
[root@bj-k8s-master-56 ~]# k -n monitoring get all
NAME READY STATUS RESTARTS AGE
pod/prometheus-operator-6685db5c6-fsfsp 1/1 Running 0 80s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/prometheus-operator ClusterIP None <none> 8080/TCP 81s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-operator 1/1 1 1 81s
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-operator-6685db5c6 1 1 1 81s
提交 manifest 中的文件
[root@docker-182 manifests]# k apply -f .
alertmanager.monitoring.coreos.com/main created
secret/alertmanager-main created
service/alertmanager-main created
serviceaccount/alertmanager-main created
servicemonitor.monitoring.coreos.com/alertmanager created
secret/grafana-datasources created
configmap/grafana-dashboard-apiserver created
configmap/grafana-dashboard-cluster-total created
configmap/grafana-dashboard-controller-manager created
configmap/grafana-dashboard-k8s-resources-cluster created
configmap/grafana-dashboard-k8s-resources-namespace created
configmap/grafana-dashboard-k8s-resources-node created
configmap/grafana-dashboard-k8s-resources-pod created
configmap/grafana-dashboard-k8s-resources-workload created
configmap/grafana-dashboard-k8s-resources-workloads-namespace created
configmap/grafana-dashboard-kubelet created
configmap/grafana-dashboard-namespace-by-pod created
configmap/grafana-dashboard-namespace-by-workload created
configmap/grafana-dashboard-node-cluster-rsrc-use created
configmap/grafana-dashboard-node-rsrc-use created
configmap/grafana-dashboard-nodes created
configmap/grafana-dashboard-persistentvolumesusage created
configmap/grafana-dashboard-pod-total created
configmap/grafana-dashboard-pods created
configmap/grafana-dashboard-prometheus-remote-write created
configmap/grafana-dashboard-prometheus created
configmap/grafana-dashboard-proxy created
configmap/grafana-dashboard-scheduler created
configmap/grafana-dashboard-statefulset created
configmap/grafana-dashboard-workload-total created
configmap/grafana-dashboards created
deployment.apps/grafana created
service/grafana created
serviceaccount/grafana created
servicemonitor.monitoring.coreos.com/grafana created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
role.rbac.authorization.k8s.io/kube-state-metrics created
rolebinding.rbac.authorization.k8s.io/kube-state-metrics created
service/kube-state-metrics created
serviceaccount/kube-state-metrics created
servicemonitor.monitoring.coreos.com/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/node-exporter created
clusterrolebinding.rbac.authorization.k8s.io/node-exporter created
daemonset.apps/node-exporter created
service/node-exporter created
serviceaccount/node-exporter created
servicemonitor.monitoring.coreos.com/node-exporter created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io configured
clusterrole.rbac.authorization.k8s.io/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader configured
clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created
clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created
clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created
configmap/adapter-config created
deployment.apps/prometheus-adapter created
rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created
service/prometheus-adapter created
serviceaccount/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/prometheus-k8s created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-operator created
prometheus.monitoring.coreos.com/k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s-config created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
prometheusrule.monitoring.coreos.com/prometheus-k8s-rules created
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus created
servicemonitor.monitoring.coreos.com/kube-apiserver created
servicemonitor.monitoring.coreos.com/coredns created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
servicemonitor.monitoring.coreos.com/kube-scheduler created
servicemonitor.monitoring.coreos.com/kubelet created
kube-prometheus 创建的 crd 资源
[root@bj-k8s-master-56 ~]# k get crd -o wide
NAME CREATED AT
alertmanagers.monitoring.coreos.com 2019-11-26T03:48:24Z
podmonitors.monitoring.coreos.com 2020-03-04T07:11:14Z
prometheuses.monitoring.coreos.com 2019-11-26T03:48:24Z
prometheusrules.monitoring.coreos.com 2019-11-26T03:48:24Z
servicemonitors.monitoring.coreos.com 2019-11-26T03:48:24Z
prometheus 资源定义了 prometheus 服务应该如何运行
[root@bj-k8s-master-56 ~]# k -n monitoring get prometheus
NAME AGE
k8s 36m
同理,alertmanager 定义了 alertmanager 资源的运行
[root@bj-k8s-master-56 ~]# kubectl -n monitoring get alertmanager
NAME AGE
main 37m
prometheus 和 alertmanager 都是 statefulset 控制器
[root@bj-k8s-master-56 ~]# k -n monitoring get statefulset -o wide
NAME READY AGE CONTAINERS IMAGES
alertmanager-main 3/3 34m alertmanager,config-reloader quay.io/prometheus/alertmanager:v0.18.0,quay.io/coreos/configmap-reload:v0.0.1
prometheus-k8s 1/2 33m prometheus,prometheus-config-reloader,rules-configmap-reloader quay.io/prometheus/prometheus:v2.11.0,quay.io/coreos/prometheus-config-reloader:v0.34.0,quay.io/coreos/configmap-reload:v0.0.1
默认的 grafana 没有配置文件的 configmap, 使用的是 sqlite 存储放在 emptydir 挂载的 /var/lib/grafana 内
[root@bj-k8s-node-84 ~]# mkdir /data/apps/data/pv/monitoring-grafana
[root@bj-k8s-node-84 ~]# chown 65534:65534 /data/apps/data/pv/monitoring-grafana
[root@docker-182 grafana]# k apply -f grafana-local-pv.yml,grafana-local-pvc.yml
persistentvolume/grafana-pv created
persistentvolumeclaim/grafana-pvc created
创建数据库
MariaDB [(none)]> create database k8s_55_grafana default character set utf8;
Query OK, 1 row affected (0.01 sec)
MariaDB [(none)]> grant all on k8s_55_grafana.* to grafana@'%';
Query OK, 0 rows affected (0.05 sec)
MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.03 sec)
[root@docker-182 grafana]# k apply -f grafana-mysql_endpoint.yaml
service/grafana-mysql created
endpoints/grafana-mysql created
[root@docker-182 grafana]# k55 apply -f grafana-config_cm.yaml
configmap/grafana-config created
[root@docker-182 grafana]# cp /data/apps/soft/ansible/kubernetes/kube-prometheus-0.3.0/manifests/grafana-deployment.yaml ./
[root@docker-182 grafana]# diff /data/apps/soft/ansible/kubernetes/kube-prometheus-0.3.0/manifests/grafana-deployment.yaml ./grafana-deployment.yaml
35a36,38
> - mountPath: /etc/grafana/grafana.ini
> name: grafana-ini
> subPath: grafana.ini
124c127,129
< - emptyDir: {}
---
> #- emptyDir: {}
> - persistentVolumeClaim:
> claimName: grafana-pvc
203a209,211
> - configMap:
> name: grafana-config
> name: grafana-ini
[root@docker-182 grafana]# k apply -f grafana-deployment.yaml
deployment.apps/grafana configured
[root@docker-182 ingress-nginx]# cat ../kube-prometheus/grafana/grafana_ingress.yaml
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
annotations:
# use the shared ingress-nginx
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/rewrite-target: /$2
#nginx.ingress.kubernetes.io/app-root: /
spec:
rules:
- http:
paths:
- path: /mygrafana(/|$)(.*)
backend:
serviceName: grafana
servicePort: 3000
[root@docker-182 grafana]# k apply -f grafana_ingress.yaml
ingress.networking.k8s.io/grafana-ingress created
默认只有 kube-scheduler 的 servicemonitor, 对于其中定义的监控对象 kube-system/kube-scheduler 的svc 并没有实现,所以需要手动添加一个。
[root@bj-k8s-master-56 ~]# k -n monitoring get servicemonitor kube-scheduler
NAME AGE
kube-scheduler 19h
[root@docker-182 kube-prometheus]# k apply -f prometheus-kubeSchedulerService.yaml
service/kube-scheduler created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
endpoints/kube-scheduler configured
[root@docker-182 kube-prometheus]# k apply -f prometheus-kubeControllerManagerService.yaml
service/kube-controller-manager created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
endpoints/kube-controller-manager configured
这个应该是针对 kube-proxy 的监控,但是并没有找到定义 kube-proxy 的资源文件,暂且略过。
想找一下如何能够在基础的资源上修改 prometheus-operator 的工作方式(是指在提交资源前修改,而不是提交至集群后,再修改生成的statefulset等)。
结果发现 kube-prometheus 项目用了 jsonnet 语言,依照他的文档,是要用这个语言来进行定制,然后生成相关的 yaml 文件。
我当然也可以去修改他 manifests 中的已经生成的 yaml 文件,这里使用这种方式。
# spec: 下新增
containers:
- name: prometheus
args:
- --web.console.templates=/etc/prometheus/consoles
- --web.console.libraries=/etc/prometheus/console_libraries
- --config.file=/etc/prometheus/config_out/prometheus.env.yaml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=360h # 这里原本是24h
- --web.enable-lifecycle
- --storage.tsdb.no-lockfile
- --web.route-prefix=/
[root@docker-182 manifests]# k apply -f prometheus-prometheus.yaml
prometheus.monitoring.coreos.com/k8s configured
这样做是可行的。
那么还需要将 prometheus 的存储卷改为 pvc 的形式,防止pod重建后数据被清空。
prometheus-k8s-db-prometheus-k8s-0
和 prometheus-k8s-db-prometheus-k8s-1
volumeClaimTemplates:
- metadata:
name: db
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
pvc 的名字是 volume_name + '-' + pod_name
,在现有的 statefulset 中,数据卷 volume 名字是 prometheus-k8s-db
,
所以名字就成了上面的那个样子,而且必须是这两个名字。
本来statefulset 是可以使用 storageClass 动态生成 pvc 的,但是这里没有相关的存储资源,所以选择手动创建 pvc ,然后进行引用。
# 1000 和 1001 是运行 prom 的用户在宿主机上的 uid 和 gid
[root@docker-182 kube-prometheus]# ansible 10.111.32.94 -m file -a "path=/data/apps/data/pv/prometheus-k8s-db-prometheus-k8s-0 state=directory owner=1000 group=1001"
[root@docker-182 kube-prometheus]# ansible 10.111.32.178 -m file -a "path=/data/apps/data/pv/prometheus-k8s-db-prometheus-k8s-1 state=directory owner=1000 group=1001"
# 创建 pv 和 pvc
[root@docker-182 kube-prometheus]# k55 apply -f prometheus-k8s-db-prometheus-k8s-0_pv.yml
persistentvolume/prometheus-k8s-db-prometheus-k8s-0 created
[root@docker-182 kube-prometheus]# k55 apply -f prometheus-k8s-db-prometheus-k8s-0_pvc.yml
persistentvolumeclaim/prometheus-k8s-db-prometheus-k8s-0 created
[root@docker-182 kube-prometheus]# k55 apply -f prometheus-k8s-db-prometheus-k8s-1_pv.yml
persistentvolume/prometheus-k8s-db-prometheus-k8s-1 created
[root@docker-182 kube-prometheus]# k55 apply -f prometheus-k8s-db-prometheus-k8s-1_pvc.yml
persistentvolumeclaim/prometheus-k8s-db-prometheus-k8s-1 created
提交更新
# prometheus-prometheus.yaml 中 spec 下添加
storage:
volumeClaimTemplate:
metadata:
name: prometheus-k8s-db
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
[root@docker-182 kube-prometheus]# k apply -f prometheus-prometheus.yaml
查看 pvc 的状态和 pod 中的信息,成功。
[root@bj-k8s-master-56 ~]# k -n monitoring get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
grafana-pvc Bound grafana-pv 16Gi RWO local-storage 5d1h
prometheus-k8s-db-prometheus-k8s-0 Bound prometheus-k8s-db-prometheus-k8s-0 200Gi RWO local-storage 3m43s
prometheus-k8s-db-prometheus-k8s-1 Bound prometheus-k8s-db-prometheus-k8s-1 200Gi RWO local-storage 3m33s
# 原先的 volume 信息为
- emptyDir: {}
name: prometheus-k8s-db
# 覆盖了原来的 emptyDir
volumes:
- name: prometheus-k8s-db
persistentVolumeClaim:
claimName: prometheus-k8s-db-prometheus-k8s-0
这里其pvc 的名字应该为 alertmanager-main-db-alertmanager-main-0
,
alertmanager-main-db-alertmanager-main-1
,alertmanager-main-db-alertmanager-main-2
创建 pvc
[root@docker-182 kube-prometheus]# ansible 10.111.32.94 -m file -a "path=/data/apps/data/pv/alertmanager-main-db-alertmanager-main-0 state=directory owner=1000 group=1001"
[root@docker-182 kube-prometheus]# ansible 10.111.32.94 -m file -a "path=/data/apps/data/pv/alertmanager-main-db-alertmanager-main-1 state=directory owner=1000 group=1001"
[root@docker-182 kube-prometheus]# ansible 10.111.32.178 -m file -a "path=/data/apps/data/pv/alertmanager-main-db-alertmanager-main-2 state=directory owner=1000 group=1001"
# 提交 pv 和 pvc 资源
[root@docker-182 alertmanager]# ls -1r |while read line; do k apply -f ${line};done
persistentvolume/alertmanager-main-db-alertmanager-main-2 created
persistentvolumeclaim/alertmanager-main-db-alertmanager-main-2 created
persistentvolume/alertmanager-main-db-alertmanager-main-1 created
persistentvolumeclaim/alertmanager-main-db-alertmanager-main-1 created
persistentvolume/alertmanager-main-db-alertmanager-main-0 created
persistentvolumeclaim/alertmanager-main-db-alertmanager-main-0 created
修改 alertmanager 资源文件,提交更改
# spec 下添加 storage 参数
storage:
volumeClaimTemplate:
metadata:
name: alertmanager-main-db
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
# 提交更改
[root@docker-182 alertmanager]# k apply -f alertmanager-alertmanager.yaml
alertmanager.monitoring.coreos.com/main configured
验证无误
[root@bj-k8s-master-56 ~]# k -n monitoring get pvc |grep alertmanager
alertmanager-main-db-alertmanager-main-0 Bound alertmanager-main-db-alertmanager-main-1 10Gi RWO local-storage 4h1m
alertmanager-main-db-alertmanager-main-1 Bound alertmanager-main-db-alertmanager-main-2 10Gi RWO local-storage 4h1m
alertmanager-main-db-alertmanager-main-2 Bound alertmanager-main-db-alertmanager-main-0 10Gi RWO local-storage 4h1m
[root@bj-k8s-master-56 ~]# k -n monitoring get statefulset alertmanager-main -o yaml
...
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: alertmanager-main-db
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
volumeMode: Filesystem
status:
phase: Pending
...
[root@bj-k8s-master-56 ~]# k -n monitoring get pod -o wide |grep alertmanager
alertmanager-main-0 2/2 Running 0 3m56s 10.20.60.180 bj-k8s-node-84.tmtgeo.com <none> <none>
alertmanager-main-1 2/2 Running 0 3m56s 10.20.245.249 bj-k8s-node-178.tmtgeo.com <none> <none>
alertmanager-main-2 2/2 Running 0 3m56s 10.20.60.179 bj-k8s-node-84.tmtgeo.com <none> <none>
[root@docker-182 ingress-nginx]# cat ingress-serviceMonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
name: ingress-nginx
namespace: monitoring
spec:
endpoints:
- interval: 15s
port: "10254"
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- ingress-nginx
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
提交之后,prom 报错了
level=error ts=2020-03-10T10:33:21.196Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"ingress-nginx\""
level=error ts=2020-03-10T10:33:22.197Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"ingress-nginx\""
level=error ts=2020-03-10T10:33:22.198Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"ingress-nginx\""
一看之下,肯定是没有权限啊,但是为什么默认的 kube-system
名称空间中的资源就可以获取到呢?
创建一个 clusterrole, 并绑定至 prometheus-k8s
sa
上(虽然也可以直接更改原来的 clusterRole 配置)
[root@docker-182 kube-prometheus]# cat my-prometheus-clusterRoleBinding.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: my-prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups: [""]
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: my-prometheus
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
[root@docker-182 kube-prometheus]# k apply -f my-prometheus-clusterRoleBinding.yml
clusterrole.rbac.authorization.k8s.io/my-prometheus created
clusterrolebinding.rbac.authorization.k8s.io/my-prometheus created
配置完 clusterRoleBinding 之后,可以发现 ingress-nginx 的 endpoints 了,但是默认的 endpoints 中只有 80 和 443 的存在, 而其的 metrics 端口 10254 并未定义在 daemonset 的配置中,所以,这里检测不到。
[root@bj-k8s-master-56 ~]# k -n ingress-nginx get endpoints -o wide
NAME ENDPOINTS AGE
ingress-nginx 10.111.32.178:80,10.111.32.94:80,10.111.32.178:443 + 1 more... 4d17h
- name: metrics
port: 10254
targetPort: 10254
提交更改
[root@docker-182 ingress-nginx]# k apply -f ingress-nginx-svc.yaml
service/ingress-nginx configured
# endpoints 中已经有了 10254 的存在了
[root@bj-k8s-master-56 ~]# k -n ingress-nginx get endpoints
NAME ENDPOINTS AGE
ingress-nginx 10.111.32.178:80,10.111.32.94:80,10.111.32.178:10254 + 3 more... 4d17h
ingress-nginx 的 serviceMonitor 中修改
endpoints:
- interval: 15s
port: metrics
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
这样一来,ingress-nginx 的 target 就添加完成了。
从 https://github.com/kubernetes/ingress-nginx/tree/master/deploy/grafana/dashboards
中加载 dashboard.
发现 Request Handling Performance
这个 dashboard 中的label 好多都是旧的了,没法用。
NGINX Ingress controller
中也有好多都没法用
kube-prometheus 中自带了 kube-state-metrics
和 prometheus-adapter
。
其中 prometheus-adapter
的 apiservice
是 v1beta1.metrics.k8s.io
。
[root@bj-k8s-master-56 ~]# k get apiservice |grep prome
v1beta1.metrics.k8s.io monitoring/prometheus-adapter True 54d
这是不对的,v1beta1.metrics.k8s.io
这个 apiservice
是属于 kubernetes 的 metrics-server 的,kube-prom
将其占用后,会引起一些依赖此api
的应用的问题,比如 hpa
资源
[root@bj-k8s-master-56 ~]# kubectl -n kube-system get pod -o wide |grep metrics
metrics-server-7ff49d67b8-mczv8 1/1 Running 2 51d 10.20.245.239 bj-k8s-node-178.tmtgeo.com <none> <none>
hpa v2
[root@docker-182 hpa]# k apply -f .
horizontalpodautoscaler.autoscaling/metrics-app-hpa created
deployment.apps/metrics-app created
service/metrics-app created
servicemonitor.monitoring.coreos.com/metrics-app created
报错
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 18m (x12 over 21m) horizontal-pod-autoscaler Invalid metrics (1 invalid out of 1), last error was: failed to get object metric value: unable to get metric http_requests: unable to fetch metrics from custom metrics API: no custom metrics API (custom.metrics.k8s.io) registered
Warning FailedGetPodsMetric 73s (x80 over 21m) horizontal-pod-autoscaler unable to get metric http_requests: unable to fetch metrics from custom metrics API: no custom metrics API (custom.metrics.k8s.io) registered
experimental/custom-metrics-api
下的资源先修复下 metrics-server 的 aipservice
[root@docker-182 metrics-server]# k apply -f metrics-apiservice.yaml
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io configured
[root@docker-182 custom-metrics-api]# ls *.yaml |while read line; do k apply -f ${line};done
clusterrolebinding.rbac.authorization.k8s.io/custom-metrics-server-resources created
apiservice.apiregistration.k8s.io/v1beta1.custom.metrics.k8s.io created
clusterrole.rbac.authorization.k8s.io/custom-metrics-server-resources created
configmap/adapter-config configured
clusterrolebinding.rbac.authorization.k8s.io/hpa-controller-custom-metrics created
servicemonitor.monitoring.coreos.com/sample-app created
service/sample-app created
deployment.apps/sample-app created
horizontalpodautoscaler.autoscaling/sample-app created
[root@docker-182 custom-metrics-api]# pwd
/data/apps/soft/ansible/kubernetes/kube-prometheus-0.3.0/experimental/custom-metrics-api
v1beta1.metrics.k8s.io
恢复正常, v1beta1.custom.metrics.k8s.io
报错了
[root@bj-k8s-master-56 ~]# k get apiservices |grep metric
v1beta1.custom.metrics.k8s.io monitoring/prometheus-adapter False (FailedDiscoveryCheck) 3m5s
v1beta1.metrics.k8s.io kube-system/metrics-server True 54d
报错信息为
Status:
Conditions:
Last Transition Time: 2020-03-11T07:45:25Z
Message: failing or missing response from https://10.20.60.171:6443/apis/custom.metrics.k8s.io/v1beta1: bad status from https://10.20.60.171:6443/apis/custom.metrics.k8s.io/v1beta1: 404
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Events: <none>
[root@bj-k8s-master-56 ~]# k -n monitoring get pod -o wide |grep adap
prometheus-adapter-68698bc948-qmpvr 1/1 Running 0 7d 10.20.60.171 bj-k8s-node-84.tmtgeo.com <none> <none>
的确访问不到相关的 api
[root@bj-k8s-master-56 ~]# curl -i -k https://10.20.60.171:6443/apis/custom.metrics.k8s.io
HTTP/1.1 404 Not Found
Content-Type: application/json
Date: Wed, 11 Mar 2020 07:56:10 GMT
Content-Length: 229
{
"paths": [
"/apis",
"/apis/metrics.k8s.io",
"/apis/metrics.k8s.io/v1beta1",
"/healthz",
"/healthz/ping",
"/healthz/poststarthook/generic-apiserver-start-informers",
"/metrics",
"/version"
]
}
找不到 latest 标签的镜像
[root@bj-k8s-node-84 ~]# docker pull quay.io/coreos/k8s-prometheus-adapter-amd64:latest
Error response from daemon: manifest for quay.io/coreos/k8s-prometheus-adapter-amd64:latest not found
在其网站上找到了其有新的 tag:v0.6.0 (当然可以直接修改镜像,让kubelet自己去下载镜像,不过网不太好,所以手动下来传到所有的node上)
[root@bj-k8s-node-84 ~]# docker pull quay.io/coreos/k8s-prometheus-adapter-amd64:v0.6.0
[root@docker-182 adapter]# grep image: prometheus-adapter-deployment.yaml
image: quay.io/coreos/k8s-prometheus-adapter-amd64:v0.6.0
[root@docker-182 adapter]# k apply -f prometheus-adapter-deployment.yaml
deployment.apps/prometheus-adapter configured
恢复了正常
[root@bj-k8s-master-56 ~]# k get apiservices |grep custom
v1beta1.custom.metrics.k8s.io monitoring/prometheus-adapter True 117m
[root@bj-k8s-master-56 ~]#k -n monitoring get pod -o wide | grep adapter
prometheus-adapter-7b785b6685-z6gfp 1/1 Running 0 91s 10.20.60.183 bj-k8s-node-84.tmtgeo.com <none> <none>
[root@bj-k8s-master-56 ~]# curl -i -k https://10.20.60.183:6443/apis/custom.metrics.k8s.io
HTTP/1.1 200 OK
Content-Type: application/json
Date: Wed, 11 Mar 2020 09:41:45 GMT
Content-Length: 303
{
"kind": "APIGroup",
"apiVersion": "v1",
"name": "custom.metrics.k8s.io",
"versions": [
{
"groupVersion": "custom.metrics.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "custom.metrics.k8s.io/v1beta1",
"version": "v1beta1"
}
}
hpav2 也恢复了正常
[root@bj-k8s-master-56 ~]# k get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
metrics-app-hpa Deployment/metrics-app 36133m/800m 2 10 4 159m
myapp Deployment/myapp 23%/60% 2 5 2 178m
sample-app Deployment/sample-app 400m/500m 1 10 1 126m
现在想想,刚部署好 kube-prometheus 时,prometheus-adapter 把 v1beta1.metrics.k8s.io
占用了之后,top 命令还能正常使用,这是由于 quay.io/coreos/k8s-prometheus-adapter-amd64:v0.5.0
的镜像其中就是只有 /apis/metrics.k8s.io/v1beta1
的原因。上面的 curl 中有这个记录。