事情的起源来自于某先生的监控需求,在这里稍微记录下,用到的镜像均来自于BitNami。
BitNami是一个开源项目,该项目产生的开源软件包括安装 Web应用程序和解决方案堆栈,以及虚拟设备。bitnami主办Bitrock公司成立于2003年在西班牙塞维利亚,由丹尼尔·洛佩兹Ridruejo。bitnami栈用于安装在Linux,Windows,Mac OS X中和Solaris软件。--来自百度百科。
开始吧,首先部署prometheus,它的配置文件可以从官网获取,或服务拉起后获取,将prometheus的data和主配置文件单独挂载,这里使用了RBAC,使prometheus能够获取k8s的相关信息,权限可以根据需要添加或删减,首先创建serviceaccount,建ClusterRole及ClusterRoleBinding,使prometheus获取k8s相关资源的查询权限。
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
创建configmap
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-cm
data:
prometheus.yml: |
global:
evaluation_interval: 15s
scrape_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
创建prometheus的deployment,使用storageclass自动建立pv
apiVersion: v1
kind: Service
metadata:
name: prometheus
labels:
app: prometheus
spec:
ports:
- name: http
port: 9090
targetPort: http
nodePort: 30203
selector:
app: prometheus
type: NodePort
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-pvc
annotations:
volume.beta.kubernetes.io/storage-class: "nfs-storage"
labels:
app: prometheus
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
selector:
matchLabels:
app: prometheus
strategy:
type: Recreate
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- image: docker.io/bitnami/prometheus:2.32.1
name: prometheus
ports:
- containerPort: 9090
name: http
volumeMounts:
- mountPath: /opt/bitnami/prometheus/data
name: prometheus
- mountPath: /opt/bitnami/prometheus/conf/prometheus.yml
name: prometheus-config
subPath: prometheus.yml
volumes:
- name: prometheus
persistentVolumeClaim:
claimName: prometheus-pvc
- name: prometheus-config
configMap:
name: prometheus-cm
拉起后可以通过nodeport进行访问,继续部署node-exporter,在官方文档中并不建议容器化部署,对容器化部署的叙述也比较少,实际使用中需要关注rootfs、sysfs、procfs,并使用主机的pid,使用daemonset的方式进行部署,使用hostnetwork,方便prometheus进行拉取。
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
name: node-exporter
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
args:
- --path.rootfs=/host
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
image: docker.io/bitnami/node-exporter:1.3.1
ports:
- name: tcp
containerPort: 9100
volumeMounts:
- name: gen
mountPath: /host
readOnly: true
volumes:
- name: gen
hostPath:
path: /
修改prometheus的configmap,并重启prometheus,也可以启用热加载,本次没有。也可以加上coredns的metrics。
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-cm
data:
prometheus.yml: |
global:
evaluation_interval: 15s
scrape_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
scrape_interval: 10s
static_configs:
- targets: ["10.0.4.15:9100"]
labels:
group: "k8s-test"
- job_name: "coredns"
static_configs:
- targets: ["kube-dns.kube-system.svc.cluster.local:9153"]
labels:
group: "k8s-test"
接下来可以先部署grafana,官方文档讲解得比较详细Deploy Grafana on Kubernetes | Grafana Labs
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
annotations:
volume.beta.kubernetes.io/storage-class: "nfs-storage"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: grafana
name: grafana
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
securityContext:
fsGroup: 472
supplementalGroups:
- 0
containers:
- name: grafana
image: docker.io/grafana/grafana:8.3.3
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: http-grafana
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /robots.txt
port: 3000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 2
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 3000
timeoutSeconds: 1
resources:
requests:
cpu: 250m
memory: 750Mi
volumeMounts:
- mountPath: /var/lib/grafana
name: grafana-pv
volumes:
- name: grafana-pv
persistentVolumeClaim:
claimName: grafana-pvc
---
apiVersion: v1
kind: Service
metadata:
name: grafana
spec:
ports:
- port: 3000
protocol: TCP
targetPort: http-grafana
nodePort: 32000
selector:
app: grafana
sessionAffinity: None
type: NodePort
拉起grafana后可以从官方提供的Dashboards | Grafana Labs仓库中按需使用一些优秀的dashboard方案模板,在线可以直接通过 dashboard的ID加载,离线可以用外网下载json文件,再内网导入,导入后再按需修改,如本次使用的1860,名字是Node Exporter Full,涵盖得比较全面。
部署metrics-server,主要可以用来监控的是CPU、文件描述符、内存等, 所以需要rbac进行相应的赋权。部署后就可以使用kubectl top node或kubectl top node命令查看资源使用情况。
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: metrics-server
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rbac.authorization.k8s.io/aggregate-to-edit: "true"
rbac.authorization.k8s.io/aggregate-to-view: "true"
name: system:aggregated-metrics-reader
rules:
- apiGroups:
- metrics.k8s.io
resources:
- pods
- nodes
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: metrics-server
name: system:metrics-server
rules:
- apiGroups:
- ""
resources:
- pods
- nodes
- nodes/stats
- namespaces
- configmaps
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server:system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: metrics-server
name: system:metrics-server
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:metrics-server
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
spec:
ports:
- name: https
port: 443
protocol: TCP
targetPort: https
selector:
k8s-app: metrics-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: metrics-server
strategy:
rollingUpdate:
maxUnavailable: 0
template:
metadata:
labels:
k8s-app: metrics-server
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls
image: docker.io/bitnami/metrics-server:0.5.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
name: metrics-server
ports:
- containerPort: 4443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 200Mi
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
volumeMounts:
- mountPath: /tmp
name: tmp-dir
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
serviceAccountName: metrics-server
volumes:
- emptyDir: {}
name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
labels:
k8s-app: metrics-server
name: v1beta1.metrics.k8s.io
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
version: v1beta1
versionPriority: 100
部署kube-state-metrics,kube-state-metrics 主要用来监控Deployment、Pod、副本状态等,直接套用了prometheus的serviceaccount的权限。
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
labels:
k8s-app: kube-state-metrics
spec:
replicas: 1
selector:
matchLabels:
k8s-app: kube-state-metrics
template:
metadata:
labels:
k8s-app: kube-state-metrics
spec:
serviceAccountName: prometheus
containers:
- name: kube-state-metrics
image: docker.io/bitnami/kube-state-metrics:2.3.0
securityContext:
runAsUser: 65534
ports:
- name: http-metrics ##用于公开kubernetes的指标数据的端口
containerPort: 8080
- name: telemetry ##用于公开自身kube-state-metrics的指标数据的端口
containerPort: 8081
加上cAvisor的metrics,cAvisor是Google开源的容器资源监控和性能分析工具,无需另外安装,在k8s中内置了,本身apiserver也有metrics, 修改prometheus 的configmap,更新后重启prometheus,查看targets是否正常,通过curl它的metrics接口查看是否有数据。
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-cm
data:
prometheus.yml: |
global:
evaluation_interval: 15s
scrape_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
scrape_interval: 10s
static_configs:
- targets: ["10.0.4.15:9100"]
labels:
group: "k8s-test"
- job_name: "coredns"
static_configs:
- targets: ["kube-dns.kube-system.svc.cluster.local:9153"]
labels:
group: "k8s-test"
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
api_server: https://10.0.4.15:6443
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: "kube-state-metrics"
metrics_path: metrics
kubernetes_sd_configs:
- role: pod
api_server: https://10.0.4.15:6443
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:8080
- source_labels: ["__meta_kubernetes_pod_container_name"]
regex: "^kube-state-metrics.*"
action: keep
- job_name: "kubernetes-nodes-cadvisor"
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: node
api_server: https://10.0.4.15:6443
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
relabel_configs:
# 将标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_node_label_(.*)
# 修改NodeIP:10250为APIServerIP:6443
- action: replace
regex: (.*)
source_labels: ["__address__"]
target_label: __address__
replacement: 10.0.4.15:6443
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
regex: (.*)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: "k8s-apiserver"
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
grafana中k8s相关的dashboard也很多,比如此次的14518
这样就大致完成了对kubernetes集群的节点资源及高级资源的监控。缺失的告警环节,后期补上。