Prometheus Operator (后面都简称 Operater) 提供如下功能:
1、创建/销毁:在 Kubernetes namespace 中更加容易地启动一个 Prometheues 实例,一个特定应用程序或者团队可以更容易使用 Operator。
2、便捷配置:通过 Kubernetes 资源配置 Prometheus 的基本信息,比如版本、存储、副本集等。
3、通过标签标记目标服务: 基于常见的 Kubernetes label 查询自动生成监控目标配置;不需要学习 Prometheus 特定的配置语言。
Prometheus Operator 架构图如下:
上面架构图中,各组件以不同的方式运行在 Kubernetes 集群中:
- Operator: 根据自定义资源(Custom Resource Definition / CRDs)来部署和管理 Prometheus Server,同时监控这些自定义资源事件的变化来做相应的处理,是整个系统的控制中心。
- Prometheus:声明 Prometheus deployment 期望的状态,Operator 确保这个 deployment 运行时一直与定义保持一致。
- Prometheus Server: Operator 根据自定义资源 Prometheus 类型中定义的内容而部署的 Prometheus Server 集群,这些自定义资源可以看作是用来管理 Prometheus Server 集群的 StatefulSets 资源。
- ServiceMonitor:声明指定监控的服务,描述了一组被 Prometheus 监控的目标列表。该资源通过 Labels 来选取对应的 Service Endpoint,让 Prometheus Server 通过选取的 Service 来获取 Metrics 信息。
- Service:简单的说就是 Prometheus 监控的对象。
- Alertmanager:定义 AlertManager deployment 期望的状态,Operator 确保这个deployment 运行时一直与定义保持一致。
以下为安装部署
https://github.com/prometheus-operator/kube-prometheus/tree/release-0.5
Prometheus-Operator版本和kubernetes版本有要求,具体见GitHub
1、GitHub下载yml文件做下分类
[root@master70 manifests]# ll
total 12
drwxr-xr-x 2 root root 241 Aug 14 19:57 01-node-exporter
drwxr-xr-x 2 root root 189 Aug 14 19:58 02-alertmanage
drwxr-xr-x 2 root root 272 Aug 14 19:58 03-kube-state-metrics
drwxr-xr-x 2 root root 254 Aug 14 19:58 04-grafana
drwxr-xr-x 2 root root 4096 Aug 14 19:58 05-prometheus-adapter
drwxr-xr-x 2 root root 4096 Aug 14 19:59 06-prometheus
drwxr-xr-x 2 root root 228 Aug 14 20:03 add
drwxr-xr-x 2 root root 4096 Aug 14 20:06 setup
[root@master70 manifests]# pwd
/root/2monitor/kube-prometheus-release-0.5/manifests
##分类文件history导出
210 mkdir 01-node-exporter
211 mv node-exporter-* 01-node-exporter/
212 ll
213 mkdir 02-alertmanage
214 mv alertmanager-* 02-alertmanage/
215 ll
216 mkdir 03-kube-state-metrics
217 mv kube-state-metrics-* 03-kube-state-metrics/
218 ll
219 mkdir 04-grafana
220 mv grafana-* 04-grafana/
221 ll
222 mkdir 05-prometheus-adapter
223 mv prometheus-adapter-* 05-prometheus-adapter/
224 ll
225 vim prometheus-clusterRole.yaml
226 vim prometheus-clusterRoleBinding.yaml
227 mkdor 06-prometheus
228 mkdir 06-prometheus
229 ll
230 mv prometheus-* 06-prometheus/
2、首先我们需要安装 setup 目录下面的 CRD 和 Operator 资源对象:
[root@master70 manifests]# cd setup/
[root@master70 setup]# ll
total 952
-rw-r--r-- 1 root root 60 Jun 24 19:32 0namespace-namespace.yaml
-rw-r--r-- 1 root root 268343 Jun 24 19:32 prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 12635 Jun 24 19:32 prometheus-operator-0podmonitorCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 348686 Jun 24 19:32 prometheus-operator-0prometheusCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 3605 Jun 24 19:32 prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 23305 Jun 24 19:32 prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 279736 Jun 24 19:32 prometheus-operator-0thanosrulerCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 425 Jun 24 19:32 prometheus-operator-clusterRoleBinding.yaml
-rw-r--r-- 1 root root 1665 Jun 24 19:32 prometheus-operator-clusterRole.yaml
-rw-r--r-- 1 root root 1943 Jun 24 19:32 prometheus-operator-deployment.yaml
-rw-r--r-- 1 root root 239 Jun 24 19:32 prometheus-operator-serviceAccount.yaml
-rw-r--r-- 1 root root 422 Jun 24 19:32 prometheus-operator-service.yaml
直接创建
[root@master70 setup]# kubectl create -f .
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
Error from server (AlreadyExists): error when creating "0namespace-namespace.yaml": namespaces "monitoring" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-0podmonitorCustomResourceDefinition.yaml": customresourcedefinitions.apiextensions.k8s.io "podmonitors.monitoring.coreos.com" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-0thanosrulerCustomResourceDefinition.yaml": customresourcedefinitions.apiextensions.k8s.io "thanosrulers.monitoring.coreos.com" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-clusterRole.yaml": clusterroles.rbac.authorization.k8s.io "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-clusterRoleBinding.yaml": clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-deployment.yaml": deployments.apps "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-service.yaml": services "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-serviceAccount.yaml": serviceaccounts "prometheus-operator" already exists
若提示crd存在,进行删除
[root@master70 setup]# kubectl delete -f .
namespace "monitoring" deleted
customresourcedefinition.apiextensions.k8s.io "alertmanagers.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "podmonitors.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheusrules.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "thanosrulers.monitoring.coreos.com" deleted
clusterrole.rbac.authorization.k8s.io "prometheus-operator" deleted
clusterrolebinding.rbac.authorization.k8s.io "prometheus-operator" deleted
deployment.apps "prometheus-operator" deleted
service "prometheus-operator" deleted
serviceaccount "prometheus-operator" deleted
[root@master70 setup]# kubectl delete crd prometheuses.monitoring.coreos.com
[root@master70 setup]# kubectl delete crd prometheusrules.monitoring.coreos.com
[root@master70 setup]# kubectl delete crd servicemonitors.monitoring.coreos.com
[root@master70 setup]# kubectl delete crd alertmanagers.monitoring.coreos.com
重新创建
[root@master70 setup]# kubectl create -f .
namespace/monitoring created
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com created
clusterrole.rbac.authorization.k8s.io/prometheus-operator created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created
[root@master70 setup]# kubectl get crd -n monitoring
NAME CREATED AT
alertmanagers.monitoring.coreos.com 2020-08-14T12:10:54Z
bgpconfigurations.crd.projectcalico.org 2020-08-14T10:57:22Z
bgppeers.crd.projectcalico.org 2020-08-14T10:57:22Z
blockaffinities.crd.projectcalico.org 2020-08-14T10:57:22Z
clusterauthtokens.cluster.cattle.io 2020-08-14T10:57:46Z
clusterinformations.crd.projectcalico.org 2020-08-14T10:57:22Z
clusteruserattributes.cluster.cattle.io 2020-08-14T10:57:46Z
felixconfigurations.crd.projectcalico.org 2020-08-14T10:57:22Z
globalnetworkpolicies.crd.projectcalico.org 2020-08-14T10:57:22Z
globalnetworksets.crd.projectcalico.org 2020-08-14T10:57:22Z
hostendpoints.crd.projectcalico.org 2020-08-14T10:57:22Z
ipamblocks.crd.projectcalico.org 2020-08-14T10:57:22Z
ipamconfigs.crd.projectcalico.org 2020-08-14T10:57:22Z
ipamhandles.crd.projectcalico.org 2020-08-14T10:57:22Z
ippools.crd.projectcalico.org 2020-08-14T10:57:22Z
networkpolicies.crd.projectcalico.org 2020-08-14T10:57:23Z
networksets.crd.projectcalico.org 2020-08-14T10:57:23Z
podmonitors.monitoring.coreos.com 2020-08-14T12:10:54Z
prometheuses.monitoring.coreos.com 2020-08-14T12:10:54Z
prometheusrules.monitoring.coreos.com 2020-08-14T12:10:54Z
servicemonitors.monitoring.coreos.com 2020-08-14T12:10:54Z
thanosrulers.monitoring.coreos.com 2020-08-14T12:10:55Z
#查看pod状态为running
[root@master70 setup]# kubectl get pod -n monitoring -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-operator-574fd8ccd9-xlx8t 2/2 Running 0 14m 10.42.1.11 node71
3、修改grafana、prometheus、alertmanagers的svc端口暴露方式为nodeport(或者使用ingress暴露外部访问)
[root@master70 manifests]# cat 02-alertmanage/alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
alertmanager: main
name: alertmanager-main
namespace: monitoring
spec:
type: NodePort
ports:
- name: web
port: 9093
targetPort: web
nodePort: 30093
selector:
alertmanager: main
app: alertmanager
sessionAffinity: ClientIP
[root@master70 manifests]# cat 04-grafana/grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- name: http
port: 3000
targetPort: http
nodePort: 30030
selector:
app: grafana
[root@master70 manifests]# cat 06-prometheus/prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
prometheus: k8s
name: prometheus-k8s
namespace: monitoring
spec:
type: NodePort
ports:
- name: web
port: 9090
targetPort: web
nodePort: 30090
selector:
app: prometheus
prometheus: k8s
sessionAffinity: ClientIP
4、创建node-exporter
[root@master70 manifests]# kubectl apply -f 01-node-exporter/
clusterrole.rbac.authorization.k8s.io/node-exporter created
clusterrolebinding.rbac.authorization.k8s.io/node-exporter created
daemonset.apps/node-exporter created
service/node-exporter created
serviceaccount/node-exporter created
servicemonitor.monitoring.coreos.com/node-exporter created
[root@master70 manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
node-exporter-m9pgp 2/2 Running 0 3m56s
node-exporter-tsmfs 2/2 Running 0 3m56s
prometheus-operator-574fd8ccd9-xlx8t 2/2 Running 0 25m
5、创建kube-state-metrics
[root@master70 manifests]# kubectl apply -f 02-kube-state-metrics/
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
service/kube-state-metrics created
serviceaccount/kube-state-metrics created
servicemonitor.monitoring.coreos.com/kube-state-metrics created
[root@master70 manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
kube-state-metrics-bdb8874fd-xptb5 3/3 Running 0 58s
6、创建promethues。
首先修改promethues配置文件,使用自动发现功能,自动发现 Kubernetes 集群中的 Service/Pod。
- 创建自动发现文件:
[root@master70 06-prometheus]# cat ../add/prometheus-additional.yaml
- job_name: 'kubernetes-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- 要想自动发现集群中的 Service,就需要我们在 Service 的 annotation 区域添加 prometheus.io/scrape=true 的声明,将上面文件直接保存为 prometheus-additional.yaml,然后通过这个文件创建一个对应的 Secret 对象:
[root@master70 add]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created
- 然后我们需要在声明 prometheus 的资源对象文件中通过 additionalScrapeConfigs 属性添加上这个额外的配置:(末尾三行)
[root@master70 06-prometheus]# tail prometheus-prometheus.yaml
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.15.2
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
[root@master70 manifests]# kubectl apply -f 06-prometheus/
clusterrole.rbac.authorization.k8s.io/prometheus-k8s created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-operator created
prometheus.monitoring.coreos.com/k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s-config created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
prometheusrule.monitoring.coreos.com/prometheus-k8s-rules created
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus created
servicemonitor.monitoring.coreos.com/kube-apiserver created
servicemonitor.monitoring.coreos.com/coredns created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
servicemonitor.monitoring.coreos.com/kube-scheduler created
servicemonitor.monitoring.coreos.com/kubelet created
[root@master70 manifests]# kubectl apply -f 05-prometheus-adapter/
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io configured
clusterrole.rbac.authorization.k8s.io/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created
clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created
clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created
configmap/adapter-config created
deployment.apps/prometheus-adapter created
rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created
service/prometheus-adapter created
serviceaccount/prometheus-adapter created
- 访问promethues,暴露端口为nodeport 30090
[root@master70 manifests]# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-state-metrics ClusterIP None 8443/TCP,9443/TCP 3m45s
node-exporter ClusterIP None 9100/TCP 8m29s
prometheus-adapter ClusterIP 10.43.114.84 443/TCP 37s
prometheus-k8s NodePort 10.43.45.89 9090:30090/TCP 68s
prometheus-operated ClusterIP None 9090/TCP 69s
prometheus-operator ClusterIP None 8443/TCP 30m
-
operator文件中已经定义了部分告警规则
但是切换到 targets 页面下面却并没有发现对应的监控任务,查看 Prometheus 的 Pod 日志:
$ kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
......
level=error ts=2020-08-14T02:38:27.800Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
level=error ts=2020-08-14T02:38:27.801Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
都是 xxx is forbidden,说明是 RBAC 权限的问题,通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole:
修改yml文件:
[root@master70 06-prometheus]# cat prometheus-clusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
[root@master70 06-prometheus]# kubectl apply -f .
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured
重新应用下promethues下的yml文件,稍等会就能在promethues的dashboard看到endpoints的targets
7、部署Grafana
[root@master70 manifests]# kubectl apply -f 07-grafana/
secret/grafana-datasources created
configmap/grafana-dashboard-apiserver created
configmap/grafana-dashboard-cluster-total created
configmap/grafana-dashboard-controller-manager created
configmap/grafana-dashboard-k8s-resources-cluster created
configmap/grafana-dashboard-k8s-resources-namespace created
configmap/grafana-dashboard-k8s-resources-node created
configmap/grafana-dashboard-k8s-resources-pod created
configmap/grafana-dashboard-k8s-resources-workload created
configmap/grafana-dashboard-k8s-resources-workloads-namespace created
configmap/grafana-dashboard-kubelet created
configmap/grafana-dashboard-namespace-by-pod created
configmap/grafana-dashboard-namespace-by-workload created
configmap/grafana-dashboard-node-cluster-rsrc-use created
configmap/grafana-dashboard-node-rsrc-use created
configmap/grafana-dashboard-nodes created
configmap/grafana-dashboard-persistentvolumesusage created
configmap/grafana-dashboard-pod-total created
configmap/grafana-dashboard-prometheus-remote-write created
configmap/grafana-dashboard-prometheus created
configmap/grafana-dashboard-proxy created
configmap/grafana-dashboard-scheduler created
configmap/grafana-dashboard-statefulset created
configmap/grafana-dashboard-workload-total created
configmap/grafana-dashboards created
deployment.apps/grafana created
service/grafana created
serviceaccount/grafana created
servicemonitor.monitoring.coreos.com/grafana created
[root@master70 manifests]# kubectl get pod,svc -n monitoring|grep grafana
pod/grafana-5c55845445-9hsjj 1/1 Running 0 46s
service/grafana NodePort 10.43.173.121 3000:30030/TCP 48s
访问grafana
默认登录用户密码:admin/admin,首次登陆会要求改密码
默认官方已经提供很多dashboard模板
8、部署告警组件
[root@master70 manifests]# kubectl apply -f 08-alertmanage/
alertmanager.monitoring.coreos.com/main created
secret/alertmanager-main created
service/alertmanager-main created
serviceaccount/alertmanager-main created
servicemonitor.monitoring.coreos.com/alertmanager created
[root@master70 manifests]# kubectl get pod,svc -n monitoring|grep alertman
pod/alertmanager-main-0 2/2 Running 0 16s
pod/alertmanager-main-1 2/2 Running 0 16s
pod/alertmanager-main-2 2/2 Running 0 15s
service/alertmanager-main NodePort 10.43.34.192 9093:30093/TCP 16s
service/alertmanager-operated ClusterIP None 9093/TCP,9094/TCP,9094/UDP 16s
访问alertmanager页面
默认为告警页面,查看有无告警
菜单栏点击Status可以看到配置信息,这些配置信息实际上是来自于前面创建的 alertmanager-secret.yaml(配置文件为base64加密形式) 文件:
9、配置钉钉告警
[root@master70 add]# cat 08-dingtalk-hook.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dingding-config
namespace: monitoring
data:
config.yml: |-
templates:
- /etc/prometheus-webhook-dingtalk/templet.tmpl
targets:
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=8c1293fb6418399f21526087f08cf6d241192531664a0de7e3a86004652
templet.tmpl: |-
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}
{{ define "default.__text_alert_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}
**运营团队:** {{ .Labels.team | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**事件信息:**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "default.__text_alertresovle_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}
**运营团队:** {{ .Labels.team | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
**事件信息:**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{/* Default */}}
{{ define "default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}
![警报 图标](https://ss0.bdstatic.com/70cFuHSh_Q1YnxGkpoWK1HF6hhy/it/u=3626076420,1196179712&fm=15&gp=0.jpg)
**====侦测到故障====**
{{ template "default.__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
{{ template "default.__text_alertresovle_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{/* Following names for compatibility */}}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dingtalk-hook
namespace: monitoring
spec:
selector:
matchLabels:
app: dingtalk-hook
template:
metadata:
labels:
app: dingtalk-hook
spec:
containers:
- name: dingtalk-hook
image: timonwong/prometheus-webhook-dingtalk:v1.2.2
args:
- '--web.listen-address=0.0.0.0:8060'
- '--log.level=info'
- '--config.file=/etc/prometheus-webhook-dingtalk/config.yml'
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8060
resources:
requests:
cpu: 100m
memory: 32Mi
limits:
cpu: 200m
memory: 64Mi
volumeMounts:
- mountPath: /etc/prometheus-webhook-dingtalk/
name: config-yml
- mountPath: /etc/prometheus-webhook-dingtalk/
name: template-conf
volumes:
- configMap:
defaultMode: 420
name: dingding-config
name: config-yml
- configMap:
name: dingding-config
name: template-conf
---
apiVersion: v1
kind: Service
metadata:
name: dingtalk-hook
namespace: monitoring
spec:
ports:
- port: 8060
protocol: TCP
targetPort: 8060
name: http
selector:
app: dingtalk-hook
type: ClusterIP
报错
[root@master70 add]# kubectl apply -f 08-dingtalk-hook.yaml
configmap/dingding-config created
service/dingtalk-hook created
The Deployment "dingtalk-hook" is invalid: spec.template.spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/etc/prometheus-webhook-dingtalk/": must be unique
先注释
- mountPath: /etc/prometheus-webhook-dingtalk/
name: template-conf
创建完成后再打开注释 重新apply,
[root@master70 add]# vim 08-dingtalk-hook.yaml
[root@master70 add]# kubectl apply -f 08-dingtalk-hook.yaml
configmap/dingding-config unchanged
deployment.apps/dingtalk-hook created
service/dingtalk-hook unchanged
[root@master70 add]# vim 08-dingtalk-hook.yaml
[root@master70 add]# kubectl apply -f 08-dingtalk-hook.yaml
configmap/dingding-config unchanged
deployment.apps/dingtalk-hook configured
service/dingtalk-hook unchanged
[root@master70 add]# kubectl get pod -n monitoring|grep ding
dingtalk-hook-85894948b7-dcwrg 1/1 Running 0 2m16s
配置alertmanager,
指定了告警接受者为邮件接收和webhook接收
secret文件支持64编码
[root@master70 08-alertmanage]# cat alertmanager-secret.yaml
apiVersion: v1
data: {}
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
data:
alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KICBzbXRwX3NtYXJ0aG9zdDogJ3NtdHAuMTYzLmNvbToyNScKICBzbXRwX2Zyb206ICdjZG5samNAMTYzLmNvbScKICBzbXRwX2F1dGhfdXNlcm5hbWU6ICdjZG5samNAMTYzLmNvbScKICBzbXRwX2F1dGhfcGFzc3dvcmQ6ICdsamNjLmNvbScKICBzbXRwX2hlbGxvOiAnMTYzLmNvbScKICBzbXRwX3JlcXVpcmVfdGxzOiBmYWxzZQpyb3V0ZToKICBncm91cF9ieTogWydqb2InLCAnc2V2ZXJpdHknXQogIGdyb3VwX3dhaXQ6IDMwcwogIGdyb3VwX2ludGVydmFsOiAzMHMKICByZXBlYXRfaW50ZXJ2YWw6IDFoCiAgcmVjZWl2ZXI6IGRlZmF1bHQKICByb3V0ZXM6CiAgLSByZWNlaXZlcjogd2ViaG9vawogICAgbWF0Y2hfcmU6CiAgICAgIHNldmVyaXR5OiBjcml0aWNhbHx3YXJuaW5nfG5vbmV8ZXJyb3IKcmVjZWl2ZXJzOgotIG5hbWU6ICdkZWZhdWx0JwogIGVtYWlsX2NvbmZpZ3M6CiAgLSB0bzogJzgwNzAxMTUyNUBxcS5jb20nCiAgICBzZW5kX3Jlc29sdmVkOiB0cnVlCi0gbmFtZTogJ3dlYmhvb2snCiAgd2ViaG9va19jb25maWdzOgogIC0gdXJsOiAnaHR0cDovL2Rpbmd0YWxrLWhvb2s6ODA2MC9kaW5ndGFsay93ZWJob29rMi9zZW5kJwogICAgc2VuZF9yZXNvbHZlZDogdHJ1ZQp0ZW1wbGF0ZXM6Ci0gJyoudG1wbCc=
解码后的配置文件为:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxx'
smtp_hello: '163.com'
smtp_require_tls: false
route:
group_by: ['job', 'severity']
group_wait: 30s
group_interval: 30s
repeat_interval: 1h
receiver: default
routes:
- receiver: webhook
match_re:
severity: critical|warning|none|error
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'webhook'
webhook_configs:
- url: 'http://dingtalk-hook:8060/dingtalk/webhook2/send'
send_resolved: true
templates:
- '*.tmpl'
应用此文件
[root@master70 08-alertmanage]# kubectl apply -f alertmanager-secret.yaml
secret/alertmanager-main configured
修改告警,测试钉钉通知:
本文使用的告警规则文件:prometheus-k8s-rules.yaml
链接:https://pan.baidu.com/s/1O8XNLxvM6ERmhGFjD3u7eg
提取码:trna
##修改阈值和持续时间,然后重新apply该规则文件
- alert: CPU报警
annotations:
description: '{{$labels.instance}}: NodeCpu usage above 85% (current value:
{{ $value }}'
summary: '{{$labels.instance}}: High NodeCpu usage detected'
expr: 100 - (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100) > 1
for: 1s
接收成功~
未完待续~~