我们安装好 prometheus-operator 之后,打开 prometheus 页面Alerts
页面能看到好多报警规则,目前有的还处于报警状态
但是这些报警信息是哪里来的呢?他们应该用怎样的方式通知我们呢?我们知道 可以在Prometheus 的配置文件之中指定 AlertManager 实例和 报警的 rules 文件,现在我们通过 Operator 部署的呢?我们可以在 Prometheus Dashboard 的 Config 页面下面查看关于 AlertManager 的配置:
alerting:
alert_relabel_configs:
- separator: ;
regex: prometheus_replica
replacement: $1
action: labeldrop
alertmanagers:
- kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- monitoring
scheme: http
path_prefix: /
timeout: 10s
api_version: v1
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: alertmanager-main
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: web
replacement: $1
action: keep
rule_files:
- /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
上面 alertmanagers 实例的配置我们可以看到是通过角色为 endpoints 的 kubernetes 的服务发现机制获取的,匹配的是服务名为 alertmanager-main,端口名为 web 的 Service 服务,我们查看下 alertmanager-main 这个 Service:
$ kubectl describe -n monitoring svc alertmanager-main
Name: alertmanager-main
Namespace: monitoring
Labels: alertmanager=main
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"alertmanager":"main"},"name":"alertmanager-main","namespace":"...
Selector: alertmanager=main,app=alertmanager
Type: ClusterIP
IP: 10.16.131.214
Port: web 9093/TCP
TargetPort: web/TCP
Endpoints: 10.103.74.7:9093,10.103.75.9:9093,10.103.76.7:9093
Session Affinity: ClientIP
Events:
可以看到服务名正是 alertmanager-main,Port 定义的名称也是 web,符合上面的规则,所以 Prometheus 和 AlertManager 组件就正确关联上了。而对应的报警规则文件位于:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/
目录下面所有的 YAML 文件。我们可以进入 Prometheus 的 Pod 中验证下该目录下面是否有 YAML 文件:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
/prometheus $ ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/
monitoring-prometheus-k8s-rules.yaml
/prometheus $ cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/monitoring-pr
ometheus-k8s-rules.yaml
groups:
- name: k8s.rules
rules:
- expr: |
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace)
record: namespace:container_cpu_usage_seconds_total:sum_rate
......
这个 YAML 文件实际上就是我们之前创建的一个 PrometheusRule 文件包含的:
$ cat prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-k8s-rules
namespace: monitoring
spec:
groups:
- name: k8s.rules
rules:
- expr: |
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace)
record: namespace:container_cpu_usage_seconds_total:sum_rate
我们这里的 PrometheusRule 的 name 为 prometheus-k8s-rules,namespace 为 monitoring,我们可以猜想到我们创建一个 PrometheusRule 资源对象后,会自动在上面的 prometheus-k8s-rulefiles-0 目录下面生成一个对应的<namespace>-<name>.yaml
文件,所以如果以后我们需要自定义一个报警选项的话,只需要定义一个 PrometheusRule 资源对象即可。至于为什么 Prometheus 能够识别这个 PrometheusRule 资源对象呢?这就需要查看我们创建的 prometheus 这个资源对象了,里面有非常重要的一个属性 ruleSelector,用来匹配 rule 规则的过滤器,要求匹配具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 资源对象,现在明白了吧?
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
所以我们要想自定义一个报警规则,只需要创建一个具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就行了,比如 我们现在添加一个集群节点磁盘 使用率操过 88% 就报警。
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: disk-free-rules
namespace: monitoring
spec:
groups:
- name: disk
rules:
- alert: diskFree
annotations:
summary: "{{ $labels.job }} 项目实例 {{ $labels.instance }} 磁盘使用率大于 80%"
description: "{{ $labels.instance }} {{ $labels.mountpoint }} 磁盘使用率大于80% (当前的值: {{ $value }}%),请及时处理"
expr: |
(1-(node_filesystem_free_bytes{fstype=~"ext4|xfs",mountpoint!="/boot"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",mountpoint!="/boot"}) )*100 > 85
for: 3m
labels:
level: disaster
注意 label 标签一定至少要有 prometheus=k8s 和 role=alert-rules,创建完成后,隔一会儿再去容器中查看下 rules 文件夹:
/etc/prometheus/rules/prometheus-k8s-rulefiles-0 $ ls
monitoring-disk-free-rules.yaml monitoring-prometheus-k8s-rules.yaml
可以看到我们创建的 rule 文件已经被注入到了对应的 rulefiles 文件夹下面了,证明我们上面的设想是正确的。然后再去 Prometheus Dashboard 的 Alert 页面下面就可以查看到上面我们新建的报警规则了:
配置报警
我们知道了如何去添加一个报警规则配置项,但是这些报警信息用怎样的方式去发送呢?我们知道我们可以通过 AlertManager 的配置文件去配置各种报警接收器,现在我们是通过 Operator 提供的 alertmanager 资源对象创建的组件,应该怎样去修改配置呢?
首先我们将 alertmanager-main 这个 Service 创建一个 ingress,修改完成后我们可以在页面上的 status 路径下面查看 AlertManager 的配置信息:
$ cat ingress.yml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: kube-prometheus
namespace: monitoring
spec:
rules:
- host: prometheus.zsf.com
http:
paths:
- path: /
backend:
serviceName: prometheus-k8s
servicePort: 9090
- host: grafana.zsf.com
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
- host: alertmanager.zsf.com
http:
paths:
- path: /
backend:
serviceName: alertmanager-main
servicePort: 9093
配置信息其实来自于 alertmanager/alertmanager-secret.yaml
# cat alertmanager/alertmanager-secret.yaml
apiVersion: v1
data:
alertmanager.yaml: Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg==
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
type: Opaque
我们对 alertmanager.yml 文件进行 base 64 反解析
$ echo 'Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg==' | base64 -d
"global":
"resolve_timeout": "5m"
"receivers":
- "name": "null"
"route":
"group_by":
- "job"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "null"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Watchdog"
"receiver": "null"
我们可以看到内容和上面查看的配置信息是一致的,所以如果我们想要添加自己的接收器,或者模板消息,我们就可以更改这个文件:
# cat alertmanager.yaml
global:
resolve_timeout: 5m
receivers:
- name: dingtalk-webhook
webhook_configs:
- send_resolved: true
url: http://dingtalk-webhook:8060/dingtalk/guiji/send
route:
group_by:
- job
group_interval: 5m
group_wait: 30s
receiver: dingtalk-webhook
repeat_interval: 12h
routes:
- receiver: dingtalk-webhook
group_wait: 10s
将上面文件保存为 alertmanager.yaml,然后使用这个文件创建一个 Secret 对象:
# 先将之前的 secret 对象删除
$ kubectl delete secret alertmanager-main -n monitoring
secret "alertmanager-main" deleted
$ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
secret "alertmanager-main" created
配置prometheus-operate 钉钉告警
创建 webhook 的配置文件
# vim dingTalk-webhook-configmap.yml
apiVersion: v1
kind: ConfigMap
metadata:
namespace: monitoring
name: dingtalk-webhook-config
data:
config.yml: |
# Request timeout
timeout: 5s
## Customizable templates path
templates:
- /etc/prometheus-webhook-dingtalk/templates/*.tmpl
## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
# default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'
## Targets, previously was known as "profiles"
targets:
guiji:
url: https://oapi.dingtalk.com/robot/send?access_token=5752a9d10727165d116b883b4e7d312b781a3ed90fefa5d1a8f4d61f06343a27
message:
title: '{{ template "ding.link.title" . }}'
text: '{{ template "ding.link.content" . }}'
mention:
all: true
mobiles: ['18001587880']
创建告警模板配置文件:
# vim dingTalk-webhook-template.yml
apiVersion: v1
kind: ConfigMap
metadata:
namespace: monitoring
name: dingtalk-webhook-template
data:
template.tmpl: |
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}
{{ define "default.__text_alert_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}
**运营团队:** {{ .Labels.team | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**事件信息:**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "default.__text_alertresovle_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}
**运营团队:** {{ .Labels.team | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
**事件信息:**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{/* Default */}}
{{ define "default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}
![警报 图标](https://ss0.bdstatic.com/70cFuHSh_Q1YnxGkpoWK1HF6hhy/it/u=3626076420,1196179712&fm=15&gp=0.jpg)
**====侦测到故障====**
{{ template "default.__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
{{ template "default.__text_alertresovle_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{/* Following names for compatibility */}}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
创建 webhook 的资源配置清单
# cat dingTalk-webhook-deployment.yml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: monitoring
name: dingtalk-webhook
labels:
app: dingtalk-webhook
spec:
selector:
matchLabels:
app: dingtalk-webhook
replicas: 1
template:
metadata:
labels:
app: dingtalk-webhook
spec:
containers:
- name: dingtalk-webhook
image: harbor.zsf.com/public/prometheus-webhook-dingtalk
args:
- --config.file=/etc/prometheus-webhook-dingtalk/config.yml
#- --ding.profile=guiji=https://oapi.dingtalk.com/robot/send?access_token=5752a9d10727165d116b883b4e7d312b781a3ed90fefa5d1a8f4d61f06343a27
ports:
- containerPort: 8060
protocol: TCP
volumeMounts:
- mountPath: "/etc/prometheus-webhook-dingtalk"
name: dingtalk-webhook-confing
subPath: config.yml
- mountPath: "/etc/prometheus-webhook-dingtalk/templates"
name: dingtalk-webhook-template
subPath: template.tmpl
volumes:
- name: dingtalk-webhook-confing
configMap:
name: dingtalk-webhook-config
- name: dingtalk-webhook-template
configMap:
name: dingtalk-webhook-template
---
apiVersion: v1
kind: Service
metadata:
namespace: monitoring
name: dingtalk-webhook
labels:
app: dingtalk-webhook
spec:
selector:
app: dingtalk-webhook
ports:
- name: http
port: 8060
targetPort: 8060
protocol: TCP
然后我们等一会就能查看到报警信息了。