old_GGB

Prometheus告警处理

Prometheus告警简介

告警能力在Prometheus的架构中被划分成两个独立的部分。如下所示，通过在Prometheus中定义AlertRule（告警规则），Prometheus会周期性的对告警规则进行计算，如果满足告警触发条件就会向Alertmanager发送告警信息。

Prometheus告警处理

在Prometheus中一条告警规则主要由以下几部分组成：

告警名称：用户需要为告警规则命名，当然对于命名而言，需要能够直接表达出该告警的主要内容
告警规则：告警规则实际上主要由PromQL进行定义，其实际意义是当表达式（PromQL）查询结果持续多长时间（During）后出发告警

在Prometheus中，还可以通过Group（告警组）对一组相关的告警进行统一定义。当然这些定义都是通过YAML文件来统一管理的。

Alertmanager作为一个独立的组件，负责接收并处理来自Prometheus Server(也可以是其它的客户端程序)的告警信息。Alertmanager可以对这些告警信息进行进一步的处理，比如当接收到大量重复告警时能够消除重复的告警信息，同时对告警信息进行分组并且路由到正确的通知方，Prometheus内置了对邮件，Slack等多种通知方式的支持，同时还支持与Webhook的集成，以支持更多定制化的场景。例如，目前Alertmanager还不支持钉钉，那用户完全可以通过Webhook与钉钉机器人进行集成，从而通过钉钉接收告警信息。同时AlertManager还提供了静默和告警抑制机制来对告警通知行为进行优化。

Alertmanager特性

Alertmanager除了提供基本的告警通知能力以外，还主要提供了如：分组、抑制以及静默等告警特性：

Alertmanager特性

分组

分组机制可以将详细的告警信息合并成一个通知。在某些情况下，比如由于系统宕机导致大量的告警被同时触发，在这种情况下分组机制可以将这些被触发的告警合并为一个告警通知，避免一次性接受大量的告警通知，而无法对问题进行快速定位。

例如，当集群中有数百个正在运行的服务实例，并且为每一个实例设置了告警规则。假如此时发生了网络故障，可能导致大量的服务实例无法连接到数据库，结果就会有数百个告警被发送到Alertmanager。

而作为用户，可能只希望能够在一个通知中中就能查看哪些服务实例收到影响。这时可以按照服务所在集群或者告警名称对告警进行分组，而将这些告警内聚在一起成为一个通知。

告警分组，告警时间，以及告警的接受方式可以通过Alertmanager的配置文件进行配置。

抑制

抑制是指当某一告警发出后，可以停止重复发送由此告警引发的其它告警的机制。

例如，当集群不可访问时触发了一次告警，通过配置Alertmanager可以忽略与该集群有关的其它所有告警。这样可以避免接收到大量与实际问题无关的告警通知。

抑制机制同样通过Alertmanager的配置文件进行设置。

静默

静默提供了一个简单的机制可以快速根据标签对告警进行静默处理。如果接收到的告警符合静默的配置，Alertmanager则不会发送告警通知。

静默设置需要在Alertmanager的Werb页面上进行设置。

自定义Prometheus告警规则

Prometheus中的告警规则允许你基于PromQL表达式定义告警触发条件，Prometheus后端对这些触发规则进行周期性计算，当满足触发条件后则会触发告警通知。默认情况下，用户可以通过Prometheus的Web界面查看这些告警规则以及告警的触发状态。当Promthues与Alertmanager关联之后，可以将告警发送到外部服务如Alertmanager中并通过Alertmanager可以对这些告警进行进一步的处理。

定义告警规则

一条典型的告警规则如下所示：

groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency
      description: description info

在告警规则文件中，我们可以将一组相关的规则设置定义在一个group下。在每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成：

alert：告警规则的名称。
expr：基于PromQL表达式告警触发条件，用于计算是否有时间序列满足该条件。
for：评估等待时间，可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
labels：自定义标签，允许用户指定要附加到告警上的一组附加标签。
annotations：用于指定一组附加信息，比如用于描述告警详细信息的文字等，annotations的内容在告警产生时会一同作为参数发送到Alertmanager。

为了能够让Prometheus能够启用定义的告警规则，我们需要在Prometheus全局配置文件中通过rule_files指定一组告警规则文件的访问路径，Prometheus启动后会自动扫描这些路径下规则文件中定义的内容，并且根据这些规则计算是否向外部发送通知：

rule_files:
  [ -  ... ]

默认情况下Prometheus会每分钟对这些告警规则进行计算，如果用户想定义自己的告警计算周期，则可以通过evaluation_interval来覆盖默认的计算周期：

global:
  [ evaluation_interval:  | default = 1m ]

模板化

一般来说，在告警规则文件的annotations中使用summary描述告警的概要信息，description用于描述告警的详细信息。同时Alertmanager的UI也会根据这两个标签值，显示告警信息。为了让告警信息具有更好的可读性，Prometheus支持模板化label和annotations的中标签的值。

通过$labels.变量可以访问当前告警实例中指定标签的值。$value则可以获取当前PromQL表达式计算的样本值。

# To insert a firing element's label values:
{{ $labels. }}
# To insert the numeric expression value of the firing element:
{{ $value }}

例如，可以通过模板化优化summary以及description的内容的可读性：

groups:
- name: example
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

查看告警状态

如下所示，用户可以通过Prometheus WEB界面中的Alerts菜单查看当前Prometheus下的所有告警规则，以及其当前所处的活动状态。

告警活动状态

同时对于已经pending或者firing的告警，Prometheus也会将它们存储到时间序列ALERTS{}中。

可以通过表达式，查询告警实例：

ALERTS{alertname="", alertstate="pending|firing", }

样本值为1表示当前告警处于活动状态（pending或者firing），当告警从活动状态转换为非活动状态时，样本值则为0。

实例：定义主机监控告警

修改Prometheus配置文件prometheus.yml,添加以下配置：

rule_files:
  - /etc/prometheus/rules/*.rules

在目录/etc/prometheus/rules/下创建告警文件hoststats-alert.rules内容如下：

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

重启Prometheus后访问Prometheus UIhttp://127.0.0.1:9090/rules可以查看当前以加载的规则文件。

告警规则

切换到Alerts标签http://127.0.0.1:9090/alerts可以查看当前告警的活动状态。

告警活动状态

此时，我们可以手动拉高系统的CPU使用率，验证Prometheus的告警流程，在主机上运行以下命令：

cat /dev/zero>/dev/null

运行命令后查看CPU使用率情况，如下图所示：

Prometheus首次检测到满足触发条件后，hostCpuUsageAlert显示由一条告警处于活动状态。由于告警规则中设置了1m的等待时间，当前告警状态为PENDING，如下图所示：

如果1分钟后告警条件持续满足，则会实际触发告警并且告警状态为FIRING，如下图所示：

部署Alertmanager

Alertmanager和Prometheus Server一样均采用Golang实现，并且没有第三方依赖。一般来说我们可以通过以下几种方式来部署Alertmanager：二进制包、容器以及源码方式安装。

使用二进制包部署AlertManager

获取并安装软件包

Alertmanager最新版本的下载地址可以从Prometheus官方网站Download | Prometheus获取。

export VERSION=0.15.2
curl -LO https://github.com/prometheus/alertmanager/releases/download/v$VERSION/alertmanager-$VERSION.darwin-amd64.tar.gz
tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz

创建alertmanager配置文件

Alertmanager解压后会包含一个默认的alertmanager.yml配置文件，内容如下所示：

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alertmanager的配置主要包含两个部分：路由(route)以及接收器(receivers)。所有的告警信息都会从配置中的顶级路由(route)进入路由树，根据路由规则将告警信息发送给相应的接收器。

在Alertmanager中可以定义一组接收器，比如可以按照角色(比如系统运维，数据库管理员)来划分多个接收器。接收器可以关联邮件，Slack以及其它方式接收告警信息。

当前配置文件中定义了一个默认的接收者default-receiver由于这里没有设置接收方式，目前只相当于一个占位符。关于接收器的详细介绍会在后续章节介绍。

在配置文件中使用route定义了顶级的路由，路由是一个基于标签匹配规则的树状结构。所有的告警信息从顶级路由开始，根据标签匹配规则进入到不同的子路由，并且根据子路由设置的接收器发送告警。目前配置文件中只设置了一个顶级路由route并且定义的接收器为default-receiver。因此，所有的告警都会发送给default-receiver。关于路由的详细内容会在后续进行详细介绍。

启动Alertmanager

Alermanager会将数据保存到本地中，默认的存储路径为data/。因此，在启动Alertmanager之前需要创建相应的目录：

./alertmanager

用户也在启动Alertmanager时使用参数修改相关配置。--config.file用于指定alertmanager配置文件路径，--storage.path用于指定数据存储路径。

查看运行状态

Alertmanager启动后可以通过9093端口访问，http://192.168.33.10:9093

Alertmanager页面

Alert菜单下可以查看Alertmanager接收到的告警内容。Silences菜单下则可以通过UI创建静默规则，这部分我们会在后续部分介绍。进入Status菜单，可以看到当前系统的运行状态以及配置信息。

关联Prometheus与Alertmanager

在Prometheus的架构中被划分成两个独立的部分。Prometheus负责产生告警，而Alertmanager负责告警产生后的后续处理。因此Alertmanager部署完成后，需要在Prometheus中设置Alertmanager相关的信息。

编辑Prometheus配置文件prometheus.yml,并添加以下内容

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

重启Prometheus服务，成功后，可以从http://192.168.33.10:9090/config查看alerting配置是否生效。

此时，再次尝试手动拉高系统CPU使用率：

cat /dev/zero>/dev/null

等待Prometheus告警进行触发状态：

查看Alertmanager UI此时可以看到Alertmanager接收到的告警信息。

Alertmanager配置概述

在上面的部分中已经简单介绍过，在Alertmanager中通过路由(Route)来定义告警的处理方式。路由是一个基于标签匹配的树状匹配结构。根据接收到告警的标签匹配相应的处理方式。这里将详细介绍路由相关的内容。

Alertmanager主要负责对Prometheus产生的告警进行统一处理，因此在Alertmanager配置中一般会包含以下几个主要部分：

全局配置（global）：用于定义一些全局的公共参数，如全局的SMTP配置，Slack配置等内容；
模板（templates）：用于定义告警通知时的模板，如HTML模板，邮件模板等；
告警路由（route）：根据标签匹配，确定当前告警应该如何处理；
接收人（receivers）：接收人是一个抽象的概念，它可以是一个邮箱也可以是微信，Slack或者Webhook等，接收人一般配合告警路由使用；
抑制规则（inhibit_rules）：合理设置抑制规则可以减少垃圾告警的产生

其完整配置格式如下：

global:
  [ resolve_timeout:  | default = 5m ]
  [ smtp_from:  ] 
  [ smtp_smarthost:  ] 
  [ smtp_hello:  | default = "localhost" ]
  [ smtp_auth_username:  ]
  [ smtp_auth_password:  ]
  [ smtp_auth_identity:  ]
  [ smtp_auth_secret:  ]
  [ smtp_require_tls:  | default = true ]
  [ slack_api_url:  ]
  [ victorops_api_key:  ]
  [ victorops_api_url:  | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url:  | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key:  ]
  [ opsgenie_api_url:  | default = "https://api.opsgenie.com/" ]
  [ hipchat_api_url:  | default = "https://api.hipchat.com/" ]
  [ hipchat_auth_token:  ]
  [ wechat_api_url:  | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret:  ]
  [ wechat_api_corp_id:  ]
  [ http_config:  ]

templates:
  [ -  ... ]

route: 

receivers:
  -  ...

inhibit_rules:
  [ -  ... ]

在全局配置中需要注意的是resolve_timeout，该参数定义了当Alertmanager持续多长时间未接收到告警后标记告警状态为resolved（已解决）。该参数的定义可能会影响到告警恢复通知的接收时间，读者可根据自己的实际场景进行定义，其默认值为5分钟。在接下来的部分，我们将已一些实际的例子解释Alertmanager的其它配置内容。

内置告警接收器Receiver

前上一小节已经讲过，在Alertmanager中路由负责对告警信息进行分组匹配，并将像告警接收器发送通知。告警接收器可以通过以下形式进行配置：

receivers:
  -  ...

每一个receiver具有一个全局唯一的名称，并且对应一个或者多个通知方式：

name: 
email_configs:
  [ - , ... ]
hipchat_configs:
  [ - , ... ]
pagerduty_configs:
  [ - , ... ]
pushover_configs:
  [ - , ... ]
slack_configs:
  [ - , ... ]
opsgenie_configs:
  [ - , ... ]
webhook_configs:
  [ - , ... ]
victorops_configs:
  [ - , ... ]

目前官方内置的第三方通知集成包括：邮件、即时通讯软件（如Slack、Hipchat）、移动应用消息推送(如Pushover)和自动化运维工具（例如：Pagerduty、Opsgenie、Victorops）。Alertmanager的通知方式中还可以支持Webhook，通过这种方式开发者可以实现更多个性化的扩展支持。

与SMTP邮件集成

邮箱应该是目前企业最常用的告警通知方式，Alertmanager内置了对SMTP协议的支持，因此对于企业用户而言，只需要一些基本的配置即可实现通过邮件的通知。

在Alertmanager使用邮箱通知，用户只需要定义好SMTP相关的配置，并且在receiver中定义接收方的邮件地址即可。在Alertmanager中我们可以直接在配置文件的global中定义全局的SMTP配置：

global:
  [ smtp_from:  ]
  [ smtp_smarthost:  ]
  [ smtp_hello:  | default = "localhost" ]
  [ smtp_auth_username:  ]
  [ smtp_auth_password:  ]
  [ smtp_auth_identity:  ]
  [ smtp_auth_secret:  ]
  [ smtp_require_tls:  | default = true ]

完成全局SMTP之后，我们只需要为receiver配置email_configs用于定义一组接收告警的邮箱地址即可，如下所示：

name: 
email_configs:
  [ - , ... ]

每个email_config中定义相应的接收人邮箱地址，邮件通知模板等信息即可，当然如果当前接收人需要单独的SMTP配置，那直接在email_config中覆盖即可：

[ send_resolved:  | default = false ]
to: 
[ html:  | default = '{{ template "email.default.html" . }}' ]
[ headers: { : , ... } ]

如果当前收件人需要接受告警恢复的通知的话，在email_config中定义send_resolved为true即可。

如果所有的邮件配置使用了相同的SMTP配置，则可以直接定义全局的SMTP配置。

这里，以Gmail邮箱为例，我们定义了一个全局的SMTP配置，并且通过route将所有告警信息发送到default-receiver中:

global:
  smtp_smarthost: smtp.gmail.com:587
  smtp_from: 
  smtp_auth_username: 
  smtp_auth_identity: 
  smtp_auth_password: 

route:
  group_by: ['alertname']
  receiver: 'default-receiver'

receivers:
  - name: default-receiver
    email_configs:
      - to: 
        send_resolved: true

需要注意的是新的Google账号安全规则需要使用”应用专有密码“作为邮箱登录密码

这时如果手动拉高主机CPU使用率，使得监控样本数据满足告警触发条件。在SMTP配置正确的情况下，可以接收到如下的告警内容：

与Slack集成

Slack是非常流行的团队沟通应用，提供群组聊天和直接消息发送功能，支持移动端，Web 和桌面平台。在国外有大量的IT团队使用Slack作为团队协作平台。同时其提供了强大的集成能力，在Slack的基础上也衍生出了大量的ChatOps相关的技术实践。这部分将介绍如何将Slack集成到Alertmanager中。

认识Slack

Slack

Slack作为一款即时通讯工具，协作沟通主要通过Channel（平台）来完成，用户可以在企业中根据用途添加多个Channel，并且通过Channel来集成各种第三方工具。

例如，我们可以为监控建立一个单独的Channel用于接收各种监控信息：

创建Channel

通过一个独立的Channle可以减少信息对用户工作的干扰，并且将相关信息聚合在一起：

Monitoring

Slack的强大之处在于在Channel中添加各种第三方服务的集成，用户也可以基于Slack开发自己的聊天机器人来实现一些更高级的能力，例如自动化运维，提高开发效率等。

添加应用：Incomming Webhooks

为了能够在Monitoring中接收来自Alertmanager的消息，我们需要在Channel的设置选项中使用"Add an App"为Monitoring channel添加一个名为Incoming WebHooks的应用：

添加Incomming Webhooks

添加成功后Slack会显示Incoming WebHooks配置和使用方式：

Incomming Webhhook配置

Incomming Webhook的工作方式很简单，Slack为当前Channel创建了一个用于接收消息的API地址：

https://hooks.slack.com/services/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

用户只需要使用Post方式向Channel发送需要通知的消息即可，例如，我们可以在命令行中通过curl模拟一次消息通知：

curl -d "payload={'text': 'This is a line of text in a channel.\nAnd this is another line of text.'}" https://hooks.slack.com/services/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

在网络正常的情况下，在Channel中会显示新的通知信息，如下所示：

测试消息

除了发送纯文本以外，slack还支持在文本内容中添加链接，例如：

payload={"text": "A very important thing has occurred!  for details!"}

此时接收到的消息中建辉包含一个可点击的超链接地址。除了payload以外，Incomming Webhhook还支持一些其他的参数：

参数	作用	示例
username	设置当前聊天机器人的名称	webhookbot
icon_url	当前聊天机器人的头像地址	https://slack.com/img/icons/app-57.png
icon_emoji	使用emoji作为聊天机器人的头像	:ghost:
channel	消息发送的目标channel, 需要直接发给特定用户时使用@username即可	#monitoring 或者 @username

例如，使用以上参数发送一条更有趣的消息：

curl -X POST --data-urlencode "payload={'channel': '#monitoring', 'username': 'webhookbot', 'text': 'This is posted to #monitoring and comes from a bot named webhookbot.', 'icon_emoji': ':ghost:'}" https://hooks.slack.com/services/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

自定义消息

在Alertmanager中使用Slack

在了解了Slack以及Incomming Webhhook的基本使用方式后，在Alertmanager中添加Slack支持就非常简单了。

在Alertmanager的全局配置中，将Incomming Webhhook地址作为slack_api_url添加到全局配置中即可：

global:
  slack_api_url: https://hooks.slack.com/services/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

当然，也可以在每个receiver中单独定义自己的slack_configs即可：

receivers：
- name: slack
  slack_configs:
    - channel: '#monitoring'
      send_resolved: true

这里如果我们手动拉高当前主机的CPU利用率，在#Monitoring平台中，我们会接收到一条告警信息如下所示：

告警信息

而当告警项恢复正常后，则可以接收到如下通知：

告警恢复信息

对于Incomming Webhhook支持的其它自定义参数，也可以在slack_config中进行定义，slack_config的主要配置如下：

channel: 
[ send_resolved:  | default = false ]
[ api_url:  | default = global.slack_api_url ]
[ icon_emoji:  ]
[ icon_url:  ]
[ link_names:  | default = false ]
[ username:  | default = '{{ template "slack.default.username" . }}' ]
[ color:  | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ footer:  | default = '{{ template "slack.default.footer" . }}' ]
[ pretext:  | default = '{{ template "slack.default.pretext" . }}' ]
[ text:  | default = '{{ template "slack.default.text" . }}' ]
[ title:  | default = '{{ template "slack.default.title" . }}' ]
[ title_link:  | default = '{{ template "slack.default.titlelink" . }}' ]
[ image_url:  ]
[ thumb_url:  ]

如果要覆盖默认的告警内容，直接使用Go Template即可。例如：

color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

与企业微信集成

Alertmanager已经内置了对企业微信的支持，我们可以通过企业微信来管理报警，更进一步可以通过企业微信和微信的互通来直接将告警消息转发到个人微信上。

prometheus官网中给出了企业微信的相关配置说明

# Whether or not to notify about resolved alerts.
[ send_resolved:  | default = false ]

# The API key to use when talking to the WeChat API.
[ api_secret:  | default = global.wechat_api_secret ]

# The WeChat API URL.
[ api_url:  | default = global.wechat_api_url ]

# The corp id for authentication.
[ corp_id:  | default = global.wechat_api_corp_id ]

# API request data as defined by the WeChat API.
[ message:  | default = '{{ template "wechat.default.message" . }}' ]
[ agent_id:  | default = '{{ template "wechat.default.agent_id" . }}' ]
[ to_user:  | default = '{{ template "wechat.default.to_user" . }}' ]
[ to_party:  | default = '{{ template "wechat.default.to_party" . }}' ]
[ to_tag:  | default = '{{ template "wechat.default.to_tag" . }}' ]

企业微信相关概念说明请参考企业微信API说明，可以在企业微信的后台中建立多个应用，每个应用对应不同的报警分组，由企业微信来做接收成员的划分。具体配置参考如下：

global:
  resolve_timeout: 10m
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: '应用的secret，在应用的配置页面可以看到'
  wechat_api_corp_id: '企业id，在企业的配置页面可以看到'
templates:
- '/etc/alertmanager/config/*.tmpl'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  routes:
  - receiver: 'wechat'
    continue: true
inhibit_rules:
- source_match:
receivers:
- name: 'wechat'
  wechat_configs:
  - send_resolved: false
    corp_id: '企业id，在企业的配置页面可以看到'
    to_user: '@all'
    to_party: ' PartyID1 | PartyID2 '
    message: '{{ template "wechat.default.message" . }}'
    agent_id: '应用的AgentId，在应用的配置页面可以看到'
    api_secret: '应用的secret，在应用的配置页面可以看到'

配置模板示例如下：

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}

=====================
{{- end }}
===告警详情===
告警详情: {{ $alert.Annotations.message }}
故障时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
===参考信息===
{{ if gt (len $alert.Labels.instance) 0 -}}故障实例ip: {{ $alert.Labels.instance }};{{- end -}}
{{- if gt (len $alert.Labels.namespace) 0 -}}故障实例所在namespace: {{ $alert.Labels.namespace }};{{- end -}}
{{- if gt (len $alert.Labels.node) 0 -}}故障物理机ip: {{ $alert.Labels.node }};{{- end -}}
{{- if gt (len $alert.Labels.pod_name) 0 -}}故障pod名称: {{ $alert.Labels.pod_name }}{{- end }}
=====================
{{- end }}
{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}

=====================
{{- end }}
===告警详情===
告警详情: {{ $alert.Annotations.message }}
故障时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {{ $alert.EndsAt.Format "2006-01-02 15:04:05" }}
===参考信息===
{{ if gt (len $alert.Labels.instance) 0 -}}故障实例ip: {{ $alert.Labels.instance }};{{- end -}}
{{- if gt (len $alert.Labels.namespace) 0 -}}故障实例所在namespace: {{ $alert.Labels.namespace }};{{- end -}}
{{- if gt (len $alert.Labels.node) 0 -}}故障物理机ip: {{ $alert.Labels.node }};{{- end -}}
{{- if gt (len $alert.Labels.pod_name) 0 -}}故障pod名称: {{ $alert.Labels.pod_name }};{{- end }}
=====================
{{- end }}
{{- end }}
{{- end }}

这时如果某一容器频繁重启，可以接收到如下的告警内容：

使用Webhook扩展Alertmanager

在某些情况下除了Alertmanager已经内置的集中告警通知方式以外，对于不同的用户和组织而言还需要一些自定义的告知方式支持。通过Alertmanager提供的webhook支持可以轻松实现这一类的扩展。除了用于支持额外的通知方式，webhook还可以与其他第三方系统集成实现运维自动化，或者弹性伸缩等。

在Alertmanager中可以使用如下配置定义基于webhook的告警接收器receiver。一个receiver可以对应一组webhook配置。

name: 
webhook_configs:
  [ - , ... ]

每一项webhook_config的具体配置格式如下：

# Whether or not to notify about resolved alerts.
[ send_resolved:  | default = true ]

# The endpoint to send HTTP POST requests to.
url: 

# The HTTP client's configuration.
[ http_config:  | default = global.http_config ]

send_resolved用于指定是否在告警消除时发送回执消息。url则是用于接收webhook请求的地址。http_configs则是在需要对请求进行SSL配置时使用。

当用户定义webhook用于接收告警信息后，当告警被触发时，Alertmanager会按照以下格式向这些url地址发送HTTP Post请求，请求内容如下：

{
  "version": "4",
  "groupKey": ,    // key identifying the group of alerts (e.g. to deduplicate)
  "status": "",
  "receiver": ,
  "groupLabels":

Prometheus告警处理

Prometheus告警简介

Alertmanager特性

分组

抑制

静默

自定义Prometheus告警规则

定义告警规则

模板化

查看告警状态

实例：定义主机监控告警

部署Alertmanager

使用二进制包部署AlertManager

获取并安装软件包

创建alertmanager配置文件

启动Alertmanager

查看运行状态

关联Prometheus与Alertmanager

Alertmanager配置概述

内置告警接收器Receiver

与SMTP邮件集成

与Slack集成

认识Slack

添加应用：Incomming Webhooks

在Alertmanager中使用Slack

与企业微信集成

使用Webhook扩展Alertmanager

使用Golang创建webhook服务

与钉钉集成

自定义webhook群机器人

定义转换器将告警通知转化为Dingtalk消息对象

创建Dingtalk通知发送包

扩展启动函数

使用Dingtalk扩展

自定义告警模板

屏蔽告警通知

抑制机制

临时静默

你可能感兴趣的:(prometheus,运维)