数据架构师

Prometheus Alertmanager报警组件

分享一个朋友的人工智能教程。零基础！通俗易懂！风趣幽默！还带黄段子！大家可以看看是否对自己有帮助：点击打开

docker/kubernetes入门视频教程

全栈工程师开发手册（作者：栾鹏）
架构系列文章

Prometheus Alertmanager

概述

Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager，然后Alertmanager将silencing、inhibition、aggregation等消息通过电子邮件、PaperDuty和HipChat发送通知。

设置警报和通知的主要步骤：

安装配置Alertmanager
配置Prometheus通过-alertmanager.url标志与Alertmanager通信
在Prometheus中创建告警规则

Alertmanager简介及机制

Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组，并将警报通过路由发送到正确的接收器，比如电子邮件、Slack等。Alertmanager还支持groups,silencing和警报抑制的机制。

分组

分组是指将同一类型的警报分类为单个通知。当许多系统同时宕机时，很有可能成百上千的警报会同时生成，这种机制特别有用。
例如，当数十或数百个服务的实例在运行，网络发生故障时，有可能一半的服务实例不能访问数据库。在prometheus告警规则中配置为每一个服务实例都发送警报的话，那么结果是数百警报被发送至Alertmanager。

但是作为用户只想看到单一的报警页面，同时仍然能够清楚的看到哪些实例受到影响，因此，可以通过配置Alertmanager将警报分组打包，并发送一个相对看起来紧凑的通知。

分组警报、警报时间，以及接收警报的receiver是在alertmanager配置文件中通过路由树配置的。

抑制(Inhibition)

抑制是指当警报发出后，停止重复发送由此警报引发其他错误的警报的机制。(比如网络不可达，导致其他服务连接相关警报)

例如，当整个集群网络不可达，此时警报被触发，可以事先配置Alertmanager忽略由该警报触发而产生的所有其他警报，这可以防止通知数百或数千与此问题不相关的其他警报。

抑制机制也是通过Alertmanager的配置文件来配置。

沉默(Silences)

Silences是一种简单的特定时间不告警的机制。silences警告是通过匹配器(matchers)来配置，就像路由树一样。传入的警报会匹配RE，如果匹配，将不会为此警报发送通知。

这个可视化编辑器可以帮助构建路由树。

silences报警机制可以通过Alertmanager的Web页面进行配置。

Alermanager的配置

Alertmanager通过命令行flag和一个配置文件进行配置。命令行flag配置不变的系统参数、配置文件定义的抑制(inhibition)规则、通知路由和通知接收器。

要查看所有可用的命令行flag，运行alertmanager -h。
Alertmanager支持在运行时加载配置，如果新配置语法格式不正确，更改将不会被应用，并记录语法错误。通过向该进程发送SIGHUP或向/-/reload端点发送HTTP POST请求来触发配置热加载。

配置文件

要指定加载的配置文件，需要使用-config.file标志。该文件使用YAML来完成，通过下面的描述来定义。带括号的参数表示是可选的，对于非列表的参数的值，将被设置为指定的缺省值。

通用占位符定义解释:

: 与正则表达式匹配的持续时间值,[0-9]+(ms|[smhdwy])
: 与正则表达式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]*
: unicode字符串
: 有效的文件路径
: boolean类型，true或者false
: 字符串
: 模板变量字符串

global全局配置文件参数在所有配置上下文生效，作为其他配置项的默认值,可被覆盖.

global:
  # ResolveTimeout is the time after which an alert is declared resolved
  # if it has not been updated.
  #解决报警时间间隔
  [ resolve_timeout:  | default = 5m ]

  # The default SMTP From header field.
  [ smtp_from:  ]
  # The default SMTP smarthost used for sending emails.
  [ smtp_smarthost:  ]
  # SMTP authentication information.
  [ smtp_auth_username:  ]
  [ smtp_auth_password:  ]
  [ smtp_auth_secret:  ]
  # The default SMTP TLS requirement.
  [ smtp_require_tls:  | default = true ]

  # The API URL to use for Slack notifications.
  [ slack_api_url:  ]

  [ pagerduty_url:  | default = "https://events.pagerduty.com/generic/2010-04-15/create_event.json" ]
  [ opsgenie_api_host:  | default = "https://api.opsgenie.com/" ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ -  ... ]

# The root node of the routing tree.
route: 

# A list of notification receivers.
receivers:
  -  ...

# A list of inhibition rules.
inhibit_rules:
  [ -  ... ]

路由(route)

路由块定义了路由树及其子节点。如果没有设置的话，子节点的可选配置参数从其父节点继承。

每个警报都会在配置的顶级路由中进入路由树，该路由树必须匹配所有警报（即没有任何配置的匹配器）。然后遍历子节点。如果continue的值设置为false，它在第一个匹配的子节点之后就停止；如果continue的值为true，警报将继续进行后续子节点的匹配。如果警报不匹配任何节点的任何子节点（没有匹配的子节点，或不存在），该警报基于当前节点的配置处理。

路由配置格式

#报警接收器
[ receiver:  ]

#分组
[ group_by: '[' , ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue:  | default = false ]

# A set of equality matchers an alert has to fulfill to match the node.
#根据匹配的警报，指定接收器
match:
  [ : , ... ]

# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
#根据匹配正则符合的警告，指定接收器
  [ : , ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait:  ]

# How long to wait before sending notification about new alerts that are
# in are added to a group of alerts for which an initial notification
# has already been sent. (Usually ~5min or more.)
[ group_interval:  ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval:  ]

# Zero or more child routes.
routes:
  [ -  ... ]

例子：

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

抑制规则 inhibit_rule

抑制规则，是存在另一组匹配器匹配的情况下，使其他被引发警报的规则静音。这两个警报，必须有一组相同的标签。

抑制配置格式

# Matchers that have to be fulfilled in the alerts to be muted.
##必须在要需要静音的警报中履行的匹配者
target_match:
  [ : , ... ]
target_match_re:
  [ : , ... ]

# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
#必须存在一个或多个警报以使抑制生效的匹配者。
source_match:
  [ : , ... ]
source_match_re:
  [ : , ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
#在源和目标警报中必须具有相等值的标签才能使抑制生效
[ equal: '[' , ... ']' ]

接收器(receiver)

顾名思义，警报接收的配置。

通用配置格式

# The unique name of the receiver.
name: 

# Configurations for several notification integrations.
email_configs:
  [ - , ... ]
pagerduty_configs:
  [ - , ... ]
slack_config:
  [ - , ... ]
opsgenie_configs:
  [ - , ... ]
webhook_configs:
  [ - , ... ]

邮件接收器email_config

# Whether or not to notify about resolved alerts.
#警报被解决之后是否通知
[ send_resolved:  | default = false ]

# The email address to send notifications to.
to: 
# The sender address.
[ from:  | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost:  | default = global.smtp_smarthost ]

# The HTML body of the email notification.
[ html:  | default = '{{ template "email.default.html" . }}' ] 

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { : , ... } ]

Slcack接收器slack_config

# Whether or not to notify about resolved alerts.
[ send_resolved:  | default = true ]

# The Slack webhook URL.
[ api_url:  | default = global.slack_api_url ]

# The channel or user to send notifications to.
channel: 

# API request data as defined by the Slack webhook API.
[ color:  | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username:  | default = '{{ template "slack.default.username" . }}'
[ title:  | default = '{{ template "slack.default.title" . }}' ]
[ title_link:  | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext:  | default = '{{ template "slack.default.pretext" . }}' ]
[ text:  | default = '{{ template "slack.default.text" . }}' ]
[ fallback:  | default = '{{ template "slack.default.fallback" . }}' ]

Webhook接收器webhook_config

 # Whether or not to notify about resolved alerts.
[ send_resolved:  | default = true ]

 # The endpoint to send HTTP POST requests to.
url:

Alertmanager会使用以下的格式向配置端点发送HTTP POST请求：

{
  "version": "3",
  "groupKey":      // key identifying the group of alerts (e.g. to deduplicate)
  "status": "",
  "receiver": ,
  "groupLabels":