Prometheus server自身并没有告警功能，告警功能是通过Alertmanager实现的。通过配置Prometheus向Alertmanager发送告警信息，Alertmanager根据配置的规则选择是否告警？告警发给谁？

INSTALL

和Prometheus的安装类似，Alertmanager也有三种安装方式：

Precompiled binaries
Docker images
Compiling the binary
我选的是第一种，在https://prometheus.io/download/可以下载最新版的可执行文件。

RUN

./alertmanager --config.file=simple.yml

可以访问web ui

http://127.0.0.1:9093/#/alerts

image.png

CONFIG Prometheus

在Prometheus配置文件添加如下：

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: ["localhost:9093"]
# 加载告警规则
rule_files:
  - "alert_rules.yml"

alert_rules.yml中的内容

groups:
    - name: test-rule
      rules:
      # 磁盘使用率高于80%超过两分钟则告警
      - alert: NodeFilesystemUsage
        expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High Filesystem usage detected"
          description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
      # 内存使用率高于80%超过两分钟则告警
      - alert: NodeMemoryUsage
        expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High Memory usage detected"
          description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"

groups表示分组；labels和annotations在Alertmanager生产告警信息的时候会用到。

CONCEPT of Alertmanager

分组

分组是指当出现问题时，Alertmanager会收到一个单一的通知，而当系统宕机时，很有可能成百上千的警报会同时生成，这种机制在较大的中断中特别有用。

例如，当数十或数百个服务的实例在运行，网络发生故障时，有可能服务实例的一半不可达数据库。在告警规则中配置为每一个服务实例都发送警报的话，那么结果是数百警报被发送至Alertmanager。

但是作为用户只想看到单一的报警页面，同时仍然能够清楚的看到哪些实例受到影响，因此，人们通过配置Alertmanager将警报分组打包，并发送一个相对看起来紧凑的通知。

分组警报、警报时间，以及接收警报的receiver是在配置文件中通过路由树配置的。

抑制

抑制是指当警报发出后，停止重复发送由此警报引发其他错误的警报的机制。

例如，当警报被触发，通知整个集群不可达，可以配置Alertmanager忽略由该警报触发而产生的所有其他警报，这可以防止通知数百或数千与此问题不相关的其他警报。

抑制机制可以通过Alertmanager的配置文件来配置。

沉默

沉默是一种简单的特定时间静音提醒的机制。一种沉默是通过匹配器来配置，就像路由树一样。传入的警报会匹配RE，如果匹配，将不会为此警报发送通知。

沉默机制可以通过Alertmanager的Web页面进行配置。

一个报警信息在生命周期内有下面3中状态：

inactive: 表示当前报警信息既不是firing状态也不是pending状态
pending: 表示在设置的阈值时间范围内被激活了
firing: 表示超过设置的阈值时间被激活了

CONFIG Alertmanager

global

global:
  # The smarthost and SMTP sender used for mail notifications.
  # 邮箱相关的配置
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'my_passwd'
  # Hipchat一款企业聊天工具，类似RTX
  # The auth token for Hipchat.
  hipchat_auth_token: '1234556789'
  # Alternative host for Hipchat.
  hipchat_api_url: 'https://hipchat.foobar.org/'

route

路由块定义了路由树及其子节点。如果没有设置的话，子节点的可选配置参数从其父节点继承。

每个警报进入配置的路由树的顶级路径，顶级路径必须匹配所有警报（即没有任何形式的匹配）。然后匹配子节点。如果continue的值设置为false，它在匹配第一个孩子后就停止；如果在子节点匹配，continue的值为true，警报将继续进行后续兄弟姐妹的匹配。如果警报不匹配任何节点的任何子节点（没有匹配的子节点，或不存在），该警报基于当前节点的配置处理。

  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
  # 标签“service”的值符合正则则表示匹配
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails
    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to 'team-X-mails'
    routes:
    - match:
        severity: critical
      receiver: team-X-pager
  - match:
      service: files
    receiver: team-Y-mails

    routes:
    - match:
        severity: critical
      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's
  # no team to handle it, it defaults to the DB team.
  - match:
      service: database
    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]
    routes:
    - match:
        owner: team-X
      receiver: team-X-pager
    - match:
        owner: team-Y
      receiver: team-Y-pager

可以通过https://prometheus.io/webtools/alerting/routing-tree-editor/查看配置文件的路由图。

Inhibition

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is 
# already critical.
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  equal: ['alertname', 'cluster', 'service']

receivers

- name: 'team-X-mails'
  email_configs:
  - to: '[email protected]'

- name: 'team-X-pager'
  email_configs:
  - to: '[email protected]'
  pagerduty_configs:
  - service_key:

支持：

email
企业微信
钉钉
webhook
其他国外工具

告警截图