prometheus告警配置

1. 摘要

本文主要介绍,如何通过prometheus监控服务状态,并产生告警信息,已便于运维人员快速响应。

2. 整体架构

本次设计用到prometheus服务,alertmanger服务,blackbox exporter。
prometheus告警配置_第1张图片

以上服务都可以在官网下载:https://prometheus.io/download/

3. prometheus 部署

  • 下载并解压
$ tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz
$ ls prometheus-2.35.0-rc0.linux-amd64.tar.gz
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool
  • 启动,这里prometheus.yml不做详细解释,请参考官网文档
$ ./prometheus
ts=2022-04-12T06:20:30.952Z caller=main.go:488 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-04-12T06:20:30.953Z caller=main.go:525 level=info msg="Starting Prometheus" version="(version=2.35.0-rc0, branch=HEAD, revision=5b73e518260d8bab36ebb1c0d0a5826eba8fc0a0
  • 浏览器访问localhost:9090端口
    prometheus告警配置_第2张图片

4. blackbox exporter 部署

Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP以及ICMP的方式对网络进行探测。

  • 下载并解压:
$ tar xvf blackbox_exporter-0.20.0.linux-amd64.tar.gz
$ ls
blackbox_exporter  blackbox.yml  LICENSE  NOTICE
  • 下面是一个简化的探针配置文件blockbox.yml
modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp

注:更多的HTTP请求方法、HTTP头信息、请求参数、auth、证书认证等,请参考官方文档。

  • 通过运行以下命令,并指定使用的探针配置文件启动Blockbox Exporter实例:
blackbox_exporter --config.file=/etc/prometheus/blackbox.yml
  • 与Prometheus集成,在prometheus,yml中,加入如下配置,实现对http://www.123.com 和 http://www.baidu.com 的探测
 - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]            # 模块对应 blackbox.yml 
    static_configs:
      - targets:
        - http://www.123.com        # http
        - http://www.baidu.com      # http
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115

  • 重新启动prometheus,并访问prometheus页面验证
    prometheus告警配置_第3张图片

5. AlertManager部署

  • 下载并解压:
$ tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz
$ ls
alertmanager  alertmanager.yml  amtool  data  LICENSE  NOTICE
  • Alertmanager解压后会包含一个默认的alertmanager.yml配置文件,内容如下所示:
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'   #这里配置接收告警的服务
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alertmanager的配置主要包含两个部分:路由(route)以及接收器(receivers)。所有的告警信息都会从配置中的顶级路由(route)进入路由树,根据路由规则将告警信息发送给相应的接收器。

  • 启动Alertmanager
./alertmanager

用户也在启动Alertmanager时使用参数修改相关配置。–config.file用于指定alertmanager配置文件路径,–storage.path用于指定数据存储路径。

  • 查看运行状态
    Alertmanager启动后可以通过9093端口访问,http://localhost:9093
    prometheus告警配置_第4张图片
  • 关联Prometheus与Alertmanager
    在Prometheus的架构中被划分成两个独立的部分。Prometheus负责产生告警,而Alertmanager负责告警产生后的后续处理。因此Alertmanager部署完成后,需要在Prometheus中设置Alertmanager相关的信息。
    编辑Prometheus配置文件prometheus.yml,并添加以下内容
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']
  • 告警规则配置
    新建文件blackbox_rules.yml
groups:
- name: blackbox_network_stats
  rules:
  - alert: blackbox_network_stats
    expr: up == 0   #这里遵循Promsql的语法 
    for: 1m         #如1分钟内持续为0  报警
    labels:
      severity: critical
    annotations:
      description: 'Job {{ $labels.job }} {{ $labels.instance }}.'
      summary: '{{ $labels.instance }} down ! ! !'


注:promsql语法请参考官方文档

  • 编辑Prometheus配置文件prometheus.yml,并添加以下内容
rule_files:
  - "blackbox_rules.yml"

  • 重新启动prometheus,配置完成。
    可以通过访问http://localhost:9093/#/alerts,查看告警信息
    prometheus告警配置_第5张图片

6. 验证接收告警

为了验证接收告警,我这里写了一个简单的http服务,通过alertmanager的web hook方式验证测试

  • 修改 alertmanager.yml 并重启
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:8981/'  #这里为接收告警的服务地址
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

  • 用python写的一个简单的接收http告警的服务
#coding=utf-8
import http.client
import urllib
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

def start_server():
    data = {'result': 'this is a test'}
    host = ('localhost', 8981)

    class Resquest(BaseHTTPRequestHandler):
        def do_POST(self):
            length = int(self.headers['Content-Length'])
            post_data = urllib.parse.parse_qs(self.rfile.read(length).decode('utf-8'))
            # You now have a dictionary of the post data
            data = {"Method:": self.command,
                    "Path:": self.path,
                    "Post Data":post_data}
            print(data)
            self.send_response(200)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(json.dumps(data).encode())
    server = HTTPServer(host, Resquest)
    print("Starting server, listen at: %s:%s" % host)
    server.serve_forever()

if __name__ == '__main__':
    start_server()
    print("start server success...")

启动,观察接收到的告警信息

$ python server.py 
Starting server, listen at: localhost:8981
{'Method:': 'POST', 'Post Data': {'{"receiver":"web\\\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"blackbox_network_stats","instance":"172.17.0.1:8001","job":"kong","severity":"critical"},"annotations":{"description":"Job kong 172.17.0.1:8001.","summary":"172.17.0.1:8001 down ! ! !"},"startsAt":"2022-04-12T06:21:45.185Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://ubuntu:9090/graph?g0.expr': ['up == 0\\u0026g0.tab=1","fingerprint":"5776c946d916f29c"}],"groupLabels":{"alertname":"blackbox_network_stats"},"commonLabels":{"alertname":"blackbox_network_stats","instance":"172.17.0.1:8001","job":"kong","severity":"critical"},"commonAnnotations":{"description":"Job kong 172.17.0.1:8001.","summary":"172.17.0.1:8001 down ! ! !"},"externalURL":"http://ubuntu:9093","version":"4","groupKey":"{}:{alertname=\\"blackbox_network_stats\\"}","truncatedAlerts":0}\n']}, 'Path:': '/'}
127.0.0.1 - - [12/Apr/2022 14:30:30] "POST / HTTP/1.1" 200 -

至此,prometheus 的告警服务搭建完毕

参考地址:https://prometheus.io/download/

你可能感兴趣的:(prometheus,监控类)