本文主要介绍,如何通过prometheus监控服务状态,并产生告警信息,已便于运维人员快速响应。
本次设计用到prometheus服务,alertmanger服务,blackbox exporter。
以上服务都可以在官网下载:https://prometheus.io/download/
$ tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz
$ ls prometheus-2.35.0-rc0.linux-amd64.tar.gz
console_libraries consoles LICENSE NOTICE prometheus prometheus.yml promtool
$ ./prometheus
ts=2022-04-12T06:20:30.952Z caller=main.go:488 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-04-12T06:20:30.953Z caller=main.go:525 level=info msg="Starting Prometheus" version="(version=2.35.0-rc0, branch=HEAD, revision=5b73e518260d8bab36ebb1c0d0a5826eba8fc0a0
Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP以及ICMP的方式对网络进行探测。
$ tar xvf blackbox_exporter-0.20.0.linux-amd64.tar.gz
$ ls
blackbox_exporter blackbox.yml LICENSE NOTICE
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
grpc:
prober: grpc
grpc:
tls: true
preferred_ip_protocol: "ip4"
grpc_plain:
prober: grpc
grpc:
tls: false
service: "service1"
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: "SSH-2.0-blackbox-ssh-check"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
注:更多的HTTP请求方法、HTTP头信息、请求参数、auth、证书认证等,请参考官方文档。
blackbox_exporter --config.file=/etc/prometheus/blackbox.yml
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # 模块对应 blackbox.yml
static_configs:
- targets:
- http://www.123.com # http
- http://www.baidu.com # http
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
$ tar xvf alertmanager-$VERSION.darwin-amd64.tar.gz
$ ls
alertmanager alertmanager.yml amtool data LICENSE NOTICE
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/' #这里配置接收告警的服务
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Alertmanager的配置主要包含两个部分:路由(route)以及接收器(receivers)。所有的告警信息都会从配置中的顶级路由(route)进入路由树,根据路由规则将告警信息发送给相应的接收器。
./alertmanager
用户也在启动Alertmanager时使用参数修改相关配置。–config.file用于指定alertmanager配置文件路径,–storage.path用于指定数据存储路径。
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: up == 0 #这里遵循Promsql的语法
for: 1m #如1分钟内持续为0 报警
labels:
severity: critical
annotations:
description: 'Job {{ $labels.job }} {{ $labels.instance }}.'
summary: '{{ $labels.instance }} down ! ! !'
注:promsql语法请参考官方文档
rule_files:
- "blackbox_rules.yml"
为了验证接收告警,我这里写了一个简单的http服务,通过alertmanager的web hook方式验证测试
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:8981/' #这里为接收告警的服务地址
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
#coding=utf-8
import http.client
import urllib
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
def start_server():
data = {'result': 'this is a test'}
host = ('localhost', 8981)
class Resquest(BaseHTTPRequestHandler):
def do_POST(self):
length = int(self.headers['Content-Length'])
post_data = urllib.parse.parse_qs(self.rfile.read(length).decode('utf-8'))
# You now have a dictionary of the post data
data = {"Method:": self.command,
"Path:": self.path,
"Post Data":post_data}
print(data)
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps(data).encode())
server = HTTPServer(host, Resquest)
print("Starting server, listen at: %s:%s" % host)
server.serve_forever()
if __name__ == '__main__':
start_server()
print("start server success...")
启动,观察接收到的告警信息
$ python server.py
Starting server, listen at: localhost:8981
{'Method:': 'POST', 'Post Data': {'{"receiver":"web\\\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"blackbox_network_stats","instance":"172.17.0.1:8001","job":"kong","severity":"critical"},"annotations":{"description":"Job kong 172.17.0.1:8001.","summary":"172.17.0.1:8001 down ! ! !"},"startsAt":"2022-04-12T06:21:45.185Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://ubuntu:9090/graph?g0.expr': ['up == 0\\u0026g0.tab=1","fingerprint":"5776c946d916f29c"}],"groupLabels":{"alertname":"blackbox_network_stats"},"commonLabels":{"alertname":"blackbox_network_stats","instance":"172.17.0.1:8001","job":"kong","severity":"critical"},"commonAnnotations":{"description":"Job kong 172.17.0.1:8001.","summary":"172.17.0.1:8001 down ! ! !"},"externalURL":"http://ubuntu:9093","version":"4","groupKey":"{}:{alertname=\\"blackbox_network_stats\\"}","truncatedAlerts":0}\n']}, 'Path:': '/'}
127.0.0.1 - - [12/Apr/2022 14:30:30] "POST / HTTP/1.1" 200 -
参考地址:https://prometheus.io/download/