安装软件版本
Download | Prometheus
prometheus-2.33
Download Grafana | Grafana Labs
grafana-8.3.6
Releases · prometheus/node_exporter · GitHub
node_exporter-1.3.1
blackbox_exporter-0.20.0
alertmanager-0.24.0
cadvisor
mongodb_exporter :https://github.com/percona/mongodb_exporter
metrics类型:
Counter计数器:计的数据是递增的,不能使用计数器来统计可能减小的指标
Gauge量规:代表可以任意上下波动的单个数值
Summary摘要:用于表示一段时间内的数据采样的结果(客户端计算)
Histogram直方图:上边界、样本值总和、样本总数(服务端计算)
配置prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.21.120:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "first_rules.yml"
- "rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'linux'
static_configs:
- targets: ['localhost:9222','192.168.21.11:9222']
- job_name: 'docker'
static_configs:
- targets: ['localhost:8080','192.168.21.11:8080']
- job_name: 'mongo_exp'
static_configs:
- targets: ['192.168.21.11:9223']
labels:
unitname: "Mongodb_exporter"
- job_name: 'port_status'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: ['192.168.21.120:3000']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox-exporter 服务所在的机器和端口
- job_name: 'port_status_gyds'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.21.11:9110
- 192.168.21.11:9210
- 192.168.21.11:8090
- 192.168.21.11:9100
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.21.11:9115 # blackbox-exporter 服务所在的机器和端口
告警规则
采集服务未开启
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: Instance {{ $labels.instance }} has been down for more than 5 minutes
node_exporter告警配置
groups:
- name: test
rules:
- alert: 内存使用率过高
expr: 100-(node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 30
for: 1m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }} of job {{$labels.job}}内存使用率超过80%,当前使用率[{{ $value }}]."
- alert: cpu使用率过高
expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} cpu使用率过高"
description: "{{ $labels.instance }} of job {{$labels.job}}cpu使用率超过80%,当前使用率[{{ $value }}]."# 尽可能把详细告警信息写入summary标签值,因为告警短信/邮件/钉钉发送的内容使用了summary标签中的值。
blackbox_exporter 告警配置
groups:
- name: 站点状态-监控告警
rules:
- alert: docker_port #alertname报警名称
expr: probe_success == 0
for: 1h
labels:
status: 严重告警
annotations:
summary: "{{$labels.instance}} 不能访问"
description: "{{$labels.instance}} 不能访问"
Promethus部署
cd /home/suer/prometheus/prometheus-2.33
nohup ./prometheus --config.file=./prometheus.yml --log.level=debug --log.format=logfmt --web.enable-lifecycle &
配置自启动脚本
vi /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=Prometheus Monitoring System
[Service]
ExecStart=/home/suer/prometheus/prometheus-2.33/prometheus --config.file=/home/suer/prometheus/prometheus-2.33/prometheus.yml --log.level=debug --log.format=logfmt --web.enable-lifecycle --web.listen-address=:9090
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
#热启动
curl -XPOST http://192.168.21.120:9090/-/reload
部署
cd /home/suer/prometheus/node_exporter
nohup ./node_exporter --web.listen-address=":9222" --log.level="info" --log.format="logfmt" &
配置自启动脚本
vim /usr/lib/systemd/system/blackbox_exporter.service
[Unit]
Description=node_exporter
After=network.target
[Service]
ExecStart=/home/suer/prometheus/node_exporter/node_exporter --web.listen-address=":9222" --log.level="info" --log.format="logfmt"
[Install]
WantedBy=multi-user.target
systemctl enable node_exporter.service
systemctl start node_exporter.service
部署
cd /home/suer/prometheus/blackbox_exporter-0.20.0
nohup ./blackbox_exporter --config.file=./blackbox.yml --web.listen-address=":9115" --log.level=debug > ./blackbox.out 2>&1 &
部署
nohup ./mongodb_exporter --mongodb.uri='mongodb://admin:[email protected]:27017/?authSource=admin' --compatible-mode --discovering-mode --web.listen-address=":9223" --log.level=debug ./mongodb_exporter.out 2>&1 &
用于收集正在运行的容器资源使用和性能信息
使用docker部署
docker run -d \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest
下载二进制:https://github.com/google/cadvisor/releases/latest
本地运行:./cadvisor -port=8080 &>>/var/log/cadvisor.log
华为交换机:snmp_exporter监控华为网络设备 - 简书
DELL 服务器:Prometheus 实现监控Dell服务器相关硬件指标 - 屌丝的IT - 博客园
snmp.yml MIB 配置
huawei_mib:
walk:
- sysUpTime
- interfaces
- ifXTable
- sysDescr
- sysName
- 1.3.6.1.2.1.31.1.1.1.1
***
version: 2
auth:
community: public_read
lookups:
- source_indexes: [ifIndex]
lookup: ifAlias
- source_indexes: [ifIndex]
# Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
- source_indexes: [ifIndex]
# Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
overrides:
ifAlias:
ignore: true # Lookup metric
ifDescr:
ignore: true # Lookup metric
ifName:
ignore: true # Lookup metric
ifType:
type: EnumAsInfo
promethus 配置
- job_name: 'snmp_dell'
scrape_interval: 10s #刷新间隔默认10s
scrape_timeout: 1m #超时时间,snmp_exporter刷数据慢修改大一点
static_configs:
- targets:
- 10.1.0.1 #交换机IP地址
metrics_path: /snmp
params:
module: [huawei_mib] #generator.yml自定义文件的模块名
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.21.120:9116 # snmp_exporter 服务IP地址
配置 alertmanager.yml
global: #全局配置,包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
resolve_timeout: 5m
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxxxxxxxxxxx'
smtp_require_tls: false
smtp_hello: 'qq.com'
route: # 用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。
group_by: ['alertname'] # 采用哪个标签来作为分组依据
group_wait: 10s # 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
group_interval: 5s # 两组告警的间隔时间
repeat_interval: 5m # 重复告警的间隔时间,减少相同邮件的发送频率
receiver: 'email' # 设置默认接收人
routes: # 可以指定哪些组接手哪些消息
- receiver: 'default-receiver'
continue: true
group_wait: 10s
- receiver: 'fping-receiver'
group_wait: 10s
match_re: #根据标签分组,匹配标签dest=szjf的为fping-receiver组
dest: szjf
receivers: #配置告警消息接受者信息,例如常用的 email、wechat、slack、webhook 等消息通知方式。
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
- name: "fping-receiver"
webhook_configs:
- url: 'http://127.0.0.1:9095/dingtalk'
send_resolved: true
- name: 'email'
#webhook_configs
email_configs:
- to: '[email protected]'
send_resolved: true
inhibit_rules: #抑制规则配置,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)
- source_match: #匹配当前告警发生后其他告警抑制掉
severity: 'critical' #指定告警级别
target_match:
severity: 'warning' #指定抑制告警级别
equal: ['alertname', 'dev', 'instance'] # 确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制。
静默(silences): 是一种简单的特定时间静音的机制
#qq邮箱配置,需要申请第三方登录密码
安装
cd /home/suer/prometheus/alertmanager
nohup ./alertmanager --config.file=alertmanager.yml &
cd /home/suer/prometheus/grafana-8.3.6/bin
nohup ./grafana-server &
http://192.168.21.120:3000/
admin 123456
webhook(python:fastapi
uvicorn
)
#启动脚本
uvicorn /home/suer/alert:app --reload
ansible
ansible docker_node -m shell -a ‘docker restart cadvisor’
sudo firewall-cmd --zone=public --add-port=9093/tcp --permanent
sudo firewall-cmd --reload
sudo firewall-cmd --zone=public --list-ports
node_exporter 模板: 8919
blackbox_exporter 模板:9965
cadvisor 模板:193
Kong for prometheus:7424
mongodb : 14997
mysql : 14057
测试环境:
grafana:
http://192.168.21.120:3000/
admin 123456
告警:
http://192.168.21.120:9093/#/alerts
promethus:
http://192.168.21.120:9090/