Prometheus and Alertmanager

Prometheus学习笔记_第1张图片
Prometheus学习笔记_第2张图片

Download binary and Docker-image

  • Prometheus-binary
  • Docker-image

How to get metrics from target jobs

  • timeseries collection happens via a pull model over HTTP
  • pushing timeseries is supported via an intermediary gateway
  • targets are discovered via service discovery or static configuration

How to run prometheus by docker container?

Attrations of the volume dir privileges in higher version prometheus. because the base image that use to build prom/prometheus images is changed.
see details from dockerfile in hub.docker.com. as below.

FROM        quay.io/prometheus/busybox:latest
MAINTAINER  The Prometheus Authors 

COPY prometheus                             /bin/prometheus
COPY promtool                               /bin/promtool
COPY documentation/examples/prometheus.yml  /etc/prometheus/prometheus.yml
COPY console_libraries/                     /etc/prometheus/
COPY consoles/                              /etc/prometheus/

EXPOSE     9090
VOLUME     [ "/prometheus" ]
WORKDIR    /prometheus
ENTRYPOINT [ "/bin/prometheus" ]
CMD        [ "-config.file=/etc/prometheus/prometheus.yml", \
             "-storage.local.path=/prometheus", \
             "-web.console.libraries=/etc/prometheus/console_libraries", \
             "-web.console.templates=/etc/prometheus/consoles" ]

run prometheus v2.4 in docker container.

configure_file=/apps/prometheus/conf
prometheus_data=/data/prometheus
chown -R nobody:nogroup /data/prometheus
chown -R nonody:nogroup /apps/prometheus/conf
docker run -d --name prometheus --restart=always -v $(configure_file):/etc/prometheus/ -v $(prometheus_data):/prometheus -p 9090:9090 prom/prometheus:latest --config.file=/etc/prometheus/prometheus.yml

running alertmanager

docker run  -d --name alertmanager --restart=always -p 10.1.100.231:9093:9093 -v /apps/alertmanager:/etc/alertmanager -v /data/alertmanager:/alertmanager prom/alertmanager:latest --config.file=/etc/alertmanager/config.yml

How to synchronize metric data between Prometheus server?

???

How apply new configuration files?

  • send SIGHUP
  • send http post request to /-/reload endpoint

How to specifies a set of targets by static_configs or dynamically discovered?

expression

All regular expressions in prometheus use RE2 syntax

How to upgrade prometheus server version?

Attentions

  • prometheus commandline args is different for different prometheus version.

使用prometheus告警时,prometheus会把警告规则发送给AlertManager,然后再由AlertManager管理这些警告,Alertmanager发送通知的方式通常有以下几种:

  • Email
  • PagerDuty
  • webhook
  • Slack
  • OpsGenie

设置警报与通知的步骤

  • 设置并配置 Alertmanager
  • 配置Prometheus与Alertmanager的api接口
  • 在prometheus中创建警报规则

prometheus的优势

  • 引用一篇cloudman的文章,Prometheus到底NB在哪里?

Alertmanager的配置

Alertmanager通过命令行参数和配置文件进行配置,命令行参数进行的配置是固定的,配置文件定义了路由通知,通知接受者的信息

使用可视化编辑器可以帮助你构建路由树

使用alertmanager -h显示alertmanager可用的命令行参数

Alertmanager可以在进程运行的时候重新加载他的配置文件,如果你的配置文件不正确,它将不会被记录以及应用,只有你对正在运行的进程发送SIGHUP信号或者通过发送HTTP POST请求到/-/reload时才会被加载

路由块定义路由树中的节点及其子节点。如果未设置,其可选配置参数将从其父节点继承。每个警报都在配置的顶级路由中进入路由树,该路由必须匹配所有警报(即没有任何已配置的匹配器)。然后它遍历子节点。如果将continue设置为false,则在第一个匹配的子项后停止。如果匹配节点上的continue为true,则警报将继续与后续兄弟节点匹配。如果警报与节点的任何子节点都不匹配(没有匹配的子节点,或者不存在),则根据当前节点的配置参数处理警报。

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

一些receiver的常见配置:https://prometheus.io/docs/alerting/configuration/

  • slack
  • webhook
  • wechat
  • pagerduty

免责声明:Prometheus会自动负责发送由其配置的警报规则生成的警报。强烈建议根据时间序列数据在Prometheus中配置警报规则,而不是实现直接客户端

通知模板

prometheus向alertmanager发送警告.alertManager向接受者发送通知的模板是可以自定义的,也可以使用Prometheus自身的模板,其自身的模板是基于Go语言的

prometheus query express

prometheus query function rate() vs irate()?

  • rate() 某个时间时间范围内每秒的增长率,rate应该只和计数器一起使用。最适合告警和缓慢计数器的绘图
  • irate() 某个时间范围内某个时刻的每秒增长率,基于最后两个数据点进行计算。自适应单调性中断(比如target重启导致的计数器重置)
# 最后五分钟http请求增长率
rate(http_requests_total{job="api-server"}[5m])
# 返回五分钟内最近两次数据点的HTTP请求每秒增长率
irate(http_requests_total{job="api-server"}[5m])

HTTP API

reload promether configuration file

http://192.168.20.161:9090/-/reload

reference

  • prometheus no-official manual