Prometheus and Alertmanager
Download binary and Docker-image
- Prometheus-binary
- Docker-image
How to get metrics from target jobs
- timeseries collection happens via a pull model over HTTP
- pushing timeseries is supported via an intermediary gateway
- targets are discovered via service discovery or static configuration
How to run prometheus by docker container?
Attrations of the volume dir privileges in higher version prometheus. because the base image that use to build prom/prometheus images is changed.
see details from dockerfile in hub.docker.com. as below.
FROM quay.io/prometheus/busybox:latest
MAINTAINER The Prometheus Authors
COPY prometheus /bin/prometheus
COPY promtool /bin/promtool
COPY documentation/examples/prometheus.yml /etc/prometheus/prometheus.yml
COPY console_libraries/ /etc/prometheus/
COPY consoles/ /etc/prometheus/
EXPOSE 9090
VOLUME [ "/prometheus" ]
WORKDIR /prometheus
ENTRYPOINT [ "/bin/prometheus" ]
CMD [ "-config.file=/etc/prometheus/prometheus.yml", \
"-storage.local.path=/prometheus", \
"-web.console.libraries=/etc/prometheus/console_libraries", \
"-web.console.templates=/etc/prometheus/consoles" ]
run prometheus v2.4 in docker container.
configure_file=/apps/prometheus/conf
prometheus_data=/data/prometheus
chown -R nobody:nogroup /data/prometheus
chown -R nonody:nogroup /apps/prometheus/conf
docker run -d --name prometheus --restart=always -v $(configure_file):/etc/prometheus/ -v $(prometheus_data):/prometheus -p 9090:9090 prom/prometheus:latest --config.file=/etc/prometheus/prometheus.yml
running alertmanager
docker run -d --name alertmanager --restart=always -p 10.1.100.231:9093:9093 -v /apps/alertmanager:/etc/alertmanager -v /data/alertmanager:/alertmanager prom/alertmanager:latest --config.file=/etc/alertmanager/config.yml
How to synchronize metric data between Prometheus server?
???
How apply new configuration files?
- send SIGHUP
- send http post request to
/-/reload
endpoint
How to specifies a set of targets by static_configs
or dynamically discovered
?
expression
All regular expressions in prometheus use RE2 syntax
How to upgrade prometheus server version?
Attentions
- prometheus commandline args is different for different prometheus version.
使用prometheus告警时,prometheus会把警告规则发送给AlertManager,然后再由AlertManager管理这些警告,Alertmanager发送通知的方式通常有以下几种:
- PagerDuty
- webhook
- Slack
- OpsGenie
设置警报与通知的步骤
- 设置并配置 Alertmanager
- 配置Prometheus与Alertmanager的api接口
- 在prometheus中创建警报规则
prometheus的优势
- 引用一篇cloudman的文章,Prometheus到底NB在哪里?
Alertmanager的配置
Alertmanager通过命令行参数和配置文件进行配置,命令行参数进行的配置是固定的,配置文件定义了路由通知,通知接受者的信息
使用可视化编辑器可以帮助你构建路由树
使用alertmanager -h
显示alertmanager可用的命令行参数
Alertmanager可以在进程运行的时候重新加载他的配置文件,如果你的配置文件不正确,它将不会被记录以及应用,只有你对正在运行的进程发送SIGHUP
信号或者通过发送HTTP POST
请求到/-/reload
时才会被加载
路由块定义路由树中的节点及其子节点。如果未设置,其可选配置参数将从其父节点继承。每个警报都在配置的顶级路由中进入路由树,该路由必须匹配所有警报(即没有任何已配置的匹配器)。然后它遍历子节点。如果将continue设置为false,则在第一个匹配的子项后停止。如果匹配节点上的continue为true,则警报将继续与后续兄弟节点匹配。如果警报与节点的任何子节点都不匹配(没有匹配的子节点,或者不存在),则根据当前节点的配置参数处理警报。
# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
receiver: 'default-receiver'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [cluster, alertname]
# All alerts that do not match the following child routes
# will remain at the root node and be dispatched to 'default-receiver'.
routes:
# All alerts with service=mysql or service=cassandra
# are dispatched to the database pager.
- receiver: 'database-pager'
group_wait: 10s
match_re:
service: mysql|cassandra
# All alerts with the team=frontend label match this sub-route.
# They are grouped by product and environment rather than cluster
# and alertname.
- receiver: 'frontend-pager'
group_by: [product, environment]
match:
team: frontend
一些receiver的常见配置:https://prometheus.io/docs/alerting/configuration/
- slack
- webhook
- pagerduty
免责声明:Prometheus会自动负责发送由其配置的警报规则生成的警报。强烈建议根据时间序列数据在Prometheus中配置警报规则,而不是实现直接客户端
通知模板
prometheus向alertmanager发送警告.alertManager向接受者发送通知的模板是可以自定义的,也可以使用Prometheus自身的模板,其自身的模板是基于Go语言的
prometheus query express
prometheus query function rate() vs irate()?
- rate() 某个时间时间范围内每秒的增长率,rate应该只和计数器一起使用。最适合告警和缓慢计数器的绘图
- irate() 某个时间范围内某个时刻的每秒增长率,基于最后两个数据点进行计算。自适应单调性中断(比如target重启导致的计数器重置)
# 最后五分钟http请求增长率
rate(http_requests_total{job="api-server"}[5m])
# 返回五分钟内最近两次数据点的HTTP请求每秒增长率
irate(http_requests_total{job="api-server"}[5m])
HTTP API
reload promether configuration file
http://192.168.20.161:9090/-/reload
reference
- prometheus no-official manual