Prometheus下载地址Prometheus相关文档Prometheus官方文档
通过HTTP协议周期性抓取被监控组件的状态
,任意组件只要提供对应的HTTP接口就可以接入监控。
输出被监控组件信息的HTTP接口被叫做exporter
,也就是数据采集端,通常来说,最需要接入改造的就是expoter. 当前互联网上已经有很多成熟的exporter
组件,当然用户也可用根据官方提供的sdk自行编写exporter.
注意:prometheus的时间序列数据分为四种类型
数据采集和存储(TSDB)
,提供PromQL查询语言的支持 Prometheus Daemon
定时去目标上抓取metrics(指标)数据
,每个抓取目标需要暴露一个http服务的接口给server进行定时获取。支持配置文件、文本文件、Zookeeper、Consul、DNS SRV Lookup方式抓取目标;对于长生命周期的服务,采用Pull模式定期拉取数据,对于段生命周期的任务,通过push-gateway来主动推送数据 Prometheus
本地存储抓取的所有数据,并通过一定规则进行清理和整理数据,并把得到的结果存储到新的时间序列中。 Grafana
的数据源进行图标输出,也可通过API对外提供数据展示 PushGateway
支持client主动推送metrics到push-gateway(相当于是一个常驻的exporter服务),prometheus定期去push-gateway中获取数据 Alertmanager
是独立于prometheus的一个组件,支持PromQL查询语句,提供灵活的报警功能 下载地址
源码安装
# 三个组件
$ wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz -O node_exporter-0.18.1.tar.gz
$ wget https://github.com/prometheus/prometheus/releases/download/v2.10.0/prometheus-2.10.0.linux-amd64.tar.gz -O prometheus-2.10.0.linux-amd64.tar.gz
$ wget https://github.com/prometheus/pushgateway/releases/download/v0.8.0/pushgateway-0.8.0.linux-amd64.tar.gz -O pushgateway-0.8.0.linux-amd64.tar.gz
docker方式安装
注意:prometheus默认使用yaml格式来定义配置文件
# 编写prometheus默认配置文件
$ cat prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
# 会在每个metrics数据中增加job="prometheus"和instance="localhost:9090"的基本数据
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
# 热启动prometheus服务
$ docker run --name=prometheus -d -p 9090:9090 -v /Users/xuxuebiao/Desktop/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle
$ docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5e1698a320b2 prom/prometheus "/bin/prometheus -..." 3 days ago Up 2 minutes 0.0.0.0:9090->9090/tcp prometheus
注意:
prometheus为golang编写的程序,因此只有一个二进制文件,使用--config.file来制定配置文件,使用--web.enable-lifecycle来启用远程热加载配置文件. 调用指令curl -X POST http://localhost:9090/-/reload
此时可以访问prometheus-web即可查看prometheus的状态页面。此时它会每30s对自己暴露的http metrics数据进行采集。可以访问prometheus本身的metrics数据
通过node exporter提供metrics
# 启动node-exporter
$ docker run -d --name=node-exporter -p 9100:9100 prom/node-exporter
$ docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f60bcce1ea6 prom/node-exporter "/bin/node_exporter" 3 days ago Up 42 seconds 0.0.0.0:9100->9100/tcp node-exporter
# 查看服务暴露的metrics
$ curl http://localhost:9100/metrics
# 将配置暴露给prometheus,并重载prometheus
$ cat prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
# 增加一个target 并附加一个label来标记该metrics
# 注意:在prometheus启动时增加了一些参数,因此target不需要写协议和uri(http和/metrics)
- targets: ['10.13.13.60:9100']
labels:
group: "client-node-exporter"
# prometheus服务重载
$ curl -X POST http://localhost:9090/-/reload
注意:如果需要同时查找多个项,其实需要熟悉prometheus的表达式编写
注意:push-gateway服务启动后也需要将endpoint加入prometheus中
# 启动push-gateway服务
docker run -d -p 9091:9091 --name pushgateway prom/pushgateway
# 查看push-gateway服务
$ curl localhost:9091
# 测试push一条metrics数据到Push-gateway
echo "tps 100" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
# 多指标推送
cat <
# 创建grafana服务
$ docker run -d -p 3000:3000 --name grafana grafana/grafana
$ curl localhost:3000
添加数据源,以及基本数据验证
# 向prometheus的push-gateway上主动push数据模拟数据上报
➜ Desktop echo "tps 10" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
➜ Desktop echo "tps 9" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
➜ Desktop echo "tps 20" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
➜ Desktop echo "tps 30" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
➜ Desktop echo "tps 310" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
➜ Desktop echo "tps 222" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
Prometheus
中的告警由独立的两部分组成
建立告警和通知的基本步骤:
启动altermanager服务
# 编辑alertmanager配置文件
$cat alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['cqh']
group_wait: 10s #组报警等待时间
group_interval: 10s #组报警间隔时间
repeat_interval: 1m #重复报警间隔时间
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://10.13.118.71:8889/open/test'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
# 启动服务
$ docker run -d -p 9093:9093 --name alertmanager -v /Users/xuxuebiao/Desktop/alertmanager.yml:/etc/alertmanager/alertmanager.yml prom/alertmanager
$ docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c6ba74bfd03b prom/alertmanager "/bin/alertmanager..." 3 days ago Up 7 seconds 0.0.0.0:9093->9093/tcp alertmanager
在prometheus中配置altermanager服务
# 编辑rule配置
$ cat rules.yml
groups:
# tps 超过150 并且持续10s就报警告通知
- name: bgbiao
rules:
- alert: bgbiao测试
expr: tps > 150
for: 10s
labels:
status: warning
annotations:
summary: "{{$labels.instance}}:tps 超过阈值150."
description: "{{$labels.instance}}:tps 超过阈值!. 当前值: {{ $value }}"
# 修改prometheus主配置文件
$ cat prometheus.yml
global:
# 默认抓取时间间隔为15s
scrape_interval: 15s
# 计算rule的间隔
evaluation_interval: 15s
# 定义额外的label
external_labels:
monitor: "bgbiao-monitor"
rule_files:
- /etc/prometheus/rules.yml
# - "first.rules"
# - "second.rules"
# 抓取对象
scrape_configs:
- job_name: prometheus
# 重写数据抓取时间(局部生效)
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
labels:
group: "prom"
- targets: ['10.13.118.71:9100']
labels:
group: "node-exporter"
- targets: ['10.13.118.71:9091']
labels:
group: "push-gateway"
# 配置报警对象
alerting:
alertmanagers:
- static_configs:
- targets: ["10.13.118.71:9093"]
重载prometheus服务
curl -X POST http://localhost:9090/-/reload
重新导入数据测试
# 循环向push-gateway推送数据
$ cat test-abc.sh
#!/bin/bash
#Author_by:Andy_xu @JR-OPS
num=`date %s | cut -c10-13`
metrics=`date %s | cut -c${num}-13`
echo $metrics
echo "tps $metrics" | curl --data-binary @- http://localhost:9091/metrics/job/xxb
注意:
此时使用prometheus可以监控到基础服务的资源使用情况,并且也可用借用alertmanager
服务对相关报警规则进行检测和报警,那么需要如何把相关报警及时的通知到相关负责人呢。我们前面在alertmanager
服务中配置了一个web-hook,即http://10.13.118.71:8889/open/test
,可以在alertmanager服务的status
中找到。我们可以很好的借助这个web-hook来对相关的报警发送。
# 一个临时用来测试的web-hook服务
$ cat test-web-hook.go
/**
* @File Name: test-web-hook.go
* @Author: xxbandy @http://xxbandy.github.io
* @Email:
* @Create Date: 2019-06-19 14:06:48
* @Last Modified: 2019-06-19 15:06:13
* @Description: 一个临时用来测试的web-hook服务
* @build:
GOOS=darwin GOARCH=amd64 CGO_ENABLED=0 build -ldflags '-w -s' -o prometheus-web-hook test-web-hook.go
*/
package main
import (
"github.com/gin-gonic/gin"
"net/http"
"io/ioutil"
"fmt"
)
func main() {
router := gin.Default()
router.GET("/open/test", CollectData)
router.POST("/open/test", CollectData)
router.Run(":8889")
}
func CollectData(c *gin.Context) {
alertdata,_ := ioutil.ReadAll(c.Request.Body)
fmt.Println(string(alertdata))
c.JSON(http.StatusOK,nil)
}
# 构建成二进制文件
$ GOOS=darwin GOARCH=amd64 CGO_ENABLED=0 build -ldflags '-w -s' -o prometheus-web-hook test-web-hook.go
$ chmod a x prometheus-web-hook
# 运行web-hook并收集报警信息
➜ ./prometheus-web-hook
[GIN-debug] [WARNING] Now Gin requires Go 1.6 or later and Go 1.7 will be required soon.
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] GET /open/test --> main.CollectData (3 handlers)
[GIN-debug] POST /open/test --> main.CollectData (3 handlers)
[GIN-debug] Listening and serving HTTP on :8889
{"receiver":"web\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"bgbiao测试","exported_job":"xxb","group":"push-gateway","instance":"10.13.118.71:9091","job":"prometheus","monitor":"bgbiao-monitor","status":"warning"},"annotations":{"description":"10.13.118.71:9091:tps 超过阈值!. 当前值: 26986","summary":"10.13.118.71:9091:tps 超过阈值150."},"startsAt":"2019-06-19T06:17:19.247257311Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://5e1698a320b2:9090/graph?g0.expr=tps > 150\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{"alertname":"bgbiao测试","exported_job":"xxb","group":"push-gateway","instance":"10.13.118.71:9091","job":"prometheus","monitor":"bgbiao-monitor","status":"warning"},"commonAnnotations":{"description":"10.13.118.71:9091:tps 超过阈值!. 当前值: 26986","summary":"10.13.118.71:9091:tps 超过阈值150."},"externalURL":"http://c6ba74bfd03b:9093","version":"4","groupKey":"{}:{}"}
[GIN] 2019/06/19 - 15:24:10 | 200 | 727.873µs | 10.13.118.71 | POST /open/test
{"receiver":"web\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"bgbiao测试","exported_job":"xxb","group":"push-gateway","instance":"10.13.118.71:9091","job":"prometheus","monitor":"bgbiao-monitor","status":"warning"},"annotations":{"description":"10.13.118.71:9091:tps 超过阈值!. 当前值: 26986","summary":"10.13.118.71:9091:tps 超过阈值150."},"startsAt":"2019-06-19T06:17:19.247257311Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://5e1698a320b2:9090/graph?g0.expr=tps > 150\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{"alertname":"bgbiao测试","exported_job":"xxb","group":"push-gateway","instance":"10.13.118.71:9091","job":"prometheus","monitor":"bgbiao-monitor","status":"warning"},"commonAnnotations":{"description":"10.13.118.71:9091:tps 超过阈值!. 当前值: 26986","summary":"10.13.118.71:9091:tps 超过阈值150."},"externalURL":"http://c6ba74bfd03b:9093","version":"4","groupKey":"{}:{}"}
[GIN] 2019/06/19 - 15:25:20 | 200 | 129.897µs | 10.13.118.71 | POST /open/test
注意:我们这里的web-hook服务其实是将报警信息临时全部打印出来了,其实可以根据用户关心程度,将相关值取出来直接发送至用户终端,比如钉钉,微信,或者短信
欢迎关注我的公众号