Prometheus

TSDB是什么？ (Time Series Database)

简单的理解为.一个优化后用来处理时间序列数据的软件,并且数据中的数组是由时间进行索引的

l 大部分时间都是写入操作

l 写入操作几乎是顺序添加;大多数时候数据到达后都以时间排序.

l 写操作很少写入很久之前的数据,也很少更新数据.大多数情况在数据被采集到数秒或者数分钟后就会被写入数据库.

l 删除操作一般为区块删除,选定开始的历史时间并指定后续的区块.很少单独删除某个时间或者分开的随机时间的数据.

l 数据一般远远超过内存大小,所以缓存基本无用.系统一般是 IO 密集型

l 读操作是十分典型的升序或者降序的顺序读,

l 高并发的读操作十分常见.

Prometheus是什么

Prometheus 是由 SoundCloud 开发的开源监控报警系统和时序列数据库(TSDB)

Prometheus 在2016加入 CNCF (Cloud Native Computing Foundation), 作为在 kubernetes 之后的第二个由基金会主持的项目

Prometheus 的特点

l 多维数据模型（时序列数据由metric名和一组key/value组成）

l 在多维度上灵活的查询语言(PromQl)

l 不依赖分布式存储，单主节点工作.

l 通过基于HTTP的pull方式采集时序数据

l 可以通过中间网关进行时序列数据推送(pushing)

l 目标服务器可以通过发现服务或者静态配置实现

l 多种可视化和仪表盘支持

Prometheus 生态系统

l Prometheus 主服务,用来抓取和存储时序数据

l client library 用来构造应用或 exporter 代码 (go,java,python,ruby)

l push 网关可用来支持短连接任务

l 可视化的dashboard (两种选择,promdash 和 grafana.目前主流选择是 grafana.)

l 一些特殊需求的数据出口(用于HAProxy, StatsD, Graphite等服务)

l 实验性的报警管理端(alartmanager,单独进行报警汇总,分发,屏蔽等 )

部署和配置

下载

地址: https://prometheus.io/download/

部署

下载 prometheus-*.tar.gz

解压

配置

在prometheus目录下有一个名为 prometheus.yml 的主配置文件.其中包含大多数标准配置及 prometheus 的自检控配置,配置文件如下:

my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. [ 抓取的间隔时间]

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. [计算的间隔时间]

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting:

alertmanagers:

static_configs:
targets:
'172.17.20.231:20507' [连接报警管理器]

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

- "first_rules.yml"

- "second_rules.yml"

"alert-rule.yml" [此处有两个规则，一个为计算规则，一个为报警规则]

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label `job=` to any timeseries scraped from this config.

job_name: 'prometheus' [抓取的目标]

metrics_path defaults to '/metrics' // [连接的prometheus 自带的 exporter]

scheme defaults to 'http'.

static_configs:

targets: ['localhost:20504'] // [prometheus 启动的端口]
job_name: 'spring-boot'

metrics_path: '/prometheus' // [自己写的spring-boot的exporter地址]

static_configs:

targets: ['localhost:20506'] [spring-boot 启动的端口]

启动

编写启动脚本

nohup ./prometheus --config.file=prometheus.yml --web.enable-admin-api --web.listen-address=:20504 >/dev/null 2>&1 &

静默启动 --web-listen-address 指定端口

数据类型

l Counter : Counter表示收集的数据是按照某个趋势（增加／减少）一直变化的。

l Gauge:
Gauge表示搜集的数据是瞬时的，可以任意变高变低。

l Histogram: Histogram可以理解为直方图，主要用于表示一段时间范围内对数据进行采样，（通常是请求持续时间或响应大小），并能够对其指定区间以及总数进行统计。

l Summary: Summary和Histogram十分相似，主要用于表示一段时间范围内对数据进行采样，（通常是请求持续时间或响应大小），它直接存储了 quantile 数据，而不是根据统计区间计算出来的。

时序数据-打点-查询

我们知道每条时序数据都是由 metric（指标名称），一个或一组label（标签），以及float64的值组成的。

标准格式为 {=,...}

例如：

rpc_invoke_cnt_c{code="0",method="Session.GenToken",job="Center"} 5

rpc_invoke_cnt_c{code="0",method="Relation.GetUserInfo",job="Center"} 12

rpc_invoke_cnt_c{code="0",method="Message.SendGroupMsg",job="Center"} 12

rpc_invoke_cnt_c{code="4",method="Message.SendGroupMsg",job="Center"} 3

rpc_invoke_cnt_c{code="0",method="Tracker.Tracker.Get",job="Center"} 70

这是一组用于统计RPC接口处理次数的监控数据。

其中rpc_invoke_cnt_c为指标名称，每条监控数据包含三个标签：code 表示错误码，service表示该指标所属的服务，method表示该指标所属的方法，最后的数字代表监控值。

针对这个例子，我们共有四个维度（一个指标名称、三个标签），这样我们便可以利用Prometheus强大的查询语言PromQL进行极为复杂的查询。

PromQL

PromQL(Prometheus Query Language) 是 Prometheus 自己开发的数据查询 DSL 语言，语言表现力非常丰富，支持条件查询、操作符，并且内建了大量内置函，供我们针对监控数据的各种维度进行查询。

我们想统计Center组件Relation.GetUserInfo的频率，可使用如下Query语句：

rate(rpc_invoke_cnt_c{method="Relation.GetUserInfo",job="Center"}[1m])

或者基于方法和错误码统计Center的整体RPC请求错误频率：

sum by (method, code)(rate(rpc_invoke_cnt_c{job="Center",code!="0"}[1m]))

如果我们想统计Center各方法的接口耗时，使用如下Query语句即可：

rate(rpc_invoke_time_h_sum{job="Center"}[1m]) / rate(rpc_invoke_time_h_count{job="Center"}[1m])

rate(http_requests_total[5m])

返回范围向量中每个时间序列在过去5分钟内测量的HTTP请求的每秒速率

increase(http_request_total[5m])

返回范围向量中每个时间序列在过去5分钟内测得的HTTP请求数

官方函数库: https://prometheus.io/docs/querying/functions/

另外，配合查询，在打点时metric和labal名称的定义也有一定技巧。

rpc_invoke_cnt_c 表示rpc调用统计

api_req_num_cv 表示httpapi调用统计

msg_queue_cnt_c 表示队列长度统计

命名官方引导： https://prometheus.io/docs/practices/naming/

报警

部署安装

下载地址： https://prometheus.io/download/

制作启动脚本

nohup ./alertmanager --web.listen-address=:20507 >/dev/null 2>&1 &

调整配置文件

alertmanager.yml 文件

制定报警规则

首先制定报警规则，在prometheus 上进行报警 rules 的配置

rule_files:

- "first_rules.yml"

- "second_rules.yml"

"alert-rule.yml" [此处有两个规则，一个为计算规则，一个为报警规则]

自己写对应的报警规则：

groups:

name: example

interval: 1s

rules:

Alert for any instance that is unreachable for >5 minutes.

alert: InstanceDown

expr: up == 0

for: 1s

labels:

severity: page

annotations:

summary: "Instance {{ $labels.instance }} down"

description: "{{ $labels.instance }} of job {{ $labels.job }} has been down"

以上为宕机的报警规则

配置报警设置

以下为简易配置

global:

smtp_smarthost: 'smtp.exmail.qq.com:25' // 配置smtp服务器用于发信

smtp_from: [email protected]'

smtp_auth_username: [email protected]'

smtp_auth_password: 'xxx'

The directory from which notification templates are read.

templates:

'/etc/alertmanager/template/*.tmpl'

The root route on which each incoming alert enters.

route:

The labels by which incoming alerts are grouped together. For example,

multiple alerts coming in for cluster=A and alertname=LatencyHigh would

be batched into a single group.

group_by: ['alertname', 'cluster', 'service'] //配置组用于后面的一些规则制定

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification. //新建立的组，在发信之前等待时间。组队上车

group_wait: 5s

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

group_interval: 1m // 一个组的发送间隔

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

repeat_interval: 3h // 重发的间隔

A default receiver

receiver: zhangm // 默认收件人

receivers: //配置所有收件人

name: 'zhangm'

email_configs:

to: '[email protected]'

绘图展示

启动

安装Grafana。https://grafana.com/

下载 grafana.tar.gz 包

解压

进入bin目录

nohup ./grafana-server >/dev/null 2>&1 &

后台启动 grafana

配置

更改端口 conf 目录下的 default.ini http_port 参数

界面

账号密码

默认账号：admin 密码： admin

新增数据源

集成

集成相关参考 [[Prometheus官方示例]] [Play集成 Prometheus] [Spring集成Prometheus]

参考文献

Prometheus 官网

[Prometheus入门] (http://www.10tiao.com/html/357/201705/2247485232/1.html)

[Prometheus进阶] (http://www.10tiao.com/html/357/201705/2247485249/1.html)

360基于Prometheus的在线服务监控实践

Prometheus官方示例

Play集成 Prometheus

Spring集成Prometheus

Prometheus基础文档

TSDB是什么？ (Time Series Database)

Prometheus是什么

Prometheus 的特点

Prometheus 生态系统

部署和配置

下载

部署

配置

my global config

scrape_timeout is set to the global default (10s).

Alertmanager configuration

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

The job name is added as a label job= to any timeseries scraped from this config.

metrics_path defaults to '/metrics' // [连接的prometheus 自带的 exporter]

scheme defaults to 'http'.

启动

数据类型

时序数据-打点-查询

PromQL

报警

部署安装

制定报警规则

- "first_rules.yml"

- "second_rules.yml"

Alert for any instance that is unreachable for >5 minutes.

配置报警设置

The directory from which notification templates are read.

The root route on which each incoming alert enters.

The labels by which incoming alerts are grouped together. For example,

multiple alerts coming in for cluster=A and alertname=LatencyHigh would

be batched into a single group.

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification. //新建立的组，在发信之前等待时间。 组队上车

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

A default receiver

绘图展示

启动

配置

界面

账号密码

新增数据源

集成

参考文献

你可能感兴趣的:(Prometheus基础文档)

The job name is added as a label `job=` to any timeseries scraped from this config.

notification. //新建立的组，在发信之前等待时间。组队上车