基于Prometheus的业界主流监控系统搭建和实践

作者简介

秦超（生举）

2013年毕业于河海大学，计算机科学与技术，硕士学位。

前携程集团技术专家。

从0到1 建立和携程机票出票系统、验票系统，著有《机票订单的出票方法及系统》、《通用航空电子票验票方法》等专利。

负责携程机票保险出、退、改、理赔系统。集团很大一部分营收都是靠这个系统，系统需要极致的稳定性。

负责携程机票支付网关系统，升级了携程机票支付网关模型，项目涉及4个BU，100+开发，改造后的模型，支撑了很多客户、公司、航司三赢的场景。

现在淘菜菜物流域做运输的排线、调度、实操、以及司机的分层治理。

监控在整个微服务体系中的地位

这是市场上现有的微服务整理方案的一个选型。每个技术点都有很多替换的方案。

监控的种类和比较

出处：

Go for Industrial Programming

https://peter.bourgon.org/go-for-industrial-programming/

监控可用分为三类：Logging、Metrics、Tracing。

Logging

现在阿里一般用的是sls，外面一般是elk，也就是elastic search + logstash + kibana。

其中es负责日志的存储、查询。

logstash负责日志的收集。

kibana负责日志的展示。

Tracing

主要是做调用链监控

现在市面上可选的tracing系统有很多，

2002年， ebay做了一个第一个调用链监控系统 CAL。

2010年， google发了一篇论文，叫做《Dapper, a Large-Scale Distributed Systems Tracing Infrastructure》

2011年，吴其敏在美团写了CAT。

2012年，twiter实现了dapper的开源版本，zipkin。

2014年，阿里也实现了dapper，也就是eagleye。

最近一些年社区又有了很多其他类似的调用链监控，但是基本都是基于dapper的实现，比如uber的jaeger， open tracing。

《Dapper, a Large-Scale Distributed Systems Tracing Infrastructure》的原文和译文地址：

https://github.com/AlphaWang/alpha-dapper-translation-zh

Metrics

阿里的metrics监控，是用的sunfire，外面我接触过的是prometheus+grafana。

prometheus负责收集、存储和查询。

grafana 负责展示。

metris监控，一般由几个部分组成：

1、metrics收集器

2、metrics查询引擎

3、metrics存储

4、用于展示的dashboard

5、用于告警的alerts。

时间序列数据库

时间序列数据库存什么内容？

时间序列数据就是一个数据源会每隔一段时间产生一条数据，除了时间戳和值不一样，其他都相同。比如一个cpu的使用率，随着时间的变化不断变化，那么它产生的数据就是时间序列数据。

比如一个典型的时间序列的数据：

http_requests_total{endpoint="/login",status="500",} 44545.0 http_requests_total{endpoint="/register",status="500",} 12781.0 http_requests_total{endpoint="/login",status="200",} 4434300.0 http_requests_total{endpoint="/register",status="200",} 1268621.0 http_requests_total{endpoint="/users/{id}",status="500",} 6397.0 http_requests_total{endpoint="/logout",status="200",} 2532440.0 http_requests_total{endpoint="/users/{id}",status="200",} 633429.0 http_requests_total{endpoint="/users",status="200",} 1899649.0 http_requests_total{endpoint="/users",status="500",} 19298.0 http_requests_total{endpoint="/logout",status="500",} 25668.0

其中http_requests_total 一般叫做metric name。用{}扩起来的code 和 path，一般叫做label或者tag。后面是其具体的值。时间一般在记录的时候插入进去。

为什么不能直接用关系型数据库来存？

1、metric 和 tag是动态变化的，传统的关系型数据库无法快速响应变化。

2、需要对tag做大量的聚合操作，物化视图和实时聚合传统的关系型数据库都做的不够好。

3、数据的结构，决定了时间序列数据，可以在存储上做大量的优化，从而减少存储的空间、支持更高的写入。比如利用LSM树（Log Structured Merge Tree）可以提高写入的速度。

Prometheus 的存储层在展现出卓越的性能，单一服务器每秒就能够摄入上百万个时间序列样本，同时只占用了很少的磁盘空间。

常见的时间序列数据库有：influxdb， opentsdb，prometheus。

时间序列数据库对比

趋势对比

数据来源：

https://db-engines.com/en/ranking_trend/time+series+dbms

什么是prometheus

1、开源的时间序列数据库，2016年加入云原生基金会，是继 Kubernetes之后的第二个加入的host project。

2、监控工具，和grafana完美集成，同时提供了PromQL语言

3、soundcloud研发，源自google的borgmon

4、白盒、黑盒监控都支持。

5、社区生态非常丰富，提供了丰富的exporters，alerts，dashboard。

6、单机性能高：每秒支持百万级的时间序列，能同时支持上千个targets。

官网地址：https://prometheus.io/

一个典型的prometheus+grafana的监控如下：

prometheus的架构

整体架构如下：

来源：

https://prometheus.io/docs/introduction/overview/

Prometheus 提供了两种存储方式， local 和remote。

详情参见：

https://prometheus.io/docs/prometheus/latest/storage/

https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md

https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/

《技术分享：Prometheus是怎么存储数据的（陈皓）》https://www.bilibili.com/video/BV1a64y1X7ys?from=search&seid=16300830048851003304&spm_id_from=333.337.0.0

存储结构如下：

持久化文件结构

Local storage的最小单位是block，每个block是最近两个小时的数据。

block里面是多个chunk、meta.json、index、tombstones。

tombstones，用来标记哪些chunk文件被删除了。

chunk存的是具体的数据

meta.json存的是用来存block的元数据，比如开始时间，结束时间等等。

index存储索引，用于索引chunk里的数据，用于快速查找。

先建立一个symblo table，把所有的label，都映射成一个排序的id，然后通过倒排索引，来加快搜索。

比如前面例子中提到的

1-http_requests_total{endpoint="/login",status="500",}

2-http_requests_total{endpoint="/register",status="500",}

3-http_requests_total{endpoint="/login",status="200",}

4-http_requests_total{endpoint="/register",status="200",}

5-http_requests_total{endpoint="/users/{id}",status="500",}

6-http_requests_total{endpoint="/logout",status="200",}

7-http_requests_total{endpoint="/users/{id}",status="200",}

8-http_requests_total{endpoint="/users",status="200",}

9-http_requests_total{endpoint="/users",status="500",}

10-http_requests_total{endpoint="/logout",status="500",}

然后建立一个倒排索引：

endpoint="/login" [1,3]

status="500" [1,2,5,6,10]

那么我想找endpoint="/login" 并且 status="500" 就能通过交集很快的找到对应的label的索引

但是这个里面会涉及到一个求交集or 并集的问题。

大家可以思考一下，如果普通的算法，求endpoint="/login" 并且 status="500" ，复杂度是不是o(m*n)。借助内外的空间，把其中一个转成hash，时间复杂度可以变为o(m+n)。那如何不借助额外的空间做到这一点呢？

至于每个文件里面的存储的结构，大家可以去看上面给出的文档。

我们注意到还会有一个wal文件夹，write ahead log ，预写日志。

prometheus抓取的数据不会立即刷盘，而是放在内存中。想象一下如果这时候宕机了，数据岂不丢了。wal日志的作用就是防止这个的。很多数据库现在都有这个wal的使用。

metrics采集

Exporters

详情参见：

https://prometheus.io/docs/instrumenting/exporters/

Alerts

详情参见：

实验

安装prometheus

•brew install Grafana

•cd /usr/local/Cellar/prometheus/2.31.1/bin/

•./prometheus --config.file=/usr/local/etc/prometheus.yml

•访问：http://localhost:9090/

安装grafana

brew install Grafana

cd /usr/local/Cellar/grafana/8.3.2

grafana-server --config=/usr/local/etc/grafana/grafana.ini --homepath /usr/local/share/grafana --packaging=brew cfg:default.paths.logs=/usr/local/var/log/grafana cfg:default.paths.data=/usr/local/var/lib/grafana cfg:default.paths.plugins=/usr/local/var/lib/grafana/plugins

访问：http://localhost:3000/

启动模拟程序

open /usr/local/etc/prometheus.yml

添加以下内容

- job_name: "simulator"

metrics_path: /prometheus

static_configs:

- targets: ["localhost:8080"]

在8080，启动http-simulator：

同时，访问http://localhost:9090/，查看prometheus的targets，状态为up。

给grafana添加数据源

安装node_exporter

brew install node_exporter

node_exporter

#采集node exporter监控数据

- job_name: "node_exporter"

static_configs:

- targets: ["localhost:9100"]

使用grafana dashboard

在grafana 里输入dashboard，然后找到想要的exporter对应的dashboard。

基于Prometheus的业界主流监控系统搭建和实践

作者简介

监控在整个微服务体系中的地位

监控的种类和比较

Logging

Tracing

Metrics

时间序列数据库

时间序列数据库存什么内容？

为什么不能直接用关系型数据库来存？

时间序列数据库对比

什么是prometheus

prometheus的架构

metrics采集

Exporters

Alerts

实验

安装grafana

启动模拟程序

安装node_exporter

使用grafana dashboard

你可能感兴趣的:(基于Prometheus的业界主流监控系统搭建和实践)