prometheus监控

1、简介

Prometheus 是一套开源的系统监控报警框架。它启发于 Google 的 borgmon 监控系统,由工作在 SoundCloud 的 google 前员工在 2012 年创建,作为社区开源项目进行开发,并于 2015 年正式发布。2016 年,Prometheus 正式加入 Cloud Native Computing Foundation,成为受欢迎度仅次于 Kubernetes 的项目。

2、组成及架构

Prometheus 生态圈中包含了多个组件,其中许多组件是可选的:

  • Prometheus Server: 用于收集和存储时间序列数据。
  • Client Library: 客户端库,为需要监控的服务生成相应的 metrics 并暴露给 Prometheus server。当 Prometheus server 来 pull 时,直接返回实时状态的 metrics。
  • Push Gateway: 主要用于短期的 jobs。由于这类 jobs 存在时间较短,可能在 Prometheus 来 pull 之前就消失了。为此,这次 jobs 可以直接向 Prometheus server 端推送它们的 metrics。这种方式主要用于服务层面的 metrics,对于机器层面的 metrices,需要使用 node exporter。
  • Exporters: 用于暴露已有的第三方服务的 metrics 给 Prometheus。
  • Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对收的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty,OpsGenie, webhook 等。
  • 一些其他的工具。

prometheus监控_第1张图片

其大概的工作流程是:

  • Prometheus server 定期从配置好的 jobs 或者 exporters 中拉 metrics,或者接收来自 Pushgateway 发过来的 metrics,或者从其他的 Prometheus server 中拉 metrics。
  • Prometheus server 在本地存储收集到的 metrics,并运行已定义好的 alert.rules,记录新的时间序列或者向 Alertmanager 推送警报。
  • Alertmanager 根据配置文件,对接收到的警报进行处理,发出告警。
  • 在图形界面中,可视化采集数据。

3、数据类型

Prometheus 客户端库主要提供四种主要的 metric 类型:

Counter

  • 一种累加的 metric,典型的应用如:请求的个数,结束的任务数, 出现的错误数等等。

例如,查询 http_requests_total{method="get", job="Prometheus", handler="query"} 返回 8,10 秒后,再次查询,则返回 14。

Gauge

  • 一种常规的 metric,典型的应用如:温度,运行的 goroutines 的个数。
  • 可以任意加减。

例如:go_goroutines{instance="172.17.0.2", job="Prometheus"} 返回值 147,10 秒后返回 124。

Histogram

  • 可以理解为柱状图,典型的应用如:请求持续时间,响应大小。
  • 可以对观察结果采样,分组及统计

Summary

  • 类似于 Histogram, 典型的应用如:请求持续时间,响应大小。
  • 提供观测值的 count 和 sum 功能。
  • 提供百分位的功能,即可以按百分比划分跟踪结果。

histogram与summary功能类似,由于summary对百分位的计算是在client端进行的,并且summary在每次新增数据后都会进行一次百分位的计算,可能会对client的性能产生影响,而histogram可以在服务端进行计算,因此应当根据当前服务的具体情况选择合适的统计容器。

4、使用示例

此处以xid2.0分布式版broker中监控数据为例。

开发环境:MacBook Pro

语言:Go

4.1、client端

定义两个counter和一个histogram:

durationsHistogram = *prometheus.NewHistogramVec(
   prometheus.HistogramOpts{
      Name:    "xxx_op_time",
      Help:    "milliseconds latency distributions.",
      Buckets: metricsBuckets,
   },
   []string{"handle", "stage"},
)
ReqCounter = *prometheus.NewCounterVec(
   prometheus.CounterOpts{
      Name: "xxx_http_request_total",
   },
   []string{"handle"},
)
ResCounter = *prometheus.NewCounterVec(
   prometheus.CounterOpts{
      Name: "xxx_http_response_total",
   },
   []string{"code"},
)

初始化时注册监控信息
prometheus.MustRegister(durationsHistogram)
prometheus.MustRegister(ResCounter)
prometheus.MustRegister(ReqCounter)


程序中使用举例:
// 请求个数
ReqCounter.WithLabelValues("/api/xxx/path").Inc()
// 统计请求耗时
start := time.Now()
...
durationsHistogram.WithLabelValues(setNameString, "xxx").Observe(time.Since(start).Seconds() * 1000)

4.2、服务安装

brew install services  #若未安装
brew install prometheus #prometheus服务端

为prometheus配置服务信息:

global:
  scrape_interval: 10s   #配置pull时间
  scrape_timeout: 10s
  evaluation_interval: 10m
scrape_configs:
  - job_name: spring-boot
    scrape_interval: 5s
    scrape_timeout: 5s
    metrics_path: /metrics  #uri
    scheme: http
    basic_auth:
      username: user
      password: 123456
    static_configs:
      - targets:
        - 127.0.0.1:8888 #此处填写应用的 IP + 端口号
		- 127.0.0.1:9999 #可以配置多个client

启动服务:

prometheus --config.file=PATH/prometheus.yml

http://localhost:9090/targets查看当前应用状态:第一个服务正常运行,显示up;第二个client服务未启动,因此显示DOWN。

点击对应的EndPoint可以看到返回的统计信息:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000349093
go_gc_duration_seconds{quantile="0.25"} 0.000438804
go_gc_duration_seconds{quantile="0.5"} 0.000482303
go_gc_duration_seconds{quantile="0.75"} 0.000549487
go_gc_duration_seconds{quantile="1"} 0.005140537
go_gc_duration_seconds_sum 9.595970185
go_gc_duration_seconds_count 17804
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 510
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.11.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.7242784e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.09448288456e+11
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.558332e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 8.76379383e+08
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0.00043950832257221354
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.404352e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.7242784e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.4349056e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.3027712e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 76103
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 5.7376768e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5578887209621124e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 8.76455486e+08
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 82944
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 98304
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 387752
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 442368
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 2.1128848e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.0895804e+07
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 9.732096e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 9.732096e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 8.2508024e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 110
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 5349.91
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 655360
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 411
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.16768e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.55783237186e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.801744896e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 488
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP request_number 
# TYPE request_number counter
request_number{handle="/xid/v2/search"} 1.475309e+06
# HELP response_number 
# TYPE response_number counter
response_number{code="0"} 1.475209e+06
# HELP search_cost_time_detail milliseconds latency distributions.
# TYPE search_cost_time_detail histogram
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="0"} 0
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="50"} 0
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="100"} 20403
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="150"} 528289
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="200"} 1.436994e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="250"} 1.474139e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="300"} 1.475171e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="350"} 1.475203e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="400"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="450"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="500"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="550"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="600"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="650"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="700"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="750"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="800"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="850"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="900"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="950"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="rpc",le="+Inf"} 1.475209e+06
search_cost_time_detail_sum{set_name="wyx_50w_50_1",stage="rpc"} 2.3143792061221814e+08
search_cost_time_detail_count{set_name="wyx_50w_50_1",stage="rpc"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="0"} 0
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="50"} 0
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="100"} 20158
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="150"} 514676
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="200"} 1.436145e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="250"} 1.474113e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="300"} 1.475162e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="350"} 1.475202e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="400"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="450"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="500"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="550"} 1.475208e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="600"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="650"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="700"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="750"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="800"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="850"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="900"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="950"} 1.475209e+06
search_cost_time_detail_bucket{set_name="wyx_50w_50_1",stage="search",le="+Inf"} 1.475209e+06
search_cost_time_detail_sum{set_name="wyx_50w_50_1",stage="search"} 2.3194894158179152e+08
search_cost_time_detail_count{set_name="wyx_50w_50_1",stage="search"} 1.475209e+06

5、 可视化工具grafana使用

grafana可以对监控信息进行可视化,并且可以使用prometheus中的查询语句与计算公式对监控数据进行处理等,首先放一张效果图,监控了请求总个数与返回总个数、实时QPS、协程个数变化与请求在不同阶段的耗时信息。

prometheus监控_第2张图片

5.1、服务安装

brew install grafana 
brew services start grafana //启动

默认服务端口为localhost:3000,账号密码默认都为:admin,第一次登陆会要求重新设置密码。

5.2 、使用方法简介

  • 首先选择数据源,此处选择prometheus,如下图:

        

  • 输入http的URL,也即prometheus服务端的地址,然后点击save & test显示Data Source is Working方可进行下一步:

    prometheus监控_第3张图片

  • 新建dashboard并选择choose visuallization:

prometheus监控_第4张图片

  • 在query中添加需要监控的变量,系统会根据现有数据推荐选项,输入正确后即可看到曲线图:prometheus监控_第5张图片
  • 界面设置为显示过去5min分钟的数据,并自动每5s刷新一次:

    prometheus监控_第6张图片

5.3、数据过滤与处理

查询过滤

在配置prometheus的时候看到可以配置多个target,grafana可以通过查询语句对不同target的数据进行过滤。首先看下一个histogram数据的一个区间的完整信息:

search_cost_time_detail_bucket{instance="10.31.11.158:8100",job="broker",set_name="wyx_50w_50_1",le="400",stage="search"}
search_cost_time_detail_bucket{instance="127.0.0.1:8100",job="broker",set_name="wyx_50w_50_1",le="400",stage="search"}

如果不同节点的broker使用的histogram的名字相同,可以根据instance选择不同地址的数据,同样可以根据不同的stage对数据进行过滤。比如只查看来自10.31.11.158:8100端口的数据可以使用如下语句:

search_cost_time_detail_bucket{instance="10.31.11.158:8100"}

同理如果要只看特定底库或者特定stage中的数据直接在大括号中插入对应的数据即可。

百分位计算

histogram_quantile(0.9, rate(search_cost_time_detail_bucket{job="broker",stage="rpc"}[100s]))

QPS

delta(request_number{handle="/xid/v2/search"}[10s])/10

5.4、报警功能设置

监控特定数据的数值,当达到特定条件后发送邮件通知。首先修改grafana配置文件,启用邮件功能,配置文件如下:

[smtp]
enabled = true
host = smtp.qq.com:25
user = [email protected]
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
password = ******************
#cert_file =
#key_file =
skip_verify = true
from_address = [email protected]
from_name = Grafana
ehlo_identity =

更改配置文件后需要重启grafana服务。

首先alerting下新建Email channel然后保存。

这里以qps为例,当qps在10s的均值小于600时进行报警,程序会每10s对数据进行一次监控,当第一次发现qps低于600时并不会报警与发送邮件,当连续30s《可配置,第一行》后仍低于600才会发邮件通知。

这里手动改将压测工具停止,检测到qps下降后等待30s触发报警操作,收到邮件信息如下:

当警报解除后同样会收到邮件提醒。

 

你可能感兴趣的:(后端服务)