prometheus(version 2.0.0)系列之二

prometheus配置解读

Prometheus configuration is YAML,本文将以一个示例配置来进行解读

官方配置在此,英文要好

#全局配置将应用到所有的配置项上去,对于具体配置项中的同一配置将重写全局配置
global:
  scrape_interval:     15s
  #抓取间隔,即prometheus server从给定输出器/端口获取指标的时间间隔
  evaluation_interval: 30s
  #评估间隔,即prometheus server对抓取到的指标进行评估的时间间隔
  # scrape_timeout is set to the global default (10s).
  #抓取的超时时间被设置为10s

  external_labels:
  #外部标签,自定义键值对,可多个
    monitor: codelab
    foo:     bar

rule_files:
#规则读取文件或者路径,可多个,支持一定的正则匹配,后文将附上一般规则文件格式示例
- "first.rules"
- "my/*.rules"

remote_write:
#远程写入,将本台服务器收集到的数据写到另外的主机上去,可以对部分标签执行具体的动作
  - url: http://remote1/push
    write_relabel_configs:
    - source_labels: [__name__]
      regex:         expensive.*
      action:        drop
  - url: http://remote2/push

remote_read:
#远程读取,从其他主机获取抓取的指标,可以指明具体的服务/任务(通过标签来过滤)
  - url: http://remote1/read
    read_recent: true
  - url: http://remote3/read
    read_recent: false
    required_matchers:
      job: special

scrape_configs:
#抓取的配置
- job_name: prometheus
  honor_labels: true#用来解决抓取到的数据与服务器端标签冲突的情况,设置为true则保留抓取到的数据的标签,否则应用抓取对象+服务端标签
  # scrape_interval is defined by the configured global (15s).
  # scrape_timeout is defined by the global default (10s).

  # metrics_path defaults to '/metrics'
  # scheme defaults to 'http'.

  file_sd_configs:
  #通过指定的文件或者路径进行自动服务.目标发现,一般需要执行reload才能生效,后文将对文件格式做示例讲解
    - files:
      - foo/*.slow.json
      - foo/*.slow.yml
      - single/file.yml
      refresh_interval: 10m
    - files:
      - bar/*.yaml

  static_configs:
  #服务/目标的静态配置,格式如下,列表中包含的是端点
  - targets: ['localhost:9090', 'localhost:9191']
    labels:#目标标签,自定义键值对,可多个
      my:   label
      your: label

  relabel_configs:#标签重写配置
  - source_labels: [job, __meta_dns_name]#源标签
    regex:         (.*)some-[regex]#正则匹配
    target_label:  job#目标标签
    replacement:   foo-${1}#替换值
    # action defaults to 'replace'
  - source_labels: [abc]
    target_label:  cde
  - replacement:   static
    target_label:  abc
  - regex:
    replacement:   static
    target_label:  abc

  bearer_token_file: valid_token_file


- job_name: service-x

  basic_auth:#对于需要进行认证的服务进行的认证设置
    username: admin_name
    password: "multiline\nmysecret\ntest"

  scrape_interval: 50s
  scrape_timeout:  5s

  sample_limit: 1000

  metrics_path: /my_path
  scheme: https

  dns_sd_configs:
  - refresh_interval: 15s
    names:
    - first.dns.address.domain.com
    - second.dns.address.domain.com
  - names:
    - first.dns.address.domain.com
    # refresh_interval defaults to 30s.

  relabel_configs:
  - source_labels: [job]
    regex:         (.*)some-[regex]
    action:        drop
  - source_labels: [__address__]
    modulus:       8
    target_label:  __tmp_hash
    action:        hashmod
  - source_labels: [__tmp_hash]
    regex:         1
    action:        keep
  - action:        labelmap
    regex:         1
  - action:        labeldrop
    regex:         d
  - action:        labelkeep
    regex:         k

  metric_relabel_configs:
  - source_labels: [__name__]
    regex:         expensive_metric.*
    action:        drop

- job_name: service-y

  consul_sd_configs:
  - server: 'localhost:1234'
    token: mysecret
    services: ['nginx', 'cache', 'mysql']
    scheme: https
    tls_config:
      ca_file: valid_ca_file
      cert_file: valid_cert_file
      key_file:  valid_key_file
      insecure_skip_verify: false

  relabel_configs:
  - source_labels: [__meta_sd_consul_tags]
    separator:     ','
    regex:         label:([^=]+)=([^,]+)
    target_label:  ${1}
    replacement:   ${2}

- job_name: service-z

  tls_config:
    cert_file: valid_cert_file
    key_file: valid_key_file

  bearer_token: mysecret

- job_name: service-kubernetes

  kubernetes_sd_configs:
  - role: endpoints
    api_server: 'https://localhost:1234'

    basic_auth:
      username: 'myusername'
      password: 'mysecret'

- job_name: service-kubernetes-namespaces

  kubernetes_sd_configs:
  - role: endpoints
    api_server: 'https://localhost:1234'
    namespaces:
      names:
        - default

- job_name: service-marathon
  marathon_sd_configs:
  - servers:
    - 'https://marathon.example.com:443'

    tls_config:
      cert_file: valid_cert_file
      key_file: valid_key_file

- job_name: service-ec2
  ec2_sd_configs:
    - region: us-east-1
      access_key: access
      secret_key: mysecret
      profile: profile

- job_name: service-azure
  azure_sd_configs:
    - subscription_id: 11AAAA11-A11A-111A-A111-1111A1111A11
      tenant_id: BBBB222B-B2B2-2B22-B222-2BB2222BB2B2
      client_id: 333333CC-3C33-3333-CCC3-33C3CCCCC33C
      client_secret: mysecret
      port: 9100

- job_name: service-nerve
  nerve_sd_configs:
    - servers:
      - localhost
      paths:
      - /monitoring

- job_name: 0123service-xxx
  metrics_path: /metrics
  static_configs:
    - targets:
      - localhost:9090

- job_name: 測試
  metrics_path: /metrics
  static_configs:
    - targets:
      - localhost:9090

- job_name: service-triton
  triton_sd_configs:
  - account: 'testAccount'
    dns_suffix: 'triton.example.com'
    endpoint: 'triton.example.com'
    port: 9163
    refresh_interval: 1m
    version: 1
    tls_config:
      cert_file: testdata/valid_cert_file
      key_file: testdata/valid_key_file

alerting:#配置接收报警的alertmanager,端点可以是多个
  alertmanagers:
  - scheme: https
    static_configs:
    - targets:
      - "1.2.3.4:9093"
      - "1.2.3.5:9093"
      - "1.2.3.6:9093"

规则格式:
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

示例配置

groups:#规则组
- name: example#第一个组名
  rules:#组内规则,下面的报警规则可以是多个,以下列出两个
  - alert: HighErrorRate1
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency
  - alert: HighErrorRate2
    expr: job:request_latency_seconds:mean5m{job="yourjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

服务配置,示例配置,json格式,服务以列表形式,服务内目标也以列表形式:

[
{
    "targets": [
        "127.0.0.1:9104"
    ],
    "labels": {
        "job":"job1",
        "service":"service1"
    }
},
{
        "targets": [
                "127.0.0.1:9105""127.0.0.1:9106""127.0.0.1:9107"
        ],
        "labels": {
                "job":"job2",
                "service":"service2"
        }
}
]

另外需要注意的一点是关于prometheus配置重载:

Prometheus can reload its configuration at runtime. If the new configuration is not well-formed, the changes will not be applied. A configuration reload is triggered by sending a SIGHUP to the Prometheus process or sending a HTTP POST request to the /-/reload endpoint (when the --web.enable-lifecycle flag is enabled). This will also reload any configured rule files.

如果想要通过向端点发送重载请求来实现服务配置重载那么我们需要在运行程序的时候添加参数如下(两种运行方式):

命令行启动
nohup ./prometheus --web.enable-lifecycle --config.file=prometheus.yml &

或者修改服务启动文件:
"/usr/lib/systemd/system/prometheus.service"
# -*- mode: conf -*-

[Unit]
Description=The Prometheus monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target

[Service]
EnvironmentFile=-/etc/default/prometheus
User=prometheus
ExecStart=/usr/bin/prometheus \
          --web.enable-lifecycle \##注意这一行默认没有的,需要加上才能开启Lifecycle APIs
          --config.file=/etc/prometheus/prometheus.yml \
          --storage.tsdb.path=/var/lib/prometheus/data \
          --web.console.libraries=/usr/share/prometheus/console_libraries \
          --web.console.templates=/usr/share/prometheus/consoles \
          $PROMETHEUS_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target               

要使prometheus的配置重载有两种方法:

一:发送SIGHUP信号给应用程序的主进程:

kill -1 pid

二:发送post请求给指定端点:

curl -XPOST http://ip:9090/-/reload
#对于此种方法要注意在启动时加上以上所说的--web.enable-lifecycle启动参数

你可能感兴趣的:(运维)