prometheus联邦,altermanager gossip, thanos等学习

一、Federation集群规划:3台服务器各部署一个global节点、两个shard节点

- 服务器 角色 ip:端口
node1 global 192.168.11.192:9090
node2 shard1 192.168.11.193:9090
node3 shard2 192.168.11.194:9090

大型prometheus架构图
prometheus联邦,altermanager gossip, thanos等学习_第1张图片

二、prometheus (global)

prometheus global节点作为警报节点,如果实时性要求较高时可通过proxy代理直接访问后端的prometheus

#创建prometheus工作目录
mkdir /data/prometheus/{data,conf,conf/rules,conf/sd_config} -p
chown -R  65534:65534 /data/prometheus/data

#promethes配置文件
cat > /data/prometheus/conf/prometheus.yml << 'EOF'
global:
  scrape_interval:     30s
  evaluation_interval: 30s
  scrape_timeout:      10s

#加载警报规则
rule_files:
  - "/etc/prometheus/rules/*.rules"

#集成alertmanager高可用
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.11.192:9093
      - 192.168.11.193:9093
      - 192.168.11.194:9093
    timeout: 10s
 
scrape_configs:
#联邦监控
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{job="node"}'
        - '{job="Linux"}'
        - '{job="srpingboot"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 192.168.11.193:9090
        - 192.168.11.194:9090

  #promethes自身的监控
  - job_name: prometheus
    metrics_path: '/metrics' #默认
    scheme: 'http'  #默认
    scrape_interval: 30s    #覆盖全局
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: prometheus
EOF

#第1份警报规则
curl https://raw.githubusercontent.com/NoviceZeng/DevOps/master/monitor/first_rules.yml -o /data/prometheus/conf/rules/first_rules.rules
sed -i 's/status:/severity:/g' /data/prometheus/conf/rules/first_rules.rules

cat >/data/prometheus/conf/rules/alert.yml<< 'EOF'
groups:
  - name: prometheus
    rules:
      - alert: prometheus节点UP状态
        expr: sum(up{job="prometheus"})==1
        for: 1m
        labels:
          severity: 严重
          team: node-prometheus
        annotations:
          summary: "{{ $labels.job }} 已停止运行超过 1分钟!"
          description: "{{ $labels.instance }} 异常停止,请尽快处理!"  
          value: '{{ $value }}'
EOF

#启动脚本
cat > /data/prometheus/start.sh << 'EOF'
docker run -d \
--name prometheus \
--restart=always \
-p 9090:9090 \
-v /data/prometheus/conf/prometheus.yml:/etc/prometheus/prometheus.yml  \
-v /data/prometheus/conf/rules:/etc/prometheus/rules \
-v /data/prometheus/conf/sd_config:/etc/prometheus/sd_config \
-v /data/prometheus/data:/data/prometheus \
-v /etc/localtime:/etc/localtime:ro \
prom/prometheus:v2.28.0 \
--web.read-timeout=5m \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--web.max-connections=512 \
--storage.tsdb.retention=30d \
--query.timeout=2m \
--web.enable-lifecycle  \
--web.listen-address=:9090  \
--web.enable-admin-api
EOF
bash /data/prometheus/start.sh

三、alertmanager高可用

#创建grafana工作目录
mkdir /data/alertmanager/{conf,template} -p
#promethes配置文件
cat > /data/alertmanager/conf/alertmanager.yml << 'EOF'
global:
  resolve_timeout: 1m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'XXXXXX'  
  smtp_require_tls: false
  smtp_hello: 'qq.com'
  
templates:
  - '/etc/alertmanager/email.tmpl' #邮件模板文件,容器内的路径  

route:
  receiver: 'wechat.webhook'
  #按alertname等进行分组
  group_by: ['alertname']
  #周期内有同一组的报警到来则一起发送 
  group_wait: 1m 
  #报警发送周期 
  group_interval: 10m
  #与上次相同的报警延迟30m才发送,这里应该是(10+30)m左右 
  repeat_interval: 30m 
  routes:
    #可以使用match_re正则匹配
    - match:     
        severity: 严重
      #匹配上则发给下面的name=email的receivers  
      receiver: wechat.webhook 

receivers:
#企微机器人(方法2)
- name: 'wechat.webhook'
  webhook_configs:
  - url: 'http://192.168.11.221:18089/alert0'
    send_resolved: false

- name: 'web.hook'
  webhook_configs:
  - url: 'http://172.31.23.2:8080'
- name: 'email'
  email_configs:
    - to: '[email protected]'
      html: '{{ template "email.jwolf.html" . }}'
      send_resolved: true

#抑制规则,(如果是critical时,抑制warning警报)
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']      
EOF


cat >  /data/alertmanager/conf/email.tmpl<< 'EOF'
{{ define "email.jwolf.html" }}
{{ range $i ,$alert := .Alerts }}
=========start==========
告警级别: {{ $alert.Labels.severity }}
告警类型: {{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.description }}
告警阈值: {{ $alert.Annotations.value }}
触发时间: {{ $alert.StartsAt }}
=========end==========
{{ end }} {{ end }} EOF
#启动脚本 cat > /data/alertmanager/start.sh << 'EOF' docker run -d \ --name alertmanager \ --restart=always \ --network host \ -v /data/alertmanager/conf/:/etc/alertmanager/ \ -v /etc/localtime:/etc/localtime:ro \ prom/alertmanager:v0.22.2 \ --config.file="/etc/alertmanager/alertmanager.yml" \ --cluster.listen-address="0.0.0.0:9094" \ --cluster.peer=192.168.11.192:9094 EOF bash /data/alertmanager/start.sh

prometheus联邦,altermanager gossip, thanos等学习_第2张图片

四、remote storage
选择

  • elasticsearch
  • OpenTSDB

参考:
https://blog.csdn.net/dengxiangbao3167/article/details/102365229

五、thanos方案
高可用Prometheus:Thanos
参考:
https://www.kubernetes.org.cn/7217.html
https://www.cnblogs.com/danny-djy/articles/13230529.html
https://www.cnblogs.com/rongfengliang/p/11319933.html

你可能感兴趣的:(运维工具,CNCF)