docker+prometheus+grafana+alertmanager告警安装使用 (图文详解)

前言:

一个服务上线了后,你想知道这个服务是否可用,需要监控。假如线上出故障了,你要先于顾客感知错误,你需要监控。还有对数据库,服务器的监控,等等各层面的监控。

近年来,微服务架构的流行,服务数越来越多,监控指标变得越来越多,所以监控也变得越来越复杂,需要新的监控系统适应这种变化。

以前我们用zabbix,StatsD监控,但是随着容器化,微服务的流行,我们需要新的监控系统来适应这种变化。于是监控项目Prometheus就应运而生。

所有的前提是在安装docker后,若不会安装docker可以翻看我前面的内容,或者发私信找我​!

 

1.部署cadvisor

docker run --name cadvisor -d -p 8090:8080 \-v /:/rootfs:ro \-v /var/run:/var/run:rw \-v /sys:/sys:ro \-v /var/lib/docker/:/var/lib/docker:ro \google/cadvisor

 

2.部署node

docker run --name=node -d -p 9100:9100 \-v "/proc:/host/proc:ro" \-v "/sys:/host/sys:ro" \-v "/:/rootfs:ro" \--net="host" \quay.io/prometheus/node-exporter:v0.13.0 \-collector.procfs /host/proc \-collector.sysfs /host/sys \-collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"

 

3.部署prometheus

docker run --name=prometheus -d -p 9090:9090  \-v `pwd`/conf/prometheus.yml:/etc/prometheus/prometheus.yml \--net=host \prom/prometheus

prometheus配置文件

global:  scrape_interval:     15s  evaluation_interval: 15s  # Attach these labels to any time series or alerts when communicating with  # external systems (federation, remote storage, Alertmanager).  external_labels:      monitor: 'docker-host-alpha'# Load and evaluate rules in this file every 'evaluation_interval' seconds.rule_files:  - "targets.yml"  - "host.yml"  - "containers.yml"# A scrape configuration containing exactly one endpoint to scrape.scrape_configs:  - job_name: 'app_cadvisor'    scrape_interval: 5s    static_configs:      - targets: ['172.16.1.251:8090']  - job_name: 'app_nodeexporter'    scrape_interval: 5s    static_configs:      - targets: ['172.16.1.251:9100']

 

4.配置grafana

docker run --name ziyun56.grafana -d -p 3000:3000 \-v `pwd`/data/grafana:/var/lib/grafana \grafana/grafanamkdir datacd datamkdir grafanachmod 777 grafana

5.alertmanager

docker run --name alertmanager -d \-p 9093:9093 \-v `pwd`/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \docker.io/prom/alertmanager:latest

alertmanager配置文件

route:    receiver: 'zyoa_alert_webhook'receivers:    - name: 'zyoa_alert_webhook'      webhook_configs:        - url: 'http://172.16.253.121:56513/alert/webhook.do?access_token=prometheus_access_token&source=prometheus&dd_alert=true'

​备注:alertmanager中的地址是为了用公网进行访问,映射出来的,如果不做单独服务,通知的信息链接就会是内网的链接,那么你在钉钉接受到的信息就无法打开​!

 

alertmanager的告警规则,我这边写了三个作为告警的信息

查看docker进程是否挂掉的

groups:- name: monitor  rules:  - alert: monitor_service_down    expr: up == 0    for: 30s    labels:       severity: critical    annotations:      summary: "Monitor service non-operational"      description: "Service {{ $labels.instance }} is down."

 

查看cpu 内存过载的

groups:- name: host  rules:  - alert: high_cpu_load    expr: node_load1 / 8 * 100 > 60    for: 30s    labels:      severity: warning    annotations:      summary: "Server cpu under high load"      description : "Docker host cpu usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."  - alert: high_memory_load    expr: (node_memory_MemTotal - (node_memory_MemFree + node_memory_Buffers + node_memory_Cached)) / node_memory_MemTotal * 100 > 95    for: 30s    labels:      severity: warning    annotations:      summary: "Server memory is almost full"      description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."   - alert: high_storage_load     expr: (node_filesystem_size - node_filesystem_free) / node_filesystem_size * 100 > 85     for: 30s     labels:       severity: warning     annotations:       summary: "Server storage is almost full"       description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."

以上信息配置全部亲测,有问题可留言​!​

你可能感兴趣的:(docker,docker,运维)