前言:
一个服务上线了后,你想知道这个服务是否可用,需要监控。假如线上出故障了,你要先于顾客感知错误,你需要监控。还有对数据库,服务器的监控,等等各层面的监控。
近年来,微服务架构的流行,服务数越来越多,监控指标变得越来越多,所以监控也变得越来越复杂,需要新的监控系统适应这种变化。
以前我们用zabbix,StatsD监控,但是随着容器化,微服务的流行,我们需要新的监控系统来适应这种变化。于是监控项目Prometheus就应运而生。
所有的前提是在安装docker后,若不会安装docker可以翻看我前面的内容,或者发私信找我!
1.部署cadvisor
docker run --name cadvisor -d -p 8090:8080 \
-v /:/rootfs:ro \
-v /var/run:/var/run:rw \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
google/cadvisor
2.部署node
docker run --name=node -d -p 9100:9100 \
-v "/proc:/host/proc:ro" \
-v "/sys:/host/sys:ro" \
-v "/:/rootfs:ro" \
--net="host" \
quay.io/prometheus/node-exporter:v0.13.0 \
-collector.procfs /host/proc \
-collector.sysfs /host/sys \
-collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
3.部署prometheus
docker run --name=prometheus -d -p 9090:9090 \
-v `pwd`/conf/prometheus.yml:/etc/prometheus/prometheus.yml \
--net=host \
prom/prometheus
prometheus配置文件
global:
scrape_interval: 15s
evaluation_interval: 15s
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'docker-host-alpha'
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
- "targets.yml"
- "host.yml"
- "containers.yml"
# A scrape configuration containing exactly one endpoint to scrape.
scrape_configs:
- job_name: 'app_cadvisor'
scrape_interval: 5s
static_configs:
- targets: ['172.16.1.251:8090']
- job_name: 'app_nodeexporter'
scrape_interval: 5s
static_configs:
- targets: ['172.16.1.251:9100']
4.配置grafana
docker run --name ziyun56.grafana -d -p 3000:3000 \
-v `pwd`/data/grafana:/var/lib/grafana \
grafana/grafana
mkdir data
cd data
mkdir grafana
chmod 777 grafana
5.alertmanager
docker run --name alertmanager -d \
-p 9093:9093 \
-v `pwd`/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
docker.io/prom/alertmanager:latest
alertmanager配置文件
route:
receiver: 'zyoa_alert_webhook'
receivers:
- name: 'zyoa_alert_webhook'
webhook_configs:
- url: 'http://172.16.253.121:56513/alert/webhook.do?access_token=prometheus_access_token&source=prometheus&dd_alert=true'
备注:alertmanager中的地址是为了用公网进行访问,映射出来的,如果不做单独服务,通知的信息链接就会是内网的链接,那么你在钉钉接受到的信息就无法打开!
alertmanager的告警规则,我这边写了三个作为告警的信息
查看docker进程是否挂掉的
groups:
- name: monitor
rules:
- alert: monitor_service_down
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Monitor service non-operational"
description: "Service {{ $labels.instance }} is down."
查看cpu 内存过载的
groups:
- name: host
rules:
- alert: high_cpu_load
expr: node_load1 / 8 * 100 > 60
for: 30s
labels:
severity: warning
annotations:
summary: "Server cpu under high load"
description : "Docker host cpu usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- alert: high_memory_load
expr: (node_memory_MemTotal - (node_memory_MemFree + node_memory_Buffers + node_memory_Cached)) / node_memory_MemTotal * 100 > 95
for: 30s
labels:
severity: warning
annotations:
summary: "Server memory is almost full"
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
- alert: high_storage_load
expr: (node_filesystem_size - node_filesystem_free) / node_filesystem_size * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server storage is almost full"
description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
以上信息配置全部亲测,有问题可留言!