https://www.gitbook.com/book/songjiayang/prometheus/details (Prometheus 实战)
https://github.com/1046102779/prometheus (Prometheus 非官方中文手册)
http://www.bubuko.com/infodetail-2004088.html (基于prometheus监控k8s集群)
http://www.cnblogs.com/sfnz/p/6566951.html (安装prometheus+grafana监控mysql redis kubernetes等,非docker安装)
http://blog.csdn.net/wenwst/article/details/76624019 (Kubernetes 1.6 部署prometheus和grafana数据持久))
https://github.com/jason-riddle/monitor-k8s-with-prom (Kubernetes 上prometheus监控相关)
https://github.com/kayrus/prometheus-kubernetes (prometheus-kubernetes)
https://github.com/prometheus/node_exporter (prometheus/node_exporter)
http://dockone.io/article/2579 ( Prometheus在Kubernetes下的监控实践)
http://www.ywnds.com/?p=9656 ( 使用Prometheus+Grafana监控MySQL实践)
https://github.com/prometheus/prometheus/releases (prometheus 下载列表)
https://github.com/prometheus/node_exporter/releases/ (node_exporter下载列表)
https://laily.net/article/Prometheus%20%E5%88%9D%E4%BD%93%E9%AA%8C%281%29%20-%20%E5%AE%89%E8%A3%85 (Prometheus 初体验(1) - 安装)
http://blog.csdn.net/u010871982/article/details/77838592?locationNum=2&fps=1 (prometheus简单入门)
https://www.robustperception.io/scaling-and-federating-prometheus/ (prometheus federate)
http://dbaplus.cn/news-72-1462-1.html (360基于Prometheus的在线服务监控实践)
1、prometheus安装
[root@localhost prometheus]# wget https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz
[root@localhost prometheus]# mkdir /opt/prometheus
[root@localhost prometheus]# tar -zxvf prometheus-1.7.1.linux-amd64.tar.gz -C /opt/prometheus --strip-components=1
[root@localhost prometheus]# cd /opt/prometheus/
[root@localhost prometheus]# cp prometheus.yml prometheus.yml.back
[root@localhost prometheus]# vim prometheus.yml #注意 yaml 文件不允许有 tab 符,一律得使用空格
# 全局配置
global:
scrape_interval: 15s #默认 15秒到目标处抓取数据
# 这个标签是在本机上每一条时间序列上都会默认产生的,主要可以用于 联合查询、远程存储、Alertmanger时使用。
external_labels:
monitor: 'codelab-monitor'
# 这里就表示抓取对象的配置
# 设置抓取自身数据
scrape_configs:
# job name 这个配置是表示在这个配置内的时间序例,每一条都会自动添加上这个{job_name:"prometheus"}的标签。
- job_name: 'prometheus'
# 重写了全局抓取间隔时间,由15秒重写成5秒。
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
启动:
nohup ./prometheus --config.file=prometheus.yml &
或
nohup /opt/ prometheus-1.7.1.linux-amd64/prometheus &
这时 浏览器中页面访问http://localhost:9090/ ,可以看到Prometheus的graph页面。
http://www.cnblogs.com/vovlie/p/Prometheus_install.html (参考)
可直接加载Prometheus配置而不停止服务方式让配置生效,在调试过程中,每次修改配置后执行该操作让配置生效更方便:
# curl -X POST http://localhost:9090/-/reload
# netstat -antl|grep 9090 #查看是否启动成功!
如果我们要采用进程方式管理它,则需要创建脚本:
可以创建一个用户名来启动:
[root@localhost config]# useradd prometheus
[root@localhost ~]# vim /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
Deion=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \ #prometheus安装目录
-config.file=/usr/local/prometheus/prometheus.yml \ #prometheus安装目录下的prometheus.yml
-storage.local.path=/home/prometheusdata
Restart=on-failure
[Install]
WantedBy=multi-user.target
说明: -storage.local.path=/home/prometheusdata 指定的存储目录必须要让创建的prometheus用户有权限
保存退出后,此时可以用命令启动 systemctl start prometheus
# systemctl enable Prometheus.service
# systemctl restart Prometheus.service
2、Grafana 安装
[root@localhost prometheus]# wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.5.0-1.x86_64.rpm
[root@localhost prometheus]# yum install initscripts fontconfig -y
[root@localhost prometheus]# rpm -Uvh grafana-4.5.0-1.x86_64.rpm
warning: grafana-4.5.0-1.x86_64.rpm: Header V4 RSA/SHA1 Signature, key ID 24098cb6: NOKEY
error: Failed dependencies:
urw-fonts is needed by grafana-4.5.0-1.x86_64
安装发现报错;所以采用如下命令重新安装:
[root@localhost prometheus]# yum localinstall grafana-4.5.0-1.x86_64.rpm
[root@localhost prometheus]# service grafana-server start #启动服务
Starting grafana-server (via systemctl): [ OK ]
[root@localhost prometheus]# netstat -anp|grep 3000
查看到3000 端口已经OK;
页面http://localhost:3000 ,默认账号、密码admin/admin
http://docs.grafana.org/installation/rpm/ (gragana 官方文档)
可以将Grafana设置为系统服务
#mkdir-p/var/run/grafana
#chowngrafana.grafana/var/run/grafana
#vim/etc/sysconfig/grafana-server,
添加:PID_FILE_DIR=/var/run/grafan
#vim/etc/systemd/system/grafana.service
[Unit]
Description=GrafanaServices
Documentation=https://github.com/grafana/grafana
After=network.target
[Service]
EnvironmentFile=/etc/sysconfig/grafana-server
User=grafana
Group=grafana
Type=simple
WorkingDirectory=/usr/share/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750
ExecStart=/usr/sbin/grafana-server\
--config=${CONF_FILE} \
--pidfile=${PID_FILE_DIR}/grafana-server.pid \
cfg:default.paths.logs=${LOG_DIR} \
cfg:default.paths.data=${DATA_DIR} \
cfg:default.paths.plugins=${PLUGINS_DIR}
LimitNOFILE=10000
TimeoutStopSec=20UMask=0027
[Install]
WantedBy=multi-user.target
#以上配置文件中的变量${CONF_FILE}读取的是/etc/sysconfig/grafana-server中的内容
#配置文件变更后必须先reload
# systemctl daemon-reload
# systemctl restart grafana.service
# systemctl enable grafana.service
Prometheus 和 Grafana 的对接如下:
https://prometheus.io/docs/visualization/grafana/ (prometheus和grafana对接文档)
替换grafana的dashboards
Grafana 并没有太多的配置好的图表模板,除了 Percona 开源的一些外,很多需要自行配置。
[root@localhost prometheus]# yum install git -y
[root@localhost prometheus]# git clone https://github.com/percona/grafana-dashboards.git
Cloning into 'grafana-dashboards'...
remote: Counting objects: 1308, done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 1308 (delta 32), reused 40 (delta 21), pack-reused 1256
Receiving objects: 100% (1308/1308), 6.39 MiB | 1.67 MiB/s, done.
Resolving deltas: 100% (982/982), done.
[root@localhost prometheus]# cp -r grafana-dashboards/dashboards /var/lib/grafana/
[root@localhost prometheus]# vim /etc/grafana/grafana.ini
修改如下:
[dashboards.json] enabled = true path = /var/lib/grafana/dashboards
[root@localhost prometheus]# service grafana-server restart
或用如下命令重启:
[root@localhost prometheus]# systemctl restart grafana-server
3、node_exporter 安装
[root@localhost prometheus]# wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz
[root@localhost prometheus]# tar -zxvf node_exporter-0.14.0.linux-amd64.tar.gz
[root@localhost local]# mv /home/prometheus/node_exporter-0.14.0.linux-amd64 ./node_exporter-0.14.0
[root@localhost local]# cd node_exporter-0.14.0/
[root@localhost node_exporter-0.14.0]# nohup ./node_exporter &
查看进程是否OK
[root@localhost node_exporter-0.14.0]# ps -ef|grep node_exporter
root 24760 24106 0 14:39 pts/1 00:00:00 ./node_exporter
root 24766 24106 0 14:39 pts/1 00:00:00 grep --color=auto node_exporter
node_exporter 也可做成服务进程启动,
[root@localhost ~]# vim /etc/systemd/system/node_exporter.service
提供的node exporter 的 systemd 脚本如下:
[Unit]
Deion=node_exporter
Description=Prometheus node exporter
After=local-fs.target network-online.target network.target
Wants=local-fs.target network-online.target network.target
[Service]
Type=simple
User=prometheus #用户prometheus
ExecStart=/usr/local/prometheus/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
# systemctl enable node_export.service
# systemctl restart node_export.service
4、alertManager 安装
http://blog.csdn.net/y_xiao_/article/details/50818451
Prometheus Alertmanager报警组件
http://www.jianshu.com/p/239b145e2acc (Prometheus Alertmanager报警组件)
Alertmanager报警模块
https://github.com/prometheus/alertmanager )(alertmanager gighub)
Alert template:
https://prometheus.io/blog/2016/03/03/custom-alertmanager-templates/ (自定义的alertmanager 模板)
Sending alert notifications to multiple destinations
https://www.robustperception.io/sending-alert-notifications-to-multiple-destinations/ (发送提醒到多目的地)
Alert tree:
https://prometheus.io/webtools/alerting/routing-tree-editor/ (Routing tree editor)
[root@localhost prometheus]# wget https://github.com/prometheus/alertmanager/releases/download/v0.9.1/alertmanager-0.9.1.linux-amd64.tar.gz
[root@localhost prometheus]# tar -zxvf alertmanager-0.9.1.linux-amd64.tar.gz
[root@localhost prometheus]# mv alertmanager-0.9.1.linux-amd64 /opt/alertmanager
[root@localhost prometheus]# cd /opt/alertmanager
[root@localhost prometheus]# nohup ./alertmanager -config.file=simple.yml &
重启prometheus 服务:
# ./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9093
也可以通过加载配置文件方式而不重启Alertmanager服务:
# curl -XPOST http://localhost:9093/-/reload
# 设置Alertmanager 系统服务
# vim /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager.
Documentation=https://github.com/prometheus/alertmanager
After=network.target
[Service]
EnvironmentFile=-/etc/alertmanager/template
User=root
ExecStart=/opt/alertmanager/alertmanager \
-config.file=/opt/alertmanager/simple.yml \
-storage.path=/home/alertmanager \
$ALERTMANAGER_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target
最后执行:
# systemctl enable alertmanager.service
# systemctl restrart alertmanager.service
访问Alertmanager页面:http://ip:9093/#/alerts
配置 Alertmanager
报警分两部分,报警条件规则文件默认放在Prometheus安装目录下,文件名为 alert.rules。具体通知内容,例如邮件地址和通知人员设置在Alertmanager安装目录下的simply.yml文件,以下是一些基础和常用配置,阈值和时间根据自己需求进行修改。
#alert.rules:
ALERT node_down
IF up == 0 AND job="node"
FOR 5m
ANNOTATIONS {
summary = "Node is down",
description = "Node has been unreachable for more than 5 minutes.",
severity = "warning"
}
ALERT snmp_down
IF up == 0 AND job="snmp"
FOR 5m ANNOTATIONS {
summary = "SNMP is down",
description = "SNMP has been unreachable for more than 5 minutes.",
severity = "warning"
}
ALERT fs_at_80_percent
IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.8
FOR 15m
ANNOTATIONS {
summary = "File system {{$labels.hrStorageDescr}} is at 80%",
description = "{{$labels.hrStorageDescr}} has been at 80% for more than 15 Minutes.",
severity = "warning"
}
ALERT fs_at_90_percent
IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.9
FOR 15m
ANNOTATIONS {
summary = "File system {{$labels.hrStorageDescr}} is at 90%",
description = "{{$labels.hrStorageDescr}} has been at 90% for more than 15 Minutes.",
severity = "average"
}
ALERT disk_load_mostly_random_reads
IF rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND
rate(diskIONReadX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) < 10000
FOR 15m
ANNOTATIONS { summary = "Disk {{$labels.diskIODevice}} reads are mostly random.",
description = "{{$labels.diskIODevice}} reads have been mostly random for the past 15 Minutes.",
severity = "info"
}
ALERT disk_load_mostly_random_writes
IF rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND
rate(diskIONWrittenX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) < 10000
FOR 15m
ANNOTATIONS {
summary = "Disk {{$labels.diskIODevice}} writes are mostly random.",
description = "{{$labels.diskIODevice}} writes have been mostly random for the past 15 Minutes.",
severity = "info"
}
ALERT disk_load_high
IF diskIOLA1{diskIODevice=~"s|vd[a-z]+"} > 30
FOR 15m
ANNOTATIONS {
summary = "Disk {{$labels.diskIODevice}} is at 30%",
description = "{{$labels.diskIODevice}} Load has exceeded 30% over the past 15 Minutes.",
severity = "warning"
}
ALERT cpu_load_high
IF ssCpuIdle < 70
FOR 15m
ANNOTATIONS {
summary = "CPU is at 30%",
description = "CPU Load has constantly exceeded 30% over the past 15 Minutes.",
severity = "warning"
}
ALERT linux_load_high
IF laLoad1 > 50
FOR 15m
ANNOTATIONS {
summary = "Linux Load is at 40",
description = "Linux Load has constantly exceeded 40 over the past 15 Minutes.",
severity = "average"
}
ALERT if_operstatus_changed
IF delta(ifOperStatus[15m]) != 0
ANNOTATIONS {
summary = "Port {{$labels.ifDescr}} changed status",
description = "Port {{$labels.ifDescr}} went up or down in the past 15 Minutes",
severity = "info"
}
ALERT if_traffic_at_30_percent
IF ifSpeed > 10000000 AND
ifOperStatus == 1 AND
rate(ifInOctets[5m]) > ifSpeed * 0.3
FOR 15m
ANNOTATIONS {
summary = "Port {{$labels.ifDescr}} is at 30%",
description = "Port {{$labels.ifDescr}} has had at least 30% traffic over the past 15 Minutes.",
severity = "warning"
}
ALERT if_traffic_at_70_percent
IF ifSpeed > 10000000 AND
ifOperStatus == 1 AND rate(ifInOctets[5m]) > ifSpeed * 0.7
FOR 15m
ANNOTATIONS {
summary = "Port {{$labels.ifDescr}} is at 70%",
description = "Port {{$labels.ifDescr}} has had at least 70% traffic over the past 15 Minutes.",
severity = "average"
}
# CPU告警
ALERT cpu_overload
IF node_load1 >= 0.8
FOR 3m
LABELS { severity = "all" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} cpu_load1 over 80% for 3 minutes",
description = "{{ $labels.instance }} of job {{ $labels.job }} cpu_load1 over 80% for 3 minutes.",
}
# 内存告警
ALERT memory_overload
IF (node_memory_MemTotal-node_memory_MemFree)/node_memory_MemTotal >= 0.8
FOR 3m
LABELS { severity = "all" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} memory_load over 80% for 3 minutes",
description = "{{ $labels.instance }} of job {{ $labels.job }} memory_load over 80% for 3 minutes.",
}
---------------------------------------------------
# simply.yml
主要分三部分,Global部分设置发送邮件服务器信息,route设置规则和报警时间间隔等,receivers设置接收人。
global:
#设置发送邮件的地址和smtp信息
smtp_smarthost:'smtp.abc.com'
smtp_from:'[email protected]'
smtp_auth_username:'prometheus'
smtp_auth_password:'abcd’
route:receiver:'team-X-mails'group_by:['alertname']group_wait:30s
group_interval:5m
repeat_interval:6h
inhibit_rules:
-source_match:
severity:'critical'
target_match:
severity:'warning'
#Applyinhibitionifthealertnameisthesame.
equal:['alertname']
receivers:
-name:'team-X-mails'
email_configs:
-to:'[email protected]'
send_resolved:true
#设置完毕后需要重新加载配置文件
5、cadvisor 安装配置
docker run -d --restart=always --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=8090:8080 --detach=true --name=cadvisor google/cadvisor:latest
在浏览器中:http://ip:8090 就可以访问了
# 监控cAdvisor报警条件:
# vim containers.rules
ALERT cAdvisor_down
IF absent(container_memory_usage_bytes{name="cadvisor"})
FOR 1m
LABELS { severity = "critical" }
ANNOTATIONS {
summary= "cAdvisor containers down",
description= "cAdvisor container is down for more than 1 minutes."
}
ALERT cAdvisor_high_cpu
IF sum(rate(container_cpu_usage_seconds_total{name="cadvisor"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10
FOR 5m
LABELS { severity = "warning" }
ANNOTATIONS {
summary= "cAdvisor high CPU usage",
description= "cAdvisor CPU usage is {{ humanize $value}}%."
}
ALERT cAdvisor_high_memory
IF sum(container_memory_usage_bytes{name="cadvisor"}) > 1200000000 FOR 5m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "cAdvisor high memory usage",
description = "cAdvisor memory consumption is at {{ humanize $value}}.",
}