$ groupadd prometheus
$ useradd -g prometheus -M -s /sbin/nologin prometheus
下载prometheus压缩包
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
解压并安装prometheus服务:
tar xf prometheus-2.14.0.linux-amd64.tar.gz -C /srv/
$ cd /srv/
$ mv prometheus-2.7.1.linux-amd64/ prometheus
$ mkdir -pv /srv/prometheus/data
$ chown -R prometheus.prometheus /srv/prometheus
创建prometheus系统服务启动文件/usr/lib/systemd/system/prometheus.service:
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
[Service]
User=prometheus
Restart=on-failure
ExecStart=/srv/prometheus/prometheus \
--config.file=/srv/prometheus/prometheus.yml \
--storage.tsdb.path=/srv/prometheus/data
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
完整普罗米修斯系统服务启动文件参见:prometheus.service
修改prometheus配置文件/srv/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
rule_files:
#- "alert.rules"
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
scrape_interval: 10s
static_configs:
- targets: ['要监控主机1ip:9100','监控主机2ip:9100'] #多个个主机用,分开
完整的prometheus配置文件可以参见:prometheus.yml
启动服务命令(依次执行):
$ systemctl daemon-reload
$ systemctl start prometheus.service
$ systemctl enable prometheus.service
$ systemctl status prometheus.service
Prometheus服务支持热加载配置:
$ systemctl reload prometheus.service
Prometheus服务启动完成后,可以通过http:// localhost:9090访问Prometheus的UI界面。
为监控服务器CPU,内存,磁盘,I / O等信息,需要在监控机器上安装node_exporter服务。
首先我们需要从node_exporter下载页下载我们需要安装的版本,这里我们选择则安装的node_exporter版本是v0.17.0的最新版本。
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
解压并安装node_exporter服务:
$ tar xf /opt/soft/node_exporter-0.17.0.linux-amd64.tar.gz -C /srv/
$ cd /srv/
$ mv node_exporter-0.17.0.linux-amd64/ node_exporter
$ chown -R prometheus.prometheus /srv/node_exporter
创建node_exporter系统服务启动文件 /usr/lib/systemd/system/node_exporter.service
#Prometheus Node Exporter Upstart script
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
ExecStart=/srv/node_exporter/node_exporter
[Install]
WantedBy=default.target
完整node_exporter系统服务启动文件参见:node_exporter.service
启动node_exporter服务:
$ systemctl daemon-reload
$ systemctl enable node_exporter
$ systemctl start node_exporter
$ systemctl status node_exporter
服务启动后可以用http:// 被监控主机ip:9100 / metrics测试node_exporter是否获取到路由器的监控指标。如果可以正常获取到上游的指标后,我们可以将node_exporter整合到prometheus中,具体如下:
修改prometheus的配置文件/srv/prometheus/prometheus.yml,增加如下内容:
scrape_configs:
...
- job_name: 'node'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9100']
之前的prometheus配置文件已经做过修改了,这里只是提及一下
重启Prometheus服务:
systemctl reload prometheus.service
首先,需要准备grafana的repo源,手动添加/etc/yum.repos.d/grafana.repo文件:
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
可参考官方文档:grafana
然后就可以用yum安装grafana了:
$ yum makecache
$ yum -y install grafana
等待安装完成后就可以启动服务了:
$ systemctl enable grafana-server
$ systemctl start grafana-server
登录grafana
浏览器访问:http://localhost:3000,默认账号密码 admin/admin
添加数据源
在登陆首页,点击"Configuration-Data Sources"按钮,跳转到添加数据源页面,配置如下:Name: prometheusType: prometheusURL: http://localhost:9090/Access: Server取消Default的勾选,其余默认,点击"Add",如下:
导入dashboard
从grafana官网下载相关dashboard到本地,如:https://grafana.com/dashboards/8919
Upload已下载至本地的json文件
Grafana.com Dashboard输入grafana官网的Dashboard链接(如:https://grafana.com/dashboards/1860)
可以下载使用upload上传,也可不下载直接复制链接
import导入即可
1. 下载&安装
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.15.2/alertmanager-0.15.2.linux-amd64.tar.gz
$ tar zxf alertmanager-0.15.2.linux-amd64.tar.gz
$ mv alertmanager-0.15.2.linux-amd64.tar.gz /srv/alertmanager
配置文件
alertmanager的webhook集成了钉钉报警,所以他不是本来就有的。钉钉对格式要求很严格,一会还需要使用插件进行格式转换 。vim /srv/alerlmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [alertname]
routes:
- receiver: webhook
group_wait: 10s
match:
team: node
receivers:
- name: webhook
webhook_configs:
- url: http://localhost:8060/dingtalk/ops_dingding/send
send_resolved: true
启动alertmanager
$ nohup ./alertmanager --config.file=alertmanager.yml 2>&1 1>altermanager.log &
#查看端口:
$ netstat -anpt | grep 9093
报警规则
监控主机是否存活
cd /usr/local/prometheus
cat rules.yml
groups:
- name: test-rule
rules:
- alert: 主机状态
expr: up == 0
for: 2m
labels:
status: warning
annotations:
summary: "{{$labels.instance}}:服务器关闭"
description: "{{$labels.instance}}:服务器关闭"
修改prometheus配置文件
修改alerting和rule_file
rule_files可以指定多个规
在这里插入代码片
将钉钉接入 Prometheus AlertManager WebHook
参考文档:http://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/插件下载地址:https://github.com/timonwong/prometheus-webhook-dingtalk
安装
把主机名换成主机ip,为报警方便提供url
$ mkdir -p /usr/lib/golang/src/github.com/timonwong/
$ cd /usr/lib/golang/src/github.com/timonwong/
$ git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git
$ cd prometheus-webhook-dingtalk
$ make(出错不要管他)
启动
不会加机器人的去网上搜ding.profile是钉钉机器人的webhook
nohup ./prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxx" 2>&1 1>dingding.log &