prometheus+node exporter+alertmanager+grafana监控平台部署

目录

  • 安装配置Prometheus服务器
  • 安装配置node_exporter
  • 安装Grafana展示工具
  • 安装配置Alertmanager插件

Prometheus安装

  • 系统:CentOS7
    为了安全,我们这里不用root用户启动相关服务,或者用我们自建的prometheus用户启动服务,首先需要创建一个用户:
$ groupadd prometheus
$ useradd -g prometheus -M -s /sbin/nologin prometheus

下载prometheus压缩包

 wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz

解压并安装prometheus服务:

tar xf prometheus-2.14.0.linux-amd64.tar.gz -C /srv/
$ cd /srv/
$ mv prometheus-2.7.1.linux-amd64/ prometheus
$ mkdir -pv /srv/prometheus/data
$ chown -R prometheus.prometheus /srv/prometheus

创建prometheus系统服务启动文件/usr/lib/systemd/system/prometheus.service:

 [Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart=/srv/prometheus/prometheus \
  --config.file=/srv/prometheus/prometheus.yml \
  --storage.tsdb.path=/srv/prometheus/data
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target

完整普罗米修斯系统服务启动文件参见:prometheus.service
修改prometheus配置文件/srv/prometheus/prometheus.yml:

global:
  scrape_interval:     15s 
  evaluation_interval: 15s 

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]

rule_files:
  #- "alert.rules"
  
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval:     5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval:     10s
    static_configs:
      - targets: ['要监控主机1ip:9100','监控主机2ip:9100']  #多个个主机用,分开

完整的prometheus配置文件可以参见:prometheus.yml
启动服务命令(依次执行):

$ systemctl daemon-reload
$ systemctl start prometheus.service
$ systemctl enable prometheus.service 
$ systemctl status prometheus.service

Prometheus服务支持热加载配置:
$ systemctl reload prometheus.service
Prometheus服务启动完成后,可以通过http:// localhost:9090访问Prometheus的UI界面。

安装配置node_exporter

为监控服务器CPU,内存,磁盘,I / O等信息,需要在监控机器上安装node_exporter服务。
首先我们需要从node_exporter下载页下载我们需要安装的版本,这里我们选择则安装的node_exporter版本是v0.17.0的最新版本。

 wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz

解压并安装node_exporter服务:

$ tar xf /opt/soft/node_exporter-0.17.0.linux-amd64.tar.gz -C /srv/
$ cd /srv/
$ mv node_exporter-0.17.0.linux-amd64/ node_exporter
$ chown -R prometheus.prometheus /srv/node_exporter

创建node_exporter系统服务启动文件 /usr/lib/systemd/system/node_exporter.service

#Prometheus Node Exporter Upstart script
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/srv/node_exporter/node_exporter

[Install]
WantedBy=default.target

完整node_exporter系统服务启动文件参见:node_exporter.service
启动node_exporter服务:

$ systemctl daemon-reload
$ systemctl enable node_exporter
$ systemctl start node_exporter
$ systemctl status node_exporter

服务启动后可以用http:// 被监控主机ip:9100 / metrics测试node_exporter是否获取到路由器的监控指标。如果可以正常获取到上游的指标后,我们可以将node_exporter整合到prometheus中,具体如下:
修改prometheus的配置文件/srv/prometheus/prometheus.yml,增加如下内容:

scrape_configs:
...
- job_name: 'node'
    scrape_interval:     10s
    static_configs:
      - targets: ['localhost:9100']

之前的prometheus配置文件已经做过修改了,这里只是提及一下

重启Prometheus服务:
systemctl reload prometheus.service

安装Grafana展示工具

首先,需要准备grafana的repo源,手动添加/etc/yum.repos.d/grafana.repo文件:

[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

可参考官方文档:grafana

然后就可以用yum安装grafana了:

$ yum makecache
$ yum -y install grafana

等待安装完成后就可以启动服务了:

$ systemctl enable grafana-server
$ systemctl start grafana-server

登录grafana
浏览器访问:http://localhost:3000,默认账号密码 admin/admin
添加数据源
在登陆首页,点击"Configuration-Data Sources"按钮,跳转到添加数据源页面,配置如下:Name: prometheusType: prometheusURL: http://localhost:9090/Access: Server取消Default的勾选,其余默认,点击"Add",如下:
prometheus+node exporter+alertmanager+grafana监控平台部署_第1张图片
导入dashboard

从grafana官网下载相关dashboard到本地,如:https://grafana.com/dashboards/8919
Upload已下载至本地的json文件
Grafana.com Dashboard输入grafana官网的Dashboard链接(如:https://grafana.com/dashboards/1860)
可以下载使用upload上传,也可不下载直接复制链接
prometheus+node exporter+alertmanager+grafana监控平台部署_第2张图片
prometheus+node exporter+alertmanager+grafana监控平台部署_第3张图片
import导入即可

部署Alertmanager 钉钉报警

1. 下载&安装

$ wget https://github.com/prometheus/alertmanager/releases/download/v0.15.2/alertmanager-0.15.2.linux-amd64.tar.gz
$ tar zxf alertmanager-0.15.2.linux-amd64.tar.gz
$ mv alertmanager-0.15.2.linux-amd64.tar.gz /srv/alertmanager

配置文件
alertmanager的webhook集成了钉钉报警,所以他不是本来就有的。钉钉对格式要求很严格,一会还需要使用插件进行格式转换 。vim /srv/alerlmanager/alertmanager.yml

global:
  resolve_timeout: 5m
route:
  receiver: webhook
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [alertname]
  routes:
  - receiver: webhook
    group_wait: 10s
    match:
      team: node
receivers:
- name: webhook
  webhook_configs:
  - url: http://localhost:8060/dingtalk/ops_dingding/send 
    send_resolved: true

启动alertmanager

$ nohup ./alertmanager --config.file=alertmanager.yml 2>&1 1>altermanager.log &
#查看端口:
$ netstat -anpt | grep 9093

报警规则

监控主机是否存活

cd /usr/local/prometheus
cat rules.yml
groups:
    - name: test-rule
      rules:
      - alert: 主机状态
        expr: up == 0
        for: 2m
        labels:
          status: warning
        annotations:
          summary: "{{$labels.instance}}:服务器关闭"
          description: "{{$labels.instance}}:服务器关闭"

修改prometheus配置文件
修改alerting和rule_file
rule_files可以指定多个规

在这里插入代码片

将钉钉接入 Prometheus AlertManager WebHook
参考文档:http://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/插件下载地址:https://github.com/timonwong/prometheus-webhook-dingtalk
安装
把主机名换成主机ip,为报警方便提供url

$ mkdir -p /usr/lib/golang/src/github.com/timonwong/
$ cd  /usr/lib/golang/src/github.com/timonwong/
$ git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git
$ cd prometheus-webhook-dingtalk
$ make(出错不要管他)

启动
不会加机器人的去网上搜ding.profile是钉钉机器人的webhook

nohup ./prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxx"   2>&1 1>dingding.log & 

测试
prometheus+node exporter+alertmanager+grafana监控平台部署_第4张图片
再启动exporter,已经恢复

prometheus+node exporter+alertmanager+grafana监控平台部署_第5张图片

你可能感兴趣的:(prometheus+node exporter+alertmanager+grafana监控平台部署)