由于公司准备以后转向微服务，使用docker集群部署项目。所以监控服务转向prometheus。
ubuntu下prometheus的安装(docker版):

1,安装docker

apt-get install -y docker.io

2,下载prometheus相关的镜像包

docker pull prom/node-exporter
docker pull prom/prometheus
docker pull grafana/grafana

3,启动node-exporter

docker run -d -p 9100:9100
-v "/proc:/host/proc:ro"
-v "/sys:/host/sys:ro"
-v "/:/rootfs:ro"
--net="host"
prom/node-exporter

查看端口是否启动：
netstat -tpln

深度截图_选择区域_20190715165135.png

4,启动prometheus

新建一个目录放prometheus的配置文件
mkdir /opt/prometheus
vim /opt/prometheus/prometheus.yml
配置文件内容：

my global config

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting:
alertmanagers:

static_configs:
- targets:
  - alertmanager:9093

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

- "first_rules.yml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label `job=` to any timeseries scraped from this config.

job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.


static_configs:
  - targets: ['192.168.10.5:19090']

job_name: 'localhost'
static_configs:
- targets: ['192.168.10.5:9100']
job_name: 'server1'
static_configs:
- targets: ['192.168.10.11:9100']
job_name: 'server1_pushgateway'
static_configs:
- targets: ['192.168.10.11:9091']
job_name: 'server2'
static_configs:
- targets: ['192.168.10.110:9100']
job_name: 'server2_pushgateway'
static_configs:
- targets: ['192.168.10.110:9091']

有时候，prometheus会报错，大部分原因是配置文件yml的格式问题，注意下配置文件中的上下级关系和开头空行，不能使用tab，要用空格键。

启动prometheus
docker run -d
-p 19090:9090
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
prom/prometheus

查看下端口启动情况
访问下：

深度截图_选择区域_20190715170932.png

查看下服务连接情况：

深度截图_选择区域_20190715171039.png

4，启动grafana

新建一个文件夹，用于存储数据
mkdir /opt/grafana-storage
设置权限
chmod 777 -R /opt/grafana-storage
启动grafana
docker run -d
-p 13000:3000
--name=grafana
-v /opt/grafana-storage:/var/lib/grafana
grafana/grafana

查看端口启动情况
访问：

深度截图_选择区域_20190715171713.png

绑定prometheus：

深度截图_选择区域_20190715171826.png

点击save&Test通过即可

导入官方插件

深度截图_选择区域_20190715172453.png

复制编码到grafana中下载插件

深度截图_选择区域_20190715172549.png

[图片上传中...(深度截图_选择区域_20190715172647.png-3fc5bf-1563182831882-0)]

深度截图_选择区域_20190715172647.png

改变下change，保存下就可以。

选择下你要查看的服务器IP地址：

深度截图_选择区域_20190715172845.png

也可以自定义成自己的IP：

深度截图_选择区域_20190715173041.png

深度截图_选择区域_20190715173114.png

6，使用自定义脚本

docker安装pushgateway:
docker pull prom/pushgateway
docker run -d -p 9091:9091 prom/pushgateway

在被监控机器上编写脚本，下面是我写的监控GPU温度的shell脚本：

!/bin/bash

while true; do
#instance_name=hostname -f | cut -d'.' -f1 #获取本机名，用于后面的的标签
#instance_name=hostname -f | cut -d'.' -f1
label1="server1_gpu1_temperature" #定义key名
server1_gpu1_temperature=3}' |awk -FC '{print(nvidia-smi|awk 'NR==12'|awk '{print1}') #获取gpu2
label3="server1_gpu3_temperature"
server1_gpu3_temperature=3}' |awk -FC '{print(nvidia-smi|awk 'NR==18'|awk '{print1}') #获取gpu4
#echo label1: label2: label3: label4: label1 label2 label3 label4 $server1_gpu4_temperature" | curl --data-binary @- http://192.168.10.11:9091/metrics/job/server1_pushgateway/instance/
sleep 10;
done

深度截图_选择区域_20190715173703.png

重启pushgateway和prometheus。
在prometheus上能搜到自定义的脚本

深度截图_选择区域_20190715173849.png

在grafana上制作GPU温度表

深度截图_选择区域_20190715173932.png

深度截图_选择区域_20190715173950.png

7，使用grafana的报警推送到钉钉上。

深度截图_选择区域_20190715174138.png

深度截图_选择区域_20190715174236.png

在grafana上添加钉钉报警

深度截图_选择区域_20190715174407.png

把钉钉机器人的webhook粘贴到dingding settings的url中

深度截图_选择区域_20190715174513.png

在表格中添加报警

深度截图_选择区域_20190715174711.png

在钉钉上正常接受到报警

深度截图_选择区域_20190715174742.png

promethues的搭建

1,安装docker

2,下载prometheus相关的镜像包

3,启动node-exporter

4,启动prometheus

my global config

scrape_timeout is set to the global default (10s).

Alertmanager configuration

- alertmanager:9093

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

The job name is added as a label `job=` to any timeseries scraped from this config.

4，启动grafana

6，使用自定义脚本

!/bin/bash

7，使用grafana的报警推送到钉钉上。

你可能感兴趣的:(promethues的搭建)

promethues的搭建

1,安装docker

2,下载prometheus相关的镜像包

3,启动node-exporter

4,启动prometheus

my global config

scrape_timeout is set to the global default (10s).

Alertmanager configuration

- alertmanager:9093

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

The job name is added as a label job= to any timeseries scraped from this config.

4，启动grafana

6，使用自定义脚本

!/bin/bash

7，使用grafana的报警推送到钉钉上。

你可能感兴趣的:(promethues的搭建)

The job name is added as a label `job=` to any timeseries scraped from this config.