Prometheus官网：https://prometheus.io/

prometheus 中文文档 · GitHub https://prometheus.fuckcloudnative.io/

Grafana：https://grafana.com

onealert：https://caweb.aiops.com/#/integrate

环境规划

主机名称主机ip 角色

prometheus 192.168.6.109 prometheus
node_exporler 192.168.6.110 node_exporler

初始化服务器

ip地址、hostname、绑定/etc/hosts文件、时间同步

修改hosts

[root@localhost ~]#  vim /etc/hosts

添加如下：

192.168.6.110 node1

虚拟机克隆过来的修改UUID后三位，检查uuid不能一致

hostnamectl set-hostname  prometheus

时间同步

1、下载ntpdate

注：有些版本是没有自带ntpdate，因此需要下载

yum install -y ntpdate

2、调整时区为上海，也就是北京时间+8区

注：想改其他时区也可以去看看/usr/share/zoneinfo目录

cp /usr/share/zoneinfo/Asia/Shangha /etc/localtime

3、使用NTP来同步时间

ntpdate ntp6.aliyun.com

4、自动时间同步

（1）利用开机脚本进行同步

Vim /etc/rc.local

添加一条时间同步命令：

/usr/sbin/ntpdate ntp6.aliyun.com

（2）利用周期进程（crontab）进行同步
crontab -e 命令，进入一个VI的编辑界面，既可以添加或修改任务了
格式：

*/5 * * * * /usr/sbin/ntpdate ntp5.aliyun.com ntp6.aliyun.com ntp7.aliyun.com&> /dev/null

Crontab –l 查看是否已经成功添加。

安装prometheus软件

 [root@promethues prometheus-2.34.0.linux-amd64]# cd /opt
[root@promethues prometheus-2.34.0.linux-amd64]# mkdir soft
[root@promethues prometheus-2.34.0.linux-amd64]# cd soft
[root@promethues prometheus-2.34.0.linux-amd64]# tar -zxvf prometheus-2.34.0.linux-amd64.tar.gz 
[root@promethues prometheus-2.34.0.linux-amd64]# rm -rf prometheus-2.34.0.linux-amd64.tar.gz 
[root@promethues prometheus-2.34.0.linux-amd64]# cd prometheus-2.34.0.linux-amd64/
[root@promethues prometheus-2.34.0.linux-amd64]# pwd
[root@promethues prometheus-2.34.0.linux-amd64]# ln -sv /opt/soft/prometheus-2.34.0.linux-amd64 /usr/local/prometheus

配置使用Systemd管理Prometheus

# 编辑脚本
vim /etc/systemd/system/prometheus.service 
# 粘贴如下内容（内容可酌情自行修改）

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path= /usr/local/promethues/data/  --web.listen-address=:9290 --web.enable-lifecycle
ExecStop=/usr/bin/pkill -f prometheus

[Install]
WantedBy=multi-user.target
# 保存退出
:wq

启动Prometheus

# 重载systemd 配置，修改完systemd配置文件后需重载才会生效。
systemctl daemon-reload
# 设置服务开机启动
systemctl enable prometheus
# 启动服务
systemctl start prometheus
# 查看服务状态
systemctl status prometheus

此时就可以访问ui了，地址：[http://ip:9290/]
页面长这样，可以访问代表启动成功

4 页面大致介绍（可选）

Alerts界面。展示告警的信息，每个告警有3种状态：
Inactive（正常状态，未满足告警条件）
Pending（待办状态，已满足告警条件，未满足持续时间，未发送告警信息）
Firing（已产生告警，已经满足告警条件和时间，已发送告警信息）
Graph
使用Promql语言查询prometheus里保存的指标（metric），结果查看形式可以是表格，也可以是图
Status
1. TSDB Status 时序数据库的状态
2. Configuration 配置信息
3. Rules 告警规则详细信息
4. Targets 所有prometheus采集的指标
Help
Classic UI

其他未提到的请自行研究。

5 常用命令（可选）

1 删除job数据

如果一个job已经不再使用，想要删除对应数据，就要用到删除命令了

注意：使用删除命令前必须开启管理员命令，删除数据无法恢复

删除名为node_exporter_local的job

curl -X POST  -g 'http://host:9290/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}&match[]={job="jobname"}'

host 为prometheus的ip或者hostname
jobname 要删除的job的名字。

2 热重载配置规则

curl -XPOST http://host:9290/-/reload
复制代码

host 为prometheus的ip或者hostname

二、安装Grafana

建议Grafana安装在Prometheus所在节点

1 添加repo

vim /etc/yum.repos.d/grafana.repo
# 粘贴如下内容

[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

2 安装

yum install grafana -y

3 启动

# 设置grafana开机启动
systemctl enable grafana-server
# 启动grafana服务
systemctl start grafana-server

访问grafana页面： [http://ip:3000]用户名密码 admin admin

访问地址看到如下界面，说明启动成功

1.png

三、监控linux服务器

监控linux服务器的cpu、内存、磁盘等信息。
流程：

node_exporter采集指标
prometheus从exporter拉取指标保存起来
grafana从prometheus查询数据，可视化展示

1 安装node_exporter

node_exporter的作用是报告单个节点的服务器指标给prometheus，例如内存、磁盘、cpu。

所有需要监控的节点都需要按照如下流程安装node_exporter。
1 下载包

# 1 进入安装目录
cd /opt/soft/
# 2 下载安装包
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.3.1.linux-amd64.tar.gz
# 3 解压
tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz
# 4 创建链接，方便统一管理目录
ln -sv /opt/soft/node_exporter-1.3.1.linux-amd64 /usr/local/node_exporter

2 配置使用Systemd管理node_exporter

# 编辑脚本
vim /etc/systemd/system/node_exporter.service 
# 粘贴如下内容（内容可酌情自行修改）

[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter --web.listen-address=:9120
ExecStop=/usr/bin/pkill -f node_exporter

[Install]
WantedBy=multi-user.target
# 保存退出
:wq
复制代码

因为默认端口号9100已被占用，通过启动时指定参数修改端口号为9120
–web.listen-address=:9120

3 启动node_exporter

# 重载systemd 配置，修改完systemd配置文件后需重载才会生效。
systemctl daemon-reload
# 设置服务开机启动
systemctl enable node_exporter
# 启动服务
systemctl start node_exporter
# 查看服务状态
systemctl status node_exporter

每个node_exporter都会启动一个简易页面：[http://ip:9120/]，如果可以访问代表启动成功

2 修改prometheus配置

默认配置

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

修改配置文件

注意：yml文件的缩进不能乱，乱了就识别不了

# 进入prometheus安装目录
cd /usr/local/prometheus
# 编辑配置文件
vim prometheus.yml

# 在文件最后粘贴如下内容
  - job_name: 'all_node'
    static_configs:
    - targets: ['node01:9120']
      labels:
        instance: node01
    - targets: ['node02:9120']
      labels:
        instance: node02
    - targets: ['node03:9120']
      labels:
        instance: node03

- targets: ['node01:9120']这里的node01是1台启动了node_exporter的服务器的hostname，这里也可以换成ip。9120则是node_exporter启动端口。其余2台服务器配置以此类推。

这里的配置是告诉prometheus从哪个服务器的哪个端口拉取数据。

修改后的配置文件如下所示

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9290"]

  - job_name: 'all_node'
    static_configs:
    - targets: ['node01:9120']
      labels:
        nodename: node01
    - targets: ['node02:9120']
      labels:
        nodename: node02
    - targets: ['node03:9120']
      labels:
        nodename: node03

修改完配置后，重启prometheus服务配置才生效

systemctl stop prometheus
systemctl start prometheus

3 验证是否配置成功

此时打开prometheus的地址 [http://ip:9290/]
操作步骤：点击Status —> 点击Targets

这里图片是后截的，这里all_node(N/N up）的根据上面的配置来的。如果配置了3台，那么应该像这样`all_node(3/3 up）

2022-04-01_150708.png

上面配置文件里配置了几台，这里看到几台就是配置正确了。如果不对，可以排查一下hostname或者端口是否有误，或者node_exporter是否启动成功。

4 配置grafana

4.1 创建数据源

1 浏览器打开grafana地址 http://ip:3000 用户名密码 admin admin

登录后，进入主界面

2 点击左侧齿轮（设置按钮） —> 点击Data sources

1.png

3 点击Add data source

1.png

4 点击Select，在搜索框输入prometheus，点击Select

1.png

5 输入数据源名字（名字默认即可），输入prometheus的地址 http://localhost:9290/。这里我的prometheus和grafana部署在同一台上所以host为localhost，如果不在一台机器上请自行更改。

1.png

6 点击Save & test

1.png

这样一个prometheus数据源就创建好了。

7 点击左上角图标回到主界面

4.2 添加Dashboard

1 点击 + ，点击Import

1.png

2 输入8919，点击Load

这个8919是一个其他人发布的一个Dashboard。这个id是我从Grafana官方提供的Dashboard网站https://grafana.com/grafana/dashboards/ 里找到的。以后要添加其他类型的比如flink或者mysql监控报表，都可以从这个网站找到。

1.png

3 输入名字，选择数据源，点击Import

1.png

看到如下的Dashboard，就说明配置成功了

2.png

四、监控Flink

flink默认提供了报道数据的实现类将指标上报给PushGateway；Prometheus再从PushGateway拉取指标，保存起来；Grafana从Prometheus查询数据展示出来。

注意：如果是CDH集成的Flink-yarn服务，那么任务必须提交到Flink-yarn服务启动时随之启动的session中，否则无法监控到任务运行指标

1 安装PushGateway

建议PushGateway安装到Prometheus所在节点

1.1 下载PushGateway

# 进入安装目录
cd /opt/soft/
# 下载安装包
wget -c https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
# 解压
tar -zxvf /pushgateway-1.4.2.linux-amd64.tar.gz
# 创建软连接，方便管理
ln -sv /opt/soft/pushgateway-1.4.2.linux-amd64 /usr/local/pushgateway

1.2 用system管理push_gateway
vim /etc/systemd/system/pushgateway.service

粘贴如下内容

[Unit]
Description=pushgateway
After=network.target

[Service]
Type=simple
ExecStart=/opt/soft/pushgateway-1.4.2.linux-amd64/pushgateway --web.listen-address=:9291
ExecStop=/usr/bin/pkill -f pushgateway

[Install]
WantedBy=multi-user.target

保存退出

:wq

–web.listen-address=:9291 默认端口为9091，避免冲突改为9291

1.3 启动PushGateway

重载systemd 配置，修改完systemd配置文件后需重载才会生效。

systemctl daemon-reload

设置服务开机启动

systemctl enable pushgateway

启动服务

systemctl start pushgateway

查看服务状态

systemctl status pushgateway


# 进入prometheus安装目录

vim /usr/local/prometheus/prometheus.yml

在文件最后追加如下内容

job_name: 'pushgateway'
honor_labels: true
static_configs:
- targets: ['node01:9291']
  labels:
  instance: 'pushgateway'

修改完配置后，重启prometheus服务配置才生效

systemctl restart prometheus


1.4 验证是否配置成功
打开prometheus地址 [http://ip:9290/targets]可以看到如下内容即是成功
![1.png](https://upload-images.jianshu.io/upload_images/26919493-4622859824f1a226.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


# 五、配置告警

告警流程

1.  exporter采集指标数据
2.  prometheus从exporter拉取指标数据保存起来
3.  prometheus向alertmanager推送触发了的告警
4.  alertmanager通过email、dingding等方式发送告警信息

## 创建钉钉机器人

1 找一个群打开对话框，点击右上角齿轮
![1.png](https://upload-images.jianshu.io/upload_images/26919493-ace28981e16971a6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
2 点击智能群助手
![2.png](https://upload-images.jianshu.io/upload_images/26919493-a0027a92cac52099.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
3 点击+
![3.png](https://upload-images.jianshu.io/upload_images/26919493-5fb0258209bc5d55.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
4 点击+
![4.png](https://upload-images.jianshu.io/upload_images/26919493-ed2aba5990bfd67b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
5 往下滑滑轮，点击自定义
![5.png](https://upload-images.jianshu.io/upload_images/26919493-a60b5ad0a2cf5d61.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
6 点击添加
![6.png](https://upload-images.jianshu.io/upload_images/26919493-e1cf077672722119.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
7 填入机器人名字，选择加签，点击我已阅读并同意，点击完成
![7.png](https://upload-images.jianshu.io/upload_images/26919493-a061df3c2aa6c610.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
这样一个机器人就添加完成了
![8.png](https://upload-images.jianshu.io/upload_images/26919493-630e1ca1f875180b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

这里建议点击复制，把Webhook的url保存一下，后面会用到

##2 安装dingtalk
dingtalk是一个用来发送钉钉告警通知的prometheus插件
### 2.1 下载

cd /opt/soft

下载安装包

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz

解压

tar -zxvf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz

创建软连接

ln -sv /opt/soft/prometheus-webhook-dingtalk-2.0.0.linux-amd64/ /usr/local/prometheus-webhook-dingtalk

###2.2 修改dingtalk配置文件

拷贝一份新配置文件，命名为config.yml

cp config.example.yml config.yml

编辑配置文件

vim config.yml

把配置里所有的url改为上一步保存的Webhook的url

假设我的Webhook的url=https://oapi.dingtalk.com/robot/send?access_token=abc，那么配置就像这样

targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
# secret for signature
secret: SEC000000000000000000000
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
# Customize template content
message:
# Use legacy template
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
webhook_mention_all:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
mention:
all: true
webhook_mention_users:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
mention:
mobiles: ['156xxxx8827', '189xxxx8325']


### 2.3 用system管理dingtalk

编辑配置

vim /etc/systemd/system/prometheus-webhook-dingtalk.service

粘贴如下内容

[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml --web.listen-address=:8260
ExecStop=/usr/bin/pkill -f prometheus-webhook-dingtalk

[Install]
WantedBy=multi-user.target

保存退出

:wq


–web.listen-address=:8260 默认端口为8060，避免冲突修改为8260

### 2.4 启动dingtalk

重载systemd 配置，修改完systemd配置文件后需重载才会生效。

systemctl daemon-reload

设置服务开机启动

systemctl enable prometheus-webhook-dingtalk

启动服务

systemctl start prometheus-webhook-dingtalk

查看服务状态

systemctl status prometheus-webhook-dingtalk

查看端口

lsof -i :8260
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
prometheu 7893 root 3u IPv6 110639539 0t0 TCP *:8260 (LISTEN)


## 3 安装alertmanager

建议alertmanager安装到prometheus所在节点

### 3.1 下载alertmanager

cd /opt/soft

下载安装包

wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.24.0.linux-amd64.tar.gz

解压

tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz

创建软连接

ln -sv /opt/soft/alertmanager-0.24.0.linux-amd64/ /usr/local/alertmanager


### 3.2 用system管理alertmanager

vim /etc/systemd/system/alertmanager.service

粘贴如下内容

[Unit]
Description=alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --web.listen-address=:9293 --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data/
ExecStop=/usr/bin/pkill -f alertmanager

[Install]
WantedBy=multi-user.target

保存退出

:wq
复制代码


### 3.3 启动alertmanager

重载systemd 配置，修改完systemd配置文件后需重载才会生效。

systemctl daemon-reload

设置服务开机启动

systemctl enable alertmanager

启动服务

systemctl start alertmanager

查看服务状态

systemctl status alertmanager


访问alertmanager的地址 [http://localhost:9293/],如果出现如下界面，说明启动成功，如果不能启动请检查配置
![1.png](https://upload-images.jianshu.io/upload_images/26919493-d9981143ed2c3bf9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

## 4 修改prometheus配置文件

### 1 编辑规则文件

创建告警规则文件目录

mkdir /usr/local/prometheus/rules

进入目录

cd /usr/local/prometheus/rules


#### 1 创建cpu告警文件，内容如下

groups:

name: CPU报警规则
rules:
- alert: 服务器-CPU使用率告警
  expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 85
  for: 3m
  labels:
  severity: warning
  annotations:
  summary: "CPU使用率正在飙升。"
  description: "CPU使用率超过85%（当前值：{{ $value }}%）"


#### 2 创建磁盘告警文件，内容如下

groups:

name: 磁盘使用率报警规则
rules:
- alert: 服务器-磁盘使用率告警
  expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 85
  for: 30m
  labels:
  severity: warning
  annotations:
  summary: "硬盘分区使用率过高"
  description: "分区使用大于85%（当前值：{{ $value }}%）"


####3 创建内存告警文件，内容如下

groups:

name: 内存报警规则
rules:
- alert: 服务器-内存使用率告警
  expr: (1 - (node_memory_MemAvailable_bytes{job="all_node"} / (node_memory_MemTotal_bytes{job="all_node"}))) * 100 > 85
  for: 3m
  labels:
  severity: warning
  annotations:
  summary: "服务器可用内存不足。"
  description: "内存使用率已超过85%（当前值：{{ $value }}%）"


#### 4 创建flink任务存活个数告警文件

第一个文件

groups:

name: 生产-实时-flink-任务执行失败
rules:
- alert: 生产-实时-flink-任务失败告警
  expr: flink_jobmanager_numRunningJobs{job=~"flink_pushgateway.*"} < 3
  for: 1m
  labels:
  severity: warning
  annotations:
  summary: "生产-flink 某个任务执行失败"
  description: "生产-实时-flink-任务执行失败, 期待正在执行的任务数=3,（当前正在执行的任务数={{ $value }}）"
  复制代码


*   expr: flink_jobmanager_numRunningJobs{job=~“flink_pushgateway.*”} < 3
    当前正在运行的任务有3个，判断当前正在运行的任务数量小于3吗，小于3说明有任务挂了
*   for: 1m
    expr条件触发后，这种情况持续了1分钟，则发出告警

#### [](https://link.juejin.cn?target=)5 判断flink某任务是否存活

groups:

name: MainClassName 执行失败
rules:
- alert: 生产-实时-flink-任务失败告警
  expr: ((flink_jobmanager_job_uptime{ job_name="MainClassName"})-(flink_jobmanager_job_uptime{ job_name="MainClassName"} offset 10s))/1000 == 0
  for: 1m
  labels:
  severity: warning
  annotations:
  summary: "生产-实时任务执行失败"
  description: "MainClassName（xxx任务）执行失败（当前值：{{ $value }}）"


**注意：这里只能根据任务启动时的MainClassName主类名监控指定任务**

原理是根据一个不断增加的指标uptime来判断

如果任务存活，当前时间戳减去10秒前的时间戳就等于10秒，则表达式 `((flink_jobmanager_job_uptime{ job_name="MainClassName"})-(flink_jobmanager_job_uptime{ job_name="MainClassName"} offset 10s))/1000`的值会是一个固定值约等于10秒的一个数

如果任务失败，那么值uptime的值不会在随时间增加而是一个固定值，那么当前时间戳减去10秒前的时间戳就回等于0。则当表达式`((flink_jobmanager_job_uptime{ job_name="MainClassName"})-(flink_jobmanager_job_uptime{ job_name="MainClassName"} offset 10s))/1000 == 0`成立时触发告警

### [](https://link.juejin.cn?target=)2 编辑配置文件

vim /usr/local/prometheus/prometheus.yml

修改如下部分

Alertmanager configuration

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9293

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

rules/*.yml

- "first_rules.yml"

- "second_rules.yml"

复制代码


重启prometheus

systemctl stop prometheus
systemctl start prometheus


## [](https://link.juejin.cn?target=)5 验证配置结果

访问prometheus的alert地址 [http://node01:9290/alerts]
可以看到如下

Prometheus

Prometheus官网：https://prometheus.io/

prometheus 中文文档 · GitHub https://prometheus.fuckcloudnative.io/

Grafana：https://grafana.com

onealert：https://caweb.aiops.com/#/integrate

环境规划

初始化服务器

ip地址、hostname、绑定/etc/hosts文件、时间同步

修改hosts

虚拟机克隆过来的修改UUID后三位，检查uuid不能一致

时间同步

1、下载ntpdate

2、调整时区为上海，也就是北京时间+8区

3、使用NTP来同步时间

4、自动时间同步

Crontab –l 查看是否已经成功添加。

安装prometheus软件

配置使用Systemd管理Prometheus

启动Prometheus

4 页面大致介绍（可选）

5 常用命令（可选）

1 删除job数据

2 热重载配置规则

二、安装Grafana

1 添加repo

2 安装

3 启动

三、监控linux服务器

1 安装node_exporter

2 配置使用Systemd管理node_exporter

3 启动node_exporter

2 修改prometheus配置

3 验证是否配置成功

4 配置grafana

4.1 创建数据源

4.2 添加Dashboard

四、监控Flink

1 安装PushGateway

1.1 下载PushGateway

粘贴如下内容

保存退出

重载systemd 配置，修改完systemd配置文件后需重载才会生效。

设置服务开机启动

启动服务

查看服务状态

在文件最后追加如下内容

下载安装包

解压

创建软连接

拷贝一份新配置文件，命名为config.yml

编辑配置文件

把配置里所有的url改为上一步保存的Webhook的url

假设我的Webhook的url=https://oapi.dingtalk.com/robot/send?access_token=abc，那么配置就像这样

编辑配置

粘贴如下内容

保存退出

重载systemd 配置，修改完systemd配置文件后需重载才会生效。

设置服务开机启动

启动服务

查看服务状态

查看端口

下载安装包

解压

创建软连接

粘贴如下内容

保存退出

重载systemd 配置，修改完systemd配置文件后需重载才会生效。

设置服务开机启动

启动服务

查看服务状态

创建告警规则文件目录

进入目录

修改如下部分

Alertmanager configuration

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

你可能感兴趣的:(Prometheus)