首先:这篇文章做的是写一个监控slurm的Prometheus的export,安装环境是ubuntu16.04。
1. 下载Prometheus
官网链接下载,然后解压
tar -zxvf prometheus-2.4.3.linux-amd64.tar.gz
cd prometheus-2.4.3.linux-amd64
2. 配置文件prometheus.yml
开头的都是默认配置,需要配置的是最低下的job_name,把你需要监控的ip地址设置一下,我在这监控的是my_slurm,ip为localhost:8000(最好写成IP地址,不要写localhost,我这里在偷懒 :D)
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'my_demo'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9100']
- job_name: 'my_slurm'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:8000']
3.没有设置其他监控,直接开启服务
./prometheus --config.file=prometheus.yml
4. 下载slurm
参考我的上一篇文章:
Ubuntu16.04安装Slurm
5. slurm的作用
现在slurm要做的就是跑一个job,然后我们通过slurm命令,拿到这个job所用的资源,先举个小栗子!
vim job.sh # 创建一个脚本任务,随便一个延时就可以了
sbatch job.sh # 运行这个任务,此时返回jobID。
cat slurm-1.out # -1就是jobID
scontrol show nodes # 查看所有状态信息
oc: NodeName=localhost Arch=x86_64 CoresPerSocket=1
CPUAlloc=1 CPUErr=0 CPUTot=1 CPULoad=0.14 Features=(null)
Gres=(null)
NodeAddr=localhost NodeHostName=localhost Version=15.08
OS=Linux RealMemory=7965 AllocMem=1024 FreeMem=4005 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=41197 Weight=1 Owner=N/A
BootTime=2018-12-03T14:23:17 SlurmdStartTime=2018-12-03T14:24:58
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
这里显示了一大堆东西,我们只需要将需要的属性名(如:CPUTot,这是这个节点所分配的cpu数)通过正则选出就可以了,量。
此时job正在跑,通过squeue可以查看job所用的资源,更多详情可查看slurm-squeue命令
squeue --format "%A:%c:%t" # %A是jobID,%c是所用cpu数,%t是job状态
JOBID:MIN_CPUS:ST
35:1:R
6. 好了,一切准备就绪,该写一个collect了。
在Prometheus中有四种Metrics(数据类型):Counter, Gauge, Summary和Histogram。
Counter:是可以增长的,初始值为0,只增不减。
Gauge:与counter类似,可增可减。
其他两个很少用,Histogram,Summary
下面用到的都是Gauge类型。
栗子中用到的subprocess可以查看我之前的文章python使用subprocess
注意:GaugeMetricFamily
中的value
,值必须是float类型。
import re
import subprocess
from prometheus_client.core import GaugeMetricFamily, REGISTRY
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server
class CustomCollector(object):
def add(self, params):
sum = 0
for i in params:
sum += int(i)
return sum
def collect(self):
output = subprocess.Popen("scontrol show nodes",
stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
out_put = output.communicate()[0]
if out_put:
count = re.findall(r'CPUTot=(\d+)', out_put)
total_c = self.add(count)
yield GaugeMetricFamily('slurm_cpu_total', 'total_count', value=total_c)
used = re.findall(r'CPUAlloc=(\d+)', out_put)
used_cpu = self.add(used)
yield GaugeMetricFamily('slurm_cpu_used', 'used_count', value=used_cpu)
real_memory = re.findall(r'RealMemory=(\d+)', out_put)
total_memory = self.add(real_memory)
yield GaugeMetricFamily('slurm_memory_total', 'total_memory', value=total_memory)
alloc_memory = re.findall(r'AllocMem=(\d+)', out_put)
used_memory = self.add(alloc_memory)
yield GaugeMetricFamily('slurm_memory_used', 'used_memory', value=used_memory)
REGISTRY.register(CustomCollector())
if __name__ == '__main__':
coll = CustomCollector()
for i in coll.collect():
print i
app = make_wsgi_app()
httpd = make_server('', 8000, app)
httpd.serve_forever()
参考
7. 在当前文件夹下创建job.sh文件并运行,到prometheus的目录下,运行服务(第三步没做的现在开启),打开浏览器,登陆localhost:9090
这里搜索的关键词就是我们定义类型时,给的名称。
以上!