python自定义监控slurm的Prometheus的export

首先：这篇文章做的是写一个监控slurm的Prometheus的export，安装环境是ubuntu16.04。

1. 下载Prometheus

官网链接下载,然后解压

tar  -zxvf prometheus-2.4.3.linux-amd64.tar.gz
cd prometheus-2.4.3.linux-amd64

2. 配置文件prometheus.yml

开头的都是默认配置，需要配置的是最低下的job_name,把你需要监控的ip地址设置一下，我在这监控的是my_slurm,ip为localhost：8000(最好写成IP地址，不要写localhost，我这里在偷懒 :D)

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'my_demo'

    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9100']

  - job_name: 'my_slurm'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:8000']

3.没有设置其他监控，直接开启服务

./prometheus --config.file=prometheus.yml

4. 下载slurm

参考我的上一篇文章：
Ubuntu16.04安装Slurm

5. slurm的作用

现在slurm要做的就是跑一个job，然后我们通过slurm命令，拿到这个job所用的资源，先举个小栗子！

vim job.sh  # 创建一个脚本任务，随便一个延时就可以了
sbatch job.sh  # 运行这个任务，此时返回jobID。
cat slurm-1.out  # -1就是jobID

scontrol show nodes  # 查看所有状态信息

oc: NodeName=localhost Arch=x86_64 CoresPerSocket=1
   CPUAlloc=1 CPUErr=0 CPUTot=1 CPULoad=0.14 Features=(null)
   Gres=(null)
   NodeAddr=localhost NodeHostName=localhost Version=15.08
   OS=Linux RealMemory=7965 AllocMem=1024 FreeMem=4005 Sockets=1 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=41197 Weight=1 Owner=N/A
   BootTime=2018-12-03T14:23:17 SlurmdStartTime=2018-12-03T14:24:58
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

这里显示了一大堆东西，我们只需要将需要的属性名(如：CPUTot，这是这个节点所分配的cpu数)通过正则选出就可以了，量。

此时job正在跑，通过squeue可以查看job所用的资源，更多详情可查看slurm-squeue命令

squeue --format "%A:%c:%t"  # %A是jobID，%c是所用cpu数，%t是job状态

JOBID:MIN_CPUS:ST
35:1:R

6. 好了，一切准备就绪，该写一个collect了。

在Prometheus中有四种Metrics(数据类型):Counter, Gauge, Summary和Histogram。
Counter：是可以增长的，初始值为0，只增不减。
Gauge：与counter类似，可增可减。
其他两个很少用，Histogram，Summary
下面用到的都是Gauge类型。
栗子中用到的subprocess可以查看我之前的文章python使用subprocess

注意：`GaugeMetricFamily`中的`value`，值必须是float类型。

import re
import subprocess
from prometheus_client.core import GaugeMetricFamily, REGISTRY
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server


class CustomCollector(object):
    def add(self, params):
        sum = 0
        for i in params:
            sum += int(i)
        return sum

    def collect(self):
        output = subprocess.Popen("scontrol show nodes",
                                  stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
        out_put = output.communicate()[0]
        if out_put:
            count = re.findall(r'CPUTot=(\d+)', out_put)
            total_c = self.add(count)
            yield GaugeMetricFamily('slurm_cpu_total', 'total_count', value=total_c)

            used = re.findall(r'CPUAlloc=(\d+)', out_put)
            used_cpu = self.add(used)
            yield GaugeMetricFamily('slurm_cpu_used', 'used_count', value=used_cpu)

            real_memory = re.findall(r'RealMemory=(\d+)', out_put)
            total_memory = self.add(real_memory)
            yield GaugeMetricFamily('slurm_memory_total', 'total_memory', value=total_memory)

            alloc_memory = re.findall(r'AllocMem=(\d+)', out_put)
            used_memory = self.add(alloc_memory)
            yield GaugeMetricFamily('slurm_memory_used', 'used_memory', value=used_memory)

REGISTRY.register(CustomCollector())

if __name__ == '__main__':
    coll = CustomCollector()
    for i in coll.collect():
        print i

    app = make_wsgi_app()
    httpd = make_server('', 8000, app)
    httpd.serve_forever()

参考

7. 在当前文件夹下创建job.sh文件并运行，到prometheus的目录下，运行服务(第三步没做的现在开启)，打开浏览器，登陆`localhost:9090`

这里搜索的关键词就是我们定义类型时，给的名称。
以上！