一个简易的多GPU服务器监控程序

前言

因为实验室有很多台 GPU 服务器,每次要运行代码都要一台一台跑上去看GPU有没有人用,所以就写了一个这种小程序,代码地址在这里

效果图

curl http://127.0.0.1:7070/info
>> 2023-06-03 12:01:31 [watchcorgi]
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   name  |cpu[s]|cpu[u]|              gpu device             |gpu[u]|       gpu[m]      |   gpu user   |update time|
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu1  | 0.0 %| 0.0 %|      A100-PCIE-40GB(460.106.00)     |  0 % |  0 MiB/40536 MiB  |     null     |  12:01:22 |
|         |      |      |      A100-PCIE-40GB(460.106.00)     | 17 % |  0 MiB/40536 MiB  |              |           |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu2  | 0.0 %| 0.0 %|  NVIDIA GeForce RTX 3090(515.65.01) |  0 % |  2 MiB/24576 MiB  |   StainAtt   |  12:01:30 |
|         |      |      |  NVIDIA GeForce RTX 3090(515.65.01) | 91 % |12611 MiB/24576 MiB|              |           |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu3  | 0.0 %| 0.0 %|NVIDIA GeForce GTX 1080 Ti(530.30.02)|  0 % |  0 MiB/11264 MiB  |     null     |  12:01:24 |
|         |      |      |NVIDIA GeForce GTX 1080 Ti(530.30.02)|  1 % |  0 MiB/11264 MiB  |              |           |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu4  | 0.0 %| 0.2 %|                                     |      |                   | driver failed|  12:01:25 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu5  | 0.0 %| 0.0 %|NVIDIA GeForce RTX 2080 Ti(530.30.02)|  0 % |  0 MiB/11264 MiB  |     null     |  12:01:20 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu6  | 0.1 %| 0.0 %|         Quadro P5000(510.54)        | 100 %|16145 MiB/16384 MiB|      CNN     |  12:01:29 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu7  | 0.0 %| 0.0 %|      A100-PCIE-40GB(460.106.00)     |  0 % |39262 MiB/40536 MiB|    API-Net   |  12:01:28 |
|         |      |      |      A100-PCIE-40GB(460.106.00)     |  0 % |  3 MiB/40536 MiB  |              |           |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu8  | 0.0 %| 0.0 %|NVIDIA GeForce RTX 2080 Ti(510.47.03)|  0 % |  1 MiB/11264 MiB  |     null     |  12:01:26 |
|         |      |      |NVIDIA GeForce RTX 2080 Ti(510.47.03)|  0 % |  1 MiB/11264 MiB  |              |           |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|   gpu9  | 0.0 %| 0.0 %|  NVIDIA A100-PCIE-40GB(525.116.03)  | 83 % |18796 MiB/40960 MiB|OpenHGNN_final|  12:01:23 |
|         |      |      |                                     |      |                   |   StainAtt   |           |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|  gpu10  | 0.0 %| 0.0 %| NVIDIA GeForce RTX 3090(525.116.03) |  0 % |  0 MiB/24576 MiB  |     null     |  12:01:28 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|  gpu11  | 0.5 %| 4.2 %|   NVIDIA A100-PCIE-40GB(515.65.01)  | 91 % | 3671 MiB/40960 MiB|     liif     |  12:01:26 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
|  gpu12  | 0.0 %| 0.0 %| NVIDIA GeForce RTX 4090(525.116.03) |  0 % |  0 MiB/24564 MiB  |     null     |  12:01:18 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
Powered by Rust

普通安装

安装之前确保 server 所在的服务器上有 redis

分别下载 client 和 server 程序,client 放在你GPU服务器上,server 随便放在另外一台服务器上。

https://github.com/rikonaka/watchcorgi/releases

之后分别运行 client 和 server 程序,client 这里的 address 参数放 server 所在服务器的 IP,默认端口是7070

watchcorgi-client --server gpu --address http://YOUR_SERVER_IP:7070/update --interval 9

这里是设置 server 的监听地址和监听端口

watchcorgi-server --address 0.0.0.0 --port 7070

当然,最好的还是用 systemd 来管理

systemd 安装

这个 service 文件仓库里面已经提供了一个,大家下下来然后对应修改一下里面的内容就行,这里提供一个版本的 service 文件

这是 client 的文件,我们记得把可执行文件换成你文件在的 PATH,你也可以图省事直接把文件拖到 /usr/bin 下面

修改 --address 参数,填 server 所在服务器的 IP 和端口就行,–interval 为多久发一次监控包,最小为 1,–server 为服务器类型,这里默认是 GPU 服务器,如果你也有 CPU 服务器就填 cpu

[Unit]
Description=Watchcorgi Client Service
After=network.target

[Service]
Type=simple
User=root
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/watchcorgi-client --server gpu --address http://192.168.1.206:7070/update --interval 9
ExecReload=/usr/bin/watchcorgi-client --server gpu --address http://192.168.1.206:7070/update --interval 9
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target

然后是 server 的 service 文件,这里的 --address 和 --port 都是后端监控地址,按需求改就行

[Unit]
Description=Watchcorgi Client Service
After=network.target

[Service]
Type=simple
User=root
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/watchcorgi-server --address 0.0.0.0 --port 7070
ExecReload=/usr/bin/watchcorgi-server --address 0.0.0.0 --port 7070
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target

然后执行

cp watchcorgi-client.service /etc/systemd/system
systemctl enable watchcorgi-client.service
systemctl start watchcorgi-client.service
cp watchcorgi-server.service /etc/systemd/system
systemctl enable watchcorgi-server.service
systemctl start watchcorgi-server.service

前端

没有…不会写漂亮的网页,如果哪个大佬有这个能力可以写一下,命令行一辈子!

前端可以请求

http://YOUR_SERVER_IP:7070/info2

来获得一个JSON字段

你可能感兴趣的:(rust,运维,深度学习,人工智能)