单卡时,一次 nvidia-smi
的输出形如:
Tue Aug 9 23:05:08 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:01:00.0 Off | N/A |
| 37% 58C P2 75W / 250W | 8481MiB / 12195MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2292 C python 8467MiB |
+-----------------------------------------------------------------------------+
其中 8481MiB / 12195MiB
显示了已用显存、显存总量。现用 shell 命令计算可用显存大小:
|100% 88C P2 95W / 280W | 13900MiB / 24576MiB | 100% Default |
,第一个 |
会跟后面的 100%
连在一起,成为同一列,使列计数出错,故要先用 sed "s/^|//"
删掉开头的 |
,而列数相应地减 1,成 $8
和 $10
。nvidia-smi | \
grep -E "[0-9]+MiB\s*/\s*[0-9]+MiB" | \
sed "s/^|//" | \
awk '{print ($8" "$10)}' | \
sed "s/\([0-9]\{1,\}\)MiB \([0-9]\{1,\}\)MiB/\1 \2/" | \
awk '{print $2 - $1}'
其中:
|
管道,将前一命令结果传给后一命令;\
续行;grep -E "[0-9]+MiB\s*/\s*[0-9]+MiB"
筛出包含使用情况的那行输出,得:| 37% 58C P2 75W / 250W | 8481MiB / 12195MiB | 2% Default |
sed "s/^|//"
删掉开头的 |
,得:37% 58C P2 75W / 250W | 8481MiB / 12195MiB | 2% Default |
awk '{print ($9" "$11)}'
awk '{print ($8" "$10)}'
筛出已用显存、显存总量两列核心数据(这里是该行的8481MiB 12195MiB
sed "s/\([0-9]\{1,\}\)MiB \([0-9]\{1,\}\)MiB/\1 \2/"
去掉 MiB
后缀(括号括住匹配组,\1
、\2
引用匹配到的字符串),得: 8481 12195
awk '{print $2 - $1}'
计算两列的差,算得可用显存,得:3714
多卡时的输出形如:
Sun Aug 14 20:44:33 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 On | 00000000:4F:00.0 Off | N/A |
| 30% 28C P8 29W / 350W | 2381MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 On | 00000000:52:00.0 Off | N/A |
| 30% 29C P8 33W / 350W | 0MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 3090 On | 00000000:56:00.0 Off | N/A |
| 35% 57C P2 257W / 350W | 12437MiB / 24268MiB | 48% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 3090 On | 00000000:57:00.0 Off | N/A |
| 30% 32C P8 28W / 350W | 0MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 3090 On | 00000000:CE:00.0 Off | N/A |
| 35% 59C P2 274W / 350W | 12353MiB / 24268MiB | 67% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 3090 On | 00000000:D1:00.0 Off | N/A |
| 36% 58C P2 263W / 350W | 12439MiB / 24268MiB | 75% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 3090 On | 00000000:D5:00.0 Off | N/A |
| 36% 61C P2 266W / 350W | 12355MiB / 24268MiB | 59% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 3090 On | 00000000:D6:00.0 Off | N/A |
| 36% 61C P2 275W / 350W | 12437MiB / 24268MiB | 50% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3948645 C python 2379MiB |
| 2 N/A N/A 1097294 C python 12435MiB |
| 4 N/A N/A 1200625 C python 12351MiB |
| 5 N/A N/A 1193557 C python 12437MiB |
| 6 N/A N/A 1187226 C python 12353MiB |
| 7 N/A N/A 1185608 C python 12435MiB |
+-----------------------------------------------------------------------------+
直接用单卡的命令就可以输出各卡的可用显存大小。可以另外加个循环显示显卡号,方便指定特定显卡:
res=$(nvidia-smi | \
grep -E "[0-9]+MiB\s*/\s*[0-9]+MiB" | \
sed "s/^|//" | \
awk '{print ($8" "$10)}' | \
sed "s/\([0-9]\{1,\}\)MiB \([0-9]\{1,\}\)MiB/\1 \2/" | \
awk '{print $2 - $1}')
i=0
for s in $res; do
echo $i: $s
i=`expr 1 + $i`
done
在命令行中指定显卡参考 [1]。从 0
号开始,指定 2
、3
号卡跑形如:
CUDA_VISIBLE_DEVICES=2,3 \
python main.py
按照 GPU 可用显存降序,找可用显存大于某一阈值的 GPU 卡号列表,作为 CUDA_VISIBLE_DEVICES
的参数。
# 全局变量
gpu_id=-1
find_gpu()
{
res=$(nvidia-smi | \
grep -E "[0-9]+MiB\s*/\s*[0-9]+MiB" | \
sed "s/^|//" | \
awk '{print ($8" "$10)}' | \
sed "s/\([0-9]\{1,\}\)MiB \([0-9]\{1,\}\)MiB/\1 \2/" | \
awk '{print $2 - $1}')
# 按可用显存降序排列
# 格式:
# 外套 `()` 可以转为列表
i=0
res=($(for s in $res; do echo $i $s && i=`expr 1 + $i`; done | \
sort -n -k 2 -r))
# 第一个参数:需要 GPU 数,默认 = 1
n_gpu_req=${1-"1"}
# 第二个参数:可用显存下界,超过才选,默认 = 0
mem_lb=${2-"0"}
echo "Requiring ${n_gpu_req} GPUs with at least ${mem_lb}MB free memory"
gpu_id=-1
n=0
for i in $(seq 0 2 `expr ${#res[@]} - 1`); do
gid=${res[i]}
mem=${res[i+1]}
echo $gid: $mem
if [ $n -lt ${n_gpu_req} -a $mem -ge ${mem_lb} ]; then
if [ $n -eq 0 ]; then
gpu_id=$gid
else
gpu_id=${gpu_id}","$gid
fi
n=`expr 1 + $n`
else
# 要么够数,要么后面的 GPU 可用显存都不够(因为已经降序排)
break
fi
done
# if [ $n -lt ${n_gpu_req} ]; then
}
# 用例
find_gpu 3
CUDA_VISIBLE_DEVICES=${gpu_id} \
python main.py
上面将 find_gpu
写在同一文件内,会使得命令文件变臃肿。可以将 find_gpu
写成一份单独的 shell script,然后调用。跨脚本之间的变量引用见 [5-7]。
#!/bin/bash
# find-gpu.sh
# 第一个参数:需要 GPU 数,默认 = 1
n_gpu_req=${1-"1"}
# 第二个参数:可用显存下界,超过才选,默认 = 0
mem_lb=${2-"0"}
# 第三个参数:匹配模式,默认 = b
# b = best fit,最少但足够显存的卡优先
# w = worst fit,最多且足够显存的卡优先
mode=${3-"b"}
# 剩余参数: 忽略的 GPU IDs
_ignore=${@:4}
res=$(nvidia-smi | \
grep -E "[0-9]+MiB\s*/\s*[0-9]+MiB" | \
sed "s/^|//" | \
awk '{print ($8" "$10)}' | \
sed "s/\([0-9]\{1,\}\)MiB \([0-9]\{1,\}\)MiB/\1 \2/" | \
awk '{print $2 - $1}')
i=0
if [ $mode == "b" ]; then
res=($(for s in $res; do echo $i $s && i=`expr 1 + $i`; done | \
sort -n -k 2))
else
res=($(for s in $res; do echo $i $s && i=`expr 1 + $i`; done | \
sort -n -k 2 -r))
fi
gpu_id=-1
n_gpu_found=0
for i in $(seq 0 2 `expr ${#res[@]} - 1`); do
gid=${res[i]}
mem=${res[i+1]}
_flag=0 # 是否忽略此 gpu
for _ig in ${_ignore[@]}; do
if [ $_ig -eq $_gid ]; then
_flag=1
break
fi
done
if [ $_flag -eq 1 ]; then continue; fi
# echo $gid: $mem
if [ ${n_gpu_found} -lt ${n_gpu_req} -a $mem -ge ${mem_lb} ]; then
if [ ${n_gpu_found} -eq 0 ]; then
gpu_id=$gid
else
gpu_id=${gpu_id}","$gid
fi
n_gpu_found=`expr 1 + ${n_gpu_found}`
# else
# break
fi
done
# echo found: ${n_gpu_found}: ${gpu_id}
# example.sh
echo before: $gpu_id # 空
# 注意调用格式
# **不**是 `bash find_gpu.sh ...`
# 而是用一点 `.`
. ./find-gpu.sh 5
echo after: $gpu_id # 7,6,5,4,3
CUDA_VISIBLE_DEVICES=${gpu_id} \
python main.py
.
调用(. --help
):.: . filename [arguments]
Execute commands from a file in the current shell.
Read and execute commands from FILENAME in the current shell. The
entries in $PATH are used to find the directory containing FILENAME.
If any ARGUMENTS are supplied, they become the positional parameters
when FILENAME is executed.
Exit Status:
Returns the status of the last command executed in FILENAME; fails if
FILENAME cannot be read.
可以并行运行多个任务时,shell 监视有无够显存的可用,若有启动一个进程放去后台跑,无则等待:
#!/bin/bash
set -e
for d in `ls data`; do
# 找/等 GPU
while true; do
. scripts/find-gpu.sh 1 7051
# 如果没有可用 GPU 就等 15min
if [ $gpu_id == "-1" ]; then sleep 15m; else break; fi
done
echo $d
LD_LIBRARY_PATH=$HOME/miniconda3/envs/pt110/lib:$LD_LIBRARY_PATH \
CUDA_VISIBLE_DEVICES=$gpu_id \
python do_not_answer.py --input $d &
# 等 20s 让它启动,确保等下剩余显存、进程数计算/记数正确
sleep 20
# 数有几多个进程在跑,超过 4 个占卡太多了会被骂,清一拨
n=$(ps aux | grep do_not_answer.py | wc -l)
if [ $n -gt 4 ]; then wait; fi
done
wait