在跑aishell的训练dnn的时候,run.sh调用local/nnet3/run_tdnn.sh,再调用steps/nnet3/train_dnn.py,在train_dnn.py的train函数中train_lib.common.train_one_iteration进行一次训练,在命令行输出一条
2019-04-05 17:15:44,522 [steps/nnet3/train_dnn.py:352 - train - INFO ] Iter: 0/236 Epoch: 0.00/4.0 (0.0% complete) lr: 0.003000
最终的调用是
train_lib.common.train_one_iteration(
dir=args.dir,
iter=iter,
srand=args.srand,
egs_dir=egs_dir,
num_jobs=current_num_jobs,/GPU并发任务数量
num_archives_processed=num_archives_processed,
num_archives=num_archives,
learning_rate=lrate,
dropout_edit_string=common_train_lib.get_dropout_edit_string(
args.dropout_schedule,
float(num_archives_processed) / num_archives_to_process,
iter),
train_opts=' '.join(args.train_opts),
minibatch_size_str=args.minibatch_size,
frames_per_eg=args.frames_per_eg,
momentum=args.momentum,
max_param_change=args.max_param_change,
shrinkage_value=shrinkage_value,
shuffle_buffer_size=args.shuffle_buffer_size,
run_opts=run_opts)
而每次iteration并发num_jobs个作业,这个变量是current_num_jobs控制的。
current_num_jobs = int(0.5 + args.num_jobs_initial
+ (args.num_jobs_final - args.num_jobs_initial)
* float(iter) / num_iters)
这个current_num_jobs被args.num_jobs_final和args.num_jobs_initial控制。这两个参数是在local/nnet3/run_tdnn.sh定义的。
num_jobs_initial=2
num_jobs_final=12
如果只有n块gpu,那么将num_jobs_final修改为n才行。
否则当并发作业数量超过GPU个数,那么就会报错Failed to create CUDA context, no more unused GPUs?
补充一下-use-gpu 取值true或者wait的不同
当-use-gpu wait会怎么样?在2个GPU的情况,并发数为3,那么只会使用其中1个GPU,当第一个任务使用GPU,其他的两个任务在等待。另外一个GPU被另外的计算进程占用了,因为计算模式设置为Exclusive Process,所以不让其他计算进程使用。
Fri Apr 5 17:31:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 950 Off | 00000000:01:00.0 On | N/A |
| 46% 53C P0 28W / 100W | 355MiB / 1995MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 950 Off | 00000000:02:00.0 Off | N/A |
| 71% 80C P0 87W / 100W | 1062MiB / 2002MiB | 96% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1646 G /usr/lib/xorg/Xorg 194MiB |
| 0 2647 G /usr/bin/gnome-shell 109MiB |
| 0 4152 G /usr/lib/firefox/firefox 1MiB |
| 0 8865 G /usr/lib/firefox/firefox 1MiB |
| 0 21056 C /usr/lib/libreoffice/program/soffice.bin 32MiB |
| 1 29721 C nnet3-train 1050MiB |
+-----------------------------------------------------------------------------+
在log中可以看到
第一个任务独占gpu
LOG (nnet3-train[5.5]:SelectGpuId():cu-device.cc:208) CUDA setup operating under Compute Exclusive Mode.
LOG (nnet3-train[5.5]:FinalizeActiveGpu():cu-device.cc:268) The active GPU is [1]: GeForce GTX 950 free:1879M, used:122M, total:2002M, free/total:0.938754 version 5.2
第二个任务在等待
WARNING (nnet3-train[5.5]:SelectGpuId():cu-device.cc:191) Will try again indefinitely every 5 seconds to get a GPU.
第三个任务在等待
WARNING (nnet3-train[5.5]:SelectGpuId():cu-device.cc:191) Will try again indefinitely every 5 seconds to get a GPU.
在调用steps/nnet3/train_dnn.py的参数中有个-use-gpu,表示如何使用gpu。当-use-gpu true会怎么样?会根据train_one_iteration的参数num_jobs来分配GPU。每个job独占一个GPU,如果不够分就报错。报错Failed to create CUDA context, no more unused GPUs?