MindSpore-GPU-1.1.0
运行环境
Windows 10 家庭中文版 2004 20279.1
WSL2
Ubuntu 18.04
cudatoolkit 1.
cudnn
conda
【操作步骤&问题现象】
python train.py --device_target="GPU"
【日志信息】(可选,上传日志内容或者附件)
============== Starting Training ==============
libnuma: Warning: Cannot read node cpumask from sysfs
numa_sched_setaffinity_v2_int() failed; abort
: Invalid argument
set_mempolicy: Function not implemented
numa_sched_setaffinity_v2_int() failed; abort
: Invalid argument
set_mempolicy: Function not implemented
[ERROR] MD(4311,python):2021-01-01-21:18:15.974.858 [mindspore/ccsrc/minddata/dataset/util/arena.cc:242] Init] cudaHostAlloc failed, ret[2], out of memory
[ERROR] KERNEL(4311,python):2021-01-01-21:23:15.982.517 [mindspore/ccsrc/backend/kernel_compiler/gpu/data/dataset_iterator_kernel.cc:114] ReadDevice] Get data timeout
[ERROR] DEVICE(4311,python):2021-01-01-21:23:15.982.628 [mindspore/ccsrc/runtime/device/gpu/gpu_kernel_runtime.cc:652] LaunchKernelDynamic] Op Error: Launch kernel failed. | Error Number: 0
Traceback (most recent call last):
File "train.py", line 71, in
dataset_sink_mode=args.dataset_sink_mode)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/train/model.py", line 592, in train
sink_size=sink_size)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/train/model.py", line 391, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/train/model.py", line 452, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/nn/cell.py", line 331, in __call__
out = self.compile_and_run(*inputs)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/nn/cell.py", line 602, in compile_and_run
return _executor(self, *new_inputs, phase=self.phase)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 582, in __call__
return self.run(obj, *args, phase=phase)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 610, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 75, in wrapper
results = fn(*arg, **kwargs)
File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 593, in _exec_pip
return self._executor(args_list, phase)
RuntimeError: mindspore/ccsrc/runtime/device/gpu/gpu_kernel_runtime.cc:652 LaunchKernelDynamic] Op Error: Launch kernel failed. | Error Number: 0
解答:
日志中报错提及:out of memory, 应该为内存不够:
[ERROR] MD(4311,python):.../arena.cc:242] Init] cudaHostAlloc failed, ret[2], out of memory
请查下机器上有多少内存,您的gpu机器应该为numa架构,请基于此命令 numastat -cm进行查看 , 正常能看到类似下图结果
当前GPU上运行初始化时,数据模块会尝试申请2G 内存,请结合上述命令分析机器内存是否够用
此外网络部分也会尝试申请内存,请执行上述命令,监控网络运行期间内存的变化,是否已满负荷
$ numastat -cm
Per-node system memory usage(in MBs):
Node 0 Node 1 Total
-------- --------- ----------
MemTotal 12524 12915 25439
MemFree **** **** ****
MemUsed **** **** ***