训练第一轮有loss,但是第二轮的我时候出现device cudnn cuda 错误

[ERROR] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.292.697 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/flatten_gpu_kernel.h:44] Launch] cudaMemcpyAsync error in FlattenFwdGpuKernelMod::Launch, error code is 700 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.292.710 [mindspore/ccsrc/plugin/device/gpu/hal/hardware/gpu_device_context.cc:540] LaunchKernel] Launch kernel failed, kernel full name: Gradients/Default/gradAdd/Reshape-op10082 

[CRITICAL] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.145 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_input_gpu_kernel.h:117] Launch] cuDNN Error: ConvolutionBackwardData failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED 

The function call stack: 

In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(65)/        dx = input_grad(dout, w, x_shape)/ 

 
[CRITICAL] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.288 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_input_gpu_kernel.h:117] Launch] cuDNN Error: ConvolutionBackwardData failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED 

The function call stack: 

In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(65)/        dx = input_grad(dout, w, x_shape)/ 

 
[CRITICAL] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.592 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_filter_gpu_kernel.h:118] Launch] cuDNN Error: ConvolutionBackwardFilter failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED 

The function call stack: 

In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(67)/        dw = filter_grad(dout, x, w_shape)/ 

 
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.700 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:167] SyncStream] cudaStreamSynchronize failed, ret[700], an illegal memory access was encountered 

Traceback (most recent call last): 

  File "/home/luoxuewei/shelei/PFST-LSTM-source-4567_x2ms/experiment/CIKM/dec_PFST_ConvLSTM_dataloader_Gan_SA_mindspore.py", line 526, in  

    model.train() 

  File "/home/luoxuewei/shelei/PFST-LSTM-source-4567_x2ms/experiment/CIKM/dec_PFST_ConvLSTM_dataloader_Gan_SA_mindspore.py", line 249, in train 

    output_G = trainer1(in_frame_dat, group_truth) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/cell.py", line 601, in __call__ 

    raise err 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/cell.py", line 597, in __call__ 

    output = self._run_construct(cast_inputs, kwargs) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/cell.py", line 416, in _run_construct 

    output = self.construct(*cast_inputs, **kwargs) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/wrap/cell_wrapper.py", line 375, in construct 

    grads = self.grad(self.network, self.weights)(*inputs, sens) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 399, in after_grad 

    return grad_(fn, weights)(*args, **kwargs) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/common/api.py", line 93, in wrapper 

    results = fn(*arg, **kwargs) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 391, in after_grad 

    out = _pynative_executor(fn, grad_.sens_param, *args, **kwargs) 

  File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/common/api.py", line 951, in __call__ 

    return self._executor(sens_param, obj, args) 

RuntimeError: mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_filter_gpu_kernel.h:118 Launch] cuDNN Error: ConvolutionBackwardFilter failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED 

The function call stack: 

In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(67)/        dw = filter_grad(dout, x, w_shape)/ 

 
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.818.362 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:167] SyncStream] cudaStreamSynchronize failed, ret[700], an illegal memory access was encountered 

[ERROR] ME(5124,7f9d16bf9240,python):2022-09-17-15:58:13.818.390 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:81] WaitTaskFinishOnDevice] SyncStream failed 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.692 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:158] DestroyStream] cudaStreamDestroy failed, ret[700], an illegal memory access was encountered 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.710 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:61] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.724 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:158] DestroyStream] cudaStreamDestroy failed, ret[700], an illegal memory access was encountered 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.733 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:61] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.830.140 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:67] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR 

[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.831.354 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:48] FreeDeviceMem] cudaFree failed, ret[700], an illegal memory access was encountered 

[CRITICAL] PRE_ACT(5124,7f9d16bf9240,python):2022-09-17-15:58:13.831.371 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:428] operator()] Free device memory[0x7f992e000000] error. 

Error in atexit._run_exitfuncs: 

RuntimeError: mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:428 operator()] Free device memory[0x7f992e000000] error. 

****************************************************解答*****************************************************

看起来是算子出现了 内存相关问题,能否参考这篇帖子帮忙协助定位下,具体是哪个算子。

https://bbs.huaweicloud.com/forum/thread-169762-1-1.html

或者 如果不清楚的话也可以设置下这两个环境变量,把日志给我们看下

export CUDA_LAUNCH_BLOCKING=1
export GLOG_v=1

你可能感兴趣的:(python,人工智能,深度学习)