tensorflow2.2错误:Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

项目场景:

系统环境:

OS: UBUNTU20.04
CUDA:10.1
Tensorflow 2.2
cuDNN: 7.6.5
TensorRT: 6.0.15(tf2.1支持TensorRT6.0)
GPU: RTX2080(8G)*2


问题描述:

使用新版本tensorflow(2.1支持的CUDA版本为10.1,2.0支持的版本为10.0)时,出现了如下错误(错误复现代码地址:https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py):

2020-03-15 22:00:40.933209: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-15 22:00:40.952977: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv2d_0_1/convolution}}]]
	 [[{{node loc_branch_concat_1/concat}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tenforflow_infer.py", line 143, in <module>
    inference(img, show_result=True, target_shape=(260, 260))
  File "tenforflow_infer.py", line 53, in inference
    y_bboxes_output, y_cls_output = tf_inference(sess, graph, image_exp)
  File "/home/smy/FaceMaskDetection/load_model/tensorflow_loader.py", line 38, in tf_inference
    feed_dict={image_tensor: img_arr})
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/smy/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
……


原因分析:

通过查看tensorflow库中的issues讨论得知问题出在RTX2070/2080显卡的显存分配问题上。


解决方案:

按照issues中提到的方法,在程序开头部分添加下述代码:

# gpus= tf.config.experimental.list_physical_devices('GPU')
gpus= tf.config.list_physical_devices('GPU') # tf2.1版本该函数不再是experimental
print(gpus) # 前面限定了只使用GPU1(索引是从0开始的,本机有2张RTX2080显卡)
tf.config.experimental.set_memory_growth(gpus[0], True) # 其实gpus本身就只有一个元素

但是在我自己得到环境中出现了另外一种错误:

ValueError: Memory growth cannot differ between GPU devices

看提示应该是GPU之间冲突的原因,因此我尝试只使用一个GPU:

import os
os.environ['CUDA_VISIBLE_DEVICES']='1' 

这样就解决该错误了


你可能感兴趣的:(tensorflow,linux,深度学习,python)