Tensorflow与cudnn配置报错:Failed to get convolution algorithm. 或 cudnn64_7.dll not found

因为某种需要,从linux系统转移到windows系统,tensorflow与cuda/cudnn需要重新配置(具体对应版本,网上已经很多)

在调试过程中,遇到一个问题:

2021-01-24 17:50:45.163408: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2021-01-24 17:50:45.418034: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
2021-01-24 17:50:45.418182: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-01-24 17:50:45.418408: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-01-24 17:50:45.418507: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{
    {node res_net/conv2d/Conv2D}}]]
Traceback (most recent call last):
  File "e:/Python_Project/xxxx_project/ver3.0/py2xsl.py", line 123, in 
    predict_and_xlsx(file_path)
  File "e:/Python_Project/xxxx_project/ver3.0/py2xsl.py", line 113, in predict_and_xlsx
    model_pre_result =  predict(img_path)
  File "e:\Python_Project\xxxx_project\ver3.0\predict.py", line 211, in predict
    set_name_list, all_name_list,loc_list = detect_img(img_path)
  File "e:\Python_Project\xxxx_project\ver3.0\predict.py", line 144, in detect_img
    detect_result = detect_model.predict(img_send_array_)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 909, in predict
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 722, in predict
    callbacks=callbacks)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 393, in model_iteration
    batch_outs = f(ins_batch)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\keras\backend.py", line 3740, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\eager\function.py", line 1081, in __call__
    return self._call_impl(args, kwargs)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\eager\function.py", line 1121, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call
    ctx=ctx)
  File "C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node res_net/conv2d/Conv2D (defined at C:\Users\Administrator\envs\env1\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]] [Op:__inference_keras_scratch_graph_3404]

Function call stack:
keras_scratch_graph

只看报错的最后两行,这个错误在linux系统上遇到过,原因似乎像是网友说的那样——显存不够引起的。当关闭了IDE或者其他的东西,的确能够解决这个问题。(仅针对报错最后两行与当时的linux系统,其他的系统不清楚)

今日,在windows上遇到了这个问题,翻来覆去找到了许多杂七杂八的方法,其中有:

1.代码块添加,设置显存

2.环境变量配置

3.更改cudnn版本

其中,最让我相信的原因是更改cudnn版本,但是我按照之前配置的,是一模一样的版本,cuda10+cudnn7.6.x,是完全没问题的,但是出于怀疑,还是更换了好多。

试过了都没用,最后还是自己一行一行找出错误的原因。

在翻阅错误信息的时候,发现日志一直在寻找dll文件,其中有一行是报出了:cudnn64_7.dll not found

这个文件见过,就是cudnn解压后,bin文件夹下的文件,想到网上说的两种方法:

1.把解压后的文件复制到cuda目录下,以cudnn或者cuda存在:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\cudnn

2.把解压后的文件夹下所有的文件,复制到cuda下的对应文件夹中:bin-bin,include-include,lib-lib,也就是将cudnn的配置添加到cuda中

之前用的是1方法,可能还要配置环境变量什么的,但是还不好使,

灵机一动,用了第二种方法,为cuda配置添加文件,具体的对应路径为:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\lib\x64

只要把解压好的cudnn中的文件对应bin  , include, lib\x64丢进去就行了。

试过之后,可以运行,不在报错,删除之前胡配的环境变量(注意是胡配的),重启,仍然可以运行。(别乱删除配置,最后把NVIDIA自动配置的路径也给误删了)

 

PS:以结果为导向的搜寻方法,在网上会出现各种各样的方法,希望在下次搜索问题解决方案的时候,最好事先分析下问题的主要矛盾点,如果只是以结果为导向,那可能在鱼龙混杂的说法中,如果自己不懂,瞎配,最后可能问题没解决,可能造成了其他的问题,还可能回不到原来的环境中,得不偿失。

希望在解决问题前,先把问题分析清楚,别人的方法不要乱用。只要逻辑对,方法对,那么问题也会解决的。

 

 

 

你可能感兴趣的:(Python,tensorflow,bug,cuda,tensorflow,cudnn,python)