tensorflow报No OpKernel was registered to support Op ‘NcclAllReduce‘

导读

在使用tensorflow训练模型的时候报如下错误

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/graph_builder/utils.py:160) with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'

	 [[AllReduceGrads/NcclAllReduce]]

Errors may have originated from an input operation.
Input Source operations connected to node AllReduceGrads/NcclAllReduce:
 tower0/gradients/AddN_373 (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/train/tower.py:276)
terminate called without an active exception
terminate called recursively
terminate called recursively
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
Aborted (core dumped)

这个错误是发生在使用多个GPU进行并行训练的时候,使用单个GPU训练的时候并没有报错,而且指定的GPU会占用135M的GPU内存。

环境

  • 系统:Ubuntu16.04
  • cuda版本:10.1
  • cudnn版本:8.0.2
  • tensorflow-gpu:1.14.0

错误原因分析及解决办法

其实这个错误主要是因为环境配置问题导致,在训练的时候报如上错误的时候,在查找上面的输出信息的前面发现如下信息

2020-08-14 13:58:07.324004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324205: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324311: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324508: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324614: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-14 13:58:07.324666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:

通过分析上面的错误可以发现,是由于找不到libcu*.so.10.0导致的,所以可以很肯定这个错误是由于cuda的版本导致的。因为我安装的是cuda10.1的版本,而TensorFlow1.14需要的是cuda10.0的版本,所以针对这种情况,要么更换cuda的版本要么更换TensorFlow的版本,关于TensorFlow和cuda对应的版本,TensorFlow官方给出了如下信息

tensorflow报No OpKernel was registered to support Op ‘NcclAllReduce‘_第1张图片

官方文档说明:https://www.tensorflow.org/install/source?hl=zh-cn

通过上面的版本对应表可以发现,TensorFlow_gpu-1.14.0所对应的cuda的版本应该是10.0,我最终更改了cuda的版本解决了这个问题。

 

 

你可能感兴趣的:(tensorflow修炼之路)