在使用tensorflow训练模型的时候报如下错误
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/graph_builder/utils.py:160) with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
device='GPU'
[[AllReduceGrads/NcclAllReduce]]
Errors may have originated from an input operation.
Input Source operations connected to node AllReduceGrads/NcclAllReduce:
tower0/gradients/AddN_373 (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/train/tower.py:276)
terminate called without an active exception
terminate called recursively
terminate called recursively
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
Aborted (core dumped)
这个错误是发生在使用多个GPU进行并行训练的时候,使用单个GPU训练的时候并没有报错,而且指定的GPU会占用135M的GPU内存。
其实这个错误主要是因为环境配置问题导致,在训练的时候报如上错误的时候,在查找上面的输出信息的前面发现如下信息
2020-08-14 13:58:07.324004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324205: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324311: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324508: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324614: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-14 13:58:07.324666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
通过分析上面的错误可以发现,是由于找不到libcu*.so.10.0导致的,所以可以很肯定这个错误是由于cuda的版本导致的。因为我安装的是cuda10.1的版本,而TensorFlow1.14需要的是cuda10.0的版本,所以针对这种情况,要么更换cuda的版本要么更换TensorFlow的版本,关于TensorFlow和cuda对应的版本,TensorFlow官方给出了如下信息
官方文档说明:https://www.tensorflow.org/install/source?hl=zh-cn
通过上面的版本对应表可以发现,TensorFlow_gpu-1.14.0所对应的cuda的版本应该是10.0,我最终更改了cuda的版本解决了这个问题。