(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR

最近在跑一篇论文时,最后roslaunch tracking_slam tb3_test.launch时总是报以下错误:

Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR

CUDA版本:
(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第1张图片
cudnn版本:
(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第2张图片
显卡驱动:
(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第3张图片
opencv版本:
在这里插入图片描述

先来回忆一下caffe-segnet-cudnn5.1的安装过程

  1. 在下图目录下:
git clone https://github.com/TimoSaemann/caffe-segnet-cudnn5.git
  1. 进入 caffe ,将 Makefile.config.example 文件复制一份并更名为 Makefile.config:
sudo cp Makefile.config.example Makefile.config
  1. 然后修改 Makefile.config 文件,在caffe-segnet 目录下打开该文件:
sudo gedit Makefile.config
  1. 修改 Makefile.config 文件内容:

启用cudnn:

将
#USE_CUDNN := 1
修改成: 
USE_CUDNN := 1

设置opencv 版本:

将
#OPENCV_VERSION := 3 
修改为: 
OPENCV_VERSION := 3

启用python 接口:

将
#WITH_PYTHON_LAYER := 1 
修改为 
WITH_PYTHON_LAYER := 1

修改 python 路径:

将
#WITH_PYTHON_LAYER := 1 
修改为 
WITH_PYTHON_LAYER := 1

修改 caffe-segnet目录下的 Makefile 文件:

将:
NVCCFLAGS +=-ccbin=$(CXX) -Xcompiler-fPIC $(COMMON_FLAGS)
替换为:
NVCCFLAGS += -D_FORCE_INLINES -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)
将:
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_hl hdf5
改为:
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_serial_hl hdf5_serial

修改 /usr/local/cuda/include/host_config.h 文件 :

将
#error-- unsupported GNU version! gcc versions later than 5.0 are not supported!
改为
//#error-- unsupported GNU version! gcc versions later than 5.0 are not supported!

开始编译,在caffe-segnet目录下执行 :

make all -j4

以上步骤都正常,一些小trick均已解决。但是在下面的测试中报错:

测试编译是否成功:

sudo ldconfig /usr/local/cuda/lib64
sudo make test -j4
sudo make runtest -j4 

错误:

F0927 16:20:13.189004 4091 math_functions.cu:394] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@ 0x7f6a1abc15cd google::LogMessage::Fail()
@ 0x7f6a1abc3433 google::LogMessage::SendToLog()
@ 0x7f6a1abc115b google::LogMessage::Flush()
@ 0x7f6a1abc3e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f6a0f5da3b4 caffe::caffe_gpu_rng_uniform<>()
@ 0x7f6a0f6008f3 caffe::PoolingLayer<>::Forward_gpu()
@ 0x47a436 caffe::Layer<>::Forward()
@ 0x480092 caffe::GradientChecker<>::CheckGradientSingle()
@ 0x52a592 caffe::GPUStochasticPoolingLayerTest_TestGradient_Test<>::TestBody()
@ 0x8f9923 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x8f373a testing::Test::Run()
@ 0x8f3888 testing::TestInfo::Run()
@ 0x8f3965 testing::TestCase::Run()
@ 0x8f4b7f testing::internal::UnitTestImpl::RunAllTests()
@ 0x8f4e93 testing::UnitTest::Run()
@ 0x46f22d main
@ 0x7f6a0e775840 __libc_start_main
@ 0x476ca9 _start
@ (nil) (unknown)
Makefile:526: recipe for target ‘runtest’ failed
make: *** [runtest] 已放弃 (core dumped)
xx@xx-OMEN-by-HP-Laptop:~/catkin_ws/src/tracking_slam/caffe-segnet-cudnn5$ make test -j4
make: Nothing to be done for ‘test’.

尝试解决1:
去cuda官网安装cuda 8.0的补丁,但还是不行。

尝试2:
直接cd进入caffe-segnet-cudnn5.1文件夹,执行以下命令:

cd ../caffe-segnet-cudnn5/
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

编译通过:
(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第4张图片
但是在执行完以下代码报错:

roslaunch tracking_slam tb3_test.launch
rosbag play hd3_2018-12-14-16-29-16.bag

有一个数据集视频显示在rviz左边小框,但并没有建图。

报错:

F0928 17:40:46.945029 6136 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR

尝试3:

重新卸载并下载cuda 8.0和cudnn 5.1,进入caffe-segnet-cudnn5.1目录:

make clean
make all -j4
sudo ldconfig /usr/local/cuda/lib64
sudo make test -j4
sudo make runtest -j4 

依然报 Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE这个问题。

根据CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE #1400
(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第5张图片

sudo apt-get remove --auto-remove nvidia-cuda-toolkit

修改后,再一次:

make clean
make all -j4
sudo make test -j4
sudo make runtest -j4 
xx@xx-OMEN-by-HP-Laptop:~/catkin_ws/src/tracking_slam/caffe-segnet-cudnn5$ sudo make runtest -j4 
.build_release/tools/caffe
.build_release/tools/caffe: error while loading shared libraries: libcudart.so.7.5: cannot open shared object file: No such file or directory
Makefile:526: recipe for target 'runtest' failed
make: *** [runtest] Error 127

还是报错,当然在这过程中,我注意到在make test和make runtest后,总会有以下警告信息:

/usr/bin/ld: warning: libcudart.so.7.5, needed by /usr/local/lib/libopencv_core.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libnppc.so.7.5, needed by /usr/local/lib/libopencv_core.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libnppi.so.7.5, needed by /usr/local/lib/libopencv_core.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libnpps.so.7.5, needed by /usr/local/lib/libopencv_core.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libcufft.so.7.5, needed by /usr/local/lib/libopencv_core.so, not found (try using -rpath or -rpath-link)
LD .build_release/src/caffe/test/test_upgrade_proto.o
............

个人觉得是 /usr/local/lib/libopencv_core.so这个文件需要cuda 7.5,0的一些文件、库,即cuda 8.0和opencv 3.1.0不兼容。明天把opencv换成2.4.13的再试试。


首先根据之前一篇博客卸载opencv 3.1.0:ubuntu卸载及重新安装opencv(解决CUDA10与opencv 3.1.0版本不兼容问题)

然后安装opencv 2.4.13:

  1. 安装编译工具
sudo apt-get install build-essential -y
  1. 安装依赖包
sudo apt-get install libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev -y
  1. 安装可选包
sudo apt-get install python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev -y

sudo apt-get install libgtk2.0-dev -y

sudo apt-get install pkg-config -y
  1. 进入~/catkin_ws/src,从GitHub下载opencv 2.4.13,这不是一个 git repository,使用wget。将下载的OpenCV解压~/catkin_ws/src目录下:
wget https://github.com/Itseez/opencv/archive/2.4.13.zip

unzip 2.4.13.zip
  1. 进入OpenCV的目录下,编译安装OpenCV 2.4.13 源码:
cd opencv-2.4.13/

mkdir build

cd build

cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local .. 

make -j4

sudo make install
  1. 配置OpenCV环境变量
sudo gedit /etc/ld.so.conf.d/opencv.conf

/etc/ld.so.conf.d/并没有opencv.conf,所以相当于得自己添加了一个opencv.conf文件,加入:

/usr/local/lib

保存退出。

sudo ldconfig    
sudo gedit /etc/bash.bashrc 

末尾加入:

PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig export PKG_CONFIG_PATH

保存退出。

使配置生效:

sudo s
输入root密码
source /etc/bash.bashrc
Ctrl+d  #(退出root)
sudo updatedb #更新database

然后进入caffe-segnet-cudnn5目录下,修改 Makefile.config 文件内容:

OPENCV_VERSION := 3用#号注释掉(个人觉得既然下载回opencv2了,就使用默认的opencv版本),不然make all -j4时会报以下错误:

/usr/bin/ld: 找不到 -lopencv_imgcodecs
Makefile:566: recipe for target '.build_release/lib/libcaffe.so.1.0.0-rc3' failed make: *** [.build_release/lib/libcaffe.so.1.0.0-rc3] Error 1

然后:

make clean
make all -j4
sudo ldconfig /usr/local/cuda/lib64
sudo make test -j4
sudo make runtest -j4 

(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第6张图片
测试成功。

roslaunch tracking_slam tb3_test.launch

(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第7张图片
1. trick:

F0929 13:53:44.471045 22307 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
[ERROR] [1632894824.504076604]: PluginlibFactory: The plugin for class 'octomap_rviz_plugin/ColorOccupancyGrid' failed to load.  Error: According to the loaded plugin descriptions the class octomap_rviz_plugin/ColorOccupancyGrid with base class type rviz::Display does not exist. Declared types are  rviz/Axes rviz/Camera rviz/DepthCloud rviz/Effort rviz/FluidPressure rviz/Grid rviz/GridCells rviz/Illuminance rviz/Image rviz/InteractiveMarkers rviz/LaserScan rviz/Map rviz/Marker rviz/MarkerArray rviz/Odometry rviz/Path rviz/PointCloud rviz/PointCloud2 rviz/PointStamped rviz/Polygon rviz/Pose rviz/PoseArray rviz/PoseWithCovariance rviz/Range rviz/RelativeHumidity rviz/RobotModel rviz/TF rviz/Temperature rviz/WrenchStamped rviz_plugin_tutorials/Imu
[tracking_slam_node-1] process has died [pid 22084, exit code -6, cmd /home/xx/catkin_ws/devel/lib/tracking_slam/tracking_slam_node __name:=tracking_slam_node __log:=/home/xx/.ros/log/8d07e664-20e9-11ec-8a5c-887873831b6b/tracking_slam_node-1.log].
log file: /home/xx/.ros/log/8d07e664-20e9-11ec-8a5c-887873831b6b/tracking_slam_node-1*.log

解决:octomap_rviz_plugins

根据in ubuntu 18.04, using melodic ROS, can not install octomap_rviz_plugins #15,下载 rviz_plugins 并将文件夹放在您的 ros 工作区/src 中,然后catkin_make插件将被安装。

2. 再次:

roslaunch tracking_slam tb3_test.launch

又报错:

cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
[tracking_slam_node-2] process has died [pid 31579, exit code -6, cmd /home/xx/catkin_ws/devel/lib/tracking_slam/tracking_slam_node __name:=tracking_slam_node __log:=/home/xx/.ros/log/7afad10a-20f5-11ec-8a5c-887873831b6b/tracking_slam_node-2.log].
log file: /home/xx/.ros/log/7afad10a-20f5-11ec-8a5c-887873831b6b/tracking_slam_node-2*.log

3. 再次尝试:

卸载cudnn 5.1.10,换成版本更低的cudnn 5,重新编译一边所有文件,再次:

roslaunch tracking_slam tb3_test.launch

依旧报错:

F0929 18:55:04.884161 16169 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
[ INFO] [1632912905.796436179]: Stereo is NOT SUPPORTED
[ INFO] [1632912905.796709093]: OpenGl version: 4.5 (GLSL 4.5).
0x16baca0 void QWindowPrivate::setTopLevelScreen(QScreen*, bool) ( QScreen(0x91b090) ): Attempt to set a screen on a child window.
0x16bba20 void QWindowPrivate::setTopLevelScreen(QScreen*, bool) ( QScreen(0x91b090) ): Attempt to set a screen on a child window.
0x16c9340 void QWindowPrivate::setTopLevelScreen(QScreen*, bool) ( QScreen(0x91b090) ): Attempt to set a screen on a child window.
0x16bb580 void QWindowPrivate::setTopLevelScreen(QScreen*, bool) ( QScreen(0x91b090) ): Attempt to set a screen on a child window.
[tracking_slam_node-1] process has died [pid 15938, exit code -6, cmd /home/xx/catkin_ws/devel/lib/tracking_slam/tracking_slam_node __name:=tracking_slam_node __log:=/home/xx/.ros/log/a3eb18e0-2113-11ec-8a5c-887873831b6b/tracking_slam_node-1.log].
log file: /home/xx/.ros/log/a3eb18e0-2113-11ec-8a5c-887873831b6b/tracking_slam_node-1*.log

4. 加入engine: CAFFE

Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR #3
(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第8张图片
在conv.prototxt中加入:

engine: CAFFE

(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第9张图片
还是报同样的错误。

5. 再次尝试:
根据CUDNN_STATUS_SUCCESS(4 对 0) CUDNN_STATUS_INTERNAL_ERROR #6873:

sudo rm -rf ~/.nv/

还是报错:

F0929 21:19:53.521602  6044 io.cpp:54] Check failed: fd != -1 (-1 vs. -1) File not found: /home/xx/catkin_ws/src/tracking_slam/config/segnet/segnet_pascal.caffemodel
*** Check failure stack trace: ***
[tracking_slam_node-1] process has died [pid 5799, exit code -6, cmd /home/xx/catkin_ws/devel/lib/tracking_slam/tracking_slam_node __name:=tracking_slam_node __log:=/home/xx/.ros/log/a3eb18e0-2113-11ec-8a5c-887873831b6b/tracking_slam_node-1.log].
log file: /home/xx/.ros/log/a3eb18e0-2113-11ec-8a5c-887873831b6b/tracking_slam_node-1*.log

这次错误全网找不到相关错误了。

6. 网上也有说是电脑显存不足的原因,但查了查显存,还有很多:

(已解决)Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR_第10张图片


caffe的环境配置没问题:

******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   1.0.0-rc3
--   Git               :   abcf30d-dirty
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -fPIC -Wall -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
-- 
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   CPU_ONLY          :   OFF
--   USE_OPENCV        :   ON
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   ALLOW_LMDB_NOLOCK :   OFF
-- 
-- Dependencies:
--   BLAS              :   Yes (Atlas)
--   Boost             :   Yes (ver. 1.58)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 2.6.1)
--   lmdb              :   Yes (ver. 0.9.17)
--   LevelDB           :   Yes (ver. 1.18)
--   Snappy            :   Yes (ver. 1.1.3)
--   OpenCV            :   Yes (ver. 3.3.1)
--   CUDA              :   Yes (ver. 8.0)
-- 
-- NVIDIA CUDA:
--   Target GPU(s)     :   Auto
--   GPU arch(s)       :   sm_61
--   cuDNN             :   Yes (ver. 5.0.5)
-- 
-- Python:
--   Interpreter       :   /usr/bin/python2.7 (ver. 2.7.12)
--   Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
--   NumPy             :   /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0)
-- 
-- Documentaion:
--   Doxygen           :   No
--   config_file       :   
-- 
-- Install:
--   Install path      :   /home/xx/catkin_ws/src/tracking_slam/caffe-segnet-cudnn5/build/install
-- 


回来了,我应该是caffe的makefile文件、 Makefile.config 文件改太多了,我不作修改,直接:

cd ../caffe-segnet-cudnn5/
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j4

根据报错信息修改相关路径即可。


reference:
caffe-segnet-cudnn5安装

你可能感兴趣的:(SLAM,caffe,python,计算机视觉,人工智能,深度学习)