我的电脑配置:
ubuntu18.04+CUDA10.0 + cudnn7.4.1.5
问题:Torch7官网上说的安装方法不适合cuda10.0。以下是本次成功安装Torch7后的记录。
以及运行论文8《Deep depth completion of a single RGB-D image》的代码行时的问题和解决办法。
检查cuda版本有以下两种方法:
nvcc --version
cat /usr/local/cuda/version.txt
nvidia-smi
如果NVIDIA驱动不正常就卸载再安装:
(1)先把gcc变成gcc-8的版本:(不然后面安装驱动会有错。)https://blog.csdn.net/u013928488/article/details/107288413/
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-6 100
sudo update-alternatives --config gcc
(2)再参照博文来安装驱动:
https://blog.csdn.net/weixin_43820996/article/details/100676292
卸载 cmake
sudo apt remove --purge cmake
hash -r
安装 cmake-3.17.2
sudo apt install build-essential libssl-dev
wget https://github.com/Kitware/CMake/releases/download/v3.17.2/cmake-3.17.2.tar.gz
tar -zxvf cmake-3.17.2.tar.gz
cd cmake-3.17.2
./bootstrap
make
sudo make install
cmake --version
参考https://blog.csdn.net/liaoze22/article/details/107821653
按Ctrl+alt+t打开终端,然后输入:
git clone https://github.com/torch/distro.git ~/torch --recursive
git clone https://github.com/nagadomi/distro.git ~/torch --recursive
cd ~/torch
sudo gedit install-deps #打开要修改的文件
然后把文件中第178行和第261行的sudo apt-get install -y python-software-properties改成sudo apt-get install -y software-properties-common,保存
cd ~/torch
git config --global url."https://".insteadOf git://
bash install-deps
rm -fr cmake/3.6/Modules/FindCUDA*
cd ~/torch
cd extra/cutorch
vim atomic.patch
将下面的内容复制进去:
diff --git a/lib/THC/THCAtomics.cuh b/lib/THC/THCAtomics.cuh
index 400875c..ccb7a1c 100644
--- a/lib/THC/THCAtomics.cuh
+++ b/lib/THC/THCAtomics.cuh
@@ -94,6 +94,7 @@ static inline __device__ void atomicAdd(long *address, long val) {
}
#ifdef CUDA_HALF_TENSOR
+#if !(__CUDA_ARCH__ >= 700 || !defined(__CUDA_ARCH__) )
static inline __device__ void atomicAdd(half *address, half val) {
unsigned int * address_as_ui =
(unsigned int *) ((char *)address - ((size_t)address & 2));
@@ -117,6 +118,7 @@ static inline __device__ void atomicAdd(half *address, half val) {
} while (assumed != old);
}
#endif
+#endif
然后保存并退出用vim命令打开的文件:按esc,再输入:wq!
patch -p1 < atomic.patch
先获取权限,此处需要切换为root用户:(重要)
su root
再安装(应安装torch和Lua5.2。因为如果安装LuaJIT会导致后面require’cutorch’ 、require’nn’等失败)
cd /home/**/torch #torch所在绝对路径。我的是/home/lt/torch
./clean.sh #执行clean.sh脚本
export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
TORCH_LUA_VERSION=LUA52 ./install.sh #执行install.sh脚本
输入yes
su root
source ~/.bashrc
source ~/.profile
(1)
su root
sudo gedit ~/.bashrc
看~/.bashrc文件的末尾是否多了类似的语句:
. /home/XXX/torch/install/bin/torch-activate
(2)
th
require'torch'
require'nn'
require'cutorch'
重启
module 'cutorch' not found:Failed loading module cutorch in LuaRocks rock cutorch scm-1
解决:在/home/**/torch/extra/cutorch目录下打开终端执行:
su root
export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
luarocks make rocks/cutorch-scm-1.rockspec
th
require'cutorch'
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_cublas_device_LIBRARY (ADVANCED)
原因:cuda版本和cmak版本冲突:cuda10.0需要CMake3.12.2+
解决:重装CMake3.14.3
(1)下载并解压:(如自行去官网下载可能会遇到没有bootstrap文件的问题)
sudo apt-get purge cmake
wget https://cmake.org/files/v3.14/cmake-3.14.3.tar.gz
sudo tar -zxv -f cmake-3.14.3.tar.gz
(2)解压后得到一个单独的文件夹,名称为cmake-3.14.3。如果这个文件夹或者文件夹里面的文件有锁说明有权限设置,需要用指令chmod -R 777 cmake-3.14.0修改文件权限 。如果没有锁的话省略。
(3)检测gcc和g++是否安装:
gcc --version
(4)安装
在解压后的的目录下打开终端。或者cd /home/**/cmake.3.14.3
sudo ./bootstrap
sudo make
sudo make install
cmake --version #查看是否安装成功
返回问题1
问题
cuda10.0要求gcc<=7
重装gcc,重装cmake
问题:require’cutorch’时遇到问题
cannot load '/home/lt/torch/install/lib/lua/5.2/libcutorch.so'
sudo chmod -R 777 /home/lt/torch
在普通用户下th ,require’cutorch’,require’cunn’都成功了。
但是在su root下时,不成功
问题:
'libcudnn (R5) not found in library path.
Please install CuDNN from https://developer.nvidia.com/cuDNN
Then make sure files named as libcudnn.so.5 or libcudnn.5.dylib are placed in
your library load path (for example /usr/local/lib , or manually add a path to LD_LIBRARY_PATH)
Alternatively, set the path to libcudnn.so.5 or libcudnn.5.dylib
to the environment variable CUDNN_PATH and rerun torch.
For example: export CUDNN_PATH="/usr/local/cuda/lib64/libcudnn.so.5"
stack traceback:
[C]: in function 'error'
/home/lt/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
main_test_bound_realsense.lua:4: in main chunk
[C]: in function 'dofile'
解决:安装cudnn
Debian File形式的安装:https://blog.csdn.net/weixin_45591044/article/details/104608506
cuDNN安装成功,但是问题依然存在。
sudo find / -name ''libcudnn.*'' #查找
sudo cp -r /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2 /usr/local/lib #复制
sudo mv libcudnn.so.7.4.2 libcudnn.so.5 #重命名 先cd /usr/local/lib
sudo cp -r /usr/local/lib/libcudnn.so.5 /usr/local/cuda/lib64/libcudnn.so.5
th main_test_bound_realsense.lua -test_model ../pre_train_model/bound.t7 -test_file ./data_list/realsense_list.txt -root_path ../data/realsense/
could not load library /usr/local/lib:
Found Environment variable CUDNN_PATH = /usr/local/lib:/home/**/torch/install/bin/lua: /home/**/torch/install/share/lua/5.2/trepl/init.lua:389: /home/**/torch/install/share/lua/5.2/trepl/init.lua:389: /home/**/torch/install/share/lua/5.2/cudnn/ffi.lua:1743: could not load library /usr/local/lib:
stack traceback:
[C]: in function 'error'
/home/**/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
main_test_bound_realsense.lua:4: in main chunk
[C]: in function 'dofile'
...e/**/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
解决:运行之前先export CUDNN_PATH=/usr/local/cuda/lib64/libcudnn.so
变为永久:
sudo gedit /etc/profile
export CUDNN_PATH=/usr/local/cuda/lib64/libcudnn.so #在文件最后添加
source ~/.profile
Are you using an older or newer version of CuDNN?
cuDNN7.4版本对于Torch7来说太新了。
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so/home/**/torch/install/bin/lua: /home/**/torch/install/share/lua/5.2/trepl/init.lua:389: /home/**/torch/install/share/lua/5.2/trepl/init.lua:389: /home/**/torch/install/share/lua/5.2/cudnn/ffi.lua:1618: These bindings are for CUDNN 5.x (5005 <= cudnn.version > 6000) , while the loaded CuDNN is version: 7401
Are you using an older or newer version of CuDNN?
stack traceback:
[C]: in function 'error'
/home/**/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
main_test_bound_realsense.lua:4: in main chunk
[C]: in function 'dofile'
...e/**/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
解决:参照博文给Torch提供一个接口(可能是某种映射之类的)来兼容最新版本(7及以上版本)的CuDNN的结构。
https://blog.csdn.net/Geek_of_CSDN/article/details/80461129
cd /home/lt/torch
git clone https://github.com/soumith/cudnn.torch.git -b R7 && cd cudnn.torch && luarocks make cudnn-scm-1.rockspec
cannot open <../pre_train_model/bound.t7> in mode
/home/**/torch/install/bin/lua: cannot open <../pre_train_model/bound.t7> in mode r at /home/**/torch/pkg/torch/lib/TH/THDiskFile.c:673
stack traceback:
[C]: in ?
[C]: in function 'DiskFile'
/home/**/torch/install/share/lua/5.2/torch/File.lua:405: in function 'load'
main_test_bound_realsense.lua:38: in main chunk
[C]: in function 'dofile'
...e/**/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
拷贝/pre_train_model/bound.t7
attempt to call global 'unpack' (a nil value)
/home/**/torch/install/bin/lua: BatchIterator_scannet.lua:24: attempt to call global 'unpack' (a nil value)
stack traceback:
BatchIterator_scannet.lua:24: in function 'nextBatchRealsense'
main_test_bound_realsense.lua:50: in main chunk
[C]: in function 'dofile'
...e/**/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
原因:由于 Lua 5.2中unpack功能现在在table.unpack。函数unpack已移入 table 库,因此必须用table.unpack 调用。
解决:把DeepCompletionRelease-master/torch目录下的文件BatchIterator_scannet.lua的第24行的unpack变为table.unpack
成功!
收获总结:出现问题时一定要要先看错误提示,再去网上找办法。不能一出现问题就马上去网上找办法。
th main_test_realsense.lua -test_model ../pre_train_model/normal_scannet.t7 -test_file ./data_list/realsense_list.txt -root_path ../data/realsense/
module 'hdf5' not found:No LuaRocks module found for hdf5
A、在目录/usr/lib/x86_64-linux-gnu下打开终端并执行:
sudo ln -s libhdf5_serial.so.100.0.1 libhdf5.so
sudo ln -s libhdf5_serial_hl.so libhdf5_hl.so
参照https://blog.csdn.net/weixin_43165871/article/details/88992354
sudo apt-get install libhdf5-serial-dev hdf5-tools
git clone https://github.com/deepmind/torch-hdf5
cd torch-hdf5
luarocks make hdf5-0-0.rockspec LIBHDF5_LIBDIR="/usr/lib/x86_64-linux-gnu/"
问题:Missing dependencies for hdf5: totem
先参照:https://stackoverflow.com/questions/45499973/missing-dependency-for-hdf5-totem-error-failed-cloning-git-repository-git-clo
git clone https://github.com/deepmind/torch-totem.git
cd torch-totem
cp rocks/totem-0-0.rockspec ./ #copy the rockspec file in the root dirctory of project
luarock make
再
luarocks make hdf5-0-0.rockspec LIBHDF5_LIBDIR="/usr/lib/x86_64-linux-gnu/"
module 'bit' not found:No LuaRocks module found for bit
A:(1)先安装lua和luarocks
安装lua5.2.3: https://blog.csdn.net/hp_cpp/article/details/87641222
安装luarocks3.0.4:https://blog.csdn.net/hp_cpp/article/details/87643911
(2)再
luarocks install luabitop
失败!无法使用luarocks install来安装
B:换种方式安装bit模块:
1.http://bitop.luajit.org/download.html下载库
2.tar xvzf LuaBitOp-1.0.2解压
3.在目录~/LuaBitOp-1.0.2下打开终端执行make
4.make install
成功!
/torch/install/share/lua/5.2/hdf5/ffi.lua:73: expected align(#) on line 679
办法:gcc和g++版本都改为4.8:
sudo update-alternatives --config gcc
解决!
/home/lt/torch/install/share/lua/5.2/hdf5/ffi.lua:88: Unsupported HDF5 version: 1.10.4
办法A:安装HDF5-1.8.12版本:
(1)下载https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.12/src/
(2)
sudo tar -xvf hdf5-1.8.12.tar.gz
cd hdf5-1.8.12/
sudo ./configure --prefix=/usr/local/hdf5 #安装路径
sudo make
sudo make check
sudo make install
make check-install
失败!
办法B:参照https://github.com/deepmind/torch-hdf5/issues/76#issuecomment-292811730修改/home/lt/torch/install/share/lua/5.2/hdf5目录下的config.lua文件:
hdf5._config = {
HDF5_INCLUDE_PATH = "/usr/local/hdf5/include/",
HDF5_LIBRARIES = "/usr/local/hdf5/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/librt.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/home/lt/anaconda3/lib/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so"
}
成功!!
在/media/lt/c8470c47-e40b-4a4a-9a40-c8c0736564fe/lt/lsn/depth/code8/code8/torch/deta_list/realsense_list.txt中添加想要test的图像的数(如050),
便会得到对应图像的遮挡边界和表面法线。