Slurm集群配置dvgo_cu

Slurm集群配置dvgo_cu

  • 配置dvgo_cu
  • Errors
    • libcublasLt.so.11
    • libopcodes-2.30-55.el7.2.so
    • ‘-std=c++14’
    • collect2: error: ld returned 1 exit status
    • Command '['which', 'x86_64-conda_cos7-linux-gnu-c++']' returned non-zero exit status 1.
    • unsupported GNU version! gcc versions later than 10 are not supported!

配置dvgo_cu

module load cuda/11.1
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

conda install gcc_linux-64=9.3.0
conda install gxx_linux-64=9.3.0

sbatch中,执行

python setup.py install

Errors

libcublasLt.so.11

报错信息:

OSError: /public/data2/users/aigc1g02/anaconda3/envs/vqrf/lib/python3.8/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Slurm集群配置dvgo_cu_第1张图片
原因:版本冲突

解决办法:

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Refs:

  1. how does one fix when torch can’t find cuda, error: version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference?
  2. Conflict with cudatoolkit 11.0.221

libopcodes-2.30-55.el7.2.so

报错信息:

/public/data2/software/gcc/8.3.1/usr/bin/…/libexec/gcc/x86_64-redhat-linux/8/as: error while loading shared libraries: libopcodes-2.30-55.el7.2.so: cannot open shared object file: No such file or directory
error: command ‘/public/data2/software/gcc/8.3.1/bin/gcc’ failed with exit code 1

Slurm集群配置dvgo_cu_第2张图片
原因:CUDA GCC Linux版本不匹配

解决办法:
查看版本信息

cat /etc/redhat-release

cat redhat-release
通过nvcc -V 查看CUDA驱动版本
Slurm集群配置dvgo_cu_第3张图片
通过gcc -v 查看gcc版本
gcc

GCC与CUDA版本匹配关系:NVIDIA CUDA Installation Guide for Linux
点击右上角的Archive,选择自己CUDA驱动的版本,再选择linux安装指导(Installation Guide linux)
Slurm集群配置dvgo_cu_第4张图片
Slurm集群配置dvgo_cu_第5张图片
根据表格内容与系统版本,调整gcc版本
Slurm集群配置dvgo_cu_第6张图片
即,根据CentOS Linux release 7.6.1810 (Core),调整gcc为4.8.5

module unload gcc/8.3.1

(原始gcc版本即为4.8.5)

‘-std=c++14’

报错信息:

gcc: error: unrecognized command line option ‘-std=c++14’
error: command ‘/usr/bin/gcc’ failed with exit code 1

Slurm集群配置dvgo_cu_第7张图片
原因:g++版本不够,-std=c++14需要g++5.2以上

解决办法:

module load gcc/5.5.0

Slurm集群配置dvgo_cu_第8张图片

collect2: error: ld returned 1 exit status

报错信息:

/public/data2/users/aigc1g02/anaconda3/envs/vqrf/compiler_compat/ld: cannot find crti.o: No such file or directory
/public/data2/users/aigc1g02/anaconda3/envs/vqrf/compiler_compat/ld: cannot find -lm: No such file or directory
collect2: error: ld returned 1 exit status

error: command ‘/public/data2/software/gcc/5.5.0/bin/g++’ failed with exit code 1

Slurm集群配置dvgo_cu_第9张图片

原因:编译器不够新,One needs to install anacondas most recent compiler before the package installation
(若有root权限,也可用ln重新创建链接)

解决办法:

conda install -c anaconda gcc_linux-64

Refs:
Install Problem with Anaconda: “lib64/crti.o: file not recognized: file format not recognized”

Command ‘[‘which’, ‘x86_64-conda_cos7-linux-gnu-c++’]’ returned non-zero exit status 1.

报错信息:

subprocess.CalledProcessError: Command ‘[‘which’, ‘x86_64-conda_cos7-linux-gnu-c++’]’ returned non-zero exit status 1.

在这里插入图片描述

解决办法:

conda install -c anaconda gxx_linux-64

Refs:
x86_64-conda_cos6-linux-gnu-c++ command not found

unsupported GNU version! gcc versions later than 10 are not supported!

报错信息:

/public/data2/software/cuda/11.1/include/crt/host_config.h:139:2: error: #error – unsupported GNU version! gcc versions later than 10 are not supported! The nvcc flag ‘-allow-unsupported-compiler’ can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

error: command ‘/public/data2/software/cuda/11.1/bin/nvcc’ failed with exit code 1

Slurm集群配置dvgo_cu_第10张图片

原因:conda中gcc版本高
Slurm集群配置dvgo_cu_第11张图片

解决办法:
对gcc降级

conda install gcc_linux-64=9.3.0
conda install gxx_linux-64=9.3.0

你可能感兴趣的:(环境配置,NeRF,Slurm,pytorch,python,linux)