运行torch_xla时,提示找不到cuda相关库(torchvision版本错误)

环境

  • pytorch 2.0.0(+cuda)
  • cuda 11.7
  • torch-xla 2.0.0
  • tensorflow 2.11.1

错误信息

明明cuda所有相关的库均存在,却提示不能加载动态库,仔细查看错误信息,是由于找不到此符号,从而引发的错误:

torch::jit::parseSchemaOrName(std::__cxx11::basic_string const&)

(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# python resnet50_infer.py
/mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2023-12-16 14:20:57.896978: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.897105: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.897166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.897237: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.898062: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.898158: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.898173: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-12-16 14:20:57.996329: F tensorflow/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread GrpcWorkerEnvPool creation via pthread_create() failed.
Aborted (core dumped)
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# c++filt _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
torch::jit::parseSchemaOrName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip list | grep torch
torch              2.0.0a0+gitec54f40 /mnt/data/jack/workspace/pytorch
torch-xla          2.0.0              /mnt/data/jack/workspace/pytorch/torch_xla
torchvision        0.15.2a0
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# ls /usr/local/cuda-11.7/lib64
cmake                           libcudnn_cnn_train.so           libcufile.so.0               libnppial.so.11.7.4.75   libnppitc.so.11.7.4.75
libaccinj64.so                  libcudnn_cnn_train.so.8         libcufile.so.1.3.1           libnppial_static.a       libnppitc_static.a
libaccinj64.so.11.7             libcudnn_cnn_train.so.8.9.7     libcufile_static.a           libnppicc.so             libnpps.so
libaccinj64.so.11.7.101         libcudnn_cnn_train_static.a     libcufilt.a                  libnppicc.so.11          libnpps.so.11
libcublasLt.so                  libcudnn_cnn_train_static_v8.a  libcuinj64.so                libnppicc.so.11.7.4.75   libnpps.so.11.7.4.75
libcublasLt.so.11               libcudnn_ops_infer.so           libcuinj64.so.11.7           libnppicc_static.a       libnpps_static.a
libcublasLt.so.11.10.3.66       libcudnn_ops_infer.so.8         libcuinj64.so.11.7.101       libnppidei.so            libnvblas.so
libcublasLt_static.a            libcudnn_ops_infer.so.8.9.7     libculibos.a                 libnppidei.so.11         libnvblas.so.11
libcublas.so                    libcudnn_ops_infer_static.a     libcurand.so                 libnppidei.so.11.7.4.75  libnvblas.so.11.10.3.66
libcublas.so.11                 libcudnn_ops_infer_static_v8.a  libcurand.so.10              libnppidei_static.a      libnvjpeg.so
libcublas.so.11.10.3.66         libcudnn_ops_train.so           libcurand.so.10.2.10.91      libnppif.so              libnvjpeg.so.11
libcublas_static.a              libcudnn_ops_train.so.8         libcurand_static.a           libnppif.so.11           libnvjpeg.so.11.8.0.2
libcudadevrt.a                  libcudnn_ops_train.so.8.9.7     libcusolver_lapack_static.a  libnppif.so.11.7.4.75    libnvjpeg_static.a
libcudart.so                    libcudnn_ops_train_static.a     libcusolverMg.so             libnppif_static.a        libnvptxcompiler_static.a
libcudart.so.11.0               libcudnn_ops_train_static_v8.a  libcusolverMg.so.11          libnppig.so              libnvrtc-builtins.so
libcudart.so.11.7.99            libcudnn.so                     libcusolverMg.so.11.4.0.1    libnppig.so.11           libnvrtc-builtins.so.11.7
libcudart_static.a              libcudnn.so.8                   libcusolver.so               libnppig.so.11.7.4.75    libnvrtc-builtins.so.11.7.99
libcudnn_adv_infer.so           libcudnn.so.8.9.7               libcusolver.so.11            libnppig_static.a        libnvrtc-builtins_static.a
libcudnn_adv_infer.so.8         libcufft.so                     libcusolver.so.11.4.0.1      libnppim.so              libnvrtc.so
libcudnn_adv_infer.so.8.9.7     libcufft.so.10                  libcusolver_static.a         libnppim.so.11           libnvrtc.so.11.2
libcudnn_adv_infer_static.a     libcufft.so.10.7.2.91           libcusparse.so               libnppim.so.11.7.4.75    libnvrtc.so.11.7.99
libcudnn_adv_infer_static_v8.a  libcufft_static.a               libcusparse.so.11            libnppim_static.a        libnvrtc_static.a
libcudnn_adv_train.so           libcufft_static_nocallback.a    libcusparse.so.11.7.4.91     libnppist.so             libnvToolsExt.so
libcudnn_adv_train.so.8         libcufftw.so                    libcusparse_static.a         libnppist.so.11          libnvToolsExt.so.1
libcudnn_adv_train.so.8.9.7     libcufftw.so.10                 liblapack_static.a           libnppist.so.11.7.4.75   libnvToolsExt.so.1.0.0
libcudnn_adv_train_static.a     libcufftw.so.10.7.2.91          libmetis_static.a            libnppist_static.a       libOpenCL.so
libcudnn_adv_train_static_v8.a  libcufftw_static.a              libnppc.so                   libnppisu.so             libOpenCL.so.1
libcudnn_cnn_infer.so           libcufile_rdma.so               libnppc.so.11                libnppisu.so.11          libOpenCL.so.1.0
libcudnn_cnn_infer.so.8         libcufile_rdma.so.1             libnppc.so.11.7.4.75         libnppisu.so.11.7.4.75   libOpenCL.so.1.0.0
libcudnn_cnn_infer.so.8.9.7     libcufile_rdma.so.1.3.1         libnppc_static.a             libnppisu_static.a       stubs
libcudnn_cnn_infer_static.a     libcufile_rdma_static.a         libnppial.so                 libnppitc.so
libcudnn_cnn_infer_static_v8.a  libcufile.so                    libnppial.so.11              libnppitc.so.11
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#

更换torchvision版本

注意到torchvision,对比docker下的正常环境,发现torchvision版本略有差异,大胆猜测是torchvision导致,直接下手干!

(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip install torchvision==0.15.0
Collecting torchvision==0.15.0
  Using cached torchvision-0.15.0-cp310-cp310-manylinux1_x86_64.whl (6.0 MB)
Requirement already satisfied: numpy in /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages (from torchvision==0.15.0) (1.26.2)
Requirement already satisfied: requests in /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages (from torchvision==0.15.0) (2.31.0)
INFO: pip is looking at multiple versions of torchvision to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu117 (from torchvision) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2)
ERROR: No matching distribution found for torch==2.0.0+cu117
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip install --no-deps torchvision==0.15.1
Collecting torchvision==0.15.1
  Using cached torchvision-0.15.1-cp310-cp310-manylinux1_x86_64.whl (6.0 MB)
Installing collected packages: torchvision
Successfully installed torchvision-0.15.1
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip list | grep torch
torch              2.0.0a0+gitec54f40 /mnt/data/jack/workspace/pytorch
torch-xla          2.0.0              /mnt/data/jack/workspace/pytorch/torch_xla
torchvision        0.15.1
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#

参考链接

refering here for torch==2.0.0+cu117

你可能感兴趣的:(deep,learning,XLA,tensorflow,pytorch,torch-xla)