明明cuda所有相关的库均存在,却提示不能加载动态库,仔细查看错误信息,是由于找不到此符号,从而引发的错误:
torch::jit::parseSchemaOrName(std::__cxx11::basic_string
const&)
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# python resnet50_infer.py
/mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2023-12-16 14:20:57.896978: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.897105: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.897166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.897237: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.898062: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.898158: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64
2023-12-16 14:20:57.898173: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-12-16 14:20:57.996329: F tensorflow/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread GrpcWorkerEnvPool creation via pthread_create() failed.
Aborted (core dumped)
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# c++filt _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
torch::jit::parseSchemaOrName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip list | grep torch
torch 2.0.0a0+gitec54f40 /mnt/data/jack/workspace/pytorch
torch-xla 2.0.0 /mnt/data/jack/workspace/pytorch/torch_xla
torchvision 0.15.2a0
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# ls /usr/local/cuda-11.7/lib64
cmake libcudnn_cnn_train.so libcufile.so.0 libnppial.so.11.7.4.75 libnppitc.so.11.7.4.75
libaccinj64.so libcudnn_cnn_train.so.8 libcufile.so.1.3.1 libnppial_static.a libnppitc_static.a
libaccinj64.so.11.7 libcudnn_cnn_train.so.8.9.7 libcufile_static.a libnppicc.so libnpps.so
libaccinj64.so.11.7.101 libcudnn_cnn_train_static.a libcufilt.a libnppicc.so.11 libnpps.so.11
libcublasLt.so libcudnn_cnn_train_static_v8.a libcuinj64.so libnppicc.so.11.7.4.75 libnpps.so.11.7.4.75
libcublasLt.so.11 libcudnn_ops_infer.so libcuinj64.so.11.7 libnppicc_static.a libnpps_static.a
libcublasLt.so.11.10.3.66 libcudnn_ops_infer.so.8 libcuinj64.so.11.7.101 libnppidei.so libnvblas.so
libcublasLt_static.a libcudnn_ops_infer.so.8.9.7 libculibos.a libnppidei.so.11 libnvblas.so.11
libcublas.so libcudnn_ops_infer_static.a libcurand.so libnppidei.so.11.7.4.75 libnvblas.so.11.10.3.66
libcublas.so.11 libcudnn_ops_infer_static_v8.a libcurand.so.10 libnppidei_static.a libnvjpeg.so
libcublas.so.11.10.3.66 libcudnn_ops_train.so libcurand.so.10.2.10.91 libnppif.so libnvjpeg.so.11
libcublas_static.a libcudnn_ops_train.so.8 libcurand_static.a libnppif.so.11 libnvjpeg.so.11.8.0.2
libcudadevrt.a libcudnn_ops_train.so.8.9.7 libcusolver_lapack_static.a libnppif.so.11.7.4.75 libnvjpeg_static.a
libcudart.so libcudnn_ops_train_static.a libcusolverMg.so libnppif_static.a libnvptxcompiler_static.a
libcudart.so.11.0 libcudnn_ops_train_static_v8.a libcusolverMg.so.11 libnppig.so libnvrtc-builtins.so
libcudart.so.11.7.99 libcudnn.so libcusolverMg.so.11.4.0.1 libnppig.so.11 libnvrtc-builtins.so.11.7
libcudart_static.a libcudnn.so.8 libcusolver.so libnppig.so.11.7.4.75 libnvrtc-builtins.so.11.7.99
libcudnn_adv_infer.so libcudnn.so.8.9.7 libcusolver.so.11 libnppig_static.a libnvrtc-builtins_static.a
libcudnn_adv_infer.so.8 libcufft.so libcusolver.so.11.4.0.1 libnppim.so libnvrtc.so
libcudnn_adv_infer.so.8.9.7 libcufft.so.10 libcusolver_static.a libnppim.so.11 libnvrtc.so.11.2
libcudnn_adv_infer_static.a libcufft.so.10.7.2.91 libcusparse.so libnppim.so.11.7.4.75 libnvrtc.so.11.7.99
libcudnn_adv_infer_static_v8.a libcufft_static.a libcusparse.so.11 libnppim_static.a libnvrtc_static.a
libcudnn_adv_train.so libcufft_static_nocallback.a libcusparse.so.11.7.4.91 libnppist.so libnvToolsExt.so
libcudnn_adv_train.so.8 libcufftw.so libcusparse_static.a libnppist.so.11 libnvToolsExt.so.1
libcudnn_adv_train.so.8.9.7 libcufftw.so.10 liblapack_static.a libnppist.so.11.7.4.75 libnvToolsExt.so.1.0.0
libcudnn_adv_train_static.a libcufftw.so.10.7.2.91 libmetis_static.a libnppist_static.a libOpenCL.so
libcudnn_adv_train_static_v8.a libcufftw_static.a libnppc.so libnppisu.so libOpenCL.so.1
libcudnn_cnn_infer.so libcufile_rdma.so libnppc.so.11 libnppisu.so.11 libOpenCL.so.1.0
libcudnn_cnn_infer.so.8 libcufile_rdma.so.1 libnppc.so.11.7.4.75 libnppisu.so.11.7.4.75 libOpenCL.so.1.0.0
libcudnn_cnn_infer.so.8.9.7 libcufile_rdma.so.1.3.1 libnppc_static.a libnppisu_static.a stubs
libcudnn_cnn_infer_static.a libcufile_rdma_static.a libnppial.so libnppitc.so
libcudnn_cnn_infer_static_v8.a libcufile.so libnppial.so.11 libnppitc.so.11
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
注意到torchvision,对比docker下的正常环境,发现torchvision版本略有差异,大胆猜测是torchvision导致,直接下手干!
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip install torchvision==0.15.0
Collecting torchvision==0.15.0
Using cached torchvision-0.15.0-cp310-cp310-manylinux1_x86_64.whl (6.0 MB)
Requirement already satisfied: numpy in /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages (from torchvision==0.15.0) (1.26.2)
Requirement already satisfied: requests in /mnt/data/jack/anaconda3/envs/py3.10/lib/python3.10/site-packages (from torchvision==0.15.0) (2.31.0)
INFO: pip is looking at multiple versions of torchvision to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu117 (from torchvision) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2)
ERROR: No matching distribution found for torch==2.0.0+cu117
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip install --no-deps torchvision==0.15.1
Collecting torchvision==0.15.1
Using cached torchvision-0.15.1-cp310-cp310-manylinux1_x86_64.whl (6.0 MB)
Installing collected packages: torchvision
Successfully installed torchvision-0.15.1
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]# pip list | grep torch
torch 2.0.0a0+gitec54f40 /mnt/data/jack/workspace/pytorch
torch-xla 2.0.0 /mnt/data/jack/workspace/pytorch/torch_xla
torchvision 0.15.1
(py3.10) [jack@td09 /mnt/data/jack/workspace/pytorch/torch_xla]#
refering here for torch==2.0.0+cu117