conda环境下Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions问题解决

1 问题描述

在训练Bert-VITS2模型时,系统报错,错误信息如下:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f938874c4d7 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f938871636b in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f93887f0fa8 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3:  + 0xdf9cde (0x7f9389615cde in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4:  + 0x4cce56 (0x7f93c7c8fe56 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #5:  + 0x3ee77 (0x7f9388731e77 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f938872a69e in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f938872a7b9 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #8:  + 0x751fb8 (0x7f93c7f14fb8 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f93c7f15345 in /root/anaconda3/envs/vits2/lib/python3.9/site-packages/torch/lib/libtorch_python.so)

frame #32: __libc_start_main + 0xf5 (0x7f940561e555 in /lib64/libc.so.6)

2 问题分析

查看cuda和torch的版本

nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.2.10.91
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusparse-cu11      11.7.4.91
nvidia-nccl-cu11          2.14.3
nvidia-nvtx-cu11          11.7.91
oauthlib                  3.2.2
openai-whisper            20231106
orjson                    3.9.10
packaging                 23.2
pandas                    2.1.2
phonemizer                3.2.1
Pillow                    10.1.0
pip                       23.3.1
platformdirs              3.11.0
pooch                     1.8.0
proces                    0.1.7
protobuf                  4.23.4
psutil                    5.9.6
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pycparser                 2.21
pydantic                  2.4.2
pydantic_core             2.10.1
pydub                     0.25.1
Pygments                  2.16.1
pykakasi                  2.2.1
pylatexenc                2.10
pyopenjtalk               0.3.3
pyparsing                 3.1.1
pypinyin                  0.49.0
python-dateutil           2.8.2
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
rdflib                    7.0.0
referencing               0.30.2
regex                     2023.10.3
requests                  2.31.0
requests-oauthlib         1.3.1
resampy                   0.4.2
rfc3986                   1.5.0
rich                      13.6.0
rpds-py                   0.12.0
rsa                       4.9
safetensors               0.4.0
scikit-learn              1.3.2
scipy                     1.11.3
segments                  2.2.1
semantic-version          2.10.0
sentencepiece             0.1.99
setuptools                68.2.2
shellingham               1.5.4
six                       1.16.0
sniffio                   1.3.0
soundfile                 0.12.1
starlette                 0.27.0
sympy                     1.12
tabulate                  0.9.0
tensorboard               2.15.1
tensorboard-data-server   0.7.2
threadpoolctl             3.2.0
tiktoken                  0.5.1
tokenizers                0.14.1
tomlkit                   0.12.0
toolz                     0.12.0
torch                     2.0.1
torchaudio                2.0.2
torchvision               0.15.2

 版本对应查看:https://pytorch.org/get-started/locally/ 

尝试更换多个版本的torch和cuda,问题依旧存在

经过尝试发现,只保留一个人的100句语料进行语音训练,没有发生问题。

3 问题解决

Bert-VITS2语音训练语料中,不能包含英文字母,如果包含英文字母,则可能出现以上问题,通过正则表达式过滤掉包含英文字母的语料,问题解决。

# 过滤有英文字母的句子
pattern = '[a-zA-Z]'
result = re.search(pattern, content)
if result:
    print(line)
    continue

你可能感兴趣的:(AI运行环境,python,深度学习,人工智能,bert_vits2)