解决RuntimeError: reduce failed to synchronize: device-side assert triggered问题

首先,上一波报错信息:

/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, 
......
......
......
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "../paragrah_selector/para_sigmoid_train.py", line 533, in 
    main()
  File "../paragrah_selector/para_sigmoid_train.py", line 463, in main
    eval_loss = eval_model(model, eval_data, device)
  File "../paragrah_selector/para_sigmoid_train.py", line 419, in eval_model
    loss, logits = model(input_ids, segment_ids, input_mask, labels=label_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 1001, in forward
    loss = loss_fn(logits, labels)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 504, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 2027, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: reduce failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /pytorch/aten/src/THC/THCCachingAllocator.cpp:470)
frame #0: std::function::operator()() const + 0x11 (0x7f0e52afc021 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f0e52afb8ea in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2:  + 0x13dbd92 (0x7f0e5e065d92 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::TensorImpl::release_resources() + 0x50 (0x7f0e534c6440 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #4:  + 0x2af03b (0x7f0e51bb703b in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #5: torch::autograd::Variable::Impl::release_resources() + 0x17 (0x7f0e51e29d27 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #6:  + 0x124cfb (0x7f0e8ce4ccfb in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7:  + 0x3204af (0x7f0e8d0484af in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8:  + 0x3204f1 (0x7f0e8d0484f1 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #25: __libc_start_main + 0xf0 (0x7f0ecf782830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
(py36) lisen@octa:~/caiyun_projects/generative_mrc/script$ sh para_sigmoid_train.sh

导致这个现象的原因有几个:
1、labels的标签超出了logits的索引,就是比如logits的-1维为10(索引为0~9),你labels出现了大于9的标签,比如10,11…等等。所以仔细检查下你的labels。
2、你的词向量出问题了,比如位置向量超过了模型预设的长度,词向量超出了词表大小等。

然后,这篇文章的重点,如果只说这两个原因,可能大家还是不太容易找出问题。那么教大家一个简单的debug方法,很明显就知道问题所在。那就是:将模型放到CPU上运行。如果放不下,把batch size调小即可。比如本人调完之后报错如下:

File "../paragrah_selector/para_sigmoid_train.py", line 533, in 
    main()
  File "../paragrah_selector/para_sigmoid_train.py", line 463, in main
    eval_loss = eval_model(model, eval_data, device)
  File "../paragrah_selector/para_sigmoid_train.py", line 419, in eval_model
    loss, logits = model(input_ids, segment_ids, input_mask, labels=label_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 987, in forward
    _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 705, in forward
    embedding_output = self.embeddings(input_ids, token_type_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 281, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

仔细分析可以很明显发现, File “/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py”, line 281, in forward
position_embeddings = self.position_embeddings(position_ids)
这里出错了,就是位置向量超出了模型的预设长度值,然后回去检查,发现,果然确实没有将较长的文本截断至该长度,导致出现这个问题。

你可能感兴趣的:(pytorch,NLP,python)