CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread:
。。。
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [11,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [12,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [13,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
File "F:/research/deeplabv3/train.py", line 71, in <module>
train_segmentor(model, datasets, cfg, distributed=False, validate=True, meta=dict())
File "f:\research\openmmlab\mmsegmentation\mmseg\apis\train.py", line 194, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\parallel\data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\base.py", line 138, in train_step
losses = self(**data_batch)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\encoder_decoder.py", line 144, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\encoder_decoder.py", line 87, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "f:\research\openmmlab\mmsegmentation\mmseg\models\decode_heads\decode_head.py", line 233, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\fp16_utils.py", line 208, in new_func
return old_func(*args, **kwargs)
File "f:\research\openmmlab\mmsegmentation\mmseg\models\decode_heads\decode_head.py", line 270, in losses
seg_weight = self.sampler.sample(seg_logit, seg_label)
File "f:\research\openmmlab\mmsegmentation\mmseg\core\seg\sampler\ohem_pixel_sampler.py", line 56, in sample
sort_prob, sort_indices = seg_prob[valid_mask].sort()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Process finished with exit code -1073740791 (0xC0000409)
原因分析:
1、训练标签类别数目和指定的num_class不一致。
解决办法:
检查样本标注的,看看是否有超出规定的标注。如果有,删除或者重新制作样本。
注:二分类样本要使用PIL库制作(可参考下面代码),open-CV生成的为3波段,模型读入会报错
这里给出我使用的一个简单的图像类别检查代码:
import numpy as np
import cv2
import os.path as osp
dataroot = r'data/Satellite_buildings_data'
imgs = "src"
labels = 'label'
train_data = osp.join(dataroot, labels)
with open(osp.join(dataroot, 'split/val.txt'), 'r+') as f:
line = [l for l in f.readlines()]
# 提取所有图片中同一元素的和
label_count = dict()
for file in line:
dataset = cv2.imread(osp.join(train_data, file.strip()+'.png'))
class_num = np.unique(dataset)
for num in class_num:
temp = np.sum(dataset == num)
if str(num) in label_count.keys():
label_count[str(num)] = label_count[str(num)] + temp
else:
label_count[str(num)] = temp
print(label_count)
# 二分类样本
import os
from PIL import Image
dataroot = r'F:\research\floodext\floodDataset\labels'
out_root = r'F:\research\floodext\floodDataset\label'
for file in os.listdir(dataroot):
img = Image.open(os.path.join(dataroot, file))
img = img.point(lambda x:x > 0)
print(np.unique(img))
img.save(os.path.join(out_root, file))
2023.9.28补充:
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: an illegal memory access was encountered
最近做多标签类别的语义分割(包含类别的变化检测)时遇到了一个新问题,其本质也是类别数目和指定的不一致。解决办法同上。
这里也花费了一天时间,所以记录下拍错的过程。但单看报错信息我以为是显存不够用或者loss函数的问题。首先loss函数不会出现问题,那么如果是显存不够的话,调小batchsize应该是可行的,当调小batchsize为2时,可以训练,但loss为0,acc为0,同时后台看显存只用了2GB,远远低于满负荷运转时的状态。最后回归排查标签,发现部分标签是三通道的颜色标注。将其改为0~n后成功运行。
2、预测结果中有nan
比如我加了自动混合精度训练,网络结构中可能某些算子不支持自动混合精度训练,导致出现nan值,最终导致计算loss时报错。
解决办法:
检查输出结果,检查过程,检查网络结构
原因有二:
File "f:\research\openmmlab\mmsegmentation\mmseg\models\backbones\resnet.py", line 662, in forward
x = self.stem(x)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\batchnorm.py", line 731, in forward
world_size = torch.distributed.get_world_size(process_group)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 867, in get_world_size
return _get_group_size(group)
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 325, in _get_group_size
default_pg = _get_default_group()
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 429, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Process finished with exit code -1073741510 (0xC000013A: interrupted by Ctrl+C)
原因分析:
debug发现torch\nn\modules\batchnorm.py
中,第725-732行代码如下,need_sync为True,但 process_group为None。这里的nee_sync是指是否开启sys batchnorm。查看我的代码,使用了SyncBN
,我的是单卡训练,因此报错。
725 # Don't sync batchnorm stats in inference mode (model.eval()).
726 need_sync = (bn_training and self.training)
727 if need_sync:
728 process_group = torch.distributed.group.WORLD
729 if self.process_group:
730 process_group = self.process_group
731 world_size = torch.distributed.get_world_size(process_group)
732 need_sync = world_size > 1
解决办法:
如果是多卡训练,采用“SyncBN”; 如果是单卡训练,将type修改为’BN’即可。
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "f:\research\openmmlab\mmsegmentation\mmseg\models\losses\cross_entropy_loss.py", line 271, in forward
loss_cls = self.loss_weight * self.cls_criterion(
File "f:\research\openmmlab\mmsegmentation\mmseg\models\losses\cross_entropy_loss.py", line 45, in cross_entropy
loss = F.cross_entropy(
File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\functional.py", line 3014, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all or no classes
Process finished with exit code -1
原因分析:
这里主要是类别权重的问题,检查类别权重和类别数目是否一致。我这里报错是因为数据集换成新的数据集后,变为了二分类,但源码是使用的cityscapes,类别有19类,因此报错。