mmdetection训练报错。
先说结论吧,coco格式的annotations.json中categories的category_id不能有0(即背景类)。
查了好几个小时,都快要疯了,才搞出这个结论。给大家趟个坑。
完整报错内容如下 :
2020-01-04 12:18:44,206 - INFO - Distributed training: False
2020-01-04 12:18:45,023 - INFO - load model from: torchvision://resnet50
2020-01-04 12:18:45,213 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.4.conv2_offset.weight, layer2.3.conv2_offset.bias, layer4.0.conv2_offset.bias, layer3.3.conv2_offset.bias, layer2.1.conv2_offset.weight, layer3.1.conv2_offset.weight, layer3.3.conv2_offset.weight, layer4.2.conv2_offset.weight, layer2.0.conv2_offset.bias, layer2.1.conv2_offset.bias, layer3.0.conv2_offset.weight, layer4.2.conv2_offset.bias, layer2.0.conv2_offset.weight, layer2.2.conv2_offset.bias, layer2.2.conv2_offset.weight, layer3.2.conv2_offset.weight, layer3.2.conv2_offset.bias, layer3.0.conv2_offset.bias, layer4.1.conv2_offset.bias, layer4.1.conv2_offset.weight, layer4.0.conv2_offset.weight, layer2.3.conv2_offset.weight, layer3.5.conv2_offset.bias, layer3.5.conv2_offset.weight, layer3.4.conv2_offset.bias, layer3.1.conv2_offset.bias
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
2020-01-04 12:18:47,564 - INFO - load checkpoint from ./work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x/latest.pth
2020-01-04 12:18:47,854 - INFO - Start running, host: root@91b7c01c2149, work_dir: /competition/work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x
2020-01-04 12:18:47,854 - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
Traceback (most recent call last):
File "tools/train.py", line 108, in
main()
File "tools/train.py", line 104, in main
logger=logger)
File "/competition/mmdet/apis/train.py", line 60, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/competition/mmdet/apis/train.py", line 221, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/competition/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/competition/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/competition/mmdet/models/detectors/cascade_rcnn.py", line 219, in forward_train
loss_bbox = bbox_head.loss(cls_score, bbox_pred, *bbox_targets)
File "/competition/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/bbox_heads/bbox_head.py", line 120, in loss
pos_bbox_pred = bbox_pred.view(bbox_pred.size(0), 4)[pos_inds]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f9374698dc5 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f9374688640 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3:
frame #4:
frame #5:
frame #6:
2020-01-04 12:18:44,206 - INFO - Distributed training: False
2020-01-04 12:18:45,023 - INFO - load model from: torchvision://resnet50
2020-01-04 12:18:45,213 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.4.conv2_offset.weight, layer2.3.conv2_offset.bias, layer4.0.conv2_offset.bias, layer3.3.conv2_offset.bias, layer2.1.conv2_offset.weight, layer3.1.conv2_offset.weight, layer3.3.conv2_offset.weight, layer4.2.conv2_offset.weight, layer2.0.conv2_offset.bias, layer2.1.conv2_offset.bias, layer3.0.conv2_offset.weight, layer4.2.conv2_offset.bias, layer2.0.conv2_offset.weight, layer2.2.conv2_offset.bias, layer2.2.conv2_offset.weight, layer3.2.conv2_offset.weight, layer3.2.conv2_offset.bias, layer3.0.conv2_offset.bias, layer4.1.conv2_offset.bias, layer4.1.conv2_offset.weight, layer4.0.conv2_offset.weight, layer2.3.conv2_offset.weight, layer3.5.conv2_offset.bias, layer3.5.conv2_offset.weight, layer3.4.conv2_offset.bias, layer3.1.conv2_offset.bias
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
2020-01-04 12:18:47,564 - INFO - load checkpoint from ./work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x/latest.pth
2020-01-04 12:18:47,854 - INFO - Start running, host: root@91b7c01c2149, work_dir: /competition/work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x
2020-01-04 12:18:47,854 - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
Traceback (most recent call last):
File "tools/train.py", line 108, in
main()
File "tools/train.py", line 104, in main
logger=logger)
File "/competition/mmdet/apis/train.py", line 60, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/competition/mmdet/apis/train.py", line 221, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/competition/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/competition/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/competition/mmdet/models/detectors/cascade_rcnn.py", line 219, in forward_train
loss_bbox = bbox_head.loss(cls_score, bbox_pred, *bbox_targets)
File "/competition/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/bbox_heads/bbox_head.py", line 120, in loss
pos_bbox_pred = bbox_pred.view(bbox_pred.size(0), 4)[pos_inds]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f9374698dc5 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f9374688640 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3:
frame #4:
frame #5:
frame #6:
2020-01-04 12:18:44,206 - INFO - Distributed training: False
2020-01-04 12:18:45,023 - INFO - load model from: torchvision://resnet50
2020-01-04 12:18:45,213 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.4.conv2_offset.weight, layer2.3.conv2_offset.bias, layer4.0.conv2_offset.bias, layer3.3.conv2_offset.bias, layer2.1.conv2_offset.weight, layer3.1.conv2_offset.weight, layer3.3.conv2_offset.weight, layer4.2.conv2_offset.weight, layer2.0.conv2_offset.bias, layer2.1.conv2_offset.bias, layer3.0.conv2_offset.weight, layer4.2.conv2_offset.bias, layer2.0.conv2_offset.weight, layer2.2.conv2_offset.bias, layer2.2.conv2_offset.weight, layer3.2.conv2_offset.weight, layer3.2.conv2_offset.bias, layer3.0.conv2_offset.bias, layer4.1.conv2_offset.bias, layer4.1.conv2_offset.weight, layer4.0.conv2_offset.weight, layer2.3.conv2_offset.weight, layer3.5.conv2_offset.bias, layer3.5.conv2_offset.weight, layer3.4.conv2_offset.bias, layer3.1.conv2_offset.bias
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
2020-01-04 12:18:47,564 - INFO - load checkpoint from ./work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x/latest.pth
2020-01-04 12:18:47,854 - INFO - Start running, host: root@91b7c01c2149, work_dir: /competition/work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x
2020-01-04 12:18:47,854 - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
Traceback (most recent call last):
File "tools/train.py", line 108, in
main()
File "tools/train.py", line 104, in main
logger=logger)
File "/competition/mmdet/apis/train.py", line 60, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/competition/mmdet/apis/train.py", line 221, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/competition/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/competition/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/competition/mmdet/models/detectors/cascade_rcnn.py", line 219, in forward_train
loss_bbox = bbox_head.loss(cls_score, bbox_pred, *bbox_targets)
File "/competition/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/bbox_heads/bbox_head.py", line 120, in loss
pos_bbox_pred = bbox_pred.view(bbox_pred.size(0), 4)[pos_inds]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f9374698dc5 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f9374688640 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3:
frame #4:
frame #5:
frame #6:
2020-01-04 12:18:44,206 - INFO - Distributed training: False
2020-01-04 12:18:45,023 - INFO - load model from: torchvision://resnet50
2020-01-04 12:18:45,213 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.4.conv2_offset.weight, layer2.3.conv2_offset.bias, layer4.0.conv2_offset.bias, layer3.3.conv2_offset.bias, layer2.1.conv2_offset.weight, layer3.1.conv2_offset.weight, layer3.3.conv2_offset.weight, layer4.2.conv2_offset.weight, layer2.0.conv2_offset.bias, layer2.1.conv2_offset.bias, layer3.0.conv2_offset.weight, layer4.2.conv2_offset.bias, layer2.0.conv2_offset.weight, layer2.2.conv2_offset.bias, layer2.2.conv2_offset.weight, layer3.2.conv2_offset.weight, layer3.2.conv2_offset.bias, layer3.0.conv2_offset.bias, layer4.1.conv2_offset.bias, layer4.1.conv2_offset.weight, layer4.0.conv2_offset.weight, layer2.3.conv2_offset.weight, layer3.5.conv2_offset.bias, layer3.5.conv2_offset.weight, layer3.4.conv2_offset.bias, layer3.1.conv2_offset.bias
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
2020-01-04 12:18:47,564 - INFO - load checkpoint from ./work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x/latest.pth
2020-01-04 12:18:47,854 - INFO - Start running, host: root@91b7c01c2149, work_dir: /competition/work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x
2020-01-04 12:18:47,854 - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
Traceback (most recent call last):
File "tools/train.py", line 108, in
main()
File "tools/train.py", line 104, in main
logger=logger)
File "/competition/mmdet/apis/train.py", line 60, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/competition/mmdet/apis/train.py", line 221, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/competition/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/competition/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/competition/mmdet/models/detectors/cascade_rcnn.py", line 219, in forward_train
loss_bbox = bbox_head.loss(cls_score, bbox_pred, *bbox_targets)
File "/competition/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/bbox_heads/bbox_head.py", line 120, in loss
pos_bbox_pred = bbox_pred.view(bbox_pred.size(0), 4)[pos_inds]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f9374698dc5 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f9374688640 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3:
frame #4:
frame #5:
frame #6:
2020-01-04 12:18:44,206 - INFO - Distributed training: False
2020-01-04 12:18:45,023 - INFO - load model from: torchvision://resnet50
2020-01-04 12:18:45,213 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.4.conv2_offset.weight, layer2.3.conv2_offset.bias, layer4.0.conv2_offset.bias, layer3.3.conv2_offset.bias, layer2.1.conv2_offset.weight, layer3.1.conv2_offset.weight, layer3.3.conv2_offset.weight, layer4.2.conv2_offset.weight, layer2.0.conv2_offset.bias, layer2.1.conv2_offset.bias, layer3.0.conv2_offset.weight, layer4.2.conv2_offset.bias, layer2.0.conv2_offset.weight, layer2.2.conv2_offset.bias, layer2.2.conv2_offset.weight, layer3.2.conv2_offset.weight, layer3.2.conv2_offset.bias, layer3.0.conv2_offset.bias, layer4.1.conv2_offset.bias, layer4.1.conv2_offset.weight, layer4.0.conv2_offset.weight, layer2.3.conv2_offset.weight, layer3.5.conv2_offset.bias, layer3.5.conv2_offset.weight, layer3.4.conv2_offset.bias, layer3.1.conv2_offset.bias
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
2020-01-04 12:18:47,564 - INFO - load checkpoint from ./work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x/latest.pth
2020-01-04 12:18:47,854 - INFO - Start running, host: root@91b7c01c2149, work_dir: /competition/work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x
2020-01-04 12:18:47,854 - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor
Traceback (most recent call last):
File "tools/train.py", line 108, in
main()
File "tools/train.py", line 104, in main
logger=logger)
File "/competition/mmdet/apis/train.py", line 60, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/competition/mmdet/apis/train.py", line 221, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/competition/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/competition/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/competition/mmdet/models/detectors/cascade_rcnn.py", line 219, in forward_train
loss_bbox = bbox_head.loss(cls_score, bbox_pred, *bbox_targets)
File "/competition/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/competition/mmdet/models/bbox_heads/bbox_head.py", line 120, in loss
pos_bbox_pred = bbox_pred.view(bbox_pred.size(0), 4)[pos_inds]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f9374698dc5 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f9374688640 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3:
frame #4:
frame #5:
frame #6: