报错:RuntimeError: CUDA error: device-side assert triggered

在运行CGAN代码时,有几个报错,记录解决办法

一、报错1

C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [0,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [0,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
。。。。(中间太多行了,就不放在这里了)
Traceback (most recent call last):
  File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 326, in <module>
    d_loss = (d_real_loss + d_fake_loss) / 2
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

最开始还以为是代码运算时哪里的数据的维度有问题,找了半天,后来才发现,是类别的设置有问题。
代码中设置类比参数的时候如下:

parser.add_argument("--n_classes", type=int, default=3, help="number of classes for dataset")  # 几类

而我在赋值训练集的类别标签的时候,写的是1、2、3:

a = [1 for _ in range(47)]  # 类比1的标签
b = [2 for _ in range(43)]  # 类别2的标签
c = [3 for _ in range(39)]  # 类别3的标签
label_list = []
label_list = a + b + c
labels = np.array(label_list)

而在训练过程中,生成的标签为:

gen_labels = Variable(LongTensor(np.random.randint(0, opt.n_classes, batch_size)))  # 得到一个大小在0-2,长度为batch_size的数组

所以解决办法是
(1)修改训练集的类别标签

a = [0 for _ in range(47)]  # 类比1的标签
b = [1 for _ in range(43)]  # 类别2的标签
c = [2 for _ in range(39)]  # 类别3的标签
label_list = []
label_list = a + b + c
labels = np.array(label_list)

(2)或者修改训练过程中生成的标签

gen_labels = Variable(LongTensor(np.random.randint(1, opt.n_classes+1, batch_size)))  # 得到一个大小在0-2,长度为batch_size的数组

我用了第一种

二、报错2

使用dataloader方法输入数据时,报错:

Traceback (most recent call last):
  File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 299, in <module>
    gen_imgs = generator(z, gen_labels)
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 103, in forward
    img = self.model(gen_input)  # img.shape是torch.Size([batch_size, 256])
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
    input = module(input)
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\batchnorm.py", line 171, in forward
    return F.batch_norm(
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\functional.py", line 2448, in batch_norm
    _verify_batch_size(input.size())
  File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\functional.py", line 2416, in _verify_batch_size
    raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256])

进程已结束,退出代码1

而将batch_size由64修改成63就不报错,所以感觉是因为我的数据总共有129个,两个batch后只剩下一个数据,然后会引起报错,解决方法没有细查,我直接在dataloader方法中加入,反正只有一个数据舍去就舍去了:

dataloader = torch.utils.data.DataLoader(
    torch_dataset,
    batch_size=opt.batch_size,
    shuffle=True,
    drop_last=True   # 注意!这个参数为True就是舍去剩下不为整batch_size的数据
)

三、报错3

在保存loss项的时候

lossData_d_loss.append([epoch, d_loss.data.numpy()])  
lossData_g_loss.append([epoch, g_loss.data.numpy()]) 

报错:

Traceback (most recent call last):
  File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 347, in <module>
    lossData_d_loss.append([epoch, d_loss.data.numpy()])  
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

原因是在使用numpy方法时,必须使用cpu(我使用了gpu),故将上两句代码修改为

lossData_d_loss.append([epoch, d_loss.cpu().detach().numpy()]) 
lossData_g_loss.append([epoch, g_loss.cpu().detach().numpy()]) 

就不报错了。

你可能感兴趣的:(GAN,深度学习,pytorch,深度学习,学习,生成对抗网络)