在运行CGAN代码时,有几个报错,记录解决办法
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [0,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [0,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
。。。。(中间太多行了,就不放在这里了)
Traceback (most recent call last):
File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 326, in <module>
d_loss = (d_real_loss + d_fake_loss) / 2
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
最开始还以为是代码运算时哪里的数据的维度有问题,找了半天,后来才发现,是类别的设置有问题。
代码中设置类比参数的时候如下:
parser.add_argument("--n_classes", type=int, default=3, help="number of classes for dataset") # 几类
而我在赋值训练集的类别标签的时候,写的是1、2、3:
a = [1 for _ in range(47)] # 类比1的标签
b = [2 for _ in range(43)] # 类别2的标签
c = [3 for _ in range(39)] # 类别3的标签
label_list = []
label_list = a + b + c
labels = np.array(label_list)
而在训练过程中,生成的标签为:
gen_labels = Variable(LongTensor(np.random.randint(0, opt.n_classes, batch_size))) # 得到一个大小在0-2,长度为batch_size的数组
所以解决办法是
(1)修改训练集的类别标签
a = [0 for _ in range(47)] # 类比1的标签
b = [1 for _ in range(43)] # 类别2的标签
c = [2 for _ in range(39)] # 类别3的标签
label_list = []
label_list = a + b + c
labels = np.array(label_list)
(2)或者修改训练过程中生成的标签
gen_labels = Variable(LongTensor(np.random.randint(1, opt.n_classes+1, batch_size))) # 得到一个大小在0-2,长度为batch_size的数组
我用了第一种
使用dataloader方法输入数据时,报错:
Traceback (most recent call last):
File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 299, in <module>
gen_imgs = generator(z, gen_labels)
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 103, in forward
img = self.model(gen_input) # img.shape是torch.Size([batch_size, 256])
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
input = module(input)
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\modules\batchnorm.py", line 171, in forward
return F.batch_norm(
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\functional.py", line 2448, in batch_norm
_verify_batch_size(input.size())
File "D:\Anaconda\envs\gpu-pytorch\lib\site-packages\torch\nn\functional.py", line 2416, in _verify_batch_size
raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256])
进程已结束,退出代码1
而将batch_size由64修改成63就不报错,所以感觉是因为我的数据总共有129个,两个batch后只剩下一个数据,然后会引起报错,解决方法没有细查,我直接在dataloader方法中加入,反正只有一个数据舍去就舍去了:
dataloader = torch.utils.data.DataLoader(
torch_dataset,
batch_size=opt.batch_size,
shuffle=True,
drop_last=True # 注意!这个参数为True就是舍去剩下不为整batch_size的数据
)
在保存loss项的时候
lossData_d_loss.append([epoch, d_loss.data.numpy()])
lossData_g_loss.append([epoch, g_loss.data.numpy()])
报错:
Traceback (most recent call last):
File "D:\Pycharm_project\Data classfication\Data_Classfication_Project\CGAN\cgan.py", line 347, in <module>
lossData_d_loss.append([epoch, d_loss.data.numpy()])
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
原因是在使用numpy方法时,必须使用cpu(我使用了gpu),故将上两句代码修改为
lossData_d_loss.append([epoch, d_loss.cpu().detach().numpy()])
lossData_g_loss.append([epoch, g_loss.cpu().detach().numpy()])
就不报错了。