问题描述:
Traceback (most recent call last):
File "train_cls.py", line 445, in
train(cfg=cfg)
File "train_cls.py", line 288, in train
cls1, cam1, cls2, cam2, cls3, cam3, cls4, cam4, attns = wetr(inputs, txt_features)
File "/home/pengzhang/anaconda3/envs/TPRO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data5/pengzhang/TPRO_nochange/cls_network/model.py", line 155, in forward
k_fea4 = k_fc4(know_feature)
File "/home/pengzhang/anaconda3/envs/TPRO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data5/pengzhang/TPRO_nochange/cls_network/model.py", line 34, in forward
x = self.fc1(x)
File "/home/pengzhang/anaconda3/envs/TPRO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pengzhang/anaconda3/envs/TPRO/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument mat1 in method wrapper_addmm)
可以看出此时指出的是我在模型训练的时候将不同的tensor放到了不同的设备上面,包括GPU和CPU,但是一般跑过深度学习代码的应该都知道:在模型训练的时候需要将所有的inputs、model都统一放到GPU或者CPU上面进行训练,一开始我也以为是不同设备冲突的问题,但是在经过仔细检查以后还是没有发现问题所在,最后在上面的报错中的最后一行
File "/home/pengzhang/anaconda3/envs/TPRO/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
找到了问题所在,可以打开这个linear.py这个源代码,源代码如下所示:
def forward(self, input: Tensor) -> Tensor:
return F.linear(input, self.weight, self.bias)
通过观察我们可以看出,此时我们的输入变量input是在GPU上面的,但是我们的self.weight和self.bias是定义在了CPU上面,所以这个时候我们需要将self.weight和self.bias也同样指定到和input一样的GPU上面即可解决这个问题,即:
def forward(self, input: Tensor) -> Tensor:
device = input.device
return F.linear(input, self.weight.to(device), self.bias.to(device))
目前的疑惑:
linear.py这个是pytorch的官方源代码,之前也用过这个函数,也没有遇到过这个问题,之前都是程序自动就指定到GPU上面训练,我们只需要指定model和input即可,但是不知道为什么这次会遇到这个问题,非常欢迎知道这个问题的小伙伴在下方评论区留言!