多任务学习(multi-task learning)中遇到的问题…
https://blog.csdn.net/jacke121/article/details/79874555
https://blog.csdn.net/fu6543210/article/details/89790220
症状:一个epoch之后,acc和loss都保持基本不变,震荡无法收敛。。。
Epoch: [1 | 90] LR: 0.010000
Processing |################################| Batch (1311/1311) | Acc 77.0433 | BAcc 49.8469 | Loss 0.0124 | Total 0:55:21 | ETA 0:00:03
Processing |################################| Batch (165/165) | Acc 77.0377 | BAcc 38.5189 | Loss 0.5296 | Total 0:05:45 | ETA 0:00:02
Updated best Acc. 77.0377385218265
Updated best BAcc. 38.518869264643214
Epoch: [2 | 90] LR: 0.010000
Processing |################################| Batch (1311/1311) | Acc 77.0455 | BAcc 38.5228 | Loss 0.0123 | Total 0:56:42 | ETA 0:00:03
Processing |################################| Batch (165/165) | Acc 77.0377 | BAcc 38.5189 | Loss 0.5290 | Total 0:05:41 | ETA 0:00:03
Epoch: [3 | 90] LR: 0.010000
Processing |################################| Batch (1311/1311) | Acc 77.0455 | BAcc 39.7728 | Loss 0.0123 | Total 0:55:22 | ETA 0:00:03
Processing |################################| Batch (165/165) | Acc 77.0377 | BAcc 38.5189 | Loss 0.5293 | Total 0:06:03 | ETA 0:00:03
Epoch: [4 | 90] LR: 0.010000
Processing |################################| Batch (1311/1311) | Acc 77.0455 | BAcc 43.5228 | Loss 0.0123 | Total 0:55:52 | ETA 0:00:03
Processing |################################| Batch (165/165) | Acc 77.0377 | BAcc 38.5189 | Loss 0.5290 | Total 0:05:50 | ETA 0:00:02
Epoch: [5 | 90] LR: 0.010000
Processing |################################| Batch (1311/1311) | Acc 77.0455 | BAcc 42.8978 | Loss 0.0123 | Total 0:55:50 | ETA 0:00:03
Processing |################################| Batch (165/165) | Acc 77.0377 | BAcc 38.5189 | Loss 0.5290 | Total 0:05:39 | ETA 0:00:03
发生以上情况的原因,主要是由于初学者才会犯的错误。先做如下检查:
1,train函数中,三者的顺序
optimizer.zero_grad()
loss_sum.backward()
optimizer.step()
2,BCEwithlogitloss的正确用法,如何加权重,如何算accuracy:
https://www.jianshu.com/p/0062d04a2782
https://discuss.pytorch.org/t/bcewithlogitsloss-and-model-accuracy-calculation/59293/2
3,计算accuracy的时候,当前累计的总样本数不是等于迭代次数*batch_size。而是应该从第一次迭代开始累加得到。因为每个batch中的样本数不一定相等
4,lr过大可能会nan,过小可能会loss不变。需要选择合适的lr,以及lr调整的策略。criterion也可以调整,根据性能表现。
5,更新和保存state时,注意不要粗心,只是对当前的state进行了判断,而没有更新最终保存的那个版本。比如只判断了acc而没有更新保存的acc等等。
6,尝试调整模型结构,看性能有没有变化。减小学习率,或者简化模型结构,看能不摆脱这种稳定的震荡
最后的解决在这儿:https://discuss.pytorch.org/t/need-help-loss-and-acc-stay-the-same-too-early-by-using-bcewithlogitsloss/88994/4
查看网络结构模型的方法:
class FC(nn.Module):
def __init__(self, bitsPerAttr=1, ):
super(FC, self).__init__()
self.layers = nn.Sequential(
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, bitsPerAttr),
)
def forward(self, x):
output = self.layers(x)
return output
# alexnet + 2fc
class Alexnet2fc(nn.Module):
def __init__(self, taskNum=40, bitsPerAttr=1, ):
if taskNum == 40: bitsPerAttr = 1
if taskNum == 10: bitsPerAttr = 8
print('Model:%s, TaskNum:%s, Bits/Attr:%s' % ('AlexNet2fc', taskNum, bitsPerAttr))
super(Alexnet2fc, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, ),
nn.Conv2d(64, 192, kernel_size=5, padding=2),
nn.BatchNorm2d(192),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, ),
nn.Conv2d(192, 384, kernel_size=3, padding=1),
nn.BatchNorm2d(384),
nn.ReLU(inplace=True),
nn.Conv2d(384, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, ), # output: x: torch.Size([10, 256, 2, 2])
nn.Linear(2, 512),
nn.ReLU(),
nn.Linear(512, 1024),
)
self.avgpool = nn.AdaptiveAvgPool2d((2, 2)) # the input of nn.AdaptiveAvgPool2d must be 3D
# wrong arch.
'''
fc = nn.Sequential(
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, bitsPerAttr),
# nn.Softmax(dim=0)
)
self.fc = fc
self.towers = nn.ModuleList([self.fc for _ in range(taskNum)])
'''
self.towers = nn.ModuleList([FC() for _ in range(taskNum)])
def forward(self, x):
x = self.features(x) # return a tuple, len = 1, torch.Size([16, 256, 2, 2])
# print('x:', x.shape)
x = self.avgpool(x) # torch.Size([128, 256, 6, 6])
x = torch.flatten(x, 1) # output: torch.Size([128, 1024])
out = [tower(x) for tower in self.towers] # torch.Size([128, 256, 2, 8]) * 10
return out
if __name__ == '__main__':
model = Alexnet2fc()
for name, para in model.named_parameters():
print(name)
参考:pytorch卷积层共享、参数共享和参数量统计。
https://blog.csdn.net/weixin_44058333/article/details/99701581
https://zhuanlan.zhihu.com/p/64425750