论文地址,https://arxiv.org/abs/1810.08425v3
关于论文,可以看看https://zhuanlan.zhihu.com/p/59498319
该篇论文主要做两件事:
1、检测的backbone网络不再使用预训练模型
2、修改backbone为Root-ResNet
本文主要在darknet框架试验下Root-ResNet-18,未在公共数据集上验证,直接用在自己的数据集上,另外记录下不用预训练模型,从0训练的技巧,主要使用Gradual warmup。
1、修改Root-ResNet-18部分
backbone网络部分,作者认为检测和分类任务不同,一开始用7*7conv和max pooling,虽然可以减少计算量,但是在检测中意味着信息丢失。因此backbone部分相当于增加了计算量,但是在检测部分,32倍下采样之后减少通道数来减少一部分计算量,主要做了以下修改:
(a)去掉max pooling
(b)第一个conv层 stride由2改为1
(c)取代7*7conv为3个3*3conv
(d)检测部分通道数修改为128
(e)backbone网络和检测部分均使用BN层,使用大学习率,论文中为0.05,从0训练模型。
在darknet上修改的Root-ResNet18.cfg,其中anchor为自己的数据集anchor
[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=16
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.0125
burn_in=4000
max_batches = 200000
policy=steps
steps=120000,170000
scales=.1,.1
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
# Residual Block
[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Residual Block
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Strided Residual Block
[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Residual Block
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Strided Residual Block
[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Residual Block
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Strided Residual Block
[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
# Residual Block
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=linear
[shortcut]
activation=leaky
from=-3
######################
[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky
[shortcut]
activation=leaky
from=-3
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky
[shortcut]
activation=leaky
from=-3
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky
[shortcut]
activation=leaky
from=-3
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky
[shortcut]
activation=leaky
from=-3
[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear
[yolo]
mask = 0,1,2
anchors = 187,124, 146,172, 184,197
classes=1
num=3
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=0
在416*416的分辨率上只能做到stride 32,因此后面的conv层stride均为1,按需修改
layer filters size input output
0 conv 64 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 64 0.598 BFLOPs
1 conv 64 3 x 3 / 1 416 x 416 x 64 -> 416 x 416 x 64 12.759 BFLOPs
2 conv 64 3 x 3 / 1 416 x 416 x 64 -> 416 x 416 x 64 12.759 BFLOPs
3 conv 64 3 x 3 / 2 416 x 416 x 64 -> 208 x 208 x 64 3.190 BFLOPs
4 conv 64 3 x 3 / 1 208 x 208 x 64 -> 208 x 208 x 64 3.190 BFLOPs
5 res 2 416 x 416 x 64 -> 208 x 208 x 64
6 conv 64 3 x 3 / 1 208 x 208 x 64 -> 208 x 208 x 64 3.190 BFLOPs
7 conv 64 3 x 3 / 1 208 x 208 x 64 -> 208 x 208 x 64 3.190 BFLOPs
8 res 5 208 x 208 x 64 -> 208 x 208 x 64
9 conv 128 3 x 3 / 2 208 x 208 x 64 -> 104 x 104 x 128 1.595 BFLOPs
10 conv 128 3 x 3 / 1 104 x 104 x 128 -> 104 x 104 x 128 3.190 BFLOPs
11 res 8 208 x 208 x 64 -> 104 x 104 x 128
12 conv 128 3 x 3 / 1 104 x 104 x 128 -> 104 x 104 x 128 3.190 BFLOPs
13 conv 128 3 x 3 / 1 104 x 104 x 128 -> 104 x 104 x 128 3.190 BFLOPs
14 res 11 104 x 104 x 128 -> 104 x 104 x 128
15 conv 256 3 x 3 / 2 104 x 104 x 128 -> 52 x 52 x 256 1.595 BFLOPs
16 conv 256 3 x 3 / 1 52 x 52 x 256 -> 52 x 52 x 256 3.190 BFLOPs
17 res 14 104 x 104 x 128 -> 52 x 52 x 256
18 conv 256 3 x 3 / 1 52 x 52 x 256 -> 52 x 52 x 256 3.190 BFLOPs
19 conv 256 3 x 3 / 1 52 x 52 x 256 -> 52 x 52 x 256 3.190 BFLOPs
20 res 17 52 x 52 x 256 -> 52 x 52 x 256
21 conv 512 3 x 3 / 2 52 x 52 x 256 -> 26 x 26 x 512 1.595 BFLOPs
22 conv 512 3 x 3 / 1 26 x 26 x 512 -> 26 x 26 x 512 3.190 BFLOPs
23 res 20 52 x 52 x 256 -> 26 x 26 x 512
24 conv 512 3 x 3 / 1 26 x 26 x 512 -> 26 x 26 x 512 3.190 BFLOPs
25 conv 512 3 x 3 / 1 26 x 26 x 512 -> 26 x 26 x 512 3.190 BFLOPs
26 res 23 26 x 26 x 512 -> 26 x 26 x 512
27 conv 128 3 x 3 / 2 26 x 26 x 512 -> 13 x 13 x 128 0.199 BFLOPs
28 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
29 res 26 26 x 26 x 512 -> 13 x 13 x 128
30 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
31 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
32 res 29 13 x 13 x 128 -> 13 x 13 x 128
33 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
34 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
35 res 32 13 x 13 x 128 -> 13 x 13 x 128
36 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
37 conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BFLOPs
38 res 35 13 x 13 x 128 -> 13 x 13 x 128
39 conv 18 1 x 1 / 1 13 x 13 x 128 -> 13 x 13 x 18 0.001 BFLOPs
40 detection
2.Gradual warmup
(a)Rethinking ImageNet Pre-training论文,https://arxiv.org/abs/1811.08883
2018 Facebook AI Research 的何恺明、Ross Girshick 及 Piotr Dollar提出过Rethinking ImageNet Pre-training,使用0.02的大学习率,采用 linear warm-up 策略,增加GN、SyncBN正则化策略,可以从0训练imagenet数据集,且效果不差于使用预训练模型。
(b)Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour论文,https://arxiv.org/abs/1706.02677v1
使用gradual wramup策略,使用了batch大小:8192,使用了256 GPUs,在一个小时内训练了ResNet-50,并且得到了和256大小的batch同样的训练精度。
从0训练的大前提是BN层,使用大学习率快速收敛。
在darknet中一开始使用0.05学习率,很快会出现loss NAN,需要做gradual wramup策略。
例如先采用0.0005学习率,迭代5个epoch,设置迭代次数n,且n*64=5*数据量,
例如4张卡学习率设置:
learning_rate=0.000125
burn_in=4000
迭代完后extract weights,
./darknet partial cfg/Root-Resnet.cfg Root-Resnet_final.weights warmup1.conv.50 50
在warmup1.conv.50模型上,将学习率*10,采用0.005学习率,再做一次warmup
learning_rate=0.00125
burn_in=4000
迭代完再做extract weights,获得warmup2.conv.50
最后在warmup2.conv.50上用大学习率0.05,设置好step,多训练几个epoch,
learning_rate=0.0125
burn_in=4000
最终迭代完,可以得到不差于使用预训练模型训练数据的效果。