YOLOv3使用笔记——[CVPR2019]:ScratchDet Training Single-Shot Object Detectors from Scratch

论文地址,https://arxiv.org/abs/1810.08425v3

关于论文,可以看看https://zhuanlan.zhihu.com/p/59498319

该篇论文主要做两件事:

1、检测的backbone网络不再使用预训练模型

2、修改backbone为Root-ResNet

本文主要在darknet框架试验下Root-ResNet-18,未在公共数据集上验证,直接用在自己的数据集上,另外记录下不用预训练模型,从0训练的技巧,主要使用Gradual warmup。

1、修改Root-ResNet-18部分

backbone网络部分,作者认为检测和分类任务不同,一开始用7*7conv和max pooling,虽然可以减少计算量,但是在检测中意味着信息丢失。因此backbone部分相当于增加了计算量,但是在检测部分,32倍下采样之后减少通道数来减少一部分计算量,主要做了以下修改:

(a)去掉max pooling

(b)第一个conv层 stride由2改为1

(c)取代7*7conv为3个3*3conv

(d)检测部分通道数修改为128

(e)backbone网络和检测部分均使用BN层,使用大学习率,论文中为0.05,从0训练模型。

 

YOLOv3使用笔记——[CVPR2019]:ScratchDet Training Single-Shot Object Detectors from Scratch_第1张图片 backbone网络
YOLOv3使用笔记——[CVPR2019]:ScratchDet Training Single-Shot Object Detectors from Scratch_第2张图片 detection网络

 

在darknet上修改的Root-ResNet18.cfg,其中anchor为自己的数据集anchor

[net]
# Testing
# batch=1
# subdivisions=1
# Training
 batch=64
 subdivisions=16
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.0125
burn_in=4000
max_batches = 200000
policy=steps
steps=120000,170000
scales=.1,.1


[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky


[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky


[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky


# Residual Block
[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3

# Residual Block
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3

# Strided Residual Block
[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3

# Residual Block
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3


# Strided Residual Block
[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3

# Residual Block
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3


# Strided Residual Block
[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3

# Residual Block
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=linear

[shortcut]
activation=leaky
from=-3


######################

[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky

[shortcut]
activation=leaky
from=-3

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky


[shortcut]
activation=leaky
from=-3

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky

[shortcut]
activation=leaky
from=-3

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=128
activation=leaky

[shortcut]
activation=leaky
from=-3

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 0,1,2
anchors = 187,124,  146,172,  184,197
classes=1
num=3
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=0

在416*416的分辨率上只能做到stride 32,因此后面的conv层stride均为1,按需修改

layer     filters    size              input                output
    0 conv     64  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  64  0.598 BFLOPs
    1 conv     64  3 x 3 / 1   416 x 416 x  64   ->   416 x 416 x  64  12.759 BFLOPs
    2 conv     64  3 x 3 / 1   416 x 416 x  64   ->   416 x 416 x  64  12.759 BFLOPs
    3 conv     64  3 x 3 / 2   416 x 416 x  64   ->   208 x 208 x  64  3.190 BFLOPs
    4 conv     64  3 x 3 / 1   208 x 208 x  64   ->   208 x 208 x  64  3.190 BFLOPs
    5 res    2                 416 x 416 x  64   ->   208 x 208 x  64
    6 conv     64  3 x 3 / 1   208 x 208 x  64   ->   208 x 208 x  64  3.190 BFLOPs
    7 conv     64  3 x 3 / 1   208 x 208 x  64   ->   208 x 208 x  64  3.190 BFLOPs
    8 res    5                 208 x 208 x  64   ->   208 x 208 x  64
    9 conv    128  3 x 3 / 2   208 x 208 x  64   ->   104 x 104 x 128  1.595 BFLOPs
   10 conv    128  3 x 3 / 1   104 x 104 x 128   ->   104 x 104 x 128  3.190 BFLOPs
   11 res    8                 208 x 208 x  64   ->   104 x 104 x 128
   12 conv    128  3 x 3 / 1   104 x 104 x 128   ->   104 x 104 x 128  3.190 BFLOPs
   13 conv    128  3 x 3 / 1   104 x 104 x 128   ->   104 x 104 x 128  3.190 BFLOPs
   14 res   11                 104 x 104 x 128   ->   104 x 104 x 128
   15 conv    256  3 x 3 / 2   104 x 104 x 128   ->    52 x  52 x 256  1.595 BFLOPs
   16 conv    256  3 x 3 / 1    52 x  52 x 256   ->    52 x  52 x 256  3.190 BFLOPs
   17 res   14                 104 x 104 x 128   ->    52 x  52 x 256
   18 conv    256  3 x 3 / 1    52 x  52 x 256   ->    52 x  52 x 256  3.190 BFLOPs
   19 conv    256  3 x 3 / 1    52 x  52 x 256   ->    52 x  52 x 256  3.190 BFLOPs
   20 res   17                  52 x  52 x 256   ->    52 x  52 x 256
   21 conv    512  3 x 3 / 2    52 x  52 x 256   ->    26 x  26 x 512  1.595 BFLOPs
   22 conv    512  3 x 3 / 1    26 x  26 x 512   ->    26 x  26 x 512  3.190 BFLOPs
   23 res   20                  52 x  52 x 256   ->    26 x  26 x 512
   24 conv    512  3 x 3 / 1    26 x  26 x 512   ->    26 x  26 x 512  3.190 BFLOPs
   25 conv    512  3 x 3 / 1    26 x  26 x 512   ->    26 x  26 x 512  3.190 BFLOPs
   26 res   23                  26 x  26 x 512   ->    26 x  26 x 512
   27 conv    128  3 x 3 / 2    26 x  26 x 512   ->    13 x  13 x 128  0.199 BFLOPs
   28 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   29 res   26                  26 x  26 x 512   ->    13 x  13 x 128
   30 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   31 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   32 res   29                  13 x  13 x 128   ->    13 x  13 x 128
   33 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   34 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   35 res   32                  13 x  13 x 128   ->    13 x  13 x 128
   36 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   37 conv    128  3 x 3 / 1    13 x  13 x 128   ->    13 x  13 x 128  0.050 BFLOPs
   38 res   35                  13 x  13 x 128   ->    13 x  13 x 128
   39 conv     18  1 x 1 / 1    13 x  13 x 128   ->    13 x  13 x  18  0.001 BFLOPs
   40 detection

 

2.Gradual warmup

(a)Rethinking ImageNet Pre-training论文,https://arxiv.org/abs/1811.08883

2018 Facebook AI Research 的何恺明、Ross Girshick 及 Piotr Dollar提出过Rethinking ImageNet Pre-training,使用0.02的大学习率,采用 linear warm-up 策略,增加GN、SyncBN正则化策略,可以从0训练imagenet数据集,且效果不差于使用预训练模型。

(b)Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour论文,https://arxiv.org/abs/1706.02677v1

使用gradual wramup策略,使用了batch大小:8192,使用了256 GPUs,在一个小时内训练了ResNet-50,并且得到了和256大小的batch同样的训练精度。

YOLOv3使用笔记——[CVPR2019]:ScratchDet Training Single-Shot Object Detectors from Scratch_第3张图片

 

从0训练的大前提是BN层,使用大学习率快速收敛。

在darknet中一开始使用0.05学习率,很快会出现loss NAN,需要做gradual wramup策略。

例如先采用0.0005学习率,迭代5个epoch,设置迭代次数n,且n*64=5*数据量,

例如4张卡学习率设置:

learning_rate=0.000125
burn_in=4000

迭代完后extract weights,

./darknet partial cfg/Root-Resnet.cfg Root-Resnet_final.weights warmup1.conv.50 50

在warmup1.conv.50模型上,将学习率*10,采用0.005学习率,再做一次warmup

learning_rate=0.00125
burn_in=4000

迭代完再做extract weights,获得warmup2.conv.50

最后在warmup2.conv.50上用大学习率0.05,设置好step,多训练几个epoch,

learning_rate=0.0125
burn_in=4000

最终迭代完,可以得到不差于使用预训练模型训练数据的效果。

你可能感兴趣的:(ScratchDet,Root-ResNet-18,YOLO,Gradual,warmup,YOLO)