ScratchDet: Exploring to train single-shot object detectors from scratch
Abstract:为什么采用在大型数据集上预训练模型?
提出了ScratchDet
we study the impact of BatchNorm on training detectors from scratch, and find that using BatchNorm on the backbone and detection head subnetworks makes the detector converge well from scratch.
1.Introduction
使用预训练模型在目标检测上有严重的限制。第一,fine-tuning can be regarded as a transfer learning problem,whilch is difficult to fill the domain gap perfectly between the source dataset and the target dataset.第二,其次,分类和检测任务对transition有不同程度的敏感性。 分类任务优选于平移不变性,因此需要下采样操作(例如,最大池和步幅2的卷积)以获得更好的性能。 相反,局部纹理信息对于对象检测更为关键,使得transition不变操作(例如,下采样操作)的使用谨慎。也不方便更改模型结构。
ResNet/VGGNet + SSD
BN reparameterizes the optimization problem to make its landscape significantly smoother instead of reducing the internal covariate shift. BN helps the detector converge well without adapting the pretrained model based detector.分析基于RsNet和VGGNet的SSD下第一个卷积上的采样步长对表现有比较大的影像。引入了root block
2.Related work
Object detectors with pretrained network.
Train-from-scratch object detectors
BN(有了bn,学习率可以大一点,加速模型训练)
3.ScratchNet
BactNorm for SSD Trained from sratch
DSOD 使用Densenet,并没有发现BN的重要作用。
BN on the backbone subnetwork
We add BN on each conv layer in the backbonesubnetwork and then train it form scratch.我们可以使用相对较大的学习率,0.01或者0.05来进一步提高表现72.5%-77.8% and 78%,和预训练77.2%,进一步表明添加BN在backbone subnetwork是很重要的一个措施去提高SSD from scratch.
BN on the detection head subnetwork
These results are very useful to explain the phenomenon that using large learning rate to train SSD with the original architecture from scratch or pretrained networks usually leads to gradient explosion, poor stability and weak prediction of gradients.
BN in the whole network
在两个部分上都使用了BN,配一个比较大lr,相比较预训练SSD,from scratch模型精度提高了。
大的lr配BN
Backbone Network redesign
Perdormance analysis of ResNet and VGGNet
SSD中ResNet-101比VGGNet16效果要好,在DSSD中,VGGNet16要比ResNet-101效果好。
We argue that this phenomenon is attributed to the downsampling operation in the first convolution layer (i.e.,conv1 x with stride 2) of ResNet-101, which cuts off half of the raw image information. This operation significantly affects the detection accuracy, especially for small objects
尤其对于小物体的detcetion,上来缩小一半,信息损失太严重,所以如果将图像放大,就可以消除这个缺点,512*512,所以SSD上resnet表现好一点,
In summary, the downsampling operation in the first convolution layer has a bad impact on the detection accuracy, especially for small objects.
Backbone network redesign for object detection
Root-ResNet,
we remove the downsampling operation (i.e., change the stride from 2 to 1) in the first conv layer and replace the 7 × 7 convolution kernel by a stack of several 3 × 3 convolution filters (denoted as the root block). With these improvements, Root-ResNet is able to exploit more local information from the image, so as to extract powerful features for small object detection.
Furthermore, we replace four convolution blocks (added by SSD to extract the feature maps with different scales) with four residual blocks to the end of the Root-ResNet. Each residual block is formed by two branches. One branch is a 1 × 1 convolution layer with stride 2 and the other one consists of a 3×3 convolution layer with stride 2 and a 3×3 convolution layer with stride 1. The number of output channels in each convolution layer is set to 128
Input size越大,map越高
使用BN,使用一个大的lr,map越高。不用BN,大的lr就不收敛了。