For those plugin modules and post-processing methods
that only increase the inference cost by a small amount
but can signifificantly improve the accuracy of object detec
tion, we call them “bag of specials”. Generally speaking,
these plugin modules are for enhancing certain attributes in
a model, such as enlarging receptive fifield, introducing at
tention mechanism, or strengthening feature integration ca
pability, etc., and post-processing is a method for screening
model prediction results.
Common modules that can be used to enhance recep
tive fifield are SPP [
25
], ASPP [
5
], and RFB [
47
]. The
SPP module was originated from Spatial Pyramid Match
ing (SPM) [
39
], and SPMs original method was to split fea
ture map into several
d
×
d
equal blocks, where
d
can be
{
1
,
2
,
3
, ...
}
, thus forming spatial pyramid, and then extract
ing bag-of-word features. SPP integrates SPM into CNN
and use max-pooling operation instead of bag-of-word op
eration. Since the SPP module proposed by He
et al
. [
25
]
will output one dimensional feature vector, it is infeasible to
be applied in Fully Convolutional Network (FCN). Thus in
the design of YOLOv3 [
63
], Redmon and Farhadi improve
SPP module to the concatenation of max-pooling outputs
with kernel size
k
×
k
, where
k
=
{
1
,
5
,
9
,
13
}
, and stride
equals to 1. Under this design, a relatively large
k
×
k
max
pooling effectively increase the receptive fifield of backbone
feature. After adding the improved version of SPP module,
YOLOv3-608 upgrades AP
50
by 2.7% on the MS COCO
object detection task at the cost of 0.5% extra computation.
The difference in operation between ASPP [
5
] module and
improved SPP module is mainly from the original
k
×
k
ker
nel size, max-pooling of stride equals to 1 to several
3
×
3
kernel size, dilated ratio equals to
k
, and stride equals to 1
in dilated convolution operation. RFB module is to use sev
eral dilated convolutions of
k
×
k
kernel, dilated ratio equals
to
k
, and stride equals to 1 to obtain a more comprehensive
spatial coverage than ASPP. RFB [
47
] only costs 7% extra
inference time to increase the AP
50
of SSD on MS COCO
by 5.7%.
The attention module that is often used in object detec
tion is mainly divided into channel-wise attention and point
wise attention, and the representatives of these two atten
tion models are Squeeze-and-Excitation (SE) [
29
] and Spa
tial Attention Module (SAM) [
85
], respectively. Although
SE module can improve the power of ResNet50 in the Im
ageNet image classifification task 1% top-1 accuracy at the
cost of only increasing the computational effort by 2%, but
on a GPU usually it will increase the inference time by
about 10%, so it is more appropriate to be used in mobile
devices. But for SAM, it only needs to pay 0.1% extra cal
culation and it can improve ResNet50-SE 0.5% top-1 accu
racy on the ImageNet image classifification task. Best of all,
it does not affect the speed of inference on the GPU at all.
In terms of feature integration, the early practice is to use
skip connection [
51
] or hyper-column [
22
] to integrate low
level physical feature to high-level semantic feature. Since
multi-scale prediction methods such as FPN have become
popular, many lightweight modules that integrate different
feature pyramid have been proposed. The modules of this
sort include SFAM [
98
], ASFF [
48
], and BiFPN [
77
]. The
main idea of SFAM is to use SE module to execute channel
wise level re-weighting on multi-scale concatenated feature
maps. As for ASFF, it uses softmax as point-wise level re
weighting and then adds feature maps of different scales.
In BiFPN, the multi-input weighted residual connections is
proposed to execute scale-wise level re-weighting, and then
add feature maps of different scales.
In the research of deep learning, some people put their
focus on searching for good activation function. A good
activation function can make the gradient more effificiently
propagated, and at the same time it will not cause too
much extra computational cost. In 2010, Nair and Hin
ton [
56
] propose ReLU to substantially solve the gradient
vanish problem which is frequently encountered in tradi
tional tanh and sigmoid activation function. Subsequently,
LReLU [
54
], PReLU [
24
], ReLU6 [
28
], Scaled Exponential
Linear Unit (SELU) [
35
], Swish [
59
], hard-Swish [
27
], and
Mish [
55
], etc., which are also used to solve the gradient
vanish problem, have been proposed. The main purpose of
LReLU and PReLU is to solve the problem that the gradi
ent of ReLU is zero when the output is less than zero. As
for ReLU6 and hard-Swish, they are specially designed for
quantization networks. For self-normalizing a neural net
work, the SELU activation function is proposed to satisfy
the goal. One thing to be noted is that both Swish and Mish
are continuously differentiable activation function.
The post-processing method commonly used in deep
learning-based object detection is NMS, which can be used
to fifilter those BBoxes that badly predict the same ob
ject, and only retain the candidate BBoxes with higher re
sponse. The way NMS tries to improve is consistent with
the method of optimizing an objective function. The orig
inal method proposed by NMS does not consider the con
text information, so Girshick
et al
. [
19
] added classifification
confifidence score in R-CNN as a reference, and according to
the order of confifidence score, greedy NMS was performed
in the order of high score to low score. As for soft NMS [
1
],
it considers the problem that the occlusion of an object may
cause the degradation of confifidence score in greedy NMS
with IoU score. The DIoU NMS [
99
] developers way of
thinking is to add the information of the center point dis
tance to the BBox screening process on the basis of soft
NMS. It is worth mentioning that, since none of above post
processing methods directly refer to the captured image fea
tures, post-processing is no longer required in the subse
quent development of an anchor-free method.
|