yolo3解析

yolov3解析

yolo系列论文看过，源码包调过，抽点时间把论文理解和源码做个一一对应，加深理解，论文

https://pjreddie.com/darknet/yolo/

源码看的mxnet,gluon-cv,代码地址：https://github.com/dmlc/gluon-cv

yolov3 network

darknet53一共53层卷积，除去最后一个FC总共52个卷积用于当做主体网络，主体网络被分成三个stage，结构类似FPN，1-26层卷积为stage1,27-43层卷积为stage2,44-52层卷积为stage3，低层卷积(26)感受野更小，负责检测小目标，深层卷积(52)感受野大，容易检测出大目标，整体网络的graph见文章最后

yolov3 summary

layer filters size input output
0 0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BFLOPs
1 1 conv 64 3 x 3 / 2 416 x 416 x 32 -> 208 x 208 x 64 1.595 BFLOPs
2 2 conv 32 1 x 1 / 1 208 x 208 x 64 -> 208 x 208 x 32 0.177 BFLOPs
3 3 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BFLOPs
  4 res 1 208 x 208 x 64 -> 208 x 208 x 64
4 5 conv 128 3 x 3 / 2 208 x 208 x 64 -> 104 x 104 x 128 1.595 BFLOPs
5 6 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BFLOPs
6 7 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BFLOPs
  8 res 5 104 x 104 x 128 -> 104 x 104 x 128
7 9 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BFLOPs
8 10 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BFLOPs
  11 res 8 104 x 104 x 128 -> 104 x 104 x 128
9 12 conv 256 3 x 3 / 2 104 x 104 x 128 -> 52 x 52 x 256 1.595 BFLOPs
10 13 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
11 14 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   15 res 12 52 x 52 x 256 -> 52 x 52 x 256
12 16 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
13 17 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   18 res 15 52 x 52 x 256 -> 52 x 52 x 256
14 19 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
15 20 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   21 res 18 52 x 52 x 256 -> 52 x 52 x 256
16 22 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
17 23 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   24 res 21 52 x 52 x 256 -> 52 x 52 x 256
18 25 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
19 26 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   27 res 24 52 x 52 x 256 -> 52 x 52 x 256
20 28 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
21 29 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   30 res 27 52 x 52 x 256 -> 52 x 52 x 256
22 31 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
23 32 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   33 res 30 52 x 52 x 256 -> 52 x 52 x 256
24 34 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
25 35 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
   36 res 33 52 x 52 x 256 -> 52 x 52 x 256
fpn1---------------------------------------------------------
26 37 conv 512 3 x 3 / 2 52 x 52 x 256 -> 26 x 26 x 512 1.595 BFLOPs
27 38 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
28 39 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   40 res 37 26 x 26 x 512 -> 26 x 26 x 512
29 41 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
30 42 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   43 res 40 26 x 26 x 512 -> 26 x 26 x 512
31 44 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
32 45 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   46 res 43 26 x 26 x 512 -> 26 x 26 x 512
33 47 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
34 48 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   49 res 46 26 x 26 x 512 -> 26 x 26 x 512
35 50 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
36 51 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   52 res 49 26 x 26 x 512 -> 26 x 26 x 512
37 53 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
38 54 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   55 res 52 26 x 26 x 512 -> 26 x 26 x 512
39 56 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
40 57 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   58 res 55 26 x 26 x 512 -> 26 x 26 x 512
41 59 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
42 60 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
   61 res 58 26 x 26 x 512 -> 26 x 26 x 512
fpn2------------------------------------------------------------
43 62 conv 1024 3 x 3 / 2 26 x 26 x 512 -> 13 x 13 x1024 1.595 BFLOPs
44 63 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
45 64 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
   65 res 62 13 x 13 x1024 -> 13 x 13 x1024
46 66 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
47 67 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
   68 res 65 13 x 13 x1024 -> 13 x 13 x1024
48 69 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
49 70 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
   71 res 68 13 x 13 x1024 -> 13 x 13 x1024
50 72 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
51 73 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
   74 res 71 13 x 13 x1024 -> 13 x 13 x1024
fpn3---------------------------------------------------------------  
0 75 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
1 76 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
2 77 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
3 78 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
4 79 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs
5 80 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
58 81 conv 75 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 75 0.026 BFLOPs
   82 yolo
   83 route 79
59 84 conv 256 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 256 0.044 BFLOPs
   85 upsample 2x 13 x 13 x 256 -> 26 x 26 x 256
   86 route 85 61
60 87 conv 256 1 x 1 / 1 26 x 26 x 768 -> 26 x 26 x 256 0.266 BFLOPs
88 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
89 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
90 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
91 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs
92 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs
93 conv 75 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 75 0.052 BFLOPs
94 yolo
95 route 91
96 conv 128 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BFLOPs
97 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128
98 route 97 36
99 conv 128 1 x 1 / 1 52 x 52 x 384 -> 52 x 52 x 128 0.266 BFLOPs
100 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
101 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
102 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
103 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs
104 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
105 conv 75 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 75 0.104 BFLOPs
106 yolo
---------------------

YOLOOutputV3

从tip卷积套件relu输出开始，到推理的reshape成detection结束。

针对最后一个尺度：

卷积输出24=（1+4+3）3,3是类别，4代表一个box，1代表是否有物体，最后的3=6/2，anchor两个一组，一组分别代表高和宽。
yolo3.py->YOLOOutputV3->123行->
pred->24x169 (1+4+3)x3x13x13<-->(置信度+坐标+类别)xanchor数量x高x宽
pred->169x3x8 列代表特征位置，横代表anchor的index，通道分别是置信度，位置，类别
raw_box_centers->169x3x2 每个格子，每个anchor相对的中心点
raw_box_scales->169x3x2 每个格子，每个anchor的伸缩比例
objness->169x3x1 每个格子，每个anchor的置信度
class_pred->169x3x3 每个格子，每个anchor的类别概率
box_centers->169x3x2 每个格子，每个anchor对应box相对原图的中心点,加了offset
box_scales->169x3x2 每个格子，每个anchor对应box相对原图高宽,它由raw_box_scales先按元素计算以 e(2.71)为底的幂，再和anchor相乘
class_score->169x3x3 每个格子，每个anchor每个类别的得分乘以置信度，分类与置信度联合做loss
bbox->169x3x4 每个格子，每个anchor对应box的坐标，左上角，右下角
offsets->169x1x2 ,每个网格相对偏移，x(0->12),y(0->12),每个网格中心点加上其左上角的相对位置偏移，再乘以stride(32)，坐标中心从相对变为绝对
anchor->[[116,90],[156,198],[373,326]],每个anchor的比例，最后一个尺度(1313)的三个anchor，相对于定义，anchor被颠倒，高纬用于检测大物体，yolo3定义的三组anchors:anchors = [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]]
如果是训练，返回bbox(1,507,4),raw_box_centers(1x169x3x2),raw_box_scales(1x169x3x2),bojness(1x169x3x1),clas_pred(1x169x3),anchor(1x1x3x2),offset(1x169x1x2)

针对其他两个尺度

针对其他两个尺度分别返回
bbox(1,2028,4),raw_box_centers(1x676x3x2),raw_box_scales(1x676x3x2),bojness(1x676x3x1),clas_pred(1x676x3),anchor(1x1x3x2),offset(1x676x1x2)
bbox(1,8112,4),raw_box_centers(1x2704x3x2),raw_box_scales(1x2704x3x2),bojness(1x2704x3x1),clas_pred(1x2704x3),anchor(1x1x3x2),offset(1x2704x1x2)
训练的时候前向计算的
all_objectness (1x507x1) con (1x2028x1) con (1x8112x1)->(1x10647x1)
all_box_centers,(1x507x2) con (1x2028x2) con (1x8112x1)->(1x10647x2)
all_box_scales,(1x507x2) con (1x2028x2) con (1x8112x2)->(1x10647x2)
all_class_pred (1x507x3) con (1x2028x3) con (1x8112x3)->(1x10647x3)
与构造好的label做loss更新参数，所有的cell长宽以及anchor数量糅合到一维

yolo_output.gif

YOLODetectionBlockV3

接在特征提取后面，介于特征提取和输出pred之间，用作特征转换，降维等,源码在yolo3.py，类名YOLODetectionBlockV3,每一个stage之后都接一个YOLODetectionBlock,channel设置为[512,256,128],所以每个YOLODetectionBlock最后输出的通道数依次减少，[512,1024,512,1024,512,1024],[256,512,256,512,256,512], [128,256,128,256,128,256]，每一组一个6个卷积，最后一个卷积的输出(tip)进入output用于检测，第5个卷积的输出进入transitions层后和对应的stage concate后进入下一个YOLODetectionBlockV3。

yolodetectionblock.png

YOLODetectionBlockV3之间transition,就一个卷积，卷积后分别在特征图高和宽的维度各做一次repeat使得上采样，然后做一次slice_like使得YOLODetectionBlockV3的输出和route的一模一样以便concate

loss

+
+

整体思路为：每个cell的每个anchor和label做loss，根据label会有一个mask，中心点，scale有物体的cell，anchor才有loss，其他位置被0mask值忽略，每个cell,anchor有没有物体的置信度都都被用来做loss，有物体的cell才会做分类loss，依次对应上面的数学公式；针对某个cell，某个类被预测，则为1，该cell如果有物体，那这个位置肯定为1？那一个cell有很多anchor，不用算每个anchor的分类吗？这一部分在源码部分看的不是特别明白

anchors

anchors = [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]],

这个应该是相对于416的宽和高，相对于416，320，608训练的时候是等比例调整得，同样等比例调整了的应该还有label值
整体网络结构图

yolov3_graph.png