【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第1张图片
CVPR-2017


Torch 版代码:https://github.com/facebookresearch/ResNeXt
Caffe 版代码:https://github.com/soeaver/caffe-model/tree/master/cls/resnext
Caffe 代码可视化工具:http://ethereon.github.io/netscope/#/editor


文章目录

  • 1 Background and Motivation
  • 2 Innovations / Contributions
  • 3 Advantages
  • 4 Related work
  • 5 Method
    • 5.1 split-transform-aggregate(modules)
    • 5.2 Architecture
  • 6 Experiments
    • 6.1 Experiment on ImageNet 1K
      • 6.1.1 Cardinality vs. Width
      • 6.1.2 Increasing Cardinality vs. Deeper / Wider
      • 6.1.3 Residual connection
      • 6.1.4 Comparisons with state-of-the-art results
    • 6.2 Experiments on ImageNet5K
    • 6.3 Experiments on CIFAR-10
    • 6.4 Experiments on COCO object detection
  • 7 Conclusion


1 Background and Motivation

Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering”.

Inception famliy 带来的启发是 split-transform-merge

  • split:1x1
  • transform:3x3,5x5(passway)
  • merge:concatenate

作者说 inception family carefully designed topologies are able to achieve compelling accuracy with low theoretical complexity(哈哈哈哈哈哈哈)

Inception 有太多超参数要去 design 了

  • the filter numbers and sizes are tailored for each individual transformation(也即 inception module 里面分支的设计)
  • the modules are customized stage-by-stage

这样会导致 it is in general unclear how to adapt the Inception architectures to new datasets / tasks

作者在 inception 的基础上,采用 VGG / ResNet’s strategy of repeating layers,利用组卷积的思想(split-transform-merge strategy),提出了 ResNext。结构设计更加规范!

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第2张图片

2 Innovations / Contributions

1)提出了 ResNext 结构
2)将 group convolution 发扬光大(split-transform-aggregate)

3 Advantages

  • ILSVRC 2016 classification task (2nd place)
  • better results than its ResNet counterpart(ImageNet-5K set,COCO detection set)

ImageNet-5K set 是 5000 classes,要知道,resnet 相当爆炸了,出来的时候几乎横扫了视觉任务竞赛的榜单!这个比 resnet 效果还好!

increasing cardinality is a more effective way of gaining accuracy than going deeper or wider

4 Related work

  • Multi-branch convolutional networks
    inception family、resnet (two-branch)、Deep neural decision forests
  • Grouped convolutions
    第一次是出现在 AlexNet,To the best of our knowledge, there has been little evidence on exploiting grouped convolutions to improve accuracy.(用组卷积来提升分类精度的,未有人)
  • Compressing convolutional networks
    These methods [6, 18, 21, 16] have shown elegant compromise of accuracy with lower complexity and smaller model sizes.
  • Ensembling
    把 ResNeXt 看成 ensembling 是 imprecise 的,因为每个 paths 都 trained jointly,而不是 independently

5 Method

注意文中的 width 指的是 number of channels(a group),deep 指的是 number of layers

5.1 split-transform-aggregate(modules)

1)思想起源

传统的 fully connection
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第3张图片
X = [ x 1 , x 2 , . . . , x D ] X = [x_1, x_2, ..., x_D] X=[x1,x2,...,xD] is a D-channel input vector

第一步 X X X split 成 a low-dimensional embedding x i x_i xi
第二步 transform, w i x i w_ix_i wixi
第三步 aggregate, ∑ 1 D \sum_{1}^{D} 1D

合起来 ∑ i = 1 D w i x i \sum_{i=1}^{D}w_{i}x_{i} i=1Dwixi

2)移花接木
引用到 convolution 2D(split-transform-aggregate)
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第4张图片
作者将 fc 的结构升华下,定义 aggregated transformation 结构如下:
在这里插入图片描述
C 表示 cardinality,也即 number of groups, τ i \tau_i τi should project x x x into an (optionally low-dimensional) embedding and then transform it.(关于 low-dimensional embedding 的理解参考 深度学习中 Embedding层两大作用的个人理解)

加 residual connection 后表示为:
在这里插入图片描述

parameters(差不多)

  • left: 256 ∗ 64 + 64 ∗ 3 ∗ 3 ∗ 64 + 64 ∗ 256 = 69632 ≈ 70 k 256*64 + 64*3*3*64 + 64*256 = 69632 \approx 70k 25664+643364+64256=6963270k
  • right: 256 ∗ 4 ∗ 32 + 4 ∗ 3 ∗ 3 ∗ 4 ∗ 32 + 4 ∗ 256 ∗ 32 = 70144 ≈ 70 k 256*4*32 + 4*3*3*4*32 + 4*256*32 = 70144 \approx 70k 256432+433432+425632=7014470k(cardinality = 32,width = 4)

3)三种等价的形式

We have trained all three forms and obtained the same results.(选(c)因为 more succinct and faster)
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第5张图片
parameters
(a) 256 ∗ 4 ∗ 32 + 4 ∗ 3 ∗ 3 ∗ 4 ∗ 32 + 4 ∗ 256 ∗ 32 = 70144 256*4*32 + 4*3*3*4*32 + 4*256*32 = 70144 256432+433432+425632=70144
(b) 256 ∗ 4 ∗ 32 + 4 ∗ 3 ∗ 3 ∗ 4 ∗ 32 + 128 ∗ 256 = 70144 256*4*32 + 4*3*3*4*32 + 128*256 = 70144 256432+433432+128256=70144
(c) 256 ∗ 128 + 128 / 32 ∗ 3 ∗ 3 ∗ 128 / 32 ∗ 32 + 128 ∗ 256 = 70144 256*128 + 128/32*3*3*128/32*32 + 128*256 = 70144 256128+128/3233128/3232+128256=70144

因为都是三层,每层的 resolution 都一样,所以同 parameters 的话,也同计算量!(c)结构比(a),(b)结构看上去简洁很多,作者后续的设计都是采用的(c)结构

4)趋利避害
figure 3 (c) 的结构,depth 要 ≥ 3 \geq 3 3,why
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第6张图片
parameters

  • left: 64 ∗ 3 ∗ 3 ∗ 4 ∗ 32 + 4 ∗ 3 ∗ 3 ∗ 64 ∗ 32 = 147456 64*3*3*4*32 + 4*3*3*64*32 = 147456 6433432+4336432=147456
  • right: 64 ∗ 3 ∗ 3 ∗ 128 + 128 ∗ 3 ∗ 3 ∗ 64 = 147456 64*3*3*128 + 128*3*3*64 = 147456 6433128+1283364=147456

可以看出,如果 depth = 2,和普通的两层卷积等价,no sense!

5.2 Architecture

两个设计准则:

  1)分辨率一样,block 的参数都一样,
  2)分辨率减半, channles 翻倍

第二条设计准则并不陌生,在 resnet 论文中也有见过,此文的解释也如出一辙,The second rule ensures that the computational complexity, in terms of FLOPs (floating-point operations, in # of multiply-adds), is roughly the same for all blocks.
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第7张图片
1)关于 cardinality 和 width 的理解

C C C 为 cardinality,也即是 number of groups,
d = 4 d = 4 d=4 表示 w i d t h = 4 width=4 width=4,也即每组的 channels 为 4 dimension

C ∗ d = f i l t e r s C*d = filters Cd=filters(见 table 2),值得注意的是,这里的 filters 仅仅指的是第一个 bottleneck block 的 filters(上例子中 32 ∗ 4 = 128 32*4 =128 324=128),因为其它 bottleneck block 的 filters 都可以根据 128 结合两个设计准则推导出来。

这里一定要辨别清楚。不然你会懵圈,后面 C = 32 不变, d d d 还等于4的话,256,512,1024 就解释不通了,这里我困惑了很久!

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第8张图片

2)关于 residual connection(shortcut) 的细节

resnext 采用的 shorcut 结构为 resnet-B

same resolution down sampling
resnet-A identity zero padding
resnet-B identity conv(stride=2)
resnet-C conv(stride=1) conv(stride=2)

convolution 也可以叫做 a liner projection(mapping)

3)stride = 2 在 bottleneck block 哪层卷积?
bottleneck block 像个夹心饼干,前后两个 1 ∗ 1 1*1 11 卷积,中间一个 group convolution,如果 resolution 降低,那层用 stride = 2 呢? 看了下代码 https://github.com/soeaver/caffe-model/tree/master/cls/resnext ,stride =2 用在 group convolution 那层,重复的 bottleneck block 中,第一个bottleneck block负责 down sampling!

6 Experiments

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第9张图片

  • left: 256 ∗ 64 + 64 ∗ 3 ∗ 3 ∗ 64 + 64 ∗ 256 = 69632 ≈ 70 k 256*64 + 64*3*3*64 + 64*256 = 69632 \approx 70k 25664+643364+64256=6963270k
  • right: C ∗ ( 256 ∗ d + d ∗ 3 ∗ 3 ∗ d + d ∗ 256 ) = 70144 ≈ 70 k C*(256*d + d*3*3*d + d*256) = 70144 \approx 70k C256d+d33d+d256=7014470k(C = 32,d = 4)

从 C(cardinality)和 d(width)两个角度来做实验!

6.1 Experiment on ImageNet 1K

6.1.1 Cardinality vs. Width

solo resnet(ImageNet 1K)
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第10张图片
C 和 d 的设计原则,preserved complexity,多 C 更有效(没有必要更多的组,acc 饱和了)
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第11张图片

6.1.2 Increasing Cardinality vs. Deeper / Wider

Cardinality:分的组更多

Deeper:网络的layers更多

Wider:每组的 channels 更多

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第12张图片

结论:increasing cardinality C shows much better results than going deeper or wider、
注意:ResNeXt-101 结果比 ResNet-200 还好(half complexity),侧面说明了 cardinality is a more effective dimension than depth and width

6.1.3 Residual connection

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第13张图片

6.1.4 Comparisons with state-of-the-art results

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第14张图片

6.2 Experiments on ImageNet5K

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第15张图片
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第16张图片

6.3 Experiments on CIFAR-10

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第17张图片
【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第18张图片

6.4 Experiments on COCO object detection

【ResNext】《Aggregated Residual Transformations for Deep Neural Networks》_第19张图片

7 Conclusion

比 ResNet 强,比 inception famliy 设计的更规范和容易,第一次把 group convolution 用来提升精度!弄清楚 cardinality 和 width 的关系!弄清楚作者说的 low-dimensional embedding!

你可能感兴趣的:(CNN)