论文:https://arxiv.org/pdf/2007.00992.pdf
代码:https://github.com/clovaai/rexnet
个人认为就是做了一堆实验+数学推导,实质性的东西就是提出了三个设计原则
Representational Bottleneck即特征描述的瓶颈就是中间某层对特征在空间维度进行较大比例的压缩(比如使用pooling时或者是降维),导致很多特征丢失。
基于数学推导和大量的实验提出了解决Representational Bottleneck
问题的三个设计原则
在网络中哪一层可能出现表征瓶颈呢?所有流行的深度网络都有类似的架构,有许多扩展层将图像输入的通道从3通道输入扩展到c通道然后输出预测。
首先,对块或层进行下采样就像展开层一样。其次,瓶颈模块和反向瓶颈块中的第一层也是一个扩展层。最后,还存在大量扩展输出通道大小的倒数第2层。
本文作者认为:表征瓶颈将发生在这些扩展层和倒数第2层。
这里作者其实可以看作在MobileNet这个网络上利用三原则重新设计,形成了自己ReXNet
基于MobileNet的新型网络RexNet?
哈哈哈
本文作者首先考虑了MobileNetV1。依次对接近倒数第2层的卷积做同样的修改。通过:
作者在MobileNetV1中做了类似MobileNetV2地更新。所有从末端到第1个的反向瓶颈都按照相同的原理依次修改。
在ResNet及其变体中,每个瓶颈块在第3个卷积层之后不存在非线性,所以扩展输入通道大小是唯一的补救办法。
倒数第2个层
很多网络架构在倒数第2层有一个输出通道尺寸较大的卷积层。这是为了防止最终分类器的表征瓶颈,但是倒数第2层仍然受到这个问题的困扰。于是作者扩大了倒数第2层的输入通道大小,并替换了ReLU6。
网络结构自己看代码或者论文
import torch
import torch.nn as nn
from math import ceil
# Memory-efficient Siwsh using torch.jit.script borrowed from the code in (https://twitter.com/jeremyphoward/status/1188251041835315200)
# Currently use memory-efficient Swish as default:
USE_MEMORY_EFFICIENT_SWISH = True
if USE_MEMORY_EFFICIENT_SWISH:
@torch.jit.script
def swish_fwd(x):
return x.mul(torch.sigmoid(x))
@torch.jit.script
def swish_bwd(x, grad_output):
x_sigmoid = torch.sigmoid(x)
return grad_output * (x_sigmoid * (1. + x * (1. - x_sigmoid)))
class SwishJitImplementation(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return swish_fwd(x)
@staticmethod
def backward(ctx, grad_output):
x = ctx.saved_tensors[0]
return swish_bwd(x, grad_output)
def swish(x, inplace=False):
return SwishJitImplementation.apply(x)
else:
def swish(x, inplace=False):
return x.mul_(x.sigmoid()) if inplace else x.mul(x.sigmoid())
class Swish(nn.Module):
def __init__(self, inplace=True):
super(Swish, self).__init__()
self.inplace = inplace
def forward(self, x):
return swish(x, self.inplace)
def ConvBNAct(out, in_channels, channels, kernel=1, stride=1, pad=0,
num_group=1, active=True, relu6=False):
out.append(nn.Conv2d(in_channels, channels, kernel,
stride, pad, groups=num_group, bias=False))
out.append(nn.BatchNorm2d(channels))
if active:
out.append(nn.ReLU6(inplace=True) if relu6 else nn.ReLU(inplace=True))
def ConvBNSwish(out, in_channels, channels, kernel=1, stride=1, pad=0, num_group=1):
out.append(nn.Conv2d(in_channels, channels, kernel,
stride, pad, groups=num_group, bias=False))
out.append(nn.BatchNorm2d(channels))
out.append(Swish())
class SE(nn.Module):
def __init__(self, in_channels, channels, se_ratio=12):
super(SE, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Conv2d(in_channels, channels // se_ratio, kernel_size=1, padding=0),
nn.BatchNorm2d(channels // se_ratio),
nn.ReLU(inplace=True),
nn.Conv2d(channels // se_ratio, channels, kernel_size=1, padding=0),
nn.Sigmoid()
)
def forward(self, x):
y = self.avg_pool(x)
y = self.fc(y)
return x * y
class LinearBottleneck(nn.Module):
def __init__(self, in_channels, channels, t, stride, use_se=True, se_ratio=12,
**kwargs):
super(LinearBottleneck, self).__init__(**kwargs)
self.use_shortcut = stride == 1 and in_channels <= channels
self.in_channels = in_channels
self.out_channels = channels
out = []
if t != 1:
dw_channels = in_channels * t
ConvBNSwish(out, in_channels=in_channels, channels=dw_channels)
else:
dw_channels = in_channels
ConvBNAct(out, in_channels=dw_channels, channels=dw_channels, kernel=3, stride=stride, pad=1,
num_group=dw_channels, active=False)
if use_se:
out.append(SE(dw_channels, dw_channels, se_ratio))
out.append(nn.ReLU6())
ConvBNAct(out, in_channels=dw_channels, channels=channels, active=False, relu6=True)
self.out = nn.Sequential(*out)
def forward(self, x):
out = self.out(x)
if self.use_shortcut:
out[:, 0:self.in_channels] += x
return out
class ReXNetV1(nn.Module):
def __init__(self, input_ch=16, final_ch=180, width_mult=1.0, depth_mult=1.0, classes=1000,
use_se=True,
se_ratio=12,
dropout_ratio=0.2,
bn_momentum=0.9):
super(ReXNetV1, self).__init__()
layers = [1, 2, 2, 3, 3, 5]
strides = [1, 2, 2, 2, 1, 2]
use_ses = [False, False, True, True, True, True]
layers = [ceil(element * depth_mult) for element in layers]
strides = sum([[element] + [1] * (layers[idx] - 1)
for idx, element in enumerate(strides)], [])
if use_se:
use_ses = sum([[element] * layers[idx] for idx, element in enumerate(use_ses)], [])
else:
use_ses = [False] * sum(layers[:])
ts = [1] * layers[0] + [6] * sum(layers[1:])
self.depth = sum(layers[:]) * 3
stem_channel = 32 / width_mult if width_mult < 1.0 else 32
inplanes = input_ch / width_mult if width_mult < 1.0 else input_ch
features = []
in_channels_group = []
channels_group = []
# The following channel configuration is a simple instance to make each layer become an expand layer.
for i in range(self.depth // 3):
if i == 0:
in_channels_group.append(int(round(stem_channel * width_mult)))
channels_group.append(int(round(inplanes * width_mult)))
else:
in_channels_group.append(int(round(inplanes * width_mult)))
inplanes += final_ch / (self.depth // 3 * 1.0)
channels_group.append(int(round(inplanes * width_mult)))
ConvBNSwish(features, 3, int(round(stem_channel * width_mult)), kernel=3, stride=2, pad=1)
for block_idx, (in_c, c, t, s, se) in enumerate(zip(in_channels_group, channels_group, ts, strides, use_ses)):
features.append(LinearBottleneck(in_channels=in_c,
channels=c,
t=t,
stride=s,
use_se=se, se_ratio=se_ratio))
pen_channels = int(1280 * width_mult)
ConvBNSwish(features, c, pen_channels)
features.append(nn.AdaptiveAvgPool2d(1))
self.features = nn.Sequential(*features)
self.output = nn.Sequential(
nn.Dropout(dropout_ratio),
nn.Conv2d(pen_channels, classes, 1, bias=True))
def forward(self, x):
x = self.features(x)
x = self.output(x).squeeze()
return x
作者这里做了大量的实验,反正就是soft,感觉cv的论文一千篇论文有一千篇soft
实质性的东西没怎么讲,不算很大的突破,个人感觉就是给网络+大宽度系数,修改了一下非线性函数,当然作者做了很多数学推导,理论意义强于工程意义吧。
CVPR2021全新Backbone | ReXNet在CV全任务以超低FLOPs达到SOTA水平(文末下载论文和源码)