paper:EfficientNetV2: Smaller Models and Faster Training
official implementation:automl/efficientnetv2 at master · google/automl · GitHub
third-party implementation:mmpretrain/efficientnet_v2.py at main · open-mmlab/mmpretrain · GitHub
本文是对EfficientNet v1的改进,通过系统地研究了v1中的训练瓶颈,发现v1存在以下问题:
基于这些观察结果,作者设计了一个包含额外算子ops如Fused-MBConv的搜索空间,并应用训练感知training-aware的NAS和缩放scaling来联合优化模型精度、训练速度和参数大小。最终搜索到的网络即为EfficientNetV2。
此外,本文还提出了一种改进的渐进式训练progressive learning方法:在训练的早期用较小的输入和较弱的正则化,随着训练的进行,逐渐增大输入分辨率和正则化的强度。基于这种训练方法,可以加快训练速度而不会导致准确率下降。
作者首先研究了v1中的训练瓶颈,并提出了一些简单的技巧来提升训练效率。
Training with very large image sizes is slow
在v1中,更大的输入导致了更多的显存占用,而总显存是固定的,因此需要减小batch_size,这同时也降低了训练速度。一个简单的改进是应用FixRes,即使用比推理小的图像来进行训练,如表2所示,输入图像越小计量算也越小,就可以使用更大的batch_size,训练速度最高提升了2.2倍。
而本文提出了一种更先进的训练方法,在训练过程中逐步调整输入图像大小和正则化,后面会详细介绍。
Depthwise convolutions are slow in early layers but effective in later stages
v1中的另一个训练瓶颈在于大量的使用深度卷积,深度卷积比传统卷积具有更少的参数和FLOPs,但无法充分使用现代的加速方法(现代的硬件设备和训练框架对传统卷积操作进行了特殊的优化加速,但没有针对深度卷积进行优化)。最近提出的Fused-MBConv,可以更好的利用移动设备或服务器的加速。它将MBConv中的3x3 depthwise卷积和1x1 expansion卷积替换为一个传统的3x3卷积,如图2所示。
为了系统的比较这两个block,作者逐步用Fused-MBConv替换EfficientNet-B4中的MBConv,如表3所示。当替换stage 1-3时,在增加了一点参数量和FLOPs的情况下,精度和训练速度都得到了提升。但如果把所有1-7 stage中的MBConv都替换为Fused-MBConv,参数量和FLOPs会大幅增加,同时精度和训练速度都更低了。因此需要找到这两个building blocks的完美组合,作者使用NAS来自动搜索最佳组合。
Equally scaling up every stage is sub-optimal
v1中使用一个简单的复合缩放规则来扩展所有stage,比如当深度系数为2时,那么网路中所有stage的层数都会增加一倍,然而不同stage对训练速度和参数效率的影响并不相同。本文作者采用了一种非均匀的缩放策略,逐步向网络的深层增加更多的层。此外v1逐步增大输入大小导致更多的显存占用和更慢的训练速度,为了解决这个问题,本文稍微修改了缩放规则,并将最大的输入尺寸限制为一个较小的值。
相比v1,作者修改了NAS的一些条件,这里不做具体介绍,搜索到的EfficientNetV2-S的结构如表4所示
相比于v1,可以看到有几点区别
作者采用了类似v1的缩放规则来扩展EfficientNetV2-S得到EfficientNetV2-M/L,并进行了一些额外的优化
图像大小在训练效率方面起着重要的作用,除了FixRes,许多其它的网络也在训练期间动态的调整输入大小,但通常会带来精度的下降。作者假设精度的下降是不平衡的正则化导致的,当使用不同大小的图像进行训练时,也应该相应的调整正则化的强度。因为大模型通常需要更强的正则化来对抗过拟合,即使对于相同的网络,较小的输入也会导致网络容量较小,因此只需要较弱的正则化。相反,输入图像越大,计算量越大,网络容量越大,也更容易过拟合。为了验证这一假设,作者用同一网络,以不同的输入大小和不同的正则强度进行训练,结果如表5,可以看出,随着输入的增大,得到最好的精度所需的正则化强度也逐步增加。
图4展示了本文提出的改进后的渐进式学习的训练过程,在训练的早期,采用较小的输入和较弱的正则让网络可以轻松快速的学习简单的表示,然后逐渐增大输入大小和正则化的强度让学习变得更加困难。
具体来说,假设训练一共有 \(N\) 步,最大的输入大小为 \(S_{e}\),正则化强度列表 \(\Phi_{e}=\left \{ \phi^{k}_{e} \right \} \),其中 \(k\) 表示一类正则化方法的强度比如dropout rate或mixup rate值。我们把整个训练过程分为 \(M\) 个stage,对于每个stage \(1\le i\le M\),模型的输入大小为 \(S_{i}\),正则化强度为 \(\Phi_{i}=\left \{ \phi^{k}_{i} \right \} \)。最后一个stage \(M\) 的输入大小为 \(S_{e}\),正则化强度为 \(\Phi_{e}\)。为了方便,假设初始输入大小为 \(S_{0}\),正则化强度为 \(\Phi_{0}\),使用线性插值来确定每个阶段具体的值。计算过程如下
作者在ImageNet上进行实验,对于progressive learning的设置如表6所示,一共训练350个epoch,共分为4个stage,每个stage大约87个epoch。表中的min、max表示输入大小和正则化强度的最小值和最大值。并且延续Fixeffificientnet这篇文章的做法,训练时的最大输入比推理时小大约20%,并且训练完成后不再对任何层进行微调。
ImageNet上的完整结果如表7所示,可以看出EfficientNet v2的推理速度非常快,并且相比于之前的ConvNets和Transformer模型,具有更好的精度和参数效率。
这里以mmpretrain中的实现为例,简单讲下具体实现。首先是网络结构,下面是S的结构,可以对照表4进行查看。实际还有M、L等,这里不做具体介绍。另外下面的列表中有7个子列表,对应表4中的层1-7,不包括第0层。而且表4中没有给出MBConv的expand_ratio,在arch_settings中给出了具体值。
# Parameters to build layers. From left to right:
# - repeat (int): The repeat number of the block in the layer
# - kernel_size (int): The kernel size of the layer
# - stride (int): The stride of the first block of the layer
# - expand_ratio (int, float): The expand_ratio of the mid_channels
# - in_channel (int): The number of in_channels of the layer
# - out_channel (int): The number of out_channels of the layer
# - se_ratio (float): The squeeze ratio of SELayer.
# - block_type (int): -2: ConvModule, -1: EnhancedConvModule,
# 0: FusedMBConv, 1: MBConv
arch_settings = {
**dict.fromkeys(['small', 's'], [[2, 3, 1, 1, 24, 24, 0.0, -1],
[4, 3, 2, 4, 24, 48, 0.0, 0],
[4, 3, 2, 4, 48, 64, 0.0, 0],
[6, 3, 2, 4, 64, 128, 0.25, 1],
[9, 3, 1, 6, 128, 160, 0.25, 1],
[15, 3, 2, 6, 160, 256, 0.25, 1],
[1, 1, 1, 1, 256, 1280, 0.0, -2]])
}
然后就是构建网络层,首先是第0层,就是普通的卷积层,代码如下
self.layers.append(
ConvModule(
in_channels=self.in_channels,
out_channels=self.arch[0][4],
kernel_size=3,
stride=2,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg))
然后根据arch_settings构建1-7层,代码如下
in_channels = self.arch[0][4]
layer_setting = self.arch[:-1]
total_num_blocks = sum([x[0] for x in layer_setting])
block_idx = 0
dpr = [
x.item()
for x in torch.linspace(0, self.drop_path_rate, total_num_blocks)
] # stochastic depth decay rule
for layer_cfg in layer_setting:
layer = []
(repeat, kernel_size, stride, expand_ratio, _, out_channels,
se_ratio, block_type) = layer_cfg
for i in range(repeat):
stride = stride if i == 0 else 1
if block_type == -1:
has_skip = stride == 1 and in_channels == out_channels
droppath_rate = dpr[block_idx] if has_skip else 0.0
layer.append(
EnhancedConvModule(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
has_skip=has_skip,
drop_path_rate=droppath_rate,
stride=stride,
padding=1,
conv_cfg=None,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg))
in_channels = out_channels
else:
mid_channels = int(in_channels * expand_ratio)
se_cfg = None
if block_type != 0 and se_ratio > 0:
se_cfg = dict(
channels=mid_channels,
ratio=expand_ratio * (1.0 / se_ratio),
divisor=1,
act_cfg=(self.act_cfg, dict(type='Sigmoid')))
block = FusedMBConv if block_type == 0 else MBConv
conv_cfg = self.conv_cfg if stride == 2 else None
layer.append(
block(
in_channels=in_channels,
out_channels=out_channels,
mid_channels=mid_channels,
kernel_size=kernel_size,
stride=stride,
se_cfg=se_cfg,
conv_cfg=conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg,
drop_path_rate=dpr[block_idx],
with_cp=self.with_cp))
in_channels = out_channels
block_idx += 1
self.layers.append(Sequential(*layer))
# make the last layer
self.layers.append(
ConvModule(
in_channels=in_channels,
out_channels=self.out_channels,
kernel_size=self.arch[-1][1],
stride=self.arch[-1][2],
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg))
需要注意的是,在表4中,第1层是Fused-MBConv,但是在代码实现中,第1层却是EnhancedConvModule,即普通卷积添加了额外的shortcut,具体如下
class EnhancedConvModule(ConvModule):
"""ConvModule with short-cut and droppath.
Args:
in_channels (int): Number of channels in the input feature map.
Same as that in ``nn._ConvNd``.
out_channels (int): Number of channels produced by the convolution.
Same as that in ``nn._ConvNd``.
kernel_size (int | tuple[int]): Size of the convolving kernel.
Same as that in ``nn._ConvNd``.
stride (int | tuple[int]): Stride of the convolution.
Same as that in ``nn._ConvNd``.
has_skip (bool): Whether there is short-cut. Defaults to False.
drop_path_rate (float): Stochastic depth rate. Default 0.0.
padding (int | tuple[int]): Zero-padding added to both sides of
the input. Same as that in ``nn._ConvNd``.
dilation (int | tuple[int]): Spacing between kernel elements.
Same as that in ``nn._ConvNd``.
groups (int): Number of blocked connections from input channels to
output channels. Same as that in ``nn._ConvNd``.
bias (bool | str): If specified as `auto`, it will be decided by the
norm_cfg. Bias will be set as True if `norm_cfg` is None, otherwise
False. Default: "auto".
conv_cfg (dict): Config dict for convolution layer. Default: None,
which means using conv2d.
norm_cfg (dict): Config dict for normalization layer. Default: None.
act_cfg (dict): Config dict for activation layer.
Default: dict(type='ReLU').
inplace (bool): Whether to use inplace mode for activation.
Default: True.
with_spectral_norm (bool): Whether use spectral norm in conv module.
Default: False.
padding_mode (str): If the `padding_mode` has not been supported by
current `Conv2d` in PyTorch, we will use our own padding layer
instead. Currently, we support ['zeros', 'circular'] with official
implementation and ['reflect'] with our own implementation.
Default: 'zeros'.
order (tuple[str]): The order of conv/norm/activation layers. It is a
sequence of "conv", "norm" and "act". Common examples are
("conv", "norm", "act") and ("act", "conv", "norm").
Default: ('conv', 'norm', 'act').
"""
def __init__(self, *args, has_skip=False, drop_path_rate=0, **kwargs):
super().__init__(*args, **kwargs)
self.has_skip = has_skip
if self.has_skip and (self.in_channels != self.out_channels
or self.stride != (1, 1)):
raise ValueError('the stride must be 1 and the `in_channels` and'
' `out_channels` must be the same , when '
'`has_skip` is True in `EnhancedConvModule` .')
self.drop_path = DropPath(
drop_path_rate) if drop_path_rate else nn.Identity()
def forward(self, x: torch.Tensor, **kwargs) -> torch.Tensor:
short_cut = x
x = super().forward(x, **kwargs)
if self.has_skip:
x = self.drop_path(x) + short_cut
return x
其余层都是完全按照表4中的结构进行构建的。MBConv的代码如下,就是MobileNet中的InvertedResidual Block。
class InvertedResidual(BaseModule):
"""Inverted Residual Block.
Args:
in_channels (int): The input channels of this module.
out_channels (int): The output channels of this module.
mid_channels (int): The input channels of the depthwise convolution.
kernel_size (int): The kernel size of the depthwise convolution.
Defaults to 3.
stride (int): The stride of the depthwise convolution. Defaults to 1.
se_cfg (dict, optional): Config dict for se layer. Defaults to None,
which means no se layer.
conv_cfg (dict): Config dict for convolution layer. Defaults to None,
which means using conv2d.
norm_cfg (dict): Config dict for normalization layer.
Defaults to ``dict(type='BN')``.
act_cfg (dict): Config dict for activation layer.
Defaults to ``dict(type='ReLU')``.
drop_path_rate (float): stochastic depth rate. Defaults to 0.
with_cp (bool): Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed. Defaults to False.
init_cfg (dict | list[dict], optional): Initialization config dict.
"""
def __init__(self,
in_channels,
out_channels,
mid_channels,
kernel_size=3,
stride=1,
se_cfg=None,
conv_cfg=None,
norm_cfg=dict(type='BN'),
act_cfg=dict(type='ReLU'),
drop_path_rate=0.,
with_cp=False,
init_cfg=None):
super(InvertedResidual, self).__init__(init_cfg)
self.with_res_shortcut = (stride == 1 and in_channels == out_channels)
assert stride in [1, 2]
self.with_cp = with_cp
self.drop_path = DropPath(
drop_path_rate) if drop_path_rate > 0 else nn.Identity()
self.with_se = se_cfg is not None
self.with_expand_conv = (mid_channels != in_channels)
if self.with_se:
assert isinstance(se_cfg, dict)
if self.with_expand_conv:
self.expand_conv = ConvModule(
in_channels=in_channels,
out_channels=mid_channels,
kernel_size=1,
stride=1,
padding=0,
conv_cfg=conv_cfg,
norm_cfg=norm_cfg,
act_cfg=act_cfg)
self.depthwise_conv = ConvModule(
in_channels=mid_channels,
out_channels=mid_channels,
kernel_size=kernel_size,
stride=stride,
padding=kernel_size // 2,
groups=mid_channels,
conv_cfg=conv_cfg,
norm_cfg=norm_cfg,
act_cfg=act_cfg)
if self.with_se:
self.se = SELayer(**se_cfg)
self.linear_conv = ConvModule(
in_channels=mid_channels,
out_channels=out_channels,
kernel_size=1,
stride=1,
padding=0,
conv_cfg=conv_cfg,
norm_cfg=norm_cfg,
act_cfg=None)
def forward(self, x):
"""Forward function.
Args:
x (torch.Tensor): The input tensor.
Returns:
torch.Tensor: The output tensor.
"""
def _inner_forward(x):
out = x
if self.with_expand_conv:
out = self.expand_conv(out)
out = self.depthwise_conv(out)
if self.with_se:
out = self.se(out)
out = self.linear_conv(out)
if self.with_res_shortcut:
return x + self.drop_path(out)
else:
return out
if self.with_cp and x.requires_grad:
out = cp.checkpoint(_inner_forward, x)
else:
out = _inner_forward(x)
return out
Fused-MBConv的代码如下,就是把MBConv中的3x3 depthwise卷积和1x1 expansion卷积替换为一个传统的3x3卷积。
class EdgeResidual(BaseModule):
"""Edge Residual Block.
Args:
in_channels (int): The input channels of this module.
out_channels (int): The output channels of this module.
mid_channels (int): The input channels of the second convolution.
kernel_size (int): The kernel size of the first convolution.
Defaults to 3.
stride (int): The stride of the first convolution. Defaults to 1.
se_cfg (dict, optional): Config dict for se layer. Defaults to None,
which means no se layer.
with_residual (bool): Use residual connection. Defaults to True.
conv_cfg (dict, optional): Config dict for convolution layer.
Defaults to None, which means using conv2d.
norm_cfg (dict): Config dict for normalization layer.
Defaults to ``dict(type='BN')``.
act_cfg (dict): Config dict for activation layer.
Defaults to ``dict(type='ReLU')``.
drop_path_rate (float): stochastic depth rate. Defaults to 0.
with_cp (bool): Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed. Defaults to False.
init_cfg (dict | list[dict], optional): Initialization config dict.
"""
def __init__(self,
in_channels,
out_channels,
mid_channels,
kernel_size=3,
stride=1,
se_cfg=None,
with_residual=True,
conv_cfg=None,
norm_cfg=dict(type='BN'),
act_cfg=dict(type='ReLU'),
drop_path_rate=0.,
with_cp=False,
init_cfg=None):
super(EdgeResidual, self).__init__(init_cfg=init_cfg)
assert stride in [1, 2]
self.with_cp = with_cp
self.drop_path = DropPath(
drop_path_rate) if drop_path_rate > 0 else nn.Identity()
self.with_se = se_cfg is not None
self.with_residual = (
stride == 1 and in_channels == out_channels and with_residual)
if self.with_se:
assert isinstance(se_cfg, dict)
self.conv1 = ConvModule(
in_channels=in_channels,
out_channels=mid_channels,
kernel_size=kernel_size,
stride=stride,
padding=kernel_size // 2,
conv_cfg=conv_cfg,
norm_cfg=norm_cfg,
act_cfg=act_cfg)
if self.with_se:
self.se = SELayer(**se_cfg)
self.conv2 = ConvModule(
in_channels=mid_channels,
out_channels=out_channels,
kernel_size=1,
stride=1,
padding=0,
conv_cfg=None,
norm_cfg=norm_cfg,
act_cfg=None)
def forward(self, x):
def _inner_forward(x):
out = x
out = self.conv1(out)
if self.with_se:
out = self.se(out)
out = self.conv2(out)
if self.with_residual:
return x + self.drop_path(out)
else:
return out
if self.with_cp and x.requires_grad:
out = cp.checkpoint(_inner_forward, x)
else:
out = _inner_forward(x)
return out