code
a structural re-parameterization technique
将训练时的多分枝融合为推理时的单分支结构。很多因素影响推理速度,FLOPs的数量值不能真实反应实际的速度。
通过对普通卷积进行编码得到的一条单分支网络,也是 一种 re-parameterization method。
W = d i a g ( a ) I + d i a g ( b ) W n o r m W = diag(a)I + diag(b){W_{norm}} W=diag(a)I+diag(b)Wnorm
W W W是最终的卷积参数, a a a和 b b b是一个待学习的参数, W n o r m W_{norm} Wnorm是归一化后的卷积核
另外,之前看过的ACBlock、DO-Conv 、ExpandNet也可以看作是一种模型重参数化方法,他们是把一个block看成一个卷积,去代替普通卷积,是一种component-level improvements 。而本文提出的方法对应一个扁平(plain)的卷积网络来说是很重要的。
Winograd is a classic algorithm for accelerating 3 × 3 conv (only if the stride is 1).
Fast:
Memory-economical
Flexible
ResNet,它构造了一个shortcut分支来将信息流建模为 y = x + f ( x ) y = x+ f (x) y=x+f(x),并使用一个残差块来学习 f f f
当x和f (x)的维度不匹配时,它就变成了 y = g ( x ) + f ( x ) y = g (x) +f (x) y=g(x)+f(x),其中 g ( x ) g (x) g(x)是一个由一个1×1conv实现的shortcut。对ResNets成功的一个解释是,这样的多分支体系结构使模型成为一个由许多较浅的模型[36]组成的隐式集成。
ResNet, which explicitly constructs a shortcut branch to model the information flow as y = x + f(x) and uses a residual block to learn f. When the dimensions of x and f(x) do not match, it becomes y = g(x)+f(x), where g(x)is a convolutional shortcut implemented by a 1×1 conv.
RepVGG使用了像是ResNet中的恒等连接的一个分支,并且加了一个1x1卷积分支,共三个卷积分支进行相加: y = x + g ( x ) + f ( x ) y =x+ g (x) +f (x) y=x+g(x)+f(x)
训练阶段:
M ( 2 ) = b n ( M ( 1 ) ∗ W ( 3 ) , µ ( 3 ) , σ ( 3 ) , γ ( 3 ) , β ( 3 ) ) + b n ( M ( 1 ) ∗ W ( 1 ) , µ ( 1 ) , σ ( 1 ) , γ ( 1 ) , β ( 1 ) ) + b n ( M ( 1 ) , µ ( 0 ) , σ ( 0 ) , γ ( 0 ) , β ( 0 ) ) . M(2)= bn(M(1)∗ W(3), µ(3), σ(3), γ(3), β(3))+ bn(M(1)∗ W(1), µ(1), σ(1), γ(1), β(1))+ bn(M(1), µ(0), σ(0), γ(0), β(0)) . M(2)=bn(M(1)∗W(3),µ(3),σ(3),γ(3),β(3))+bn(M(1)∗W(1),µ(1),σ(1),γ(1),β(1))+bn(M(1),µ(0),σ(0),γ(0),β(0)).
就是 y = b n ( 1 ∗ 1 C o n v ) + b n ( 3 ∗ 3 C o n v ) + b n ( x ) y=bn(1*1 Conv)+bn(3*3 Conv)+bn(x) y=bn(1∗1Conv)+bn(3∗3Conv)+bn(x)
推理阶段:
RepVGG将网络分成5个阶段
we set the number of groups g g g the 3rd, 5th, 7th, …, 21st layer of RepVGG-A and the additional 23rd, 25th and 27th layers of RepVGG-B.
Note that the 1×1 branch shall have the same g g g as the 3×3 conv.
RepVGG models are fast, simple and practical ConvNets designed for the maximum speed on GPU and specialized hardware, less concerning the number of parameters. They are more parameter-efficient than ResNets but may be less favored than the mobile-regime models like MobileNets[16, 30, 15] and ShuffleNets [41, 24] for low-power devices.
but may be less用的很是巧妙了,避重就轻。
We proposed RepVGG, a simple architecture with a stack of 3 × 3 conv and ReLU, which is especially suitable for GPU and specialized inference chips. With our structural re-parameterization method, it reaches over 80% top-1 accuracy on ImageNet and shows favorable speed-accuracy trade-off compared to the state-of-the-art models.
Precision表示被分为正例的示例中实际为正例的比例,precision=TP/(TP+FP)。即,一个二分类,类别分别命名为1和2,Precision就表示在类别1中,分对了的数量占了类别1总数量的多少;同理,也表示在类别2中,分对了的数量占类别2总数量的多少。那么这个指标越高,就表示越整齐不混乱。
而Accuracy是我们最常见的评价指标,accuracy = (TP+TN)/(P+N),这个很容易理解,就是被分对的样本数除以所有的样本数,通常来说,正确率越高,分类器越好。我们最常说的就是这个准确率。
import torch.nn as nn
import numpy as np
import torch
import copy
from se_block import SEBlock
import torch.utils.checkpoint as checkpoint
def conv_bn(in_channels, out_channels, kernel_size, stride, padding, groups=1):
result = nn.Sequential()
result.add_module('conv', nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, bias=False))
result.add_module('bn', nn.BatchNorm2d(num_features=out_channels))
return result
class RepVGGBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size,
stride=1, padding=0, dilation=1, groups=1, padding_mode='zeros', deploy=False, use_se=False):
super(RepVGGBlock, self).__init__()
self.deploy = deploy
self.groups = groups
self.in_channels = in_channels
assert kernel_size == 3
assert padding == 1
padding_11 = padding - kernel_size // 2
self.nonlinearity = nn.ReLU() #激活函数
if use_se: #是否使用SE模块
# Note that RepVGG-D2se uses SE before nonlinearity. But RepVGGplus models uses SE after nonlinearity.
self.se = SEBlock(out_channels, internal_neurons=out_channels // 16)
else:
self.se = nn.Identity()
if deploy: #推理阶段,只有一层卷积,没有分支:rbr_reparam
self.rbr_reparam = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride,
padding=padding, dilation=dilation, groups=groups, bias=True, padding_mode=padding_mode)
else: #训练阶段:1. 恒等连接:rbr_identity 、2.1x1分支:rbr_1x1 3. kxk卷积分支(主分支):rbr_dense:通常k=3
self.rbr_identity = nn.BatchNorm2d(num_features=in_channels) if out_channels == in_channels and stride == 1 else None
self.rbr_dense = conv_bn(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, groups=groups)
self.rbr_1x1 = conv_bn(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=stride, padding=padding_11, groups=groups)
print('RepVGG Block, identity = ', self.rbr_identity)
def forward(self, inputs):
if hasattr(self, 'rbr_reparam'): #推理阶段
return self.nonlinearity(self.se(self.rbr_reparam(inputs)))
if self.rbr_identity is None:
id_out = 0
else:
id_out = self.rbr_identity(inputs)
#返回三个卷积分支的值
return self.nonlinearity(self.se(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out))
# Optional. This may improve the accuracy and facilitates quantization in some cases.
# 1. Cancel the original weight decay on rbr_dense.conv.weight and rbr_1x1.conv.weight.
# 2. Use like this.
# loss = criterion(....)
# for every RepVGGBlock blk:
# loss += weight_decay_coefficient * 0.5 * blk.get_cust_L2()
# optimizer.zero_grad()
# loss.backward()
def get_custom_L2(self):
K3 = self.rbr_dense.conv.weight
K1 = self.rbr_1x1.conv.weight
t3 = (self.rbr_dense.bn.weight / ((self.rbr_dense.bn.running_var + self.rbr_dense.bn.eps).sqrt())).reshape(-1, 1, 1, 1).detach()
t1 = (self.rbr_1x1.bn.weight / ((self.rbr_1x1.bn.running_var + self.rbr_1x1.bn.eps).sqrt())).reshape(-1, 1, 1, 1).detach()
l2_loss_circle = (K3 ** 2).sum() - (K3[:, :, 1:2, 1:2] ** 2).sum() # The L2 loss of the "circle" of weights in 3x3 kernel. Use regular L2 on them.
eq_kernel = K3[:, :, 1:2, 1:2] * t3 + K1 * t1 # The equivalent resultant central point of 3x3 kernel.
l2_loss_eq_kernel = (eq_kernel ** 2 / (t3 ** 2 + t1 ** 2)).sum() # Normalize for an L2 coefficient comparable to regular L2.
return l2_loss_eq_kernel + l2_loss_circle
# This func derives the equivalent kernel and bias in a DIFFERENTIABLE way.
# You can get the equivalent kernel and bias at any time and do whatever you want,
# for example, apply some penalties or constraints during training, just like you do to the other models.
# May be useful for quantization or pruning.
def get_equivalent_kernel_bias(self): #得到conv 和 bn 融合后的卷积:kernel、bias,并相加
kernel3x3, bias3x3 = self._fuse_bn_tensor(self.rbr_dense)
kernel1x1, bias1x1 = self._fuse_bn_tensor(self.rbr_1x1)
kernelid, biasid = self._fuse_bn_tensor(self.rbr_identity)
return kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid, bias3x3 + bias1x1 + biasid
def _pad_1x1_to_3x3_tensor(self, kernel1x1):#将1x1卷积padding成3x3卷积
if kernel1x1 is None:
return 0
else:
return torch.nn.functional.pad(kernel1x1, [1,1,1,1])
def _fuse_bn_tensor(self, branch): #bn 和conv 融合:inference 时 用bn对conv的权重进行归一化
if branch is None:
return 0, 0
if isinstance(branch, nn.Sequential):
kernel = branch.conv.weight
running_mean = branch.bn.running_mean
running_var = branch.bn.running_var
gamma = branch.bn.weight
beta = branch.bn.bias
eps = branch.bn.eps
else:
assert isinstance(branch, nn.BatchNorm2d)
if not hasattr(self, 'id_tensor'):
input_dim = self.in_channels // self.groups
kernel_value = np.zeros((self.in_channels, input_dim, 3, 3), dtype=np.float32)
for i in range(self.in_channels):
kernel_value[i, i % input_dim, 1, 1] = 1
self.id_tensor = torch.from_numpy(kernel_value).to(branch.weight.device)
kernel = self.id_tensor
running_mean = branch.running_mean
running_var = branch.running_var
gamma = branch.weight
beta = branch.bias
eps = branch.eps
std = (running_var + eps).sqrt()
t = (gamma / std).reshape(-1, 1, 1, 1)
return kernel * t, beta - running_mean * gamma / std
def switch_to_deploy(self): #训练阶段转换为推理阶段
if hasattr(self, 'rbr_reparam'):
return
kernel, bias = self.get_equivalent_kernel_bias() #获得 1x1 、3x3、恒等连接的 bn、conv融合后的卷积
self.rbr_reparam = nn.Conv2d(in_channels=self.rbr_dense.conv.in_channels, out_channels=self.rbr_dense.conv.out_channels,
kernel_size=self.rbr_dense.conv.kernel_size, stride=self.rbr_dense.conv.stride,
padding=self.rbr_dense.conv.padding, dilation=self.rbr_dense.conv.dilation, groups=self.rbr_dense.conv.groups, bias=True)
self.rbr_reparam.weight.data = kernel
self.rbr_reparam.bias.data = bias
#将 融合后的卷积参数赋值给 inference时的卷积:rbr_reparam
self.__delattr__('rbr_dense') #删掉训练阶段的三个分支rbr_dense、rbr_1x1、rbr_identity
self.__delattr__('rbr_1x1')
if hasattr(self, 'rbr_identity'):
self.__delattr__('rbr_identity')
if hasattr(self, 'id_tensor'):
self.__delattr__('id_tensor')
self.deploy = True
switch_to_deploy()
函数进行融合分支。def repvgg_model_convert(model:torch.nn.Module, save_path=None, do_copy=True):
if do_copy:
model = copy.deepcopy(model)
for module in model.modules():
if hasattr(module, 'switch_to_deploy'): #删除bn和conv进行fuse过程中,将bn的值提取为参数时添加的变量:id_tensor
module.switch_to_deploy()
if save_path is not None:
torch.save(model.state_dict(), save_path)
return model