我们知道,输入一张图片,神经网络会提取图像特征,每一层都有不同大小的特征图。如图1所示,展示了 VGG网络在提取图像特征时特征图的大小变化。
其中,特征图常见的矩阵形状为 [ C , H , W ] {[C,H,W]} [C,H,W](图1中的数字为 [ H , W , C ] {[H,W,C]} [H,W,C]格式)。当model在training时,特征图的矩阵形状为 [ B , C , H , W ] {[B,C,H,W]} [B,C,H,W]。其中B表示为batch size(批处理大小),C表示为channels(通道数),H表示为特征图的high(高度),W表示为特征图的weight(宽度)
提问:为什么特征图的维度就是 [ B , C , H , W ] {[B,C,H,W]} [B,C,H,W],而不是其他什么维度格式?
回答:pytorch在处理图像时,读入的图像处理为 [ C , H , W ] {[C,H,W]} [C,H,W]格式,如果在训练时加入batch size,那么就有多个特征图,将batch size放在第一维,自然就是 [ B , C , H , W ] {[B,C,H,W]} [B,C,H,W]。这是pytorch的处理方式
在网络提取图像特征层时,通过在卷积层之间添加通道注意力机制、空间注意力机制可以增强网络提取图像的能力。在编写代码时,考虑的是特征图间的attention机制,因此代码输入是 [ B , C , H , W ] {[B,C,H,W]} [B,C,H,W]的特征图,输出仍然是 [ B , C , H , W ] {[B,C,H,W]} [B,C,H,W]维的特征图。让我们接下来通过三篇论文来看这两种注意力机制是如何工作的。
Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ∼25%. Models and code are available at https://github.com/hujie-frank/SENet.
从单张图像开始,提取图像特征,当前特征层U的特征图维度为 [ C , H , W ] {[C,H,W]} [C,H,W]。
对特征图的 [ H , W ] {[H,W]} [H,W]维度进行平均池化或最大池化,池化过后的特征图大小从 [ C , H , W ] {[C,H,W]} [C,H,W]-> [ C , 1 , 1 ] {[C,1,1]} [C,1,1]。 [ C , 1 , 1 ] {[C,1,1]} [C,1,1]可理解为对于每一个通道C,都有一个数字和其一一对应。图4对应了步骤(2)的具体操作。
import torch
import torch.nn as nn
class SEBlock(nn.Module):
def __init__(self, mode, channels, ratio):
super(SEBlock, self).__init__()
self.avg_pooling = nn.AdaptiveAvgPool2d(1)
self.max_pooling = nn.AdaptiveMaxPool2d(1)
if mode == "max":
self.global_pooling = self.max_pooling
elif mode == "avg":
self.global_pooling = self.avg_pooling
self.fc_layers = nn.Sequential(
nn.Linear(in_features = channels, out_features = channels // ratio, bias = False),
nn.Linear(in_features = channels // ratio, out_features = channels, bias = False),
self.sigmoid = nn.Sigmoid()
def forward(self, x):
b, c, _, _ = x.shape
v = self.global_pooling(x).view(b, c)
v = self.fc_layers(v).view(b, c, 1, 1)
v = self.sigmoid(v)
return x * v
if __name__ == "__main__":
model = SEBlock("max", 54, 9)
feature_maps = torch.randn((8, 54, 32, 32))
Abstract: Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local crosschannel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.
给定通过平均池化(average pooling)获得的聚合特征 [ C , 1 , 1 ] {[C,1,1]} [C,1,1],ECA模块通过执行卷积核大小为k的一维卷积来生成通道权重,其中k通过通道维度C的映射自适应地确定。
自适应确定卷积核大小公式: k = ∣ l o g 2 C + b γ ∣ o d d {k=|\cfrac{log_2{C}+b}{\gamma}|_{odd}} k=∣γlog2C+b∣odd
其中k表示卷积核大小,C表示通道数, ∣ ∣ o d d {| |_{odd}} ∣∣odd表示k只能取奇数, γ {\gamma} γ和 b {b} b在论文中设置为2和1,用于改变通道数C和卷积核大小和之间的比例。
import math
import torch
import torch.nn as nn
class ECABlock(nn.Module):
def __init__(self, channels, gamma = 2, b = 1):
super(ECABlock, self).__init__()
kernel_size = int(abs((math.log(channels, 2) + b) / gamma))
kernel_size = kernel_size if kernel_size % 2 else kernel_size + 1
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.conv = nn.Conv1d(1, 1, kernel_size = kernel_size, padding = (kernel_size - 1) // 2, bias = False)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
v = self.avg_pool(x)
v = self.conv(v.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)
v = self.sigmoid(v)
return x * v
if __name__ == "__main__":
features_maps = torch.randn((8, 54, 32, 32))
model = ECABlock(54, gamma = 2, b = 1)
# SEBlock 采用全连接层方式
def forward(self, x):
b, c, _, _ = x.shape
v = self.global_pooling(x).view(b, c)
v = self.fc_layers(v).view(b, c, 1, 1)
v = self.sigmoid(v)
return x * v
# ECABlock 采用一维卷积方式
def forward(self, x):
v = self.avg_pool(x)
v = self.conv(v.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)
v = self.sigmoid(v)
return x * v
Abstract: We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.
我们提出了卷积块注意模块(CBAM),一种简单而有效的前馈卷积神经网络注意模块。给定一个中间的特征图,我们的模块采用两个独立的注意力机制,通道注意力和空间注意力,然后将注意力机制得到的权重乘以输入特征图以进行自适应特征细化。因为CBAM是一个轻量级的通用模块,它可以无缝地集成到任何CNN架构中,开销可以忽略不计,并且可以与基础CNN一起进行端到端培训。我们通过在ImageNet-1K、MS COCO检测和VOC 2007检测数据集上的大量实验来验证我们的CBAM。我们的实验表明,各种模型在分类和检测性能上都有一致的改进,证明了CBAM的广泛适用性。代码和模型将公开提供。
我们可以看成大小为 [ H , W ] {[H,W]} [H,W]的特征图,在每一个点 ( x , y ) , x ∈ ( 0 , H ) , y ∈ ( 0 , W ) {(x,y),x\in(0,H),y\in(0,W)} (x,y),x∈(0,H),y∈(0,W)上,都有C个数值,数值表征了特征图该点的重要程度,通过感受野反推回原图像,即表示了该区域的重要程度。我们需要让网络自适应关注需要关注的地方(数值大的地方更易受到关注),空间注意力机制应运而生。
import math
import torch
import torch.nn as nn
class Channel_Attention_Module_FC(nn.Module):
def __init__(self, channels, ratio):
super(Channel_Attention_Module_FC, self).__init__()
self.avg_pooling = nn.AdaptiveAvgPool2d(1)
self.max_pooling = nn.AdaptiveMaxPool2d(1)
self.fc_layers = nn.Sequential(
nn.Linear(in_features = channels, out_features = channels // ratio, bias = False),
nn.Linear(in_features = channels // ratio, out_features = channels, bias = False),
self.sigmoid = nn.Sigmoid()
def forward(self, x):
b, c, h, w = x.shape
avg_x = self.avg_pooling(x).view(b, c)
max_x = self.max_pooling(x).view(b, c)
v = self.fc_layers(avg_x) + self.fc_layers(max_x)
v = self.sigmoid(v).view(b, c, 1, 1)
return x * v
class Channel_Attention_Module_Conv(nn.Module):
def __init__(self, channels, gamma = 2, b = 1):
super(Channel_Attention_Module_Conv, self).__init__()
kernel_size = int(abs((math.log(channels, 2) + b) / gamma))
kernel_size = kernel_size if kernel_size % 2 else kernel_size + 1
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.conv = nn.Conv1d(1, 1, kernel_size = kernel_size, padding = (kernel_size - 1) // 2, bias = False)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
v = self.avg_pool(x)
v = self.conv(v.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)
v = self.sigmoid(v)
return x * v
class Spatial_Attention_Module(nn.Module):
def __init__(self, k: int):
super(Spatial_Attention_Module, self).__init__()
self.avg_pooling = torch.mean
self.max_pooling = torch.max
# In order to keep the size of the front and rear images consistent
# with calculate, k = 1 + 2p, k denote kernel_size, and p denote padding number
# so, when p = 1 -> k = 3; p = 2 -> k = 5; p = 3 -> k = 7, it works. when p = 4 -> k = 9, it is too big to use in network
assert k in [3, 5, 7], "kernel size = 1 + 2 * padding, so kernel size must be 3, 5, 7"
self.conv = nn.Conv2d(2, 1, kernel_size = (k, k), stride = (1, 1), padding = ((k - 1) // 2, (k - 1) // 2),
bias = False)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
# compress the C channel to 1 and keep the dimensions
avg_x = self.avg_pooling(x, dim = 1, keepdim = True)
max_x, _ = self.max_pooling(x, dim = 1, keepdim = True)
v = self.conv(torch.cat((max_x, avg_x), dim = 1))
v = self.sigmoid(v)
return x * v
class CBAMBlock(nn.Module):
def __init__(self, channel_attention_mode: str, spatial_attention_kernel_size: int, channels: int = None,
ratio: int = None, gamma: int = None, b: int = None):
super(CBAMBlock, self).__init__()
if channel_attention_mode == "FC":
assert channels != None and ratio != None and channel_attention_mode == "FC", \
"FC channel attention block need feature maps' channels, ratio"
self.channel_attention_block = Channel_Attention_Module_FC(channels = channels, ratio = ratio)
elif channel_attention_mode == "Conv":
assert channels != None and gamma != None and b != None and channel_attention_mode == "Conv", \
"Conv channel attention block need feature maps' channels, gamma, b"
self.channel_attention_block = Channel_Attention_Module_Conv(channels = channels, gamma = gamma, b = b)
assert channel_attention_mode in ["FC", "Conv"], \
"channel attention block must be 'FC' or 'Conv'"
self.spatial_attention_block = Spatial_Attention_Module(k = spatial_attention_kernel_size)
def forward(self, x):
x = self.channel_attention_block(x)
x = self.spatial_attention_block(x)
return x
if __name__ == "__main__":
feature_maps = torch.randn((8, 54, 32, 32))
model = CBAMBlock("FC", 5, channels = 54, ratio = 9)
model = CBAMBlock("Conv", 5, channels = 54, gamma = 2, b = 1)
空间注意力机制与通道注意力机制有异曲同工之妙,都是通过提取权重,作用在原特征图上,只不过一个是在 [ H , W ] {[H,W]} [H,W]维度上,一个是在 [ C ] {[C]} [C]维度上,这样的方法在不增加过多的计算量的前提下能提点,不失为一个好的trick。
Attention is all you need!