Residual attention network for image classification
Abstract: Our residual attention network is built by stacking attention modules which generate attention-aware features. Bottom-up top-down反馈结构。我们提议的attention residual learning 可以学习更深的残差网络。
1. Introduction
No attention mechanism has been applied to feedforward network structure
(1) 增加attention module可以持续提高表现
(2) 更深
贡献:
(1) Stacked netwrk structure
(2) Attention residual learning
(3) Bottom-up top-down feedforward attention
2. Related work
3. Residual attention network
每一个attention module有两个分支,mask分支和trunk分支,trunk分支用来进行特征处理可以调整成任何结构。In this work, we use pre-activation Residual Unit [11],ResNeXt [36] and Inception [32] as our Residual Attention Networks basic unit to construct Attention Module.
3.1 Attention Residual learning
However, naive stacking Attention Modules leads to the obvious performance drop. First, dot production with mask range from zero to one repeatedly will degrade the value of features in deep layers. Second, soft mask can potentially break good property of trunk branch.
M(x)在mask分支上扮演一个feature selectors which enhance good features and suppress noises from trunk features.
3.2 Soft Mask Branch
From input, max pooling are performed several times to increase the receptive field rapidly after a small number of Residual Units. After reaching the lowest resolution, the global information is then expanded by a symmetrical topdown architecture to guide input features in each position. Linear interpolation up sample the output after some Residual Units. The number of bilinear interpolation is the same as max pooling to keep the output size the same as the input feature map. Then a sigmoid layer normalizes the output
这种结构在lowest时可以认为聚集了整个网络的global information,就像用Globalaveragepooling之后产生的1*1 feature vector,很密集的全局信息,再用线性插值再插回去。
3.3 Spatial attention and channel attention
In our work, attention provided by mask branch changes adaptably with trunk branch features.
使用了3种激活函数来产生结果
F1: simple sigmoid
F2:L2 normalization within all channel for each spatial position to remove spatial information
F3:normalization within feature map from each channel and then sigmoid to get soft mask related to spatial information only.
4. Experiments
5. Discussion
1. 不同注意力模块能够捕获不同类型的注意力来指导特征学习
2. 注意力残差学习允许更深的结构