arXiv-2020
论文名字的又来(参考来自 Polarized Self-Attention: Towards High-quality Pixel-wise Regression)
In photography, there are always random lights in transverse directions that produce glares/reflections. Polarized filtering, by only allowing the light pass orthogonal to the transverse direction, can potentially improve the contrast of the photo. Due to the loss of total intensity, the light after filtering usually has a small dynamic range, therefore needs a additional boost, e.g. by High Dynamic Range(HDR), to recover the details of the original scene
为防止眩光/反射,需偏振片滤光,但总能量会消失,需要额外的提升(eg:HDR),用来以恢复原始场景的详细信息
很像 attention
为了解决同时对空间和通道建模时,如果不进行维度缩减,就会导致计算量、显存爆炸的问题,作者在PSA中采用了一种极化滤波(polarized filtering)机制。
(1)滤波(Filtering):使得一个维度的特征(比如通道维度)完全坍塌,同时让正交方向的维度(比如空间维度)保持高分辨率。
(2)High Dynamic Range(HDR):首先在 attention 模块中最小的 tensor 上用 Softmax 函数来增加注意力的范围,然后再用Sigmoid 函数进行动态的映射。
深度学习发展 coarse-grained(分类 / 检测)-> fine-grained computer vision tasks(关键点 / 分割)
the pixel-wise regression problem has a higher problem complexity by the order of output element numbers
面对上面的难点,作者从 plug-and-play solution 的方向研究来提升 the pixel-wise regression problem 的精度,提出 Polarized Self-Attention
注意力机制设计过程中,通道注意力尽可能保留通道信息,空间注意力尽可能保留空间信息
输出时结合了 softmax 的高斯分布与 sigmoid 的二项式分布
借鉴光学中偏振滤波的思想,提出 Polarized self attention,关键点检测和语义分割任务上公开数据集提点明显
2D Gaussian distribution (keypoint heatmaps)
2D Binormial distribution (binary segmentation masks)
fuse softmax-sigmoid composition in both channel-only and spatial-only attention branches
通道注意力获取 C × 1 × 1 C\times1\times1 C×1×1
C × 1 × 1 C\times1\times1 C×1×1 也可以由 ( C × H W ) × ( H W × 1 × 1 ) (C\times HW) \times (HW\times1\times1) (C×HW)×(HW×1×1) 获取得到
空间注意力获取 1 × H × W 1\times H\times W 1×H×W
1 × H × W 1\times H\times W 1×H×W 也可以是 ( 1 × C ) × ( C × H W ) (1\times C) \times (C\times HW) (1×C)×(C×HW) 再 reshape 一下得到
和 CBAM 相比(【Attention】《CBAM: Convolutional Block Attention Module》)
通道注意力中利用了更多的空间信息(not global pooling),空间注意力中更加充分的利用了通道信息(not mean)
看看公式表达
(1)通道注意力
F S M F_{SM} FSM softmax
Relation of PSA to other Self-Attentions
看看代码
import numpy as np
import torch
from torch import nn
from torch.nn import init
class ParallelPolarizedSelfAttention(nn.Module):
def __init__(self, channel=512):
super().__init__()
self.ch_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
self.ch_wq=nn.Conv2d(channel,1,kernel_size=(1,1))
self.softmax_channel=nn.Softmax(1)
self.softmax_spatial=nn.Softmax(-1)
self.ch_wz=nn.Conv2d(channel//2,channel,kernel_size=(1,1))
self.ln=nn.LayerNorm(channel)
self.sigmoid=nn.Sigmoid()
self.sp_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
self.sp_wq=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
self.agp=nn.AdaptiveAvgPool2d((1,1))
def forward(self, x):
b, c, h, w = x.size()
#Channel-only Self-Attention
channel_wv=self.ch_wv(x) #bs,c//2,h,w
channel_wq=self.ch_wq(x) #bs,1,h,w
channel_wv=channel_wv.reshape(b,c//2,-1) #bs,c//2,h*w
channel_wq=channel_wq.reshape(b,-1,1) #bs,h*w,1
channel_wq=self.softmax_channel(channel_wq)
channel_wz=torch.matmul(channel_wv,channel_wq).unsqueeze(-1) #bs,c//2,1,1
channel_weight=self.sigmoid(self.ln(self.ch_wz(channel_wz).reshape(b,c,1).permute(0,2,1))).permute(0,2,1).reshape(b,c,1,1) #bs,c,1,1
channel_out=channel_weight*x
#Spatial-only Self-Attention
spatial_wv=self.sp_wv(x) #bs,c//2,h,w
spatial_wq=self.sp_wq(x) #bs,c//2,h,w
spatial_wq=self.agp(spatial_wq) #bs,c//2,1,1
spatial_wv=spatial_wv.reshape(b,c//2,-1) #bs,c//2,h*w
spatial_wq=spatial_wq.permute(0,2,3,1).reshape(b,1,c//2) #bs,1,c//2
spatial_wq=self.softmax_spatial(spatial_wq) #bs,1,c//2
spatial_wz=torch.matmul(spatial_wq,spatial_wv) #bs,1,h*w
spatial_weight=self.sigmoid(spatial_wz.reshape(b,1,h,w)) #bs,1,h,w
spatial_out=spatial_weight*x
out=spatial_out+channel_out
return out
class SequentialPolarizedSelfAttention(nn.Module):
def __init__(self, channel=512):
super().__init__()
self.ch_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
self.ch_wq=nn.Conv2d(channel,1,kernel_size=(1,1))
self.softmax_channel=nn.Softmax(1)
self.softmax_spatial=nn.Softmax(-1)
self.ch_wz=nn.Conv2d(channel//2,channel,kernel_size=(1,1))
self.ln=nn.LayerNorm(channel)
self.sigmoid=nn.Sigmoid()
self.sp_wv=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
self.sp_wq=nn.Conv2d(channel,channel//2,kernel_size=(1,1))
self.agp=nn.AdaptiveAvgPool2d((1,1))
def forward(self, x):
b, c, h, w = x.size()
#Channel-only Self-Attention
channel_wv=self.ch_wv(x) #bs,c//2,h,w
channel_wq=self.ch_wq(x) #bs,1,h,w
channel_wv=channel_wv.reshape(b,c//2,-1) # bs,c//2,h*w
channel_wq=channel_wq.reshape(b,-1,1) # bs,h*w,1
channel_wq=self.softmax_channel(channel_wq) # bs,h*w,1
channel_wz=torch.matmul(channel_wv,channel_wq).unsqueeze(-1) #bs,c//2,1,1
channel_weight=self.sigmoid(self.ln(self.ch_wz(channel_wz).reshape(b,c,1).permute(0,2,1))).permute(0,2,1).reshape(b,c,1,1) #bs,c,1,1
channel_out=channel_weight*x
#Spatial-only Self-Attention
spatial_wv=self.sp_wv(channel_out) #bs,c//2,h,w
spatial_wq=self.sp_wq(channel_out) #bs,c//2,h,w
spatial_wq=self.agp(spatial_wq) #bs,c//2,1,1
spatial_wv=spatial_wv.reshape(b,c//2,-1) #bs,c//2,h*w
spatial_wq=spatial_wq.permute(0,2,3,1).reshape(b,1,c//2) #bs,1,c//2
spatial_wq=self.softmax_spatial(spatial_wq)
spatial_wz=torch.matmul(spatial_wq,spatial_wv) #bs,1,h*w
spatial_weight=self.sigmoid(spatial_wz.reshape(b,1,h,w)) #bs,1,h,w
spatial_out=spatial_weight*channel_out
return spatial_out
if __name__ == '__main__':
input=torch.randn(1,512,7,7)
psa = SequentialPolarizedSelfAttention(channel=512)
output=psa(input)
print(output.shape)
还是比较直观的
we add PSAs after the first 3 × 3 convolution in every residual blocks, respectively.
MS-COCO 2017 human pose estimation(AP)
Pascal VOC2012 semantic segmentation(mIoU)
(1)Top-Down 2D Human Pose Estimation
输出热力图尺寸 96 × 72,(1/4)
(2)Semantic Segmentation
没有关键点提升明显
通道注意力和空间注意力并行(p)和串行(s)并无太大差别,marginal metric differences
更多论文笔记,可以参考 【Paper Reading】
Channel-only attention blocks put the same weights on different spatial locations, such that the classification task still benefits since its spatial information eventually collapses by pooling, and the anchor displacement regression in object detection benefits since the channel-only attention unanimously highlights all foreground pixels
PSA in complex DCNN heads 效果如何作者还没有做
光学故事背景讲的可以,提点也OK,但是没有全文反复强调,力没有集中发到一处,不够精确打击