[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers

paper
code

Content

    • Contribution
    • Method
      • Twins-PCPVT
        • model architecture
        • conditional position encoding (CPE)
          • positional encoding generator (PEG)
        • architecture variants
      • Twins-SVT
        • model architecture
        • spatially separable self-attention (SSSA)
          • locally-grouped self-attention (LSA)
          • global sub-sampled attention (GSA)
          • transformer encoder
          • computational complexity
        • architecture variants
    • Experiment
        • image classification
        • object detection and instance segmentation
        • semantic segmentation
        • ablation study

Contribution

  • propose Spatially Separable self-attention (SSSA)
  • introduce Conditional Position Encoding (CPE)

Method

Twins-PCPVT: based on PVT and conditional position encoding (CPE), which only uses global attention
Twins-SVT: based on proposed spatially separable self-attention (SSSA) which interleaves local and global attention

Twins-PCPVT

model architecture

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第1张图片
Architecture of Twins-PCPVT-S. “PEG” is the positional encoding generator from CPVT.

conditional position encoding (CPE)

given input image with size H × W H\times W H×W, split into patches with size S × S S\times S S×S, patches number is N = H W S 2 N=\frac {HW}{S^2} N=S2HW
patches added with the same number as learnable absolute positional encoding vectors

limitations of previous positional encodings

  1. fixed length prevent model from handling sequences longer than the longest training sequences
  2. make model not translation-invariant because a unique positional encoding vector is added to every one patch

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第2张图片
Comparison of various positional encoding (PE) strategies tested on ImageNet validation set in terms of the top-1 accuracy. Removing the positional encodings greatly damages the performance. The relative postional encodings have inferior performance to the absolute ones.

solutions for aforementioned limitations

  1. remove positional encoding
    order of input sequence is an important clue.
    model has no way to employ the order without the positional encodings.
  2. interpolate position encoding to be shorter or the same length as fixed length
    model require fine-tuning, otherwise performance will remarkably drop.
  3. introduce relative position encoding (eg. in Swin)
    cannot provide absolute position information, which is also important to classification tasks.
    too complex and inefficient, for inner expression of transformer needed to be modified.

requirements of demanded positional encoding

  1. make input sequence permutation-variant but translation-invariant
  2. inductive and able to handle sequences longer than ones during training
  3. provide absolute position information to a certain degree
positional encoding generator (PEG)

reshape flatten input sequence X ∈ R B × N × C X\in R^{B\times N\times C} XRB×N×C to X 1 ∈ R B × C × H × W X_1\in R^{B\times C\times H\times W} X1RB×C×H×W
apply 2d-transformation F on X 1 X_1 X1 and get output X 2 ∈ R B × N × C X_2\in R^{B\times N\times C} X2RB×N×C, where F implement by 2d-conv, with kernel size k ( k ⩾ 3 k\geqslant 3 k3) and k − 1 2 \frac {k-1}2 2k1 zero-padding
reshape X 2 X_2 X2 to produce position encoding E ∈ R B × C × H × W E\in R^{B\times C\times H\times W} ERB×C×H×W

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第3张图片
Schematic illustration of Positional Encoding Generator (PEG). Note d is the embedding size, N is the number of tokens. The function F can be depth-wise, separable convolution or other complicated blocks.

问:为什么要这么做?这种做法怎么就能把位置信息引入Transformer了?
答:给Transformer引入位置信息,说白了就是给一个sequence的N个向量assign a position。那这个position它既可以是绝对信息,也可以是相对信息。相对信息就是定义一个参考点然后给每个向量一个代表它与参考点相对位置的信息。这种做法相当于是使用卷积操作得到positional encoding,而卷积操作的zero-padding就是相当于是参考点,卷积操作相当于提取了每个向量与参考点的相对位置信息。所以这种办法用一句话概括就是:
PEG的卷积部分以zero-padding作为参考点,以卷积操作提取相对位置信息,借助卷积得到适用于Transformer的可变长度的位置编码。
ref: zhihu

an extra learnable class token needed to perform classification, which is not translation-invariant although it can learn to be translation-invariant
replace class token with a global average pooling (GAP), which is inherently translation-invariant

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第4张图片
Vision Transformers: (a) ViT with explicit 1D learnable positional encodings (PE) (b) CPVT with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. © CPVT-GAP without class token (cls), but with global average pooling (GAP) over all items in the sequence. Note that GAP is a bonus version which has boosted performance.

architecture variants

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第5张图片
Configuration details of Twins-PCPVT.

Twins-SVT

model architecture

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第6张图片
Architecture of Twins-SVT-S. “PEG” is the positional encoding generator from CPVT.

spatially separable self-attention (SSSA)

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第7张图片
(a) Twins-SVT interleaves locally-grouped attention (LSA) and global sub-sampled attention (GSA). (b) Schematic view of the locally-grouped attention (LSA) and global sub-sampled attention (GSA).

locally-grouped self-attention (LSA)

feature map x ∈ R h × w × C x\in R^{h\times w\times C} xRh×w×C divided into w s × w s ws\times ws ws×ws-size windows
self-attention applied on each sub-windows who contain w s × w s ws\times ws ws×ws tokens, respectively

global sub-sampled attention (GSA)

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第8张图片
Multi-head attention (MHA) vs. spatial-reduction attention (SRA). With the spatial-reduction operation, the computational/memory cost of our SRA is much lower than that of MHA.

transformer encoder

with consecutive encoder blocks alternating between LSA and GSA, transformer encoder computed as
z ^ l = L S A ( L N ( z l − 1 ) ) + z l − 1 z l = F F N ( L N ( z ^ l ) ) + z ^ l z ^ l + 1 = G S A ( L N ( z l ) ) + z l z l + 1 = F F N ( L N ( z ^ l + 1 ) ) + z ^ l + 1 \begin{aligned} \widehat{z}_l&=LSA(LN(z_{l-1}))+z_{l-1} \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \\ \widehat{z}_{l+1}&=GSA(LN(z_l))+z_l \\ z_{l+1}&=FFN(LN(\widehat{z}_{l+1}))+\widehat{z}_{l+1} \end{aligned} z lzlz l+1zl+1=LSA(LN(zl1))+zl1=FFN(LN(z l))+z l=GSA(LN(zl))+zl=FFN(LN(z l+1))+z l+1

computational complexity

in ViT, given input feature map x ∈ R h × w × C x\in R^{h\times w\times C} xRh×w×C, FLOPs of MSA is
Ω ( M S A ) = 4 h w C 2 + 2 ( h w ) 2 C \Omega(MSA)=4hwC^2+2(hw)^2C Ω(MSA)=4hwC2+2(hw)2C
for LSA, replace h × w h\times w h×w with window size w s × w s ws\times ws ws×ws and batchsize × n w i n d o w s = h w ( w s ) 2 \times n_{windows}=\frac {hw}{(ws)^2} ×nwindows=(ws)2hw
Ω ( L S A ) = 4 h w C 2 + 2 h w ( w s ) 2 C \Omega(LSA)=4hwC^2+2hw(ws)^2C Ω(LSA)=4hwC2+2hw(ws)2C
for GSA, if reduction ratio is r
Ω ( G S A ) = 4 h w C 2 + 2 ( h w ) 2 C r 2 \Omega(GSA)=4hwC^2+\frac {2(hw)^2C}{r^2} Ω(GSA)=4hwC2+r22(hw)2C
to sum up, for SSSA
Ω ( S S S A ) = 8 h w C 2 + 2 ( h w m n + ( h w ) 2 r 2 ) C = 8 h w C 2 + 2 ( k 1 k 2 h w + ( h w ) 2 k 1 k 2 ) C ⩾ 8 h w C 2 + 2 ( h w ) 3 2 C \begin{aligned} \Omega(SSSA)&=8hwC^2+2(hwmn+\frac {(hw)^2}{r^2})C \\ &=8hwC^2+2(k_1k_2hw+\frac {(hw)^2}{k_1k_2})C \\ &\geqslant 8hwC^2+2(hw)^{\frac 32}C \end{aligned} Ω(SSSA)=8hwC2+2(hwmn+r2(hw)2)C=8hwC2+2(k1k2hw+k1k2(hw)2)C8hwC2+2(hw)23C
where, minimum obtained when k 1 k 2 = h w k_1k_2=\sqrt{hw} k1k2=hw

architecture variants

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第9张图片
Configuration details of Twins-SVT.

Experiment

image classification

dataset ImageNet-1K, with regularization as DeiT
optimizer AdamW: batchsize=1024, 300 epochs, init lr=1e-3, linear warm-up 5 epochs, cosine decay to 0
stochastic depth 0.2, 0.3, 0.5 for Twins-S, Twins-B, Twins-L
max gradient norm clipped to 5.0

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第10张图片
Comparisons with SOTA methods for ImageNet-1K classification. Throughput is tested on the batch size of 192 on a single V100 GPU. All models are trained and evaluated on 224x224 resolution on ImageNet-1K dataset. “+”: w/ CPVT’s position encodings.

object detection and instance segmentation

framework RetinaNet
dataset COCO
optimizer AdamW: batchsize=16, 12 epochs, init lr=1e-4, weigh decay=1e-4, warm-up 500 iterations, decay by 10 at 8, 11-th epoch

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第11张图片
Object detection performance on the COCO val2017 split using the RetinaNet framework. 1x is 12 epochs and 3x is 36 epochs. “MS”: Multi-scale training. FLOPs are evaluated on 800x600 resolution.

framework Mask R-CNN
dataset COCO
optimizer AdamW: init lr=2e-4, default setting as mmdetection

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第12张图片
Object detection and instance segmentation performance on the COCO val2017 dataset using the Mask R-CNN framework. FLOPs are evaluated on a 800x600 image.

semantic segmentation

framework Semantic FPN
dataset ADE20K, ImageNet (pre-training)

  • Twins-PCPVT
    optimizer AdamW: batchsize=16, init lr=1e-4, polynomial lr decay (0.9)
  • Twins-SVT
    optimizer AdamW: batchsize=16, 160K iterations, init lr=6e-6, weigh decay=5e-4, warm-up 1500 iterations, linear decay to 0

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第13张图片
Performance comparisons with different backbones on ADE20K validation dataset. FLOPs are tested on 512x512 resolution. All backbones are pretrained on ImageNet-1k except SETR, which is pretrained on ImageNet-21k dataset.

ablation study

configuration of LSA and GSA blocks

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第14张图片
Classification performance for different combinations of LSA (L) and GSA (G) blocks based on the small model.

sub-sampling function

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第15张图片
ImageNet classification performance of different forms of sub-sampled functions for the global sub-sampled attention (GSA).

position encoding

[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers_第16张图片
Object detection performance on the COCO using different positional encoding strategies.

CPVT-based Swin cannot achieve better performance
indicate that improvement owing to paradigm of Twins-SVT instead of positional encodings

你可能感兴趣的:(Vision,Transformer,计算机视觉,深度学习)