paper
code
Twins-PCPVT: based on PVT and conditional position encoding (CPE), which only uses global attention
Twins-SVT: based on proposed spatially separable self-attention (SSSA) which interleaves local and global attention
Architecture of Twins-PCPVT-S. “PEG” is the positional encoding generator from CPVT.
given input image with size H × W H\times W H×W, split into patches with size S × S S\times S S×S, patches number is N = H W S 2 N=\frac {HW}{S^2} N=S2HW
patches added with the same number as learnable absolute positional encoding vectors
limitations of previous positional encodings
Comparison of various positional encoding (PE) strategies tested on ImageNet validation set in terms of the top-1 accuracy. Removing the positional encodings greatly damages the performance. The relative postional encodings have inferior performance to the absolute ones.
solutions for aforementioned limitations
requirements of demanded positional encoding
reshape flatten input sequence X ∈ R B × N × C X\in R^{B\times N\times C} X∈RB×N×C to X 1 ∈ R B × C × H × W X_1\in R^{B\times C\times H\times W} X1∈RB×C×H×W
apply 2d-transformation F on X 1 X_1 X1 and get output X 2 ∈ R B × N × C X_2\in R^{B\times N\times C} X2∈RB×N×C, where F implement by 2d-conv, with kernel size k ( k ⩾ 3 k\geqslant 3 k⩾3) and k − 1 2 \frac {k-1}2 2k−1 zero-padding
reshape X 2 X_2 X2 to produce position encoding E ∈ R B × C × H × W E\in R^{B\times C\times H\times W} E∈RB×C×H×W
Schematic illustration of Positional Encoding Generator (PEG). Note d is the embedding size, N is the number of tokens. The function F can be depth-wise, separable convolution or other complicated blocks.
问:为什么要这么做?这种做法怎么就能把位置信息引入Transformer了?
答:给Transformer引入位置信息,说白了就是给一个sequence的N个向量assign a position。那这个position它既可以是绝对信息,也可以是相对信息。相对信息就是定义一个参考点然后给每个向量一个代表它与参考点相对位置的信息。这种做法相当于是使用卷积操作得到positional encoding,而卷积操作的zero-padding就是相当于是参考点,卷积操作相当于提取了每个向量与参考点的相对位置信息。所以这种办法用一句话概括就是:
PEG的卷积部分以zero-padding作为参考点,以卷积操作提取相对位置信息,借助卷积得到适用于Transformer的可变长度的位置编码。
ref: zhihu
an extra learnable class token needed to perform classification, which is not translation-invariant although it can learn to be translation-invariant
replace class token with a global average pooling (GAP), which is inherently translation-invariant
Vision Transformers: (a) ViT with explicit 1D learnable positional encodings (PE) (b) CPVT with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. © CPVT-GAP without class token (cls), but with global average pooling (GAP) over all items in the sequence. Note that GAP is a bonus version which has boosted performance.
Architecture of Twins-SVT-S. “PEG” is the positional encoding generator from CPVT.
(a) Twins-SVT interleaves locally-grouped attention (LSA) and global sub-sampled attention (GSA). (b) Schematic view of the locally-grouped attention (LSA) and global sub-sampled attention (GSA).
feature map x ∈ R h × w × C x\in R^{h\times w\times C} x∈Rh×w×C divided into w s × w s ws\times ws ws×ws-size windows
self-attention applied on each sub-windows who contain w s × w s ws\times ws ws×ws tokens, respectively
Multi-head attention (MHA) vs. spatial-reduction attention (SRA). With the spatial-reduction operation, the computational/memory cost of our SRA is much lower than that of MHA.
with consecutive encoder blocks alternating between LSA and GSA, transformer encoder computed as
z ^ l = L S A ( L N ( z l − 1 ) ) + z l − 1 z l = F F N ( L N ( z ^ l ) ) + z ^ l z ^ l + 1 = G S A ( L N ( z l ) ) + z l z l + 1 = F F N ( L N ( z ^ l + 1 ) ) + z ^ l + 1 \begin{aligned} \widehat{z}_l&=LSA(LN(z_{l-1}))+z_{l-1} \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \\ \widehat{z}_{l+1}&=GSA(LN(z_l))+z_l \\ z_{l+1}&=FFN(LN(\widehat{z}_{l+1}))+\widehat{z}_{l+1} \end{aligned} z lzlz l+1zl+1=LSA(LN(zl−1))+zl−1=FFN(LN(z l))+z l=GSA(LN(zl))+zl=FFN(LN(z l+1))+z l+1
in ViT, given input feature map x ∈ R h × w × C x\in R^{h\times w\times C} x∈Rh×w×C, FLOPs of MSA is
Ω ( M S A ) = 4 h w C 2 + 2 ( h w ) 2 C \Omega(MSA)=4hwC^2+2(hw)^2C Ω(MSA)=4hwC2+2(hw)2C
for LSA, replace h × w h\times w h×w with window size w s × w s ws\times ws ws×ws and batchsize × n w i n d o w s = h w ( w s ) 2 \times n_{windows}=\frac {hw}{(ws)^2} ×nwindows=(ws)2hw
Ω ( L S A ) = 4 h w C 2 + 2 h w ( w s ) 2 C \Omega(LSA)=4hwC^2+2hw(ws)^2C Ω(LSA)=4hwC2+2hw(ws)2C
for GSA, if reduction ratio is r
Ω ( G S A ) = 4 h w C 2 + 2 ( h w ) 2 C r 2 \Omega(GSA)=4hwC^2+\frac {2(hw)^2C}{r^2} Ω(GSA)=4hwC2+r22(hw)2C
to sum up, for SSSA
Ω ( S S S A ) = 8 h w C 2 + 2 ( h w m n + ( h w ) 2 r 2 ) C = 8 h w C 2 + 2 ( k 1 k 2 h w + ( h w ) 2 k 1 k 2 ) C ⩾ 8 h w C 2 + 2 ( h w ) 3 2 C \begin{aligned} \Omega(SSSA)&=8hwC^2+2(hwmn+\frac {(hw)^2}{r^2})C \\ &=8hwC^2+2(k_1k_2hw+\frac {(hw)^2}{k_1k_2})C \\ &\geqslant 8hwC^2+2(hw)^{\frac 32}C \end{aligned} Ω(SSSA)=8hwC2+2(hwmn+r2(hw)2)C=8hwC2+2(k1k2hw+k1k2(hw)2)C⩾8hwC2+2(hw)23C
where, minimum obtained when k 1 k 2 = h w k_1k_2=\sqrt{hw} k1k2=hw
dataset ImageNet-1K, with regularization as DeiT
optimizer AdamW: batchsize=1024, 300 epochs, init lr=1e-3, linear warm-up 5 epochs, cosine decay to 0
stochastic depth 0.2, 0.3, 0.5 for Twins-S, Twins-B, Twins-L
max gradient norm clipped to 5.0
Comparisons with SOTA methods for ImageNet-1K classification. Throughput is tested on the batch size of 192 on a single V100 GPU. All models are trained and evaluated on 224x224 resolution on ImageNet-1K dataset. “+”: w/ CPVT’s position encodings.
framework RetinaNet
dataset COCO
optimizer AdamW: batchsize=16, 12 epochs, init lr=1e-4, weigh decay=1e-4, warm-up 500 iterations, decay by 10 at 8, 11-th epoch
Object detection performance on the COCO val2017 split using the RetinaNet framework. 1x is 12 epochs and 3x is 36 epochs. “MS”: Multi-scale training. FLOPs are evaluated on 800x600 resolution.
framework Mask R-CNN
dataset COCO
optimizer AdamW: init lr=2e-4, default setting as mmdetection
Object detection and instance segmentation performance on the COCO val2017 dataset using the Mask R-CNN framework. FLOPs are evaluated on a 800x600 image.
framework Semantic FPN
dataset ADE20K, ImageNet (pre-training)
Performance comparisons with different backbones on ADE20K validation dataset. FLOPs are tested on 512x512 resolution. All backbones are pretrained on ImageNet-1k except SETR, which is pretrained on ImageNet-21k dataset.
configuration of LSA and GSA blocks
Classification performance for different combinations of LSA (L) and GSA (G) blocks based on the small model.
sub-sampling function
ImageNet classification performance of different forms of sub-sampled functions for the global sub-sampled attention (GSA).
position encoding
Object detection performance on the COCO using different positional encoding strategies.
CPVT-based Swin cannot achieve better performance
indicate that improvement owing to paradigm of Twins-SVT instead of positional encodings