Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)

1. Motivation

减小PVT的计算复杂度,提供新的transformer baseline。PVT的limitation在于:1)对于高分辨率的输入,计算复杂度仍然很高;2)PVT的sequence是non-overlapping patch,损失了一部分局部连续性;3)position encoding是fixed-size的,不能适应任意大小的输入。

2. Contribution

1)linear complexity attention layer(将SRA中下采样操作从卷积改为average pooling);

2)overlapping patch embedding(将patch embedding从non-overlap改为half-overlap的patch);

3)convolutional feed-forward network(将fixed-size position encoding改为zero padding position encoding)

3. Methods

3.1 Linear Spatial Reduction Attention

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第1张图片

SRA用带stride的卷积层做spatial reduction,linear SRA采用average pooling将输入(h*w)降为固定大小(P*P)。复杂度分析:

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第2张图片

 3.2 Overlapping Patch Embedding

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第3张图片

将non-overlap的patch改为half-overlap的,具体实现方式为kernel size是2S-1,stride是S,padding是S-1的卷积。

3.3 Convolutional FeedForward

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第4张图片

 在feed-forward的FC和GELU中间加了一层kernel size 3*3,padding 1 的depth-wise卷积

3.4 Details of PVTv2 Series

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第5张图片

4. Experiments

4.1 Image Classification

Setting和PVT v1一样

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第6张图片

 4.2 Object Detection

FrameworkRetinaNet, Mask R-CNN, Cascade Mask R-CNN , ATSS , GFL, and Sparse R-CNN

Other same as those in PVT v1

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第7张图片

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第8张图片

 4.3 Semantic Segmentation

Epoch : 40K iterations 

Other same as those in PVT v1

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第9张图片

4.4 Ablation Study

Overlapping patch embedding (OPE) is important.

Convolutional feed-forward network (CFFN) matters.

Linear SRA (LSRA) contributes to a better model.

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第10张图片

Transformer系列:Pyramid Vision Transformer(PVT) v2 (CVMJ 2022)_第11张图片

你可能感兴趣的:(Transformer,Object,detection,Classification,大数据)