paper
code
demerits of self-attention (SA)
demrit of multiple layer perceptron (MLP)
contributions
Results of different models on ImageNet-1K validation set. Left: Comparing the performance of recent models DeiT, PVT, Swin Transformer, ConvNeXt, Focal Transformer and our VAN. All these models have a similar amount of parameters. Right: Comparing the performance of recent models and our VAN while keeping the computational cost similar.
large-kernel convolution brings a huge amount of computational overhead and parameters
solution decompose a large kernel convolution
Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv), and a pointwise convolution ( 1 × 1 1\times1 1×1 Conv). The colored grids represent the location of convolution kernel and the yellow grid means the center point. The diagram shows that a 13 × 13 13\times13 13×13 convolution is decomposed into a 5 × 5 5\times5 5×5 depth-wise convolution, a 5 × 5 5\times5 5×5 depth-wise dilation convolution with dilation rate 3, and a pointwise convolution. Note: zero paddings are omitted in the above figure.
a K × K K\times K K×K large kernel convolution divided into 3 components
where, K K K is kernel size, d d d is dilation
Desirable properties belonging to convolution, self-attention and LKA.
write LKA module as
A t t e n t i o n = C o n v 1 × 1 ( D W - D - C o n v ( D W - C o n v ( F ) ) ) O u t p u t = A t t e n t i o n ⊗ F \begin{aligned} Attention&=Conv_{1\times1}(DW\text{-}D\text{-}Conv(DW\text{-}Conv(F))) \\ Output&=Attention\otimes F \end{aligned} AttentionOutput=Conv1×1(DW-D-Conv(DW-Conv(F)))=Attention⊗F
where, F ∈ R C × H × W F\in\Reals^{C\times H\times W} F∈RC×H×W is input features, A t t e n t i o n ∈ R C × H × W Attention\in\Reals^{C\times H\times W} Attention∈RC×H×W is attention map
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; © the self-attention module; (d) a stage of our Visual Attention Network (VAN). “CFF” means convolutional feed-forward network. Residual connection is omitted in (d). The difference between (a) and (b) is the element-wise multiply. It is worth noting that © is designed for 1D sequences.
assumed input and output with same size R C × H × W \Reals^{C\times H\times W} RC×H×W
P a r a m = ⌈ K d ⌉ × ⌈ K d ⌉ × C + ( 2 d − 1 ) × ( 2 d − 1 ) × C + C × C F L O P s = ( ⌈ K d ⌉ × ⌈ K d ⌉ × C + ( 2 d − 1 ) × ( 2 d − 1 ) × C + C × C ) × H × W \begin{aligned} \mathrm{Param}&=\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C \\ \mathrm{FLOPs}&=(\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C)\times H\times W \end{aligned} ParamFLOPs=⌈dK⌉×⌈dK⌉×C+(2d−1)×(2d−1)×C+C×C=(⌈dK⌉×⌈dK⌉×C+(2d−1)×(2d−1)×C+C×C)×H×W
where, K K K is kernel size with default K = 21 K=21 K=21, d d d is dilation with default d = 3 d=3 d=3
Comparison of parameters of different manners for a 21 × 21 21\times21 21×21 convolution. X, Y and Our donate standard convolution, MobileNet decomposition and our decomposition, respectively. The input and output feature have the same size H × W × C H\times W\times C H×W×C. Note: Bias is omitted for simplifying format.
The detailed setting for different versions of the VAN. “e.r.” represents expansion ratio in the feed-forward network.
dataset ImageNet-1K, with augmentation
optimizer AdamW: batchsize=1024, 310 epochs, momentum=0.9, weigh decy=5e-2, init lr=5e-4, warm-up, cosine decay
Compare with the state-of-the-art methods on ImageNet validation set. Params means parameter. GFLOPs donates floating point operations. Top-1 Acc represents Top-1 accuracy.
framework RetinaNet, Mask R-CNN, Cascade Mask R-CNN, Sparse R-CNN
dataset COCO 2017
Object detection on COCO 2017 dataset. #P means parameter. RetinaNet 1 × 1\times 1× donates models are based on RetinaNet and we train them for 12 epochs.
Object detection and instance segmentation on COCO 2017 dataset. #P means parameter. Mask R-CNN 1 × 1\times 1× donates models are based on Mask R-CNN and we train them for 12 epochs. A P b AP^b APb and A P m AP^m APm refer to bounding box AP and mask AP respectively.
Comparison with the state-of-the-art vision backbones on COCO 2017 benchmark. All models are trained for 36 epochs. We calculate FLOPs with input size of 1280 × 800 1280\times800 1280×800.
framework Semantic FPN, UperNet
dataset ADE20K
Results of semantic segmentation on ADE20K validation set. The upper and lower part are obtained under two different training/validation schemes. We calculate FLOPs with input size 512 × 512 512\times512 512×512 for Semantic FPN and 2048 × 512 2048\times512 2048×512 for UperNet.
Ablation study of different modules in LKA. Results show that each part is critical. Acc(%) means Top-1 accuracy on ImageNet validation set.
key findings
Ablation study of different kernel size in LKA. Acc(%) means Top-1 accuracy on ImageNet validation set.
key findings
Visualization results. All images come from different categories in ImageNet validation set. CAM is produced by using Grad-CAM. We compare different CAMs produced by Swin-T, ConvNeXt-T and VAN-Base.
key findings