[2202] Visual Attention Network

paper
code

Content

    • Abstract
    • Method
        • large kernel attention (LKA)
          • computational complexity
        • architecture variants
    • Experiment
        • image classification
        • object detection and instance segmentation
        • semantic segmentation
        • ablation studies
          • architecture components
          • kernel size and dilation
        • visualization

Abstract

demerits of self-attention (SA)

  • treat images as 1D sequence and neglect 2D structure
  • quadractic complexity is expensive for HR images
  • achieve spatial adaptability but ignore channel adaptability

demrit of multiple layer perceptron (MLP)

  • sensitive to input size and only process fixed-size images
  • consider global information but ignore local structure

contributions

  • propose large kernel attention (LKA)
    local structure information, long-range dependence, adaptability in channel dimension
  • present visual attention network (VAN) as backbone based on LKA
    SOTA performance, less parameters and FLOPS

[2202] Visual Attention Network_第1张图片
Results of different models on ImageNet-1K validation set. Left: Comparing the performance of recent models DeiT, PVT, Swin Transformer, ConvNeXt, Focal Transformer and our VAN. All these models have a similar amount of parameters. Right: Comparing the performance of recent models and our VAN while keeping the computational cost similar.

Method

large kernel attention (LKA)

large-kernel convolution brings a huge amount of computational overhead and parameters
solution decompose a large kernel convolution

[2202] Visual Attention Network_第2张图片
Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv), and a pointwise convolution ( 1 × 1 1\times1 1×1 Conv). The colored grids represent the location of convolution kernel and the yellow grid means the center point. The diagram shows that a 13 × 13 13\times13 13×13 convolution is decomposed into a 5 × 5 5\times5 5×5 depth-wise convolution, a 5 × 5 5\times5 5×5 depth-wise dilation convolution with dilation rate 3, and a pointwise convolution. Note: zero paddings are omitted in the above figure.

a K × K K\times K K×K large kernel convolution divided into 3 components

  • a spatial local convolution: ( 2 d − 1 ) × ( 2 d − 1 ) (2d-1)\times(2d-1) (2d1)×(2d1) depth-wise conv    ⟹    \implies local contextual information
  • a spatial long-range convolution: ⌈ K d ⌉ × ⌈ K d ⌉ \lceil\frac Kd\rceil\times\lceil\frac Kd\rceil dK×dK depth-wise dilated
    conv    ⟹    \implies large receptive field
  • a channel convolution: 1 × 1 1\times1 1×1 conv    ⟹    \implies adaptility in channel

where, K K K is kernel size, d d d is dilation

[2202] Visual Attention Network_第3张图片
Desirable properties belonging to convolution, self-attention and LKA.

write LKA module as
A t t e n t i o n = C o n v 1 × 1 ( D W - D - C o n v ( D W - C o n v ( F ) ) ) O u t p u t = A t t e n t i o n ⊗ F \begin{aligned} Attention&=Conv_{1\times1}(DW\text{-}D\text{-}Conv(DW\text{-}Conv(F))) \\ Output&=Attention\otimes F \end{aligned} AttentionOutput=Conv1×1(DW-D-Conv(DW-Conv(F)))=AttentionF

where, F ∈ R C × H × W F\in\Reals^{C\times H\times W} FRC×H×W is input features, A t t e n t i o n ∈ R C × H × W Attention\in\Reals^{C\times H\times W} AttentionRC×H×W is attention map

[2202] Visual Attention Network_第4张图片
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; © the self-attention module; (d) a stage of our Visual Attention Network (VAN). “CFF” means convolutional feed-forward network. Residual connection is omitted in (d). The difference between (a) and (b) is the element-wise multiply. It is worth noting that © is designed for 1D sequences.

computational complexity

assumed input and output with same size R C × H × W \Reals^{C\times H\times W} RC×H×W
P a r a m = ⌈ K d ⌉ × ⌈ K d ⌉ × C + ( 2 d − 1 ) × ( 2 d − 1 ) × C + C × C F L O P s = ( ⌈ K d ⌉ × ⌈ K d ⌉ × C + ( 2 d − 1 ) × ( 2 d − 1 ) × C + C × C ) × H × W \begin{aligned} \mathrm{Param}&=\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C \\ \mathrm{FLOPs}&=(\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C)\times H\times W \end{aligned} ParamFLOPs=dK×dK×C+(2d1)×(2d1)×C+C×C=(dK×dK×C+(2d1)×(2d1)×C+C×C)×H×W

where, K K K is kernel size with default K = 21 K=21 K=21, d d d is dilation with default d = 3 d=3 d=3

[2202] Visual Attention Network_第5张图片
Comparison of parameters of different manners for a 21 × 21 21\times21 21×21 convolution. X, Y and Our donate standard convolution, MobileNet decomposition and our decomposition, respectively. The input and output feature have the same size H × W × C H\times W\times C H×W×C. Note: Bias is omitted for simplifying format.

architecture variants

[2202] Visual Attention Network_第6张图片
The detailed setting for different versions of the VAN. “e.r.” represents expansion ratio in the feed-forward network.

Experiment

image classification

dataset ImageNet-1K, with augmentation
optimizer AdamW: batchsize=1024, 310 epochs, momentum=0.9, weigh decy=5e-2, init lr=5e-4, warm-up, cosine decay

[2202] Visual Attention Network_第7张图片
Compare with the state-of-the-art methods on ImageNet validation set. Params means parameter. GFLOPs donates floating point operations. Top-1 Acc represents Top-1 accuracy.

object detection and instance segmentation

framework RetinaNet, Mask R-CNN, Cascade Mask R-CNN, Sparse R-CNN
dataset COCO 2017

[2202] Visual Attention Network_第8张图片
Object detection on COCO 2017 dataset. #P means parameter. RetinaNet 1 × 1\times 1× donates models are based on RetinaNet and we train them for 12 epochs.

[2202] Visual Attention Network_第9张图片
Object detection and instance segmentation on COCO 2017 dataset. #P means parameter. Mask R-CNN 1 × 1\times 1× donates models are based on Mask R-CNN and we train them for 12 epochs. A P b AP^b APb and A P m AP^m APm refer to bounding box AP and mask AP respectively.

[2202] Visual Attention Network_第10张图片
Comparison with the state-of-the-art vision backbones on COCO 2017 benchmark. All models are trained for 36 epochs. We calculate FLOPs with input size of 1280 × 800 1280\times800 1280×800.

semantic segmentation

framework Semantic FPN, UperNet
dataset ADE20K

[2202] Visual Attention Network_第11张图片
Results of semantic segmentation on ADE20K validation set. The upper and lower part are obtained under two different training/validation schemes. We calculate FLOPs with input size 512 × 512 512\times512 512×512 for Semantic FPN and 2048 × 512 2048\times512 2048×512 for UperNet.

ablation studies

architecture components

[2202] Visual Attention Network_第12张图片
Ablation study of different modules in LKA. Results show that each part is critical. Acc(%) means Top-1 accuracy on ImageNet validation set.

key findings

  • local structural information, long-range dependence, adaptability in channel dimension are all critial
  • attention mechanism help network achieve adaptive property
kernel size and dilation

[2202] Visual Attention Network_第13张图片
Ablation study of different kernel size in LKA. Acc(%) means Top-1 accuracy on ImageNet validation set.

key findings

  • decomposing a 21 × 21 21\times21 21×21 convolution work better than decomposing a 7 × 7 7\times7 7×7 convolution
       ⟹    \implies large kernel is critical for visual tasks
  • decomposing a larger 35 × 35 35\times35 35×35 convolution, gain is not obvious comparing with decomposing a 21 × 21 21\times21 21×21 convolution

visualization


Visualization results. All images come from different categories in ImageNet validation set. CAM is produced by using Grad-CAM. We compare different CAMs produced by Swin-T, ConvNeXt-T and VAN-Base.

key findings

  • activation area is more accurate
  • show obvious advantages when object is dominant in images    ⟹    \implies ability to capture long-range dependence

你可能感兴趣的:(Vision,MLP,计算机视觉,深度学习)