[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration

paper
code

Content

    • Abstract
    • Method
        • model architecture
        • multi-Dconv head transposed attention (MDTA)
        • gated-Dconv feed-forward network (GDFN)
        • progressive learning
    • Experiment
        • deraining
        • deblurring
        • denoising
        • ablation study
          • improvement in multi-head attention
          • improvement in feed-forward network
          • designs for decoder at level-1
          • progressive learning
          • deeper or wider Restormer

Abstract

  • propose Restormer, an encoder-decoder Transformer for multi-scale local-global representation
    no partition into local windows    ⟹    \implies exploit distant image context
  • propose a multi-Dconv head transposed attention (MDTA) module
    aggregate local and non-local pixel interactions    ⟹    \implies process HR images efficiently
  • propose a gated-Dconv feed-forward network (GDFN)
    perform controlled feature transformation

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第1张图片
Our Restormer achieves the state-of-the-art performance on image restoration tasks while being computationally efficient.

Method

model architecture

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第2张图片
Architecture of Restormer for high-resolution image restoration. Our Restormer consists of multiscale hierarchical design incorporating efficient Transformer blocks. The core modules of Transformer block are: (a) multi-Dconv head transposed attention (MDTA) that performs (spatially enriched) query-key feature interaction across channels rather the spatial dimension, and (b) Gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation, i.e., to allow useful information to propagate further.

given a degraded image I ∈ R H × W × 3 I\in\Reals^{H\times W\times3} IRH×W×3

feature extraction
obtain low-level feature embeddings F 0 ∈ R H × W × C F_0\in\Reals^{H\times W\times C} F0RH×W×C

transformer
pass F 0 F_0 F0 through a 4-level encoder-decoder and transform into deep features F d ∈ R H × W × 2 C F_d\in\Reals^{H\times W\times 2C} FdRH×W×2C
encoder reduce spatial size and expand channel capacity
decoder recover HR representations
up-, down-sampler pixel-unshuffle, shuffle
encoder features concat wit decoder features via skip connections for finer structural and textural detail

refinement
enrich F d F_d Fd at high spatial resolution for F r ∈ R H × W × 2 C F_r\in\Reals^{H\times W\times 2C} FrRH×W×2C

reconstruction
generate residual image R ∈ R H × W × 3 R\in\Reals^{H\times W\times3} RRH×W×3 and add to obtain restored image I ^ = I + R \hat{I}=I+R I^=I+R

multi-Dconv head transposed attention (MDTA)

problem complexity of MHSA is O ( H 2 W 2 ) \mathcal{O}(H^2W^2) O(H2W2) with H × W H\times W H×W-size input image
solution apply SA on channel dimension instead of spatial dimension    ⟹    \implies complexity O ( H W ) \mathcal{O}(HW) O(HW)

given a layer normalized features Y ∈ R H × W × C Y\in\Reals^{H\times W\times C} YRH×W×C
step 1 generate query, key, value by convs
Q = W d Q W p Q Y K = W d K W p K Y V = W d V W p V Y \begin{aligned} Q&=W_d^QW_p^QY \\ K&=W_d^KW_p^KY \\ V&=W_d^VW_p^VY \end{aligned} QKV=WdQWpQY=WdKWpKY=WdVWpVY

where, W p ( ⋅ ) W_p(\cdot) Wp() is a 1 × 1 1\times1 1×1 conv to aggregate pixel-wise cross-channel context, W d ( ⋅ ) W_d(\cdot) Wd() is a 3 × 3 3\times3 3×3 depth-wise conv to channel-wise spatial context
step 2 reshape Q , K , V Q, K, V Q,K,V into Q ^ , K ^ , V ^ \hat{Q}, \hat{K}, \hat{V} Q^,K^,V^
Q ∈ R H × W × C → r e s h a p e Q ^ ∈ R H W × C K , V ∈ R H × W × C → r e s h a p e K ^ , V ^ ∈ R C × H W \begin{aligned} Q\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{Q}\in\Reals^{HW\times C} \\ K, V\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{K}, \hat{V}\in\Reals^{C\times HW} \end{aligned} QRH×W×CK,VRH×W×Creshape Q^RHW×Creshape K^,V^RC×HW

step 3 calculate transposed attention map A ∈ R C × C A\in\Reals^{C\times C} ARC×C instead of R H W × H W \Reals^{HW\times HW} RHW×HW
A t t e n t i o n = ( Q ^ , K ^ , V ^ ) = V ^ ⋅ s o f t m a x ( K ^ ⋅ Q ^ α ) Attention=(\hat{Q}, \hat{K}, \hat{V})=\hat{V}\cdot softmax(\frac{\hat{K}\cdot\hat{Q}}{\alpha}) Attention=(Q^,K^,V^)=V^softmax(αK^Q^)

where, α \alpha α is a learnable scaling parameter to control magnitude of dot product
step 4 define overall MDTA process as
X ^ = A t t e n t i o n = ( Q ^ , K ^ , V ^ ) + X \hat{X}=Attention=(\hat{Q}, \hat{K}, \hat{V})+X X^=Attention=(Q^,K^,V^)+X

similarly, divide number of channels into “heads” and learn separate attention maps in parallel

gated-Dconv feed-forward network (GDFN)

FFN operate on each pixel location separately and identically
2 modifications: gating mechanism, depth-wise conv

advantages of GDFN

  • control information flow
  • allow each level to focus on details complementary to the other levels

given a input features X ∈ R H × W × C X\in\Reals^{H\times W\times C} XRH×W×C
step 1 encode pixel- and channel-wise information
X 1 = W d 1 W p 1 L N ( X ) X 2 = W d 2 W p 2 L N ( X ) \begin{aligned} X_1&=W_d^1W_p^1LN(X) \\ X_2&=W_d^2W_p^2LN(X) \end{aligned} X1X2=Wd1Wp1LN(X)=Wd2Wp2LN(X)

where, W p ( ⋅ ) W_p(\cdot) Wp() is a 1 × 1 1\times1 1×1 conv, W d ( ⋅ ) W_d(\cdot) Wd() is a 3 × 3 3\times3 3×3 depth-wise conv
step 2 gating mechanism element-wise product of 2 parallel linear layers, one of which is activated with GELU
g a t e ( X ) = G E L U ( X 1 ) ⊙ X 2 gate(X)=GELU(X_1)\odot X_2 gate(X)=GELU(X1)X2

step 3 define overall GDFN as
X ^ = W p 0 g a t e ( X ) \hat{X}=W_p^0gate(X) X^=Wp0gate(X)

where, ⊙ \odot is element-wise multiplication
GDFN perform more operations than FFN    ⟹    \implies reduce expansion ratio γ \gamma γ for similar param. and FLOPs

progressive learning

training a Transformer model on small cropped patches cannot encode global image statistics    ⟹    \implies sub-optimal performance on full-resolution images at test time
solution progressive learning: train network on smaller patches in early epochs and on gradually larger patches in later epochs
reduce batch size as patch size increase    ⟸    \impliedby larger patched trained with longer time

Experiment

deraining

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第3张图片
Image deraining results. When averaged across all five datasets, our Restormer advances state-of-the-art by 1.05 dB.

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第4张图片
Image deraining example. Our Restormer generates rain-free image with structural fidelity and without artifacts.

deblurring

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第5张图片
Single-image motion deblurring results. Our Restormer is trained only on the GoPro dataset and directly applied to the HIDE and RealBlur benchmark datasets.

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第6张图片
Single-image motion deblurring on GoPro. Restormer generates sharper and visually-faithful result.

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第7张图片
Defocus deblurring comparisons on the DPDD testset (containing 37 indoor and 39 outdoor scenes). S: single-image defocus deblurring. D: dual-pixel defocus deblurring. Restormer sets new state-of-the-art for both single-image and dual pixel defocus deblurring.


Dual-pixel defocus deblurring comparison on the DPDD dataset. Compared to the other approaches, our Restormer more effectively removes blur while preserving the fine image details.

denoising

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第8张图片
Gaussian grayscale image denoising comparisons for two categories of methods. Top super row: learning a single model to handle various noise levels. Bottom super row: training a separate model for each noise level.

[2111] [CVPR 2022] Restormer: Efficient Transformer for High-Resolution Image Restoration_第9张图片
Gaussian color image denoising. Our Restormer demonstrates favorable performance among both categories of methods. On Urban dataset for noise level 50, Restormer yields 0.41 dB gain over CNN-based DRUNet, and 0.2 dB over Transformer model SwinIR.

2111_restormer_t6
Real image denoising on SIDD and DND datasets. “ ∗ \ast ” denotes methods using additional training data. Our Restormer is trained only on the SIDD images and directly tested on DND. Among competing approaches, only Restormer surpasses 40 dB PSNR.


Visual results on image denoising. Top row: Gaussian grayscale denoising. Middle row: Gaussian color denoising. Bottom row: real image denoising. The image reproduction quality of our Restormer is more faithful to the ground-truth than other methods.

ablation study
improvement in multi-head attention
improvement in feed-forward network
designs for decoder at level-1
progressive learning
deeper or wider Restormer

你可能感兴趣的:(计算机视觉,深度学习)