paper
code
Our Restormer achieves the state-of-the-art performance on image restoration tasks while being computationally efficient.
Architecture of Restormer for high-resolution image restoration. Our Restormer consists of multiscale hierarchical design incorporating efficient Transformer blocks. The core modules of Transformer block are: (a) multi-Dconv head transposed attention (MDTA) that performs (spatially enriched) query-key feature interaction across channels rather the spatial dimension, and (b) Gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation, i.e., to allow useful information to propagate further.
given a degraded image I ∈ R H × W × 3 I\in\Reals^{H\times W\times3} I∈RH×W×3
feature extraction
obtain low-level feature embeddings F 0 ∈ R H × W × C F_0\in\Reals^{H\times W\times C} F0∈RH×W×C
transformer
pass F 0 F_0 F0 through a 4-level encoder-decoder and transform into deep features F d ∈ R H × W × 2 C F_d\in\Reals^{H\times W\times 2C} Fd∈RH×W×2C
encoder reduce spatial size and expand channel capacity
decoder recover HR representations
up-, down-sampler pixel-unshuffle, shuffle
encoder features concat wit decoder features via skip connections for finer structural and textural detail
refinement
enrich F d F_d Fd at high spatial resolution for F r ∈ R H × W × 2 C F_r\in\Reals^{H\times W\times 2C} Fr∈RH×W×2C
reconstruction
generate residual image R ∈ R H × W × 3 R\in\Reals^{H\times W\times3} R∈RH×W×3 and add to obtain restored image I ^ = I + R \hat{I}=I+R I^=I+R
problem complexity of MHSA is O ( H 2 W 2 ) \mathcal{O}(H^2W^2) O(H2W2) with H × W H\times W H×W-size input image
solution apply SA on channel dimension instead of spatial dimension ⟹ \implies ⟹ complexity O ( H W ) \mathcal{O}(HW) O(HW)
given a layer normalized features Y ∈ R H × W × C Y\in\Reals^{H\times W\times C} Y∈RH×W×C
step 1 generate query, key, value by convs
Q = W d Q W p Q Y K = W d K W p K Y V = W d V W p V Y \begin{aligned} Q&=W_d^QW_p^QY \\ K&=W_d^KW_p^KY \\ V&=W_d^VW_p^VY \end{aligned} QKV=WdQWpQY=WdKWpKY=WdVWpVY
where, W p ( ⋅ ) W_p(\cdot) Wp(⋅) is a 1 × 1 1\times1 1×1 conv to aggregate pixel-wise cross-channel context, W d ( ⋅ ) W_d(\cdot) Wd(⋅) is a 3 × 3 3\times3 3×3 depth-wise conv to channel-wise spatial context
step 2 reshape Q , K , V Q, K, V Q,K,V into Q ^ , K ^ , V ^ \hat{Q}, \hat{K}, \hat{V} Q^,K^,V^
Q ∈ R H × W × C → r e s h a p e Q ^ ∈ R H W × C K , V ∈ R H × W × C → r e s h a p e K ^ , V ^ ∈ R C × H W \begin{aligned} Q\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{Q}\in\Reals^{HW\times C} \\ K, V\in\Reals^{H\times W\times C}&\xrightarrow{reshape}\hat{K}, \hat{V}\in\Reals^{C\times HW} \end{aligned} Q∈RH×W×CK,V∈RH×W×CreshapeQ^∈RHW×CreshapeK^,V^∈RC×HW
step 3 calculate transposed attention map A ∈ R C × C A\in\Reals^{C\times C} A∈RC×C instead of R H W × H W \Reals^{HW\times HW} RHW×HW
A t t e n t i o n = ( Q ^ , K ^ , V ^ ) = V ^ ⋅ s o f t m a x ( K ^ ⋅ Q ^ α ) Attention=(\hat{Q}, \hat{K}, \hat{V})=\hat{V}\cdot softmax(\frac{\hat{K}\cdot\hat{Q}}{\alpha}) Attention=(Q^,K^,V^)=V^⋅softmax(αK^⋅Q^)
where, α \alpha α is a learnable scaling parameter to control magnitude of dot product
step 4 define overall MDTA process as
X ^ = A t t e n t i o n = ( Q ^ , K ^ , V ^ ) + X \hat{X}=Attention=(\hat{Q}, \hat{K}, \hat{V})+X X^=Attention=(Q^,K^,V^)+X
similarly, divide number of channels into “heads” and learn separate attention maps in parallel
FFN operate on each pixel location separately and identically
2 modifications: gating mechanism, depth-wise conv
advantages of GDFN
given a input features X ∈ R H × W × C X\in\Reals^{H\times W\times C} X∈RH×W×C
step 1 encode pixel- and channel-wise information
X 1 = W d 1 W p 1 L N ( X ) X 2 = W d 2 W p 2 L N ( X ) \begin{aligned} X_1&=W_d^1W_p^1LN(X) \\ X_2&=W_d^2W_p^2LN(X) \end{aligned} X1X2=Wd1Wp1LN(X)=Wd2Wp2LN(X)
where, W p ( ⋅ ) W_p(\cdot) Wp(⋅) is a 1 × 1 1\times1 1×1 conv, W d ( ⋅ ) W_d(\cdot) Wd(⋅) is a 3 × 3 3\times3 3×3 depth-wise conv
step 2 gating mechanism element-wise product of 2 parallel linear layers, one of which is activated with GELU
g a t e ( X ) = G E L U ( X 1 ) ⊙ X 2 gate(X)=GELU(X_1)\odot X_2 gate(X)=GELU(X1)⊙X2
step 3 define overall GDFN as
X ^ = W p 0 g a t e ( X ) \hat{X}=W_p^0gate(X) X^=Wp0gate(X)
where, ⊙ \odot ⊙ is element-wise multiplication
GDFN perform more operations than FFN ⟹ \implies ⟹ reduce expansion ratio γ \gamma γ for similar param. and FLOPs
training a Transformer model on small cropped patches cannot encode global image statistics ⟹ \implies ⟹ sub-optimal performance on full-resolution images at test time
solution progressive learning: train network on smaller patches in early epochs and on gradually larger patches in later epochs
reduce batch size as patch size increase ⟸ \impliedby ⟸ larger patched trained with longer time
Image deraining results. When averaged across all five datasets, our Restormer advances state-of-the-art by 1.05 dB.
Image deraining example. Our Restormer generates rain-free image with structural fidelity and without artifacts.
Single-image motion deblurring results. Our Restormer is trained only on the GoPro dataset and directly applied to the HIDE and RealBlur benchmark datasets.
Single-image motion deblurring on GoPro. Restormer generates sharper and visually-faithful result.
Defocus deblurring comparisons on the DPDD testset (containing 37 indoor and 39 outdoor scenes). S: single-image defocus deblurring. D: dual-pixel defocus deblurring. Restormer sets new state-of-the-art for both single-image and dual pixel defocus deblurring.
Dual-pixel defocus deblurring comparison on the DPDD dataset. Compared to the other approaches, our Restormer more effectively removes blur while preserving the fine image details.
Gaussian grayscale image denoising comparisons for two categories of methods. Top super row: learning a single model to handle various noise levels. Bottom super row: training a separate model for each noise level.
Gaussian color image denoising. Our Restormer demonstrates favorable performance among both categories of methods. On Urban dataset for noise level 50, Restormer yields 0.41 dB gain over CNN-based DRUNet, and 0.2 dB over Transformer model SwinIR.
Real image denoising on SIDD and DND datasets. “ ∗ \ast ∗” denotes methods using additional training data. Our Restormer is trained only on the SIDD images and directly tested on DND. Among competing approaches, only Restormer surpasses 40 dB PSNR.
Visual results on image denoising. Top row: Gaussian grayscale denoising. Middle row: Gaussian color denoising. Bottom row: real image denoising. The image reproduction quality of our Restormer is more faithful to the ground-truth than other methods.