[2112] On Efficient Transformer and Image Pre-training for Low-level Vision

paper
code

Content

    • Abstract
    • Method
        • model architecture
        • (shifted) cross local attention
        • anti-blocking FFN
        • architecture variants
    • Pre-Training
        • pre-training on ImageNet
        • centered kernel alignment (CKA)
          • representation structure of EDT
        • single and multi-task pre-training
    • Experiment
        • super resolution
        • denoising
        • deraining
        • ablation study

Abstract

  • a highly efficient and generic transformer for low-level vision
    achieve SOTA with constrained parameters and computational complexity
  • the first in-depth study on image pre-training for low-level vision
    how pre-training affect internal representations of models, how to conduct an effective pre-training

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第1张图片
Comparison on PSNR (dB) performance of the proposed EDT and state-of-the-art methods in different low-level tasks.

Method

model architecture

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第2张图片
The proposed encoder-decoder-based Transformer (EDT). EDT processes high-resolution (e.g., in deraining) and low-resolution (e.g., in SR, s s s refers to the scale) inputs using different paths, modeling long-range interactions at a low resolution, for efficient computation.

consist of a lite conv-based encoder-decoder and a transformer body
limitation of transformer: high computational cost, so be difficult to handle HR inputs

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第3张图片
Structures of the convolution blocks (CB) in the encoder and decoder. The dotted boxes and lines represent additional down-sampling and up-sampling operations when processing high-resolution inputs.

encoder

  • sr maintain original input size
  • dn, dr down-sample to 1 4 \frac14 41 input size with stride conv

note that stack of early convs is useful for stabling the optimization.

transformer
achieve larger receptive fields at low computational cost    ⟸    \impliedby down-sampler in encoder

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第4张图片
Structure of the transformer block, including layer normalizations (LN), a (shifted) crossed local multi-head attention module ((S)CL-MSA) and an anti-blocking feed-forward network (Anti-FFN).

decoder

  • sr maintain original input size, an extra up-sampler before output
  • dn, dr down-sample to 1 4 \frac14 41 input size with stride conv

skip connection
fast convergence during training

(shifted) cross local attention

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第5张图片
Shifted Crossed Local Attention. The horizontal and vertical window sizes are set to ( h , w ) (h, w) (h,w) and ( w , h ) (w, h) (w,h).

given input features X ∈ R ( H × W ) × C X\in{\Reals}^{(H\times W)\times C} XR(H×W)×C, split into 2 parts in channel dimension
X = [ X 1 , X 2 ] X=[X_1, X_2] X=[X1,X2]

where, X 1 , X 2 ∈ R ( H × W ) × C 2 X_1, X_2\in{\Reals}^{(H\times W)\times\frac{C}2} X1,X2R(H×W)×2C
each half perform MSA in either horizontal or vertical window
X 1 ′ = H - M S A ( X 1 ) X 2 ′ = V - M S A ( X 2 ) \begin{aligned} X_1'&=H\text{-}MSA(X_1) \\ X_2'&=V\text{-}MSA(X_2) \end{aligned} X1X2=H-MSA(X1)=V-MSA(X2)

where, window size is ( w , h ) (w, h) (w,h) or ( h , w ) , w > h (h, w), w>h (h,w),w>h
projection layer finally fuse attention results
C L - M S A ( X ) = p r o j ( [ X 1 ′ , X 2 ′ ] ) CL\text{-}MSA(X)=proj([X_1', X_2']) CL-MSA(X)=proj([X1,X2])

window shifted by ( ⌊ h 2 ⌋ , ⌊ w 2 ⌋ ) (\lfloor\frac{h}2\rfloor, \lfloor\frac{w}2\rfloor) (2h,2w) or ( ⌊ w 2 ⌋ , ⌊ h 2 ⌋ ) (\lfloor\frac{w}2\rfloor, \lfloor\frac{h}2\rfloor) (2w,2h)
   ⟹    \implies crossed windows with shifts dramatically increase effective receptive field
complexity of crossed window attention for H × W H\times W H×W input
Ω ( ( S ) C L - M S A ) = 4 H W C 2 + 2 h w H W C \Omega((S)CL\text{-}MSA)=4HWC^2+2hwHWC Ω((S)CL-MSA)=4HWC2+2hwHWC

anti-blocking FFN

Y = L i n e a r ( A c t ( A B F F ( A c t ( L i n e a r ( X ) ) ) ) ) Y=Linear(Act(ABFF(Act(Linear(X))))) Y=Linear(Act(ABFF(Act(Linear(X)))))

where, A B F F ( ⋅ ) ABFF(\cdot) ABFF() implemented by depth-wise conv with 5x5 kernel
   ⟹    \implies eliminate possible blocking effect caused by window partition

architecture variants

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第6张图片
Configurations of four variants of EDT. The parameter numbers and FLOPs are counted in denoising at 192 × 192 192\times192 192×192 size.

uniformly set block number in each stage to 6, expansion ratio of FFN to 2, window size ( h , w ) (h, w) (h,w) to (6, 24)

Pre-Training

pre-training on ImageNet

dataset ImageNet: 200K, 15.6%
degradation

  • sr: × 2 \times2 ×2, × 3 \times3 ×3, × 4 \times4 ×4 bilinear interpolation (BI)
  • dn: Gaussian noise σ = 15 , 20 , 50 \sigma=15, 20, 50 σ=15,20,50
  • dr: light, heavy rain stroke

pre-training strategy

  • train a single model for a specific task, eg. × 2 \times2 ×2 sr
  • train a single model for highly related tasks, eg. × 2 \times2 ×2, × 3 \times3 ×3, × 4 \times4 ×4 sr
  • train a single model for unrelated tasks, eg. × 2 \times2 ×2, × 3 \times3 ×3 sr and 15-level dn

pre-training for unrelated tasks is used in IPT.

centered kernel alignment (CKA)

study representation similarity hidden layers in network
given m m m data points, output of 2 layers are X ∈ R m × p 1 X\in{\Reals}^{m\times p_1} XRm×p1 and Y ∈ R m × p 2 Y\in{\Reals}^{m\times p_2} YRm×p2, with p 1 p_1 p1 and p 2 p_2 p2 output neurons respectively
use Gram matrices K = X X T K=XX^T K=XXT and L = Y Y T L=YY^T L=YYT to compute CKA
C K A ( K , L ) = H S I C ( K , L ) H S I C ( K , K ) H S I C ( L , L ) CKA(K, L)=\frac{HSIC(K, L)}{\sqrt{HSIC(K, K)HSIC(L, L)}} CKA(K,L)=HSIC(K,K)HSIC(L,L) HSIC(K,L)

where, HSIC is Hilbert-Schimit independence criterion

衡量特征之间的相似性,一般通过衡量特征中样本对之间的相似性,其指标是相似性度量矩阵 (representational similarity matrices)
比较常见的相似性度量矩阵,有以下几种:
a. 点积相似性度量 dot product based similarity
(s.1)
⟨ v e c ( X X T ) , v e c ( Y Y T ) ⟩ = t r ( X X T Y Y T ) = ∣ Y T X ∣ F 2 \langle vec(XX^T), vec(YY^T)\rangle=tr(XX^TYY^T)={\vert Y^TX\vert}_F^2 vec(XXT),vec(YYT)=tr(XXTYYT)=YTXF2

其中, X X T XX^T XXT表示从特征X的视角来衡量样本之间的关系, Y Y T YY^T YYT表示从特征Y的视角来衡量样本之间的关系,两者的内积即为特征X和特征Y之间的相似性
b. Hilbert-Schimit独立性度量 Hilbert-Schimit independence criterion
对于均值为0的特征X和Y,(s.1)可变形为(s.2)
1 ( m − 1 ) 2 t r ( X X T Y Y T ) = ∣ c o v ( X T , Y T ) ∣ F 2 \frac1{(m-1)^2}tr(XX^TYY^T) = {\vert cov(X^T, Y^T)\vert}_F^2 (m1)21tr(XXTYYT)=cov(XT,YT)F2

(s.2)成立,有如下简单推导
1 ( m − 1 ) 2 t r ( X X T Y Y T ) = 1 ( m − 1 ) 2 t r ( ( Y T X ) T Y T X ) = 1 ( m − 1 ) 2 ∑ i = 1 p ∑ j = 1 p ( Y T X i , j ) ( X T Y i , j ) = 1 ( m − 1 ) 2 ∑ i = 1 p ∑ j = 1 p ( Y T X i , j ) 2 = 1 ( m − 1 ) 2 c o v ( X T , Y T ) i , j 2 = ∣ c o v ( X T , Y T ) ∣ F 2 \begin{aligned} \frac1{(m-1)^2}tr(XX^TYY^T)&=\frac1{(m-1)^2}tr((Y^TX)^TY^TX) \\ &=\frac1{(m-1)^2}\sum_{i=1}^p\sum_{j=1}^p(Y^TX_{i, j})(X^TY_{i, j}) \\ &=\frac1{(m-1)^2}\sum_{i=1}^p\sum_{j=1}^p(Y^TX_{i, j})^2 \\ &=\frac1{(m-1)^2}cov(X^T, Y^T)_{i, j}^2 \\ &={\vert cov(X^T, Y^T)\vert}_F^2 \end{aligned} (m1)21tr(XXTYYT)=(m1)21tr((YTX)TYTX)=(m1)21i=1pj=1p(YTXi,j)(XTYi,j)=(m1)21i=1pj=1p(YTXi,j)2=(m1)21cov(XT,YT)i,j2=cov(XT,YT)F2

Hilbert-Schimit独立性度量就是把(s.2)中的协方差的Frobenius范数的平方替换成Hilbert-Schimit范数的平方
定义两个核函数 K i , j = k ( x i , x j ) , L ( i , j ) = l ( y i , y j ) K_{i, j}=k(x_i, x_j), L_(i, j)=l(y_i, y_j) Ki,j=k(xi,xj),L(i,j)=l(yi,yj),Hilbert-Schimit独立性度量可以写成(s.3)
H S I C = 1 ( m − 1 ) 2 ⟨ v e c ( H K H ) , v e c ( H L H ) ⟩ = 1 ( m − 1 ) 2 t r ( K H L H ) HSIC=\frac1{(m-1)^2}\langle vec(HKH), vec(HLH)\rangle=\frac1{(m-1)^2}tr(KHLH) HSIC=(m1)21vec(HKH),vec(HLH)=(m1)21tr(KHLH)

其中, H K H HKH HKH H L H HLH HLH是中心化的Gram矩阵, H n = I n − 1 n 1 1 T H_n=I_n-\frac1n11^T Hn=Inn111T
(s.2)和(s.3)的联系是:当使用线性核函数 k ( x , y ) = l ( x , y ) = x T y k(x, y)=l(x, y)=x^Ty k(x,y)=l(x,y)=xTy时,Hilbert-Schimit独立性度量退化为点积相似性度量
ref: zhihu

representation structure of EDT


Sub-figures (a)-© show CKA similarities between all pairs of layers in x2 EDT-S SR model, x2 EDT-B SR model, level-15 EDT-B denoising model with single-task pre-training, and the corresponding similarities between with and without pre-training are shown in (e)-(g). Sub-figure (d) shows the cross-model comparison between EDT-B SR and EDT-B denoising models and (h) shows the ratios of layer similarity larger than 0.6, where “s” means the similarity between the current layer in SR and any layer in denoising.

key findings

  • sr models show clear stages in internal representations and proportion of each stage vary with model size, while dn models present a relatively uniform structure
  • layers in dn models show more similarity to lower layers in sr models, containing more local information
  • single-task pre-training mainly affect higher layers in sr models, but have limited impact on dn models

single and multi-task pre-training


Sub-figures (a)-(d) show CKA similarities of x2 SR models, without pre-training as well as with pre-training on a single task (x2), highly related tasks (x2, x3, x4 SR) and unrelated tasks (x2, x3 SR, level-15 denoising). Sub-figures (e)-(h) show the corresponding attention head mean distances of Transformer blocks. Note that we do not plot shifted local windows in (e)-(h) so that the last blue dotted line (“—”) has no matching point. The red boxes indicate the same attention modules.

key findings

  • representations of sr models contain more local information in early layers while more global information in higher layers
  • 与单任务预训练相比,相关多任务预训练(fig.g)将更多的第3个块的全局表征转化成了局部表征,增加了第2块的范围;不相关多任务预训练(fig.h)的这种现象的表现偏弱
  • all 3 pre-training methods can greatly improve the performance by introducing different degrees of local information, treated as a kind of inductive bias, to the intermediate layers of model, among which multi-related-task pre-training performs best

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第7张图片
PSNR(dB) improvements of single-task, multi-related-task and multi-unrelated-task pre-training in × 2 \times2 ×2 SR.

multi-related-task pre-training is more effective and efficient, providing initialization for multiple tasks.
multi-task pretraining is more effective and data-efficient, for its setting enable transformer body to see more samples in an iteration.

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第8张图片
PSNR(dB) improvements of different data scales during single-task pre-training for EDT-L in × 2 \times2 ×2 SR.

incremental PSNR improvements on multiple SR benchmarks by increasing the data scale
noted that double pre-training iterations (from 500K to 1M) for the data scale of 400K so that data can be fully functional, and longer pre-training period largely increases training burden.

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第9张图片
PSNR(dB) results of different pre-training (single-task) data scales in × 2 \times2 ×2 SR. “EDT-B † \dag ” refers to the base model with single-task pre-training and “EDT-B ∗ \ast ” represents the base model with multi-related-task pre-training. The best results are in bold.

Experiment

pre-training

  • input
    • sr, in transformer: 48 × 48 48\times48 48×48-size
    • dn, dr: 192 × 192 192\times192 192×192-size
  • optimizer Adam: β 1 = 0.9 , β 2 = 0.99 \beta1=0.9, \beta2=0.99 β1=0.9,β2=0.99, batch size=8, 500K iterations
  • learning rate initial 2e-4, halved at 250K, 400K, 450K, 475K-th iteration

fine-tuning

  • extra 500K, 800K, 200K iterations for sr, dn, dr
  • learning rate constant 1e-5

super resolution

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第10张图片
Quantitative comparison for classical SR on PSNR(dB)/SSIM on the Y channel from the YCbCr space. “ ‡ \ddag ” means the × 3 \times3 ×3 and × 4 \times4 ×4 models of SwinIR are pre-trained on the × 2 \times2 ×2 setup and training patch size is 64 × 64 64\times64 64×64 (ours is 48 × 48 48\times48 48×48). “ † \dag ” indicates methods with a pre-training. Best and second best results are in red and blue colors.

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第11张图片
Quantitative comparison for lightweight SR on PSNR(dB)/SSIM on the Y channel. “ ‡ \ddag ” means the × 3 \times3 ×3 and × 4 \times4 ×4 models of SwinIR are pre-trained on the × 2 \times2 ×2 setup and the training patch size is 64 × 64 64\times64 64×64 (ours is 48 × 48 48\times48 48×48 and without pre-training).

denoising

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第12张图片
Quantitative comparison for color image denoising on PSNR(dB) on RGB channels. “ ‡ \ddag ” means the σ = 25 / 50 \sigma=25/50 σ=25/50 models of SwinIR are pre-trained on the σ = 15 \sigma=15 σ=15 level. “ † \dag ” indicates methods with a pre-training. “ ∗ \ast ” means our model without down-sampling and without pre-training.

deraining

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第13张图片
Quantitative comparison for image deraining on PSNR(dB)/SSIM on the Y channel. “ † \dag ” indicates methods with a pre-training.

ablation study

window size

[2112] On Efficient Transformer and Image Pre-training for Low-level Vision_第14张图片
Ablation study of window size on PSNR(dB) in × 2 \times2 ×2 SR. Best results are in bold.

你可能感兴趣的:(Low-Level,Vision,Vision,Transformer,计算机视觉,深度学习)