paper
code
Comparison on PSNR (dB) performance of the proposed EDT and state-of-the-art methods in different low-level tasks.
The proposed encoder-decoder-based Transformer (EDT). EDT processes high-resolution (e.g., in deraining) and low-resolution (e.g., in SR, s s s refers to the scale) inputs using different paths, modeling long-range interactions at a low resolution, for efficient computation.
consist of a lite conv-based encoder-decoder and a transformer body
limitation of transformer: high computational cost, so be difficult to handle HR inputs
Structures of the convolution blocks (CB) in the encoder and decoder. The dotted boxes and lines represent additional down-sampling and up-sampling operations when processing high-resolution inputs.
encoder
note that stack of early convs is useful for stabling the optimization.
transformer
achieve larger receptive fields at low computational cost ⟸ \impliedby ⟸ down-sampler in encoder
Structure of the transformer block, including layer normalizations (LN), a (shifted) crossed local multi-head attention module ((S)CL-MSA) and an anti-blocking feed-forward network (Anti-FFN).
decoder
skip connection
fast convergence during training
Shifted Crossed Local Attention. The horizontal and vertical window sizes are set to ( h , w ) (h, w) (h,w) and ( w , h ) (w, h) (w,h).
given input features X ∈ R ( H × W ) × C X\in{\Reals}^{(H\times W)\times C} X∈R(H×W)×C, split into 2 parts in channel dimension
X = [ X 1 , X 2 ] X=[X_1, X_2] X=[X1,X2]
where, X 1 , X 2 ∈ R ( H × W ) × C 2 X_1, X_2\in{\Reals}^{(H\times W)\times\frac{C}2} X1,X2∈R(H×W)×2C
each half perform MSA in either horizontal or vertical window
X 1 ′ = H - M S A ( X 1 ) X 2 ′ = V - M S A ( X 2 ) \begin{aligned} X_1'&=H\text{-}MSA(X_1) \\ X_2'&=V\text{-}MSA(X_2) \end{aligned} X1′X2′=H-MSA(X1)=V-MSA(X2)
where, window size is ( w , h ) (w, h) (w,h) or ( h , w ) , w > h (h, w), w>h (h,w),w>h
projection layer finally fuse attention results
C L - M S A ( X ) = p r o j ( [ X 1 ′ , X 2 ′ ] ) CL\text{-}MSA(X)=proj([X_1', X_2']) CL-MSA(X)=proj([X1′,X2′])
window shifted by ( ⌊ h 2 ⌋ , ⌊ w 2 ⌋ ) (\lfloor\frac{h}2\rfloor, \lfloor\frac{w}2\rfloor) (⌊2h⌋,⌊2w⌋) or ( ⌊ w 2 ⌋ , ⌊ h 2 ⌋ ) (\lfloor\frac{w}2\rfloor, \lfloor\frac{h}2\rfloor) (⌊2w⌋,⌊2h⌋)
⟹ \implies ⟹ crossed windows with shifts dramatically increase effective receptive field
complexity of crossed window attention for H × W H\times W H×W input
Ω ( ( S ) C L - M S A ) = 4 H W C 2 + 2 h w H W C \Omega((S)CL\text{-}MSA)=4HWC^2+2hwHWC Ω((S)CL-MSA)=4HWC2+2hwHWC
Y = L i n e a r ( A c t ( A B F F ( A c t ( L i n e a r ( X ) ) ) ) ) Y=Linear(Act(ABFF(Act(Linear(X))))) Y=Linear(Act(ABFF(Act(Linear(X)))))
where, A B F F ( ⋅ ) ABFF(\cdot) ABFF(⋅) implemented by depth-wise conv with 5x5 kernel
⟹ \implies ⟹ eliminate possible blocking effect caused by window partition
Configurations of four variants of EDT. The parameter numbers and FLOPs are counted in denoising at 192 × 192 192\times192 192×192 size.
uniformly set block number in each stage to 6, expansion ratio of FFN to 2, window size ( h , w ) (h, w) (h,w) to (6, 24)
dataset ImageNet: 200K, 15.6%
degradation
pre-training strategy
pre-training for unrelated tasks is used in IPT.
study representation similarity hidden layers in network
given m m m data points, output of 2 layers are X ∈ R m × p 1 X\in{\Reals}^{m\times p_1} X∈Rm×p1 and Y ∈ R m × p 2 Y\in{\Reals}^{m\times p_2} Y∈Rm×p2, with p 1 p_1 p1 and p 2 p_2 p2 output neurons respectively
use Gram matrices K = X X T K=XX^T K=XXT and L = Y Y T L=YY^T L=YYT to compute CKA
C K A ( K , L ) = H S I C ( K , L ) H S I C ( K , K ) H S I C ( L , L ) CKA(K, L)=\frac{HSIC(K, L)}{\sqrt{HSIC(K, K)HSIC(L, L)}} CKA(K,L)=HSIC(K,K)HSIC(L,L)HSIC(K,L)
where, HSIC is Hilbert-Schimit independence criterion
衡量特征之间的相似性,一般通过衡量特征中样本对之间的相似性,其指标是相似性度量矩阵 (representational similarity matrices)
比较常见的相似性度量矩阵,有以下几种:
a. 点积相似性度量 dot product based similarity
(s.1)
⟨ v e c ( X X T ) , v e c ( Y Y T ) ⟩ = t r ( X X T Y Y T ) = ∣ Y T X ∣ F 2 \langle vec(XX^T), vec(YY^T)\rangle=tr(XX^TYY^T)={\vert Y^TX\vert}_F^2 ⟨vec(XXT),vec(YYT)⟩=tr(XXTYYT)=∣YTX∣F2
其中, X X T XX^T XXT表示从特征X的视角来衡量样本之间的关系, Y Y T YY^T YYT表示从特征Y的视角来衡量样本之间的关系,两者的内积即为特征X和特征Y之间的相似性
b. Hilbert-Schimit独立性度量 Hilbert-Schimit independence criterion
对于均值为0的特征X和Y,(s.1)可变形为(s.2)
1 ( m − 1 ) 2 t r ( X X T Y Y T ) = ∣ c o v ( X T , Y T ) ∣ F 2 \frac1{(m-1)^2}tr(XX^TYY^T) = {\vert cov(X^T, Y^T)\vert}_F^2 (m−1)21tr(XXTYYT)=∣cov(XT,YT)∣F2
(s.2)成立,有如下简单推导
1 ( m − 1 ) 2 t r ( X X T Y Y T ) = 1 ( m − 1 ) 2 t r ( ( Y T X ) T Y T X ) = 1 ( m − 1 ) 2 ∑ i = 1 p ∑ j = 1 p ( Y T X i , j ) ( X T Y i , j ) = 1 ( m − 1 ) 2 ∑ i = 1 p ∑ j = 1 p ( Y T X i , j ) 2 = 1 ( m − 1 ) 2 c o v ( X T , Y T ) i , j 2 = ∣ c o v ( X T , Y T ) ∣ F 2 \begin{aligned} \frac1{(m-1)^2}tr(XX^TYY^T)&=\frac1{(m-1)^2}tr((Y^TX)^TY^TX) \\ &=\frac1{(m-1)^2}\sum_{i=1}^p\sum_{j=1}^p(Y^TX_{i, j})(X^TY_{i, j}) \\ &=\frac1{(m-1)^2}\sum_{i=1}^p\sum_{j=1}^p(Y^TX_{i, j})^2 \\ &=\frac1{(m-1)^2}cov(X^T, Y^T)_{i, j}^2 \\ &={\vert cov(X^T, Y^T)\vert}_F^2 \end{aligned} (m−1)21tr(XXTYYT)=(m−1)21tr((YTX)TYTX)=(m−1)21i=1∑pj=1∑p(YTXi,j)(XTYi,j)=(m−1)21i=1∑pj=1∑p(YTXi,j)2=(m−1)21cov(XT,YT)i,j2=∣cov(XT,YT)∣F2
Hilbert-Schimit独立性度量就是把(s.2)中的协方差的Frobenius范数的平方替换成Hilbert-Schimit范数的平方
定义两个核函数 K i , j = k ( x i , x j ) , L ( i , j ) = l ( y i , y j ) K_{i, j}=k(x_i, x_j), L_(i, j)=l(y_i, y_j) Ki,j=k(xi,xj),L(i,j)=l(yi,yj),Hilbert-Schimit独立性度量可以写成(s.3)
H S I C = 1 ( m − 1 ) 2 ⟨ v e c ( H K H ) , v e c ( H L H ) ⟩ = 1 ( m − 1 ) 2 t r ( K H L H ) HSIC=\frac1{(m-1)^2}\langle vec(HKH), vec(HLH)\rangle=\frac1{(m-1)^2}tr(KHLH) HSIC=(m−1)21⟨vec(HKH),vec(HLH)⟩=(m−1)21tr(KHLH)
其中, H K H HKH HKH和 H L H HLH HLH是中心化的Gram矩阵, H n = I n − 1 n 1 1 T H_n=I_n-\frac1n11^T Hn=In−n111T
(s.2)和(s.3)的联系是:当使用线性核函数 k ( x , y ) = l ( x , y ) = x T y k(x, y)=l(x, y)=x^Ty k(x,y)=l(x,y)=xTy时,Hilbert-Schimit独立性度量退化为点积相似性度量
ref: zhihu
Sub-figures (a)-© show CKA similarities between all pairs of layers in x2 EDT-S SR model, x2 EDT-B SR model, level-15 EDT-B denoising model with single-task pre-training, and the corresponding similarities between with and without pre-training are shown in (e)-(g). Sub-figure (d) shows the cross-model comparison between EDT-B SR and EDT-B denoising models and (h) shows the ratios of layer similarity larger than 0.6, where “s” means the similarity between the current layer in SR and any layer in denoising.
key findings
Sub-figures (a)-(d) show CKA similarities of x2 SR models, without pre-training as well as with pre-training on a single task (x2), highly related tasks (x2, x3, x4 SR) and unrelated tasks (x2, x3 SR, level-15 denoising). Sub-figures (e)-(h) show the corresponding attention head mean distances of Transformer blocks. Note that we do not plot shifted local windows in (e)-(h) so that the last blue dotted line (“—”) has no matching point. The red boxes indicate the same attention modules.
key findings
PSNR(dB) improvements of single-task, multi-related-task and multi-unrelated-task pre-training in × 2 \times2 ×2 SR.
multi-related-task pre-training is more effective and efficient, providing initialization for multiple tasks.
multi-task pretraining is more effective and data-efficient, for its setting enable transformer body to see more samples in an iteration.
PSNR(dB) improvements of different data scales during single-task pre-training for EDT-L in × 2 \times2 ×2 SR.
incremental PSNR improvements on multiple SR benchmarks by increasing the data scale
noted that double pre-training iterations (from 500K to 1M) for the data scale of 400K so that data can be fully functional, and longer pre-training period largely increases training burden.
PSNR(dB) results of different pre-training (single-task) data scales in × 2 \times2 ×2 SR. “EDT-B † \dag †” refers to the base model with single-task pre-training and “EDT-B ∗ \ast ∗” represents the base model with multi-related-task pre-training. The best results are in bold.
pre-training
fine-tuning
Quantitative comparison for classical SR on PSNR(dB)/SSIM on the Y channel from the YCbCr space. “ ‡ \ddag ‡” means the × 3 \times3 ×3 and × 4 \times4 ×4 models of SwinIR are pre-trained on the × 2 \times2 ×2 setup and training patch size is 64 × 64 64\times64 64×64 (ours is 48 × 48 48\times48 48×48). “ † \dag †” indicates methods with a pre-training. Best and second best results are in red and blue colors.
Quantitative comparison for lightweight SR on PSNR(dB)/SSIM on the Y channel. “ ‡ \ddag ‡” means the × 3 \times3 ×3 and × 4 \times4 ×4 models of SwinIR are pre-trained on the × 2 \times2 ×2 setup and the training patch size is 64 × 64 64\times64 64×64 (ours is 48 × 48 48\times48 48×48 and without pre-training).
Quantitative comparison for color image denoising on PSNR(dB) on RGB channels. “ ‡ \ddag ‡” means the σ = 25 / 50 \sigma=25/50 σ=25/50 models of SwinIR are pre-trained on the σ = 15 \sigma=15 σ=15 level. “ † \dag †” indicates methods with a pre-training. “ ∗ \ast ∗” means our model without down-sampling and without pre-training.
Quantitative comparison for image deraining on PSNR(dB)/SSIM on the Y channel. “ † \dag †” indicates methods with a pre-training.
window size
Ablation study of window size on PSNR(dB) in × 2 \times2 ×2 SR. Best results are in bold.