【1】Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation -- NIPS2014

文章地址: https://arxiv.org/pdf/1404.0736.pdf
代码地址: https://cs.nyu.edu/~denton/compress_conv.zip
Contribution.

  1. A collection of generic methods to exploit the redundancy inherent in deep CNNs.
  2. Showing empirical speedups on convolutional layers by a factor of 2-3x and a reduction of parameters in fully connected layers by a factor of 5-10x.

Monochronmatic Convolution Approximation.
Let W ∈ R C × X × Y × F ( 96 , 7 , 7 , 3 ) W\in \mathbb{R}^{C\times X \times Y \times F} (96,7,7,3) WRC×X×Y×F(96,7,7,3)
For every output feature f f f, consider the matrix W f ∈ R C × ( X Y ) W_f \in \mathbb{R}^{C\times (XY)} WfRC×(XY)
Find the SVD, W f = U f S f V f T W_f = U_fS_fV_f^{T} Wf=UfSfVfT, where U f ∈ R C × C ( 3 , 3 ) , S f ∈ R C × X Y ( 3 , 7 × 7 = 49 ) , V f ∈ R X Y × X Y ( 49 , 49 ) U_f \in \mathbb{R}^{C\times C }(3,3), S_f \in \mathbb{R}^{C\times XY}(3,7\times 7 =49), V_f \in \mathbb{R}^{XY\times XY}(49,49) UfRC×C(3,3),SfRC×XY(3,7×7=49),VfRXY×XY(49,49).
We can take the rank 1 approximation of W f W_f Wf, W ~ f = U ~ f S ~ f V ~ f T \tilde{W}_f = \tilde{U}_f\tilde{S}_f\tilde{V}_f^{T} W~f=U~fS~fV~fT, where U ~ f ∈ R C × 1 , S ~ f ∈ R , V ~ f ∈ R 1 × X Y \tilde{U}_f\in \mathbb{R}^{C\times 1}, \tilde{S}_f\in \mathbb{R}, \tilde{V}_f\in \mathbb{R}^{1\times XY} U~fRC×1,S~fR,V~fR1×XY.

Further clustering the F F F left singular vectors, U ~ f \tilde{U}_f U~f into C ′ C' C clusters, C ′ < F C'<F C<F. (Kmeans)
W ~ f = U c f S ~ f V ~ f T \tilde{W}_f =U_{c_f}\tilde{S}_f\tilde{V}_f^T W~f=UcfS~fV~fT, where U c f ∈ R C × 1 U_{c_f}\in\mathbb{R}^{C\times 1} UcfRC×1 is the cluster center for cluster c f c_f cf.

Biclustering Approximation.
Let W ∈ R C × X × Y × F W\in \mathbb{R}^{C\times X \times Y \times F} WRC×X×Y×F, W C ∈ R C × ( X Y F ) W_C\in\mathbb{R}^{C\times (XYF)} WCRC×(XYF), W F ∈ R ( C X Y ) × F W_F\in\mathbb{R}^{(CXY)\times F} WFR(CXY)×F.
Clustering rows of W C W_C WC into G G G clusters.
Clustering columns of W F W_F WF into H H H clusters.
Then we get H × G H\times G H×G sub-tensors W ( C i , : , : , F j ) , W S ∈ R C G × ( X Y ) × F H W(C_i, :, :,F_j),W_S\in\mathbb{R}^{\frac{C}{G}\times(XY)\times{\frac{F}{H}}} W(Ci,:,:,Fj),WSRGC×(XY)×HF
Each sub-tensor contains similar elements, and thus is easier to fit with a low-rank approximation.

  1. Outer product decomposition ( r a n k − K rank-K rankK)
    W k + 1 ← W k − α ⊗ β ⊗ γ W^{k+1}\leftarrow W^k-\alpha \otimes \beta\otimes \gamma Wk+1Wkαβγ
    where α ∈ R C , β ∈ R X Y , γ ∈ R F \alpha\in \mathbb{R}^C,\beta\in\mathbb{R}^{XY},\gamma\in\mathbb{R}^{F} αRC,βRXY,γRF
  2. SVD decomposition
    For W ∈ R m × n k W\in\mathbb{R}^{m\times n k} WRm×nk, W ~ ≈ U ~ S ~ V ~ T \tilde{W}\approx\tilde{U}\tilde{S}\tilde{V}^T W~U~S~V~T.
    W can be compressed even further by applying SVD to V ~ \tilde{V} V~.
    Use K 1 K_1 K1 and K 2 K_2 K2 to denote the rank used in the first and second SVD.

Settings.
【1】Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation -- NIPS2014_第1张图片

你可能感兴趣的:(模型压缩与加速)