剪枝系列4:A Bayesian Optimization Framework for Neural Network Compression

这篇论文把贝叶斯优化用到了模型压缩中,提出了一个优化框架,具体压缩的方法可以灵活使用。超参数 θ \theta θ表示最终的网络有多小,比如,当使用剪枝方法时它是一个threshold,当使用SVD是它是一个rank。

本文考虑的是如何选择压缩超参数 θ \theta θ。根据BO方法,我们需要确定objective function和acquisition function。

Objective function

object function 需要考虑的是1.the quality of the compressed network。用 Q ( f ~ θ ) \mathcal{Q}(\tilde{f}_\theta) Q(f~θ) 代表模型的表现,用 L ( f ~ t h e t a , f ∗ ) \mathcal{L}(\tilde{f}_theta,f^*) L(f~theta,f)代表保真度。2.the size of the obtained network。用 R ( f ~ θ , f ∗ ) R(\tilde{f}_\theta,f^*) R(f~θ,f)表示压缩比例。那么优化问题可以表示为:
a r g m a x θ ( γ Q ( f ~ θ ) + R ( f ~ θ , f ∗ ) ) − 1 ⏟ J Q ( θ ) o r a r g m i n θ ( κ L ( f ~ θ , f ∗ ) + R ( f ~ θ , f ∗ ) ) ⏟ J L ( θ ) arg max_\theta(\underbrace{ \gamma \mathcal{Q}(\tilde{f}_\theta)+R(\tilde{f}_\theta,f^*))^{-1}}_{J_Q(\theta)} \\or argmin_\theta(\underbrace{ \kappa \mathcal{L}(\tilde{f}_\theta, f^*)+R(\tilde{f}_\theta,f^*))}_{J_{L}(\theta)} argmaxθ(JQ(θ) γQ(f~θ)+R(f~θ,f))1orargminθ(JL(θ) κL(f~θ,f)+R(f~θ,f))
这里我们考虑用知识蒸馏的目标函数:
L ( f ~ θ , f ∗ ) : = E x ∼ P ( ∣ ∣ f ~ θ ( x ) − f ∗ ( x ) ∣ ∣ 2 2 ) = ∣ ∣ f ∗ − f ~ θ ∣ ∣ 2 2 \mathcal{L}(\tilde{f}_\theta,f^*):=\mathbb{E}_{x\thicksim P}(||\tilde{f}_\theta(x)-f^*(x)||_2^2)=||f^*-\tilde{f}_\theta||^2_2 L(f~θ,f):=ExP(f~θ(x)f(x)22)=ff~θ22

acquisition function

在这里插入图片描述

剪枝系列4:A Bayesian Optimization Framework for Neural Network Compression_第1张图片

Experiments

Comparison of different model selection methods on Resnet18

剪枝系列4:A Bayesian Optimization Framework for Neural Network Compression_第2张图片

Knowledge distillation as a proxy for risk

一个自然的疑问是用知识蒸馏的目标函数(L2)是否在网络压缩中好用。实验结果表明用function norm 和用top-1 error rate 有相当的表现。

剪枝系列4:A Bayesian Optimization Framework for Neural Network Compression_第3张图片

Compression of VGG-16

In this section, we demonstrate that our method finds compression parameters that compare favorably to state-ofthe-art compression results reported on VGG-16 [10]. We first apply our method to compress convolutional layers of VGG-16 using tensor decomposition, which has 13 parameters. After that, we fine-tune the compressed model for 5 epochs, using Stochastic Gradient Descent (SGD) with
momentum 0.9 and learning rate 1e-4, decreased by a factor of 10 every epoch. Second, we apply another pass of our algorithm to compress the fully-connected layers of the fine-tuned model using SVD, which has 3 parameters. A single optimization takes approximately 10 minutes. Again, after the compression, we fine-tune the compressed model…
剪枝系列4:A Bayesian Optimization Framework for Neural Network Compression_第4张图片

Conclusion

In this work, we have developed a principled, fast, and flexible framework for optimizing neural network compression parameters…

你可能感兴趣的:(剪枝论文)