A pruning method that based on Batch Normalization layer.
Variational Convolutional Neural Network Pruning
To prune a filter in a convolutional layer, we should have a criterion that measures how this filter performs. In this paper, the author propose a new criterion called channel saliency which comes from batch normalization layer.
Below is the implementation of batch normalization layer
B N ( x ) = x − μ B σ B 2 + ϵ BN(x)=\frac{x-\mu_{B}}{\sqrt{\sigma_B^2+\epsilon}} BN(x)=σB2+ϵx−μB
where μ B \mu_B μB and σ B 2 \sigma_B^2 σB2 is mean and variance of batch data, and ϵ \epsilon ϵ is a small constant(1e-5 or similar) that prevent zero occurs on denominator.
Then x o u t = γ B N ( x ) + β x_{out}=\gamma BN(x)+\beta xout=γBN(x)+β can be rewritten as
x o u t = γ B N ( x ) + β ~ x_{out}=\gamma BN(x)+\tilde{\beta} xout=γBN(x)+β~
where β ~ = γ β \tilde{\beta}=\gamma\beta β~=γβ. And here we got
x o u t = γ B N ( x ) + β ~ = γ [ B N ( x ) + β ] x_{out}=\gamma BN(x)+\tilde{\beta}=\gamma[BN(x)+\beta] xout=γBN(x)+β~=γ[BN(x)+β]
The paramrter γ \gamma γ here can be viewed as a criterion that measures importance of related filter directily. So we givethe parameter γ \gamma γ a name called channel saliency.
Below, I just use γ \gamma γ to represent channel saliency.
It’s explict that we should just prune filters that have a small γ \gamma γ in the following BN layer. The problem is that γ \gamma γ changes drastically after several iterations in training stage, so the author analyzes the distribution of γ \gamma γ. When the distribution is around zero, then we can prune safely.
The general idea of VI(variational inference) is:
- Define a flexible famliy of distributions over the hidden variables indexed by free parameters.
- Find the setting of parameters that make the distribution family closest to the desired posterior distribution.
- Thus the problem of finding distribution becomes a problem of optimization.
Let’s start from Bayes Rule
P ( A i ∣ B ) = P ( B ∣ A i ) P ( A i ) ∑ j P ( B ∣ A j ) P ( A j ) P(A_i|B)=\frac{P(B|A_i)P(A_i)}{\sum_jP(B|A_j)P(A_j)} P(Ai∣B)=∑jP(B∣Aj)P(Aj)P(B∣Ai)P(Ai)
For a general problem, we want to find out the posterior distribution p ( z ∣ x ) p(z|x) p(z∣x), in fact
p ( z ∣ x ) = p ( z ) p ( x ∣ z ) p ( x ) p(z|x)=\frac{p(z)p(x|z)}{p(x)} p(z∣x)=p(x)p(z)p(x∣z)
where p ( z ) p(z) p(z) is the prior distribution and p ( x ∣ z ) p(x|z) p(x∣z) is the likehood function. The problem is, however, p ( x ) = ∫ p ( z ) p ( x ∣ z ) p(x)=\int p(z)p(x|z) p(x)=∫p(z)p(x∣z) is usually not computational-tractable. So VI method use another variational distribution q θ ( z ) q_\theta(z) qθ(z) that depends on parameter θ \theta θ to appoximate p ( z ∣ x ) p(z|x) p(z∣x). Thus we apply the following function to optimize:
min θ K L ( q θ ( z ) ∣ ∣ p ( z , x ) ) (2.3.1) \min\limits_{\theta} KL(q_\theta(z)||p(z,x)) \tag{2.3.1} θminKL(qθ(z)∣∣p(z,x))(2.3.1)
where K L KL KL is Kullback–Leibler divergence(KL divergence).
The function above is hard to optimize because we don’t know p ( z , x ) p(z,x) p(z,x). Then some transfer this function to:
max θ E L B O ( θ , x ) = E q ( z , θ ) [ log p ( x , z ) q θ ( z ) ] (2.3.2) \max\limits_{\theta}ELBO(\theta,x)=\mathbb{E}_{q(z,\theta)}[\log{\frac{p(x,z)}{q_\theta(z)}}] \tag{2.3.2} θmaxELBO(θ,x)=Eq(z,θ)[logqθ(z)p(x,z)](2.3.2)
ELBO here is Evidence Lower Bound in short. In many cases, p ( x , z ) p(x,z) p(x,z) is our model in machine learning.
There exists a lot of methods that can optimize function 2.3.2 2.3.2 2.3.2, including mean-field assumption, stochastic variational inference (SVI), black box variational inference (BBVI) , reparameterization tricks etc.
In this paper, the concrete situation is
p ( γ ∣ D ) = p ( γ ) p ( D ∣ γ ) p ( γ ) p(\gamma|\mathcal{D})=\frac{p(\gamma)p(\mathcal{D}|\gamma)}{p(\gamma)} p(γ∣D)=p(γ)p(γ)p(D∣γ)
where D \mathcal{D} D dataset. However, p ( D ) = ∫ p ( γ ) p ( D ∣ γ ) ) = ∫ p ( D , γ ) d γ p(\mathcal{D})=\int p(\gamma)p(\mathcal{D}|\gamma))=\int p(\mathcal{D},\gamma)d\gamma p(D)=∫p(γ)p(D∣γ))=∫p(D,γ)dγ is computational-intractable.
Contents for detailed is too borining and I don’t think it necessary to analyze. The author describe those manture technologies in detail, which is unnecessary. Just refer to easy-understanding way will be enough.
There exists results on CIFAR10, CIFAR100 and ImageNet2012 dataset, but I just list results on CIFAR10 and ImageNet2012 here.
Result on VGG16 does not show much improvement. Result on ResNet differs quite a lot from my data in FLOPs. FLOPs shown here for original model is much smaller than my calculation, which makes it inconvincing in my eyes.
Experiments on ImageNet2012 does not achieve state-of-the-art result.
This method use bach normalization layer to determine which layer to prune. Experiments shown here can not achieve statre-of-the-art result, but the idea is worth rethinking.
[1] 变分推断(Variational Inference)最新进展简述
[2] 变分推断学习笔记
[3] ELBO 与 KL散度
[4] Stochastic Variational Inference
[5] 请解释下variational inference
[6] 变分推断中的ELBO(证据下界)