目录
0. 先看结果,已更新
1. 背景
2. SVD分解(nn.Linear)
3. Tucker分解(nn.Conv2d)
3.1 估计卷积层矩阵的秩
3.2 Tucker分解
4. 实现代码
5. 代码链接
6. 参考链接
------------原模型--------------
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 16, 62, 62] 448
ReLU-2 [-1, 16, 62, 62] 0
Conv2d-3 [-1, 32, 60, 60] 4,640
ReLU-4 [-1, 32, 60, 60] 0
Conv2d-5 [-1, 64, 58, 58] 18,496
ReLU-6 [-1, 64, 58, 58] 0
AdaptiveAvgPool2d-7 [-1, 64, 5, 5] 0
Linear-8 [-1, 10] 16,010
================================================================
Total params: 39,594
Trainable params: 39,594
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.05
Forward/backward pass size (MB): 5.99
Params size (MB): 0.15
Estimated Total Size (MB): 6.19
----------------------------------------------------------------
------------原模型精度--------------
Test set: Average loss: 0.0135, Accuracy: 7048/10000 (70%)
-----------压缩模型--------------
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 16, 62, 62] 448
ReLU-2 [-1, 16, 62, 62] 0
Conv2d-3 [-1, 8, 62, 62] 128
Conv2d-4 [-1, 12, 60, 60] 864
Conv2d-5 [-1, 32, 60, 60] 416
ReLU-6 [-1, 32, 60, 60] 0
Conv2d-7 [-1, 18, 60, 60] 576
Conv2d-8 [-1, 24, 58, 58] 3,888
Conv2d-9 [-1, 64, 58, 58] 1,600
ReLU-10 [-1, 64, 58, 58] 0
AdaptiveAvgPool2d-11 [-1, 64, 5, 5] 0
Linear-12 [-1, 5] 8,000
Linear-13 [-1, 10] 60
================================================================
Total params: 15,980
Trainable params: 15,980
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.05
Forward/backward pass size (MB): 7.67
Params size (MB): 0.06
Estimated Total Size (MB): 7.78
----------------------------------------------------------------
-----------压缩模型精度--------------
Test set: Average loss: 0.0268, Accuracy: 4586/10000 (46%)
-----------微调验证精度--------------
Test set: Average loss: 0.0139, Accuracy: 6901/10000 (69%)
-----------模型大小对比--------------
可以看出,在基本不改变最终模型精度的情况下,模型由158KB压缩到了68KB,体积仅原来的43%。
当然对于很多精心设计的网络,该方法未必有效。
我们在训练神经网络的时候,网络的参数大都是凭经验设计的,其实里面往往有很多的参数是冗余的,我们可以将权重分解提取其中的主要成分以压缩模型。
线性层的分解十分简单,对于mxn的矩阵,通过svd分解我们将得到mxm, mxn, nxn三个矩阵,当m!=n时,pytorch对分解后的矩阵做了简化,对于5x4的矩阵分解后有:
# 由于对角阵的最后一行全为0,因此直接去掉了U矩阵的最后一列
U.shape = torch.Size([5, 4])
# 对角阵转为了一维,后面需要还原
S.shape = torch.Size([4])
V.shape = torch.Size([4, 4])
得到S, U, V矩阵后,选择前几位(l)对矩阵计算结果影响权重更高元素,实现分解,实验中l=1时是不可行的,精度无法恢复,选择l=5可以较好的恢复精度。随后根据S, U, V的参数生成新的线性层即可。具体原理看链接1。
import torch
import numpy as np
from torch import nn
def compress_linear(model, l=1):
U, S, V = torch.svd(model.weight)
# l = model.weight.shape[0] // l
U1 = U[:, :l]
S1 = S[:l]
V1 = V[:, :l]
V2 = torch.mm(torch.diag(S1), V1.T)
new_model = nn.Sequential(nn.Linear(V2.shape[0], V2.shape[1], bias=False),
nn.Linear(U1.shape[0], U1.shape[1], bias=True))
new_model[0].weight = torch.nn.parameter.Parameter(V2)
new_model[1].bias = torch.nn.parameter.Parameter(torch.unsqueeze(model.bias, dim=0))
new_model[1].weight = torch.nn.parameter.Parameter(U1)
return new_model
if __name__ == '__main__':
model = nn.Linear(4, 4, bias=True)
new_model = compress_linear(model, l=1)
input = torch.randn(4, 4)
with torch.no_grad():
output1 = model(input)
output2 = new_model(input)
print(f"output1 = {output1}")
print(f"output2 = {output2}")
分解前:
old_module = Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
分解后:
new_module = Sequential(
(0): Conv2d(3, 2, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): Conv2d(2, 30, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2), bias=False)
(2): Conv2d(30, 64, kernel_size=(1, 1), stride=(1, 1))
)
分解后的结构就是Bottleneck,先使用1x1卷积减少通道数,再使用11x11卷积获取局部特征,再使用1x1卷积调整通道数,具体参考链接1,不赘述,需要注意的是需要对源码进行修改:
在numeric.py中添加a = np.array(a),因为a在最后一次计算时为torch.tensor类型
def moveaxis(a, source, destination):
a = np.array(a)
在fromnumeric.py中注掉下面代码,同样是类型错了
# if type(obj) is not mu.ndarray:
# try:
# reduction = getattr(obj, method)
# except AttributeError:
# pass
# else:
# # This branch is needed for reductions like any which don't
# # support a dtype.
# if dtype is not None:
# return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
# else:
# return reduction(axis=axis, out=out, **passkwargs)
代码依次为VBMF.py,estimate_ranks.py
from __future__ import division
import numpy as np
from scipy.sparse.linalg import svds
from scipy.optimize import minimize_scalar
def VBMF(Y, cacb, sigma2=None, H=None):
"""Implementation of the analytical solution to Variational Bayes Matrix Factorization.
This function can be used to calculate the analytical solution to VBMF.
This is based on the paper and MatLab code by Nakajima et al.:
"Global analytic solution of fully-observed variational Bayesian matrix factorization."
Notes
-----
If sigma2 is unspecified, it is estimated by minimizing the free energy.
If H is unspecified, it is set to the smallest of the sides of the input Y.
To estimate cacb, use the function EVBMF().
Attributes
----------
Y : numpy-array
Input matrix that is to be factorized. Y has shape (L,M), where L<=M.
cacb : int
Product of the prior variances of the matrices that factorize the input.
sigma2 : int or None (default=None)
Variance of the noise on Y.
H : int or None (default = None)
Maximum rank of the factorized matrices.
Returns
-------
U : numpy-array
Left-singular vectors.
S : numpy-array
Diagonal matrix of singular values.
V : numpy-array
Right-singular vectors.
post : dictionary
Dictionary containing the computed posterior values.
References
----------
.. [1] Nakajima, Shinichi, et al. "Global analytic solution of fully-observed variational Bayesian matrix factorization." Journal of Machine Learning Research 14.Jan (2013): 1-37.
.. [2] Nakajima, Shinichi, et al. "Perfect dimensionality recovery by variational Bayesian PCA." Advances in Neural Information Processing Systems. 2012.
"""
L, M = Y.shape # has to be L<=M
if H is None:
H = L
# SVD of the input matrix, max rank of H
U, s, V = np.linalg.svd(Y)
U = U[:, :H]
s = s[:H]
V = V[:H].T
# Calculate residual
residual = 0.
if H < L:
residual = np.sum(np.sum(Y ** 2) - np.sum(s ** 2))
# Estimation of the variance when sigma2 is unspecified
if sigma2 is None:
upper_bound = (np.sum(s ** 2) + residual) / (L + M)
if L == H:
lower_bound = s[-1] ** 2 / M
else:
lower_bound = residual / ((L - H) * M)
sigma2_opt = minimize_scalar(VBsigma2, args=(L, M, cacb, s, residual), bounds=[lower_bound, upper_bound],
method='Bounded')
sigma2 = sigma2_opt.x
print("Estimated sigma2: ", sigma2)
# Threshold gamma term
# Formula above (21) from [1]
thresh_term = (L + M + sigma2 / cacb ** 2) / 2
threshold = np.sqrt(sigma2 * (thresh_term + np.sqrt(thresh_term ** 2 - L * M)))
# Number of singular values where gamma>threshold
pos = np.sum(s > threshold)
# Formula (10) from [2]
d = np.multiply(s[:pos],
1 - np.multiply(sigma2 / (2 * s[:pos] ** 2),
L + M + np.sqrt((M - L) ** 2 + 4 * s[:pos] ** 2 / cacb ** 2)))
# Computation of the posterior
post = {}
zeta = sigma2 / (2 * L * M) * (L + M + sigma2 / cacb ** 2 - np.sqrt((L + M + sigma2 / cacb ** 2) ** 2 - 4 * L * M))
post['ma'] = np.zeros(H)
post['mb'] = np.zeros(H)
post['sa2'] = cacb * (1 - L * zeta / sigma2) * np.ones(H)
post['sb2'] = cacb * (1 - M * zeta / sigma2) * np.ones(H)
delta = cacb / sigma2 * (s[:pos] - d - L * sigma2 / s[:pos])
post['ma'][:pos] = np.sqrt(np.multiply(d, delta))
post['mb'][:pos] = np.sqrt(np.divide(d, delta))
post['sa2'][:pos] = np.divide(sigma2 * delta, s[:pos])
post['sb2'][:pos] = np.divide(sigma2, np.multiply(delta, s[:pos]))
post['sigma2'] = sigma2
post['F'] = 0.5 * (L * M * np.log(2 * np.pi * sigma2) + (residual + np.sum(s ** 2)) / sigma2 - (L + M) * H
+ np.sum(M * np.log(cacb / post['sa2']) + L * np.log(cacb / post['sb2'])
+ (post['ma'] ** 2 + M * post['sa2']) / cacb + (
post['mb'] ** 2 + L * post['sb2']) / cacb
+ (-2 * np.multiply(np.multiply(post['ma'], post['mb']), s)
+ np.multiply(post['ma'] ** 2 + M * post['sa2'],
post['mb'] ** 2 + L * post['sb2'])) / sigma2))
return U[:, :pos], np.diag(d), V[:, :pos], post
def VBsigma2(sigma2, L, M, cacb, s, residual):
H = len(s)
thresh_term = (L + M + sigma2 / cacb ** 2) / 2
threshold = np.sqrt(sigma2 * (thresh_term + np.sqrt(thresh_term ** 2 - L * M)))
pos = np.sum(s > threshold)
d = np.multiply(s[:pos],
1 - np.multiply(sigma2 / (2 * s[:pos] ** 2),
L + M + np.sqrt((M - L) ** 2 + 4 * s[:pos] ** 2 / cacb ** 2)))
zeta = sigma2 / (2 * L * M) * (L + M + sigma2 / cacb ** 2 - np.sqrt((L + M + sigma2 / cacb ** 2) ** 2 - 4 * L * M))
post_ma = np.zeros(H)
post_mb = np.zeros(H)
post_sa2 = cacb * (1 - L * zeta / sigma2) * np.ones(H)
post_sb2 = cacb * (1 - M * zeta / sigma2) * np.ones(H)
delta = cacb / sigma2 * (s[:pos] - d - L * sigma2 / s[:pos])
post_ma[:pos] = np.sqrt(np.multiply(d, delta))
post_mb[:pos] = np.sqrt(np.divide(d, delta))
post_sa2[:pos] = np.divide(sigma2 * delta, s[:pos])
post_sb2[:pos] = np.divide(sigma2, np.multiply(delta, s[:pos]))
F = 0.5 * (L * M * np.log(2 * np.pi * sigma2) + (residual + np.sum(s ** 2)) / sigma2 - (L + M) * H
+ np.sum(M * np.log(cacb / post_sa2) + L * np.log(cacb / post_sb2)
+ (post_ma ** 2 + M * post_sa2) / cacb + (post_mb ** 2 + L * post_sb2) / cacb
+ (-2 * np.multiply(np.multiply(post_ma, post_mb), s)
+ np.multiply(post_ma ** 2 + M * post_sa2, post_mb ** 2 + L * post_sb2)) / sigma2))
return F
def EVBMF(Y, sigma2=None, H=None):
"""Implementation of the analytical solution to Empirical Variational Bayes Matrix Factorization.
This function can be used to calculate the analytical solution to empirical VBMF.
This is based on the paper and MatLab code by Nakajima et al.:
"Global analytic solution of fully-observed variational Bayesian matrix factorization."
Notes
-----
If sigma2 is unspecified, it is estimated by minimizing the free energy.
If H is unspecified, it is set to the smallest of the sides of the input Y.
Attributes
----------
Y : numpy-array
Input matrix that is to be factorized. Y has shape (L,M), where L<=M.
sigma2 : int or None (default=None)
Variance of the noise on Y.
H : int or None (default = None)
Maximum rank of the factorized matrices.
Returns
-------
U : numpy-array
Left-singular vectors.
S : numpy-array
Diagonal matrix of singular values.
V : numpy-array
Right-singular vectors.
post : dictionary
Dictionary containing the computed posterior values.
References
----------
.. [1] Nakajima, Shinichi, et al. "Global analytic solution of fully-observed variational Bayesian matrix factorization." Journal of Machine Learning Research 14.Jan (2013): 1-37.
.. [2] Nakajima, Shinichi, et al. "Perfect dimensionality recovery by variational Bayesian PCA." Advances in Neural Information Processing Systems. 2012.
"""
L, M = Y.shape # has to be L<=M
if H is None:
H = L
alpha = L / M
tauubar = 2.5129 * np.sqrt(alpha)
# SVD of the input matrix, max rank of H
U, s, V = np.linalg.svd(Y)
U = U[:, :H]
s = s[:H]
V = V[:H].T
# Calculate residual
residual = 0.
if H < L:
residual = np.sum(np.sum(Y ** 2) - np.sum(s ** 2))
# Estimation of the variance when sigma2 is unspecified
if sigma2 is None:
xubar = (1 + tauubar) * (1 + alpha / tauubar)
eH_ub = int(np.min([np.ceil(L / (1 + alpha)) - 1, H])) - 1
upper_bound = (np.sum(s ** 2) + residual) / (L * M)
lower_bound = np.max([s[eH_ub + 1] ** 2 / (M * xubar), np.mean(s[eH_ub + 1:] ** 2) / M])
scale = 1. # /lower_bound
s = s * np.sqrt(scale)
residual = residual * scale
lower_bound = lower_bound * scale
upper_bound = upper_bound * scale
sigma2_opt = minimize_scalar(EVBsigma2, args=(L, M, s, residual, xubar), bounds=[lower_bound, upper_bound],
method='Bounded')
sigma2 = sigma2_opt.x
print(sigma2)
# Threshold gamma term
threshold = np.sqrt(M * sigma2 * (1 + tauubar) * (1 + alpha / tauubar))
pos = np.sum(s > threshold)
# Formula (15) from [2]
d = np.multiply(s[:pos] / 2, 1 - np.divide((L + M) * sigma2, s[:pos] ** 2) + np.sqrt(
(1 - np.divide((L + M) * sigma2, s[:pos] ** 2)) ** 2 - 4 * L * M * sigma2 ** 2 / s[:pos] ** 4))
# Computation of the posterior
post = {}
post['ma'] = np.zeros(H)
post['mb'] = np.zeros(H)
post['sa2'] = np.zeros(H)
post['sb2'] = np.zeros(H)
post['cacb'] = np.zeros(H)
tau = np.multiply(d, s[:pos]) / (M * sigma2)
delta = np.multiply(np.sqrt(np.divide(M * d, L * s[:pos])), 1 + alpha / tau)
post['ma'][:pos] = np.sqrt(np.multiply(d, delta))
post['mb'][:pos] = np.sqrt(np.divide(d, delta))
post['sa2'][:pos] = np.divide(sigma2 * delta, s[:pos])
post['sb2'][:pos] = np.divide(sigma2, np.multiply(delta, s[:pos]))
post['cacb'][:pos] = np.sqrt(np.multiply(d, s[:pos]) / (L * M))
post['sigma2'] = sigma2
post['F'] = 0.5 * (L * M * np.log(2 * np.pi * sigma2) + (residual + np.sum(s ** 2)) / sigma2
+ np.sum(M * np.log(tau + 1) + L * np.log(tau / alpha + 1) - M * tau))
return U[:, :pos], np.diag(d), V[:, :pos], post
def EVBsigma2(sigma2, L, M, s, residual, xubar):
H = len(s)
alpha = L / M
x = s ** 2 / (M * sigma2)
z1 = x[x > xubar]
z2 = x[x <= xubar]
tau_z1 = tau(z1, alpha)
term1 = np.sum(z2 - np.log(z2))
term2 = np.sum(z1 - tau_z1)
term3 = np.sum(np.log(np.divide(tau_z1 + 1, z1)))
term4 = alpha * np.sum(np.log(tau_z1 / alpha + 1))
obj = term1 + term2 + term3 + term4 + residual / (M * sigma2) + (L - H) * np.log(sigma2)
return obj
def phi0(x):
return x - np.log(x)
def phi1(x, alpha):
return np.log(tau(x, alpha) + 1) + alpha * np.log(tau(x, alpha) / alpha + 1) - tau(x, alpha)
def tau(x, alpha):
return 0.5 * (x - (1 + alpha) + np.sqrt((x - (1 + alpha)) ** 2 - 4 * alpha))
import tensorly
from decomposition import VBMF
def estimate_ranks(layer):
""" Unfold the 2 modes of the Tensor the decomposition will
be performed on, and estimates the ranks of the matrices using VBMF
"""
weights = layer.weight.data.numpy()
# weights = layer
unfold_0 = tensorly.base.unfold(weights, 0)
unfold_1 = tensorly.base.unfold(weights, 1)
_, diag_0, _, _ = VBMF.EVBMF(unfold_0)
_, diag_1, _, _ = VBMF.EVBMF(unfold_1)
ranks = [diag_0.shape[0], diag_1.shape[1]]
return ranks
from estimate_ranks import estimate_ranks
from tensorly.contrib.sparse.decomposition import partial_tucker
import numpy as np
import torch
from torch import nn
def tucker_decomposition_conv_layer(layer):
""" Gets a conv layer,
returns a nn.Sequential object with the Tucker decomposition.
The ranks are estimated with a Python implementation of VBMF
https://github.com/CasvandenBogaard/VBMF
"""
ranks = estimate_ranks(layer)
print(layer, "VBMF Estimated ranks", ranks)
core, [last, first] = \
partial_tucker(layer.weight.data, \
modes=[0, 1], rank=ranks, init='svd')
# A pointwise convolution that reduces the channels from S to R3
first_layer = torch.nn.Conv2d(in_channels=first.shape[0], \
out_channels=first.shape[1], kernel_size=1,
stride=1, padding=0, dilation=layer.dilation, bias=False)
# A regular 2D convolution layer with R3 input channels
# and R3 output channels
core_layer = torch.nn.Conv2d(in_channels=core.shape[1], \
out_channels=core.shape[0], kernel_size=layer.kernel_size,
stride=layer.stride, padding=layer.padding, dilation=layer.dilation,
bias=False)
# A pointwise convolution that increases the channels from R4 to T
last_layer = torch.nn.Conv2d(in_channels=last.shape[1], \
out_channels=last.shape[0], kernel_size=1, stride=1,
padding=0, dilation=layer.dilation, bias=True)
last_layer.bias.data = layer.bias.data
first = torch.tensor(np.array(first))
first_layer.weight.data = \
torch.transpose(input=first, dim0=1, dim1=0).unsqueeze(-1).unsqueeze(-1)
last = torch.tensor(np.array(last))
last_layer.weight.data = last.unsqueeze(-1).unsqueeze(-1)
core = torch.tensor(np.array(core))
core_layer.weight.data = core
new_layers = [first_layer, core_layer, last_layer]
layer = nn.Sequential(*new_layers)
return layer
if __name__ == '__main__':
model = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=4, stride=1)
model.weight = torch.nn.parameter.Parameter(torch.tensor([[1., 2., 3., 1.],
[2., 3., 4., 9.],
[2., 4., 3., 1.],
[9., 2., 3., 8.]]))
print(model.weight)
print(tucker_decomposition_conv_layer(model))
实现过程如下,有点繁杂,自己对实现过程进行优化吧
from torch import nn
from decomposition_conv import tucker_decomposition_conv_layer
from torchvision import models
import torch
from torchsummary import summary
from decomposition_linear import compress_linear
def get_name(model):
for name, module in model.named_children():
pname = name
return pname
def decomposition_conv2D(model):
for name, module in model.named_children():
pname = get_name(model)
if name == pname:
try:
for i in range(len(list(module))):
if isinstance(module[i], nn.Conv2d):
try:
print(f"old_module = {module[i]}")
module[i] = tucker_decomposition_conv_layer(module[i])
print(f"new_module = {module[i]}")
print("this conv_layer successed")
return decomposition_conv2D(model)
except:
print("this conv_layer failed")
continue
return model
except:
return model
else:
try:
for i in range(len(list(module))):
if isinstance(module[i], nn.Conv2d):
try:
print(f"old_module = {module[i]}")
module[i] = tucker_decomposition_conv_layer(module[i])
print("this conv_layer successed")
print(f"new_module = {module[i]}")
return decomposition_conv2D(model)
except:
print("this conv_layer failed")
continue
except:
continue
def decomposition_linear(model, l):
for name, module in model.named_children():
pname = get_name(model)
if name == pname:
try:
for i in range(len(list(module))):
if isinstance(module[i], nn.Linear):
try:
print(f"old_module = {module[i]}")
module[i] = compress_linear(module[i], l)
print(f"new_module = {module[i]}")
print("this linear_layer successed")
return decomposition_linear(model, l)
except:
print("this linear_layer failed")
continue
return model
except:
return model
else:
try:
for i in range(len(list(module))):
if isinstance(module[i], nn.Linear):
try:
print(f"old_module = {module[i]}")
module[i] = compress_linear(module[i], l)
print("this linear_layer successed")
print(f"new_module = {module[i]}")
return decomposition_linear(model, l)
except:
print("this linear_layer failed")
continue
except:
continue
def decomposition_model(model, conv2D=True, Linear=True, l=1):
if conv2D:
model = decomposition_conv2D(model)
if Linear:
model = decomposition_linear(model, l)
return model
class simple_model(nn.Module):
def __init__(self):
super(simple_model, self).__init__()
self.feature = nn.Sequential(
nn.Conv2d(3, 16, 3),
nn.ReLU(inplace=True),
nn.Conv2d(16, 32, 3),
nn.ReLU(inplace=True),
nn.Conv2d(32, 64, 3),
nn.ReLU(inplace=True),
nn.AdaptiveAvgPool2d((5, 5))
)
self.classifer = nn.Sequential(
nn.Linear(64*5*5, 20),
nn.Linear(20, 10),
)
def forward(self, x):
x = self.feature(x)
x = torch.flatten(x, 1)
x = self.classifer(x)
return x
if __name__ == '__main__':
# model = simple_model()
model = models.alexnet(pretrained=True)
print("--------初始模型--------")
summary(model, (3, 224, 224), device="cpu")
try:
model.load_state_dict(torch.load("./model/model.pth"))
print("weights load successed")
except:
print("weights load failed")
model = decomposition_model(model, l=5)
print("--------压缩模型--------")
summary(model, (3, 224, 224), device="cpu")
torch.save(model.state_dict(), "model.pth")
Model-Compression/Decomposition at main · liuweixue001/Model-Compression (github.com)
1. Accelerating Deep Neural Networks with Tensor Decompositions (jacobgil.github.io)