对于线性模型 Y = X W + b Y=XW+b Y=XW+b,其中 X ∈ R n × d X\in R^{n \times d} X∈Rn×d, n n n为样本数, d d d为每个样本的特征维度, W ∈ R d × 1 W \in R^{d \times 1} W∈Rd×1, Y ∈ R n × 1 Y \in R^{n \times 1} Y∈Rn×1。可以使用权重向量 W W W的某个范数来衡量该模型的复杂度。 W = ( w 1 , w 2 , . . . , w d ) W =(w_1,w_2,...,w_d) W=(w1,w2,...,wd)
1-范数: ∣ ∣ W ∣ ∣ 1 \vert\vert W\vert\vert_1 ∣∣W∣∣1= ∣ w 1 ∣ \vert w_1\vert ∣w1∣+ ∣ w 2 ∣ \vert w_2\vert ∣w2∣ +…+ ∣ w d ∣ \vert w_d\vert ∣wd∣
2-范数: ∣ ∣ W ∣ ∣ 2 \vert\vert W\vert\vert_2 ∣∣W∣∣2= ∣ w 1 ∣ 2 + ∣ w 2 ∣ 2 + . . . + ∣ w d ∣ 2 \sqrt{\vert w_1\vert^2+\vert w_2\vert^2+...+\vert w_d\vert^2} ∣w1∣2+∣w2∣2+...+∣wd∣2
p-范数: ∣ ∣ W ∣ ∣ p \vert\vert W\vert\vert_p ∣∣W∣∣p= ( ∣ w 1 ∣ p + ∣ w 2 ∣ p + . . . + ∣ w d ∣ p ) 1 p (\vert w_1\vert^p+\vert w_2\vert^p+...+\vert w_d\vert^p)^{\frac{1}{p}} (∣w1∣p+∣w2∣p+...+∣wd∣p)p1
def my_pnorm(w, norm_size):
w_abs = abs(w)
w_norm = w_abs ** norm_size
ans = w_norm.sum() ** (1/norm_size)
return ans
fea_dim = 20
norm_size = 2
w = torch.normal(0, 1, size=(fea_dim, 1), requires_grad=True)
my_ans = my_pnorm(w, norm_size)
torch_ans = torch.norm(w, p=norm_size) # pytorch求p范数
L1正则化(Lasso regularization):损失函数使用1-范数作为惩罚项。 l o s s = l o s s ( x w + b , y ) + λ ∣ ∣ w ∣ ∣ 1 = l o s s ( x w + b , y ) + λ ∑ i d ∣ w i ∣ loss=loss(xw+b,y)+\lambda \vert\vert w\vert\vert_1=loss(xw+b,y)+\lambda\sum_i^d\vert w_i\vert loss=loss(xw+b,y)+λ∣∣w∣∣1=loss(xw+b,y)+λ∑id∣wi∣。
L2正则化(Ridge regularization):损失函数使用2-范数的平方项 ∣ ∣ w ∣ ∣ 2 2 \vert\vert w\vert\vert_2^2 ∣∣w∣∣22作为惩罚项(分母为2方便求导消去该系数)。 l o s s = l o s s ( x w + b , y ) + λ 2 ∣ ∣ w ∣ ∣ 2 2 = l o s s ( x w + b , y ) + λ 2 ∑ i d ∣ w i ∣ 2 loss=loss(xw+b,y)+\frac {\lambda}{2} \vert\vert w\vert\vert_2^2=loss(xw+b,y)+\frac {\lambda}{2}\sum_i^d\vert w_i\vert^2 loss=loss(xw+b,y)+2λ∣∣w∣∣22=loss(xw+b,y)+2λ∑id∣wi∣2。
拟合下述公式,其中特征维度 d d d=200: y = 0.05 + ∑ i = 1 d 0.01 x i + n o i s e w h e r e n o i s e ∈ N ( 0 , 0. 1 2 ) y=0.05+\sum_{i=1}^{d}0.01x_i+noise\quad where \quad noise \in N(0,0.1^2) y=0.05+i=1∑d0.01xi+noisewherenoise∈N(0,0.12)
import torch
import random
def generate_data(w, num_examples, dim):
X = torch.normal(0, 1, (num_examples, dim))
labels = torch.matmul(X, w) + 0.05
labels += torch.normal(0, 0.01, labels.shape)
return X, labels
def data_iterater(X, y, batch_size, fea_dim):
num = len(X)
indices = list(range(num))
random.shuffle(indices) # 将顺序打乱
batch_X = torch.zeros([num//batch_size, batch_size, fea_dim])
batch_y = torch.zeros([num//batch_size, batch_size, 1])
for id, i in enumerate(range(0, num, batch_size)):
batch_indices = torch.tensor(indices[i: min(i + batch_size, num)])
batch_X[id,:,:] = (X[batch_indices])
batch_y[id,:,:] = y[batch_indices]
return batch_X, batch_y
def init_params(fea_dim):
w = torch.normal(0, 1, size=(fea_dim , 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
return [w, b]
def l2_penalty(w):
return torch.sum(w.pow(2)) / 2
def Linear_Model(X, w, b):
return torch.matmul(X, w) + b
def loss_sq(predict, y):
return (predict - y) ** 2 / 2
def sgd(params, lr, batch_size):
with torch.no_grad():
for param in params:
param -= lr * param.grad / batch_size
param.grad.zero_()
def train(lambd, lr, batch_size, epochs):
[w, b] = init_params(fea_dim)
batch_X, batch_y = data_iterater(train_features, train_labels, batch_size, fea_dim)
for epoch in range(epochs):
for (X, y) in zip(batch_X, batch_y):
predict = Linear_Model(X, w, b)
with torch.enable_grad():
loss_batch = loss_sq(predict, y) + lambd * l2_penalty(w)
loss_batch.sum().backward()
sgd([w, b], lr, batch_size)
with torch.no_grad():
train_l = loss_sq(Linear_Model(train_features, w, b), train_labels)
test_l = loss_sq(Linear_Model(test_features, w, b), test_labels)
print(f'epoch {epoch + 1}, train_loss {float(train_l.mean()):f}, test_loss {float(test_l.mean()):f}')
print('w的L2范数是:', torch.norm(w).item())
print('w的前5维: ', w[0:5])
print('b:', b)
lambd = 0
batch_size = 50
lr = 0.03
epochs = 20
fea_dim = 200
num_examples = 1000
true_w = 0.01 * torch.ones((fea_dim,1))
train_features, train_labels = generate_data(true_w, num_examples, fea_dim)
test_features, test_labels = generate_data(true_w, num_examples, fea_dim)
train(lambd, lr, batch_size, epochs)
lambd = 0:
epoch 1, train_loss 24.436758, test_loss 27.161360
epoch 2, train_loss 7.634212, test_loss 10.278726
epoch 3, train_loss 2.866694, test_loss 4.450128
epoch 4, train_loss 1.214711, test_loss 2.087337
epoch 5, train_loss 0.556567, test_loss 1.030553
epoch 6, train_loss 0.268969, test_loss 0.527635
epoch 7, train_loss 0.135160, test_loss 0.277830
epoch 8, train_loss 0.070024, test_loss 0.149674
epoch 9, train_loss 0.037195, test_loss 0.082194
epoch 10, train_loss 0.020178, test_loss 0.045883
epoch 11, train_loss 0.011149, test_loss 0.025981
epoch 12, train_loss 0.006262, test_loss 0.014898
epoch 13, train_loss 0.003571, test_loss 0.008642
epoch 14, train_loss 0.002068, test_loss 0.005068
epoch 15, train_loss 0.001217, test_loss 0.003006
epoch 16, train_loss 0.000730, test_loss 0.001805
epoch 17, train_loss 0.000448, test_loss 0.001101
epoch 18, train_loss 0.000284, test_loss 0.000685
epoch 19, train_loss 0.000188, test_loss 0.000438
epoch 20, train_loss 0.000130, test_loss 0.000291
w的L2范数是: 0.141380175948143
w的前5维: tensor([[0.0096],
[0.0095],
[0.0096],
[0.0097],
[0.0105]], grad_fn=<SliceBackward>)
b: tensor([0.0515], requires_grad=True)
lambd = 3:
epoch 1, train_loss 0.462465, test_loss 0.633717
epoch 2, train_loss 0.013955, test_loss 0.019294
epoch 3, train_loss 0.007136, test_loss 0.008692
epoch 4, train_loss 0.005868, test_loss 0.007064
epoch 5, train_loss 0.005469, test_loss 0.006553
epoch 6, train_loss 0.005339, test_loss 0.006377
epoch 7, train_loss 0.005294, test_loss 0.006311
epoch 8, train_loss 0.005277, test_loss 0.006284
epoch 9, train_loss 0.005271, test_loss 0.006272
epoch 10, train_loss 0.005268, test_loss 0.006267
epoch 11, train_loss 0.005266, test_loss 0.006264
epoch 12, train_loss 0.005266, test_loss 0.006262
epoch 13, train_loss 0.005265, test_loss 0.006262
epoch 14, train_loss 0.005265, test_loss 0.006261
epoch 15, train_loss 0.005265, test_loss 0.006261
epoch 16, train_loss 0.005265, test_loss 0.006261
epoch 17, train_loss 0.005265, test_loss 0.006261
epoch 18, train_loss 0.005265, test_loss 0.006261
epoch 19, train_loss 0.005265, test_loss 0.006261
epoch 20, train_loss 0.005265, test_loss 0.006261
w的L2范数是: 0.03571242466568947
w的前5维: tensor([[0.0020],
[0.0027],
[0.0024],
[0.0038],
[0.0030]], grad_fn=<SliceBackward>)
b: tensor([0.0463], requires_grad=True)