这是《动手学深度学习》14天公益课程的笔记。希望能坚持下去,好好学习。
线性回归假设输出与输入之间是线性关系。
使用线性模型来生成数据集
y = ω 1 x 1 + ω 2 x 2 + b y=\omega_1x_1+\omega_2x_2+b y=ω1x1+ω2x2+b
ω \omega ω是权重, b b b是偏差,是单个变量。
#特征数
num_inputs=2
#样本数
num_examples=1000
#设置真实的权重以及偏差
true_w=[2.5,-1.8]
true_b=2.1
features=torch.randn(num_examples,num_inputs,dtype=torch.float32)
labels=true_w[0]*features[:,0]+true_w[1]*features[:,1]+true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),
dtype=torch.float32)
定义模型
def linreg(X, w, b):
return torch.mm(X, w) + b
损失函数用于衡量预测值与真实值之间的误差,常用平方函数
l ( i ) ( ω , b ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 l^{(i)}(\omega,b)=\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2 l(i)(ω,b)=21(y^(i)−y(i))2
def squared_loss(y_hat, y):
return (y_hat - y.view(y_hat.size())) ** 2 / 2
大多数深度学习模型无解析解,使用优化算法降低损失函数的值,得到数值解。例如使用小批量随机梯度下降:先选取参数的初始值,在负梯度方向上迭代更新参数。在每次迭代中随机选取小批量训练样本,求出这些样本的平均损失关于模型参数的导数(梯度),用此结果与一个设定的正数的乘积作为减少量。
def sgd(params, lr, batch_size):
for param in params:
param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track
#param.grad指学习率(步长大小)
模型训练
lr = 0.03
num_epochs = 5
net = linreg
loss = squared_loss
# training
for epoch in range(num_epochs): # training repeats num_epochs times
# in each epoch, all the samples in dataset will be used once
# X is the feature and y is the label of a batch sample
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y).sum()
# calculate the gradient of batch sample loss
l.backward()
# using small batch random gradient descent to iter model parameters
sgd([w, b], lr, batch_size)
# reset parameter gradient
w.grad.data.zero_()
b.grad.data.zero_()
train_l = loss(net(features, w, b), labels)
#最后得到的权重是[ 2.4999],[-1.8002],偏差2.1004
使用pyTorch 定义模型
class LinearNet(nn.Module):
def __init__(self, n_feature):
super(LinearNet, self).__init__() # call father function to init
self.linear = nn.Linear(n_feature, 1) # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`
def forward(self, x):
y = self.linear(x)
return y
net = LinearNet(num_inputs)
# ways to init a multilayer network
# method one
net = nn.Sequential(
nn.Linear(num_inputs, 1)
# other layers can be added here
)
# method two
#直接调用神经网络的Sequential函数
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......
# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
('linear', nn.Linear(num_inputs, 1))
# ......
]))
直接调用nn的均方误差函数
loss = nn.MSELoss()
Softmax回归是单层神经网络,用于离散分类。它的输出层是一个全连接层。
o i = x ω i + b i o_i=x\omega_{i}+b_{i} oi=xωi+bi
softmax运算符将输出值变换为正且和为1的概率分布
y ^ 1 , y ^ 2 , y ^ 3 = softmax ( o 1 , o 2 , o 3 ) \hat{y}_1,\hat{y}_2,\hat{y}_3=\text{softmax}(o_1,o_2,o_3) y^1,y^2,y^3=softmax(o1,o2,o3)
其中 y ^ j = e x p ( o j ) ∑ i = 1 3 e x p ( o i ) \hat{y}_j=\frac{exp(o_j)}{\sum_{i=1}^3exp(o_i)} y^j=∑i=13exp(oi)exp(oj)
softmax运算符不改变预测类别的输出。
def softmax(X):
X_exp = X.exp()
partition = X_exp.sum(dim=1, keepdim=True)
return X_exp / partition # 这里应用了广播机制
def net(X):
return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)
交叉熵损失函数更适合衡量两个概率分布差异。交叉熵
H ( y ( i ) , y ^ ( i ) ) = − ∑ j = 1 q y j ( i ) l o g y ^ j ( i ) H(y^{(i)},\hat{y}^{(i)})=-\sum_{j=1}^{q}y_j^{(i)}log\hat{y}^{(i)}_j H(y(i),y^(i))=−j=1∑qyj(i)logy^j(i)
交叉熵损失函数就是取均值
def cross_entropy(y_hat, y):
return - torch.log(y_hat.gather(1, y.view(-1, 1)))
模型训练
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
d2l.sgd(params, lr, batch_size)
else:
optimizer.step()
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)
假设多层感知机只有一个隐藏层,设输出为H。隐藏层与输出层都是全连接层,有对应的参数和偏差 W h , b h , W o , b o W_h,b_h,W_o,b_o Wh,bh,Wo,bo
输出的计算
H = X W h + b h O = H W o + b o H=XW_h+b_h \\ O=HW_o+b_o H=XWh+bhO=HWo+bo
将式子联立之后可以发现依然等价于一个单层神经网络。
解决方法是引入非线性变换,使隐藏层的输出与输出层输出呈非线性关系。这样的非线性函数称为激活函数。
常用的激活函数有
ReLu函数
R e L U ( x ) = m a x ( x , 0 ) ReLU(x)=max(x,0) ReLU(x)=max(x,0)
def relu(X):
return torch.max(input=X, other=torch.tensor(0.0))
只能在隐藏层中使用。由于计算较为简单,在层数较多时最好使用。
Sigmoid函数
s i g m o i d ( x ) = 1 1 + e x p ( − x ) sigmoid(x)=\frac{1}{1+exp(-x)} sigmoid(x)=1+exp(−x)1
模型训练
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
d2l.sgd(params, lr, batch_size)
else:
optimizer.step()
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)