多层感知机,线性回归和softmax回归在内的单层神经网络。然而深度学习主要关注多层模型。在本节中,我们将以多层感知机(multilayer perceptron,MLP)为例,介绍多层神经网络的概念。
隐藏层
激活函数
ReLU(rectified linear unit)函数提供了一个很简单的非线性变换。给定元素x,该函数定义为 ReLU ( x ) = max ( x , 0 ) \text{ReLU}(x) = \max(x, 0) ReLU(x)=max(x,0).可以看出,ReLU函数只保留正数元素,并将负数元素清零。
%matplotlib inline
import torch
import numpy as np
import matplotlib.pylab as plt
from IPython import display
def use_svg_display():
"""Use svg format to display plot in jupyter"""
display.display_svg()
def set_figsize(figsize=(3.5, 2.5)):
use_svg_display()
# 设置图的尺寸
plt.rcParams['figure.figsize'] = figsize
def xyplot(x_vals, y_vals, name):
set_figsize(figsize=(5, 2.5))
plt.plot(x_vals.detach().numpy(), y_vals.detach().numpy())
plt.xlabel('x')
plt.ylabel(name + '(x)')
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = x.relu()
xyplot(x, y, 'relu')
通过Tensor
提供的relu
函数来绘制ReLU函数。可以看到,该激活函数是一个两段线性函数。显然,当输入为负数时,ReLU函数的导数为0;当输入为正数时,ReLU函数的导数为1。尽管输入为0时ReLU函数不可导,但是我们可以取此处的导数为0。下面绘制ReLU函数的导数。
y.sum().backward()
xyplot(x, x.grad, 'grad of relu')
ReLU 函数是深度学习中较为流行的一种激活函数,相比于 sigmoid 函数和 tanh 函数,它具有如下优点:
ReLU函数的不足:
为了解决 ReLU 激活函数中的梯度消失问题,当 x < 0 时,我们使用 Leaky ReLU——该函数试图修复 dead ReLU 问题。
sigmoid函数可以将元素的值变换到0和1之间: sigmoid ( x ) = 1 1 + exp ( − x ) \text{sigmoid}(x) = \frac{1}{1 + \exp(-x)} sigmoid(x)=1+exp(−x)1.sigmoid函数在早期的神经网络中较为普遍,但它目前逐渐被更简单的ReLU函数取代。在“循环神经网络”会利用它值域在0到1之间这一特性来控制信息在神经网络中的流动。下面绘制了sigmoid函数。当输入接近0时,sigmoid函数接近线性变换。
y = x.sigmoid()
xyplot(x, y, 'sigmoid')
依据链式法则,sigmoid函数的导数 sigmoid ′ ( x ) = sigmoid ( x ) ( 1 − sigmoid ( x ) ) \text{sigmoid}'(x) = \text{sigmoid}(x)\left(1-\text{sigmoid}(x)\right) sigmoid′(x)=sigmoid(x)(1−sigmoid(x)).下面绘制了sigmoid函数的导数。当输入为0时,sigmoid函数的导数达到最大值0.25;当输入越偏离0时,sigmoid函数的导数越接近0。
x.grad.zero_()
y.sum().backward()
xyplot(x, x.grad, 'grad of sigmoid')
Sigmoid 函数的输出范围是 0 到 1。由于输出值限定在 0 到1,因此它对每个神经元的输出进行了归一化;用于将预测概率作为输出的模型。由于概率的取值范围是 0 到 1,因此 Sigmoid 函数非常合适;梯度平滑,避免「跳跃」的输出值;函数是可微的。这意味着可以找到任意两个点的 sigmoid 曲线的斜率;明确的预测,即非常接近 1 或 0。
Sigmoid 激活函数存在的不足:
tanh(双曲正切)函数可以将元素的值变换到-1和1之间: tanh ( x ) = 1 − exp ( − 2 x ) 1 + exp ( − 2 x ) \text{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)} tanh(x)=1+exp(−2x)1−exp(−2x).接着绘制tanh函数。当输入接近0时,tanh函数接近线性变换。虽然该函数的形状和sigmoid函数的形状很像,但tanh函数在坐标系的原点上对称。
y = x.tanh()
xyplot(x, y, 'tanh')
依据链式法则,tanh函数的导数 tanh ′ ( x ) = 1 − tanh 2 ( x ) \text{tanh}'(x) = 1 - \text{tanh}^2(x) tanh′(x)=1−tanh2(x).下面绘制了tanh函数的导数。当输入为0时,tanh函数的导数达到最大值1;当输入越偏离0时,tanh函数的导数越接近0。
x.grad.zero_()
y.sum().backward()
xyplot(x, x.grad, 'grad of tanh')
你可以将 Tanh 函数想象成两个 Sigmoid 函数放在一起。在实践中,Tanh 函数的使用优先性高于 Sigmoid 函数。负数输入被当作负值,零输入值的映射接近零,正数输入被当作正值:当输入较大或较小时,输出几乎是平滑的并且梯度较小,这不利于权重更新。二者的区别在于输出间隔,tanh 的输出间隔为 1,并且整个函数以 0 为中心,比 sigmoid 函数更好;在 tanh 图中,负输入将被强映射为负,而零输入被映射为接近零。
tanh存在的不足:
多层感知机
多层感知机就是含有至少一个隐藏层的由全连接层组成的神经网络,且每个隐藏层的输出通过激活函数进行变换。多层感知机的层数和各隐藏层中隐藏单元个数都是超参数。以单隐藏层为例并沿用本节之前定义的符号,多层感知机按以下方式计算输出:
H = ϕ ( X W h + b h ) , O = H W o + b o , \begin{aligned} \boldsymbol{H} &= \phi(\boldsymbol{X} \boldsymbol{W}_h + \boldsymbol{b}_h),\\ \boldsymbol{O} &= \boldsymbol{H} \boldsymbol{W}_o + \boldsymbol{b}_o, \end{aligned} HO=ϕ(XWh+bh),=HWo+bo,
其中 ϕ \phi ϕ表示激活函数。在分类问题中,可以对输出 O \boldsymbol{O} O做softmax运算,并使用softmax回归中的交叉熵损失函数。在回归问题中,将输出层的输出个数设为1,并将输出 O \boldsymbol{O} O直接提供给线性回归中使用的平方损失函数。
多层感知机的从零开始实现
动手实现一个多层感知机。首先导入实现所需的包或模块。获取和读取数据,继续使用Fashion-MNIST数据集。我们将使用多层感知机对图像进行分类。
import torch
import numpy as np
import torchvision
import sys
def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST'):
"""Download the fashion mnist dataset and then load into memory."""
trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())
transform = torchvision.transforms.Compose(trans)
mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
if sys.platform.startswith('win'):
num_workers = 0 # 0表示不用额外的进程来加速读取数据
else:
num_workers = 4
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)
return train_iter, test_iter
batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size)
print(train_iter, test_iter)
<torch.utils.data.dataloader.DataLoader object at 0x000001BBFF227040> <torch.utils.data.dataloader.DataLoader object at 0x000001BBFF20F6A0>
定义模型参数
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_hiddens)), dtype=torch.float)
b1 = torch.zeros(num_hiddens, dtype=torch.float)
W2 = torch.tensor(np.random.normal(0, 0.01, (num_hiddens, num_outputs)), dtype=torch.float)
b2 = torch.zeros(num_outputs, dtype=torch.float)
params = [W1, b1, W2, b2]
for param in params:
param.requires_grad_(requires_grad=True)
print(W1.shape,b1.shape,W2.shape,b2.shape)
print(W1.grad,b1.grad,W2.grad,b2.grad)
print(W1.requires_grad,b1.requires_grad,W2.requires_grad,b2.requires_grad)
torch.Size([784, 256]) torch.Size([256]) torch.Size([256, 10]) torch.Size([10])
None None None None
True True True True
定义激活函数
定义模型
同softmax回归一样,通过view
函数将每张原始图像改成长度为num_inputs
的向量。然后实现上文中多层感知机的计算表达式。
def net(X):
X = X.view((-1, num_inputs))
H = relu(torch.matmul(X, W1) + b1)
return torch.matmul(H, W2) + b2
定义损失函数:为了得到更好的数值稳定性,我们直接使用PyTorch提供的包括softmax运算和交叉熵损失计算的函数。
loss = torch.nn.CrossEntropyLoss()
def cross_entropy(y_hat, y):
return - torch.log(y_hat.gather(1, y.view(-1, 1)))
训练模型
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
sgd(params, lr, batch_size)
else:
optimizer.step() # “softmax回归的简洁实现”一节将用到
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
def sgd(params, lr, batch_size):
for param in params:
param.data -= lr * param.grad/batch_size # 注意这里更改param时用的param.data
def evaluate_accuracy(data_iter, net):
acc_sum, n = 0.0, 0
for X, y in data_iter:
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape[0]
return acc_sum / n
num_epochs, lr = 10, 100.0
train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)
epoch 1, loss 0.0030, train acc 0.710, test acc 0.780
epoch 2, loss 0.0019, train acc 0.825, test acc 0.803
epoch 3, loss 0.0017, train acc 0.843, test acc 0.827
epoch 4, loss 0.0015, train acc 0.855, test acc 0.826
epoch 5, loss 0.0014, train acc 0.865, test acc 0.769
epoch 6, loss 0.0014, train acc 0.870, test acc 0.862
epoch 7, loss 0.0013, train acc 0.875, test acc 0.777
epoch 8, loss 0.0013, train acc 0.878, test acc 0.804
epoch 9, loss 0.0012, train acc 0.882, test acc 0.856
epoch 10, loss 0.0012, train acc 0.886, test acc 0.831
预测
from IPython import display
from matplotlib import pyplot as plt
def get_fashion_mnist_labels(labels):
text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [text_labels[int(i)] for i in labels]
def use_svg_display():
# 用矢量图显示
display.display_svg()
def show_fashion_mnist(images, labels):
use_svg_display()
# 这里的_表示我们忽略(不使用)的变量
_, figs = plt.subplots(1, len(images), figsize=(12, 12))
for f, img, lbl in zip(figs, images, labels):
f.imshow(img.view((28, 28)).numpy())
f.set_title(lbl)
f.axes.get_xaxis().set_visible(False)
f.axes.get_yaxis().set_visible(False)
plt.show()
print(net)
X, y = next(iter(test_iter))
true_labels = get_fashion_mnist_labels(y.numpy())
pred_labels = get_fashion_mnist_labels(net(X).argmax(dim=1).numpy())
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]
show_fashion_mnist(X[0:9], titles[0:9])
使用PyTorch来实现上一节中的多层感知机。
import torch
from torch import nn
from torch.nn import init
import torchvision
import numpy as np
import sys
def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST'):
"""Download the fashion mnist dataset and then load into memory."""
trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())
transform = torchvision.transforms.Compose(trans)
mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
if sys.platform.startswith('win'):
num_workers = 0 # 0表示不用额外的进程来加速读取数据
else:
num_workers = 4
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)
return train_iter, test_iter
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
sgd(params, lr, batch_size)
else:
optimizer.step() # “softmax回归的简洁实现”一节将用到
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
class FlattenLayer(nn.Module):
def __init__(self):
super(FlattenLayer, self).__init__()
def forward(self, x): # x shape: (batch, *, *, ...)
return x.view(x.shape[0], -1)
def evaluate_accuracy(data_iter, net):
acc_sum, n = 0.0, 0
for X, y in data_iter:
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape[0]
return acc_sum / n
batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size) # 数据获取
num_inputs, num_outputs, num_hiddens = 784, 10, 256 # 多层感知机超参数设置
net = nn.Sequential( # 模型构建
FlattenLayer(),
nn.Linear(num_inputs, num_hiddens),
nn.ReLU(),
nn.Linear(num_hiddens, num_outputs),
)
for params in net.parameters(): # 模型初始化
init.normal_(params, mean=0, std=0.01)
loss = torch.nn.CrossEntropyLoss() # 损失函数
optimizer = torch.optim.SGD(net.parameters(), lr=0.5) # 优化器
num_epochs = 10
train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)
epoch 1, loss 0.0031, train acc 0.706, test acc 0.778
epoch 2, loss 0.0019, train acc 0.822, test acc 0.830
epoch 3, loss 0.0016, train acc 0.844, test acc 0.850
epoch 4, loss 0.0015, train acc 0.855, test acc 0.832
epoch 5, loss 0.0014, train acc 0.864, test acc 0.860
epoch 6, loss 0.0013, train acc 0.873, test acc 0.817
epoch 7, loss 0.0013, train acc 0.878, test acc 0.839
epoch 8, loss 0.0013, train acc 0.881, test acc 0.853
epoch 9, loss 0.0012, train acc 0.885, test acc 0.853
epoch 10, loss 0.0012, train acc 0.888, test acc 0.852
模型预测
from IPython import display
from matplotlib import pyplot as plt
def get_fashion_mnist_labels(labels):
text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [text_labels[int(i)] for i in labels]
def use_svg_display():
# 用矢量图显示
display.display_svg()
def show_fashion_mnist(images, labels):
use_svg_display()
# 这里的_表示我们忽略(不使用)的变量
_, figs = plt.subplots(1, len(images), figsize=(12, 12))
for f, img, lbl in zip(figs, images, labels):
f.imshow(img.view((28, 28)).numpy())
f.set_title(lbl)
f.axes.get_xaxis().set_visible(False)
f.axes.get_yaxis().set_visible(False)
plt.show()
print(net)
X, y = next(iter(test_iter))
true_labels = get_fashion_mnist_labels(y.numpy())
pred_labels = get_fashion_mnist_labels(net(X).argmax(dim=1).numpy())
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]
show_fashion_mnist(X[0:9], titles[0:9])