总的来说,Pytorch主要提供了两个主要特征:
本文中,会通过一个包含ReLu激活函数的全连接神经网络来作为示例,来解决一个由监督问题。为了简化起见,该网络仅仅包含一个隐藏层,训练数据 (X , y)来自随机产生的数据。这个有监督学习的目标是为了最小化网络输出与真实输出之间的欧式距离。
numpy作为一个科学计算套件想必大家都很熟悉。其主要特征是能够将运算扩展到矩阵形式,并可以通过并行等技术,加速矩阵的运算速度。通过利用numpy中的array,能够实现像matlab中的大规模矩阵快速计算。所不一样的是,matlab正版软件动辄上千美元,而python numpy作为开源套件大大促进了开发者的研究和交流。
Numpy本身不包含任何与深度学习有关的技术,例如前向传播、计算图、梯度更新等。但完全可以利用numpy构造一个简单的全连接神经网络:
# -*- coding: utf-8 -*-
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
网络输出与真实输出的距离代价如下:
看起来效果似乎很不错。但是现代许多技术中需要使用GPU大量并行计算,提高速度,很遗憾,numpy不能在支持在一些GPU运算平台如CUDA上并行计算。
Pytorch中最为重要的一个技术,Tensor,便解决了该问题。
大家可以把Pytorch中的Tensor理解为可以在GPU上计算的Numpy array。相比array,tensor提供了其他的一些与深度学习息息相关的操作,如:tensor可以追踪一个计算图和梯度。
上面用numpy手写的全连接神经网络可以用tensor重写为:
# -*- coding: utf-8 -*-
import torch
#此处不同,需要在运算前,指定运算的设备, CPU or GPU?
dtype = torch.float
device = torch.device("cpu") # 在CPU上运行
# device = torch.device("cuda:0") # 在GPU上运行
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype) #定义好tensor后,需要指定设备
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
上面的例子中,我们手算出了神经网络反向传播的梯度,这对简单的网络当然可以,但当神经网络稍微复杂一些,手动计算梯度的复杂程度就大大增加。
幸运的是pytorch提供给了我们自动计算梯度的套件。
autograd是pytorch中的一个套件 ,当使用autograd时,神经网络的前向传播会定义一个计算图(computational graph),该图中,node是tensor,edge是函数(produce output from input);在这个graph中进行反向传播可以很容易计算梯度。
具体的,如果x是一个tensor,也是graph中的一个node,如果有:x.requires_grad=True,那么x.grad将会是另一个tensor,其中包含了求导后的梯度值
我们使用autograd的套件重写上述代码:
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
# w1 w2是我们想更新的网络参数,反向求导是loss对w1 w2求导,所以对谁求导或者说想要更新
# 哪个tensor,那么这个tensor的requires_grad=True
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
# 我们可以通过autograd的机制直接反向求导,不用在保留中间值,所以上面代码前向传播的三步
# 可以直接合起来
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss using operations on Tensors.
# Now loss is a Tensor of shape (1,)
# loss.item() gets the scalar value held in the loss.
# 注意的是,使用autograd机制时,需要保证待求导的tensor(这里是loss)维度为1,
# 也就是其必须是标量
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
# 调用autograd机制中的backward方法,就可以自动完成反向求导
# w1.grad w2.grad两个tensor中分别存储了loss对w1和w2的梯度
loss.backward()
# Manually update weights using gradient descent. Wrap in torch.no_grad()
# because weights have requires_grad=True, but we don't need to track this
# in autograd.
# An alternative way is to operate on weight.data and weight.grad.data.
# Recall that tensor.data gives a tensor that shares the storage with
# tensor, but doesn't track history.
# You can also use torch.optim.SGD to achieve this.
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()
上面可以看出,autograd机制主要提供了两个函数,forward和backward
forward从输入tensor计算得到一个输出tensor;backward接受输出tensor相对于某个标量值的梯度,然后进一步计算出输入tensor相对于该同一个标量的梯度
在pytorch中我们可以通过继承torch.autograd.Function来定制一个自己的自动微分算子
本节中,通过定制一个自动微分算子来非线性实现ReLU
# -*- coding: utf-8 -*-
import torch
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
@staticmethod
def forward(ctx, input):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
# ctx可以用来存储一些与反向计算有关的信息
ctx.save_for_backward(input)
return input.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.
"""
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0
return grad_input
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# To apply our Function, we use Function.apply method. We alias this as 'relu'.
relu = MyReLU.apply
# Forward pass: compute predicted y using operations; we compute
# ReLU using our custom autograd operation.
y_pred = relu(x.mm(w1)).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
# Use autograd to compute the backward pass.
loss.backward()
# Update weights using gradient descent
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()
虽然计算图可以定义许多复杂运算算子,自动求导,但是对于大规模神经网络,这种仅仅用计算图求解的方法还是太简陋。
和tensorflow与keras中相似,为了解决这个问题,pytorch考虑将计算图中的一些计算打包进layer里,layer里有一些待学习的参数,这些参数将会在后面基于梯度信息更新。这个layer在pytorch里通过nnmodule实现。
nn module中包含了一些必要的layer,从input计算出output;除此,还包含一些有用的loss function可以直接调用。接下来,我们将调用nn module重写上面的模型:
# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(x)
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
# 和上面通过构建计算图训练网络不同,这里nn module初始化已经将待学习的参数的
# requires_grad=True
loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
直到目前为止,我们在计算出参数的更新梯度后,仍然是手动地对参数进行梯度下降更新。这对一些简单的网络,且使用SGD等优化方法的模型仍然能起作用。但是当模型较复杂,况且考虑到当下更多使用诸如Adarad、RMSProp等更加高级的优化方法,手动进行梯度更新不太可能。
所幸,Pytorch提供了optim package,该包里包含了许多定义好的优化器(optimizer),可以在autograd自动获得更新梯度的基础上,进一步使用optim自动对参数进行更新。
本节将在上节定义的nn module基础上,使用optim进行自动更新梯度:
# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)
# Compute and print loss.
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
# 在反向传播前,一定要用optimizer对象中的zero_grad方法,对参数初始化;
# 因为只要backward方法被调用,梯度会在缓存中累计
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
这样以来,一个全连接神经网络中最重要的三个环节:前向计算(forward)、反向传播(backward)、梯度更新(update grad)分别由pytorch中的三个package实现:nn moudule、 backward()、optimizer.step()整个代码的编写变得非常轻量
许多研究中,我们想用搭建的模型可能比较复杂,不像sequential这样简单,可能是多条线各种结构。pytorch允许通过继承nn module来重新定制自己的模型。
在此,我们通过继承nn Module的方法重新自定义上述代码中的Sequential模型:
-*- coding: utf-8 -*-
import torch
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
# 初始化时,需要定义layer的形式,并将其属性化
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
和tensorflow、keras、caffe等框架很大的一个不同是,pytorch支持动态的计算图
一般而言,tensorflow、keras、caffe等框架在我们定义好模型(计算图结构后),就不允许我们再进行更改,或者调整了。相反,pytorch允许计算图调整。除此,为了支持动态计算图,pytroch还有权重共享等机制。
作为动态图和权重共享的示例,我们实现了一个非常奇怪的模型:一个完全连接的ReLU网络,该网络在每个前向传递中选择1到4之间的随机数,并使用那么多隐藏层,多次重复使用相同的权重计算最里面的隐藏层。
对于此模型,我们可以使用常规的Python流控制来实现循环,并且可以通过在定义前向传递时简单地多次重复使用同一模块来实现最内层之间的权重共享。
# -*- coding: utf-8 -*-
import random
import torch
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__()
self.input_linear = torch.nn.Linear(D_in, H)
self.middle_linear = torch.nn.Linear(H, H)
self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
"""
h_relu = self.input_linear(x).clamp(min=0)
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0)
y_pred = self.output_linear(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()