You’ve written a lot of code in this assignment to provide a whole host of neural network functionality. Dropout, Batch Norm, and 2D convolutions are some of the workhorses of deep learning in computer vision. You’ve also worked hard to make your code efficient and vectorized.
For the last part of this assignment, though, we’re going to leave behind your beautiful codebase and instead migrate to one of two popular deep learning frameworks: in this instance, PyTorch.
PyTorch is a system for executing dynamic computational graphs over Tensor objects that behave similarly as numpy ndarray. It comes with a powerful automatic differentiation engine that removes the need for manual back-propagation.
One of our former instructors, Justin Johnson, made an excellent tutorial for PyTorch.
You can also find the detailed API doc here. If you have other questions that are not addressed by the API docs, the PyTorch forum is a much better place to ask than StackOverflow.
This assignment has 5 parts. You will learn PyTorch on three different levels of abstraction, which will help you understand it better and prepare you for the final project.
to define arbitrary neural network architecture.nn.Sequential
to define a linear feed-forward network very conveniently.Here is a table of comparison:
API | Flexibility | Convenience |
Barebone | High | Low |
nn.Module |
High | Medium |
nn.Sequential |
Low | High |
You can manually switch to a GPU device on Colab by clicking Runtime -> Change runtime type
and selecting GPU
under Hardware Accelerator
. You should do this before running the following cells to import packages, since the kernel gets restarted upon switching runtimes.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler
import torchvision.datasets as dset
import torchvision.transforms as T
import numpy as np
USE_GPU = True
dtype = torch.float32 # We will be using float throughout this tutorial.
if USE_GPU and torch.cuda.is_available():
device = torch.device('cuda')
device = torch.device('cpu')
# Constant to control how frequently we print train loss.
print_every = 100
print('using device:', device)
using device: cuda
Now, let’s load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.
In previous parts of the assignment we had to write our own code to download the CIFAR-10 dataset, preprocess it, and iterate through it in minibatches; PyTorch provides convenient tools to automate this process for us.
NUM_TRAIN = 49000
# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
#dset 是 torch.utils.data.Dataset 的缩写
loader_train = DataLoader(cifar10_train, batch_size=64,
cifar10_val = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
loader_val = DataLoader(cifar10_val, batch_size=64,
sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))
cifar10_test = dset.CIFAR10('./cs231n/datasets', train=False, download=True,
loader_test = DataLoader(cifar10_test, batch_size=64)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cs231n/datasets/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:10<00:00, 16169515.21it/s]
Extracting ./cs231n/datasets/cifar-10-python.tar.gz to ./cs231n/datasets
Files already downloaded and verified
Files already downloaded and verified
PyTorch ships with high-level APIs to help us define model architectures conveniently, which we will cover in Part II of this tutorial. In this section, we will start with the barebone PyTorch elements to understand the autograd engine better. After this exercise, you will come to appreciate the high-level model API more.
We will start with a simple fully-connected ReLU network with two hidden layers and no biases for CIFAR classification.
This implementation computes the forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients. It is important that you understand every line, because you will write a harder version after the example.
When we create a PyTorch Tensor with requires_grad=True
, then operations involving that Tensor will not just compute values; they will also build up a computational graph in the background, allowing us to easily backpropagate through the graph to compute gradients of some Tensors with respect to a downstream loss. Concretely if x is a Tensor with x.requires_grad == True
then after backpropagation x.grad
will be another Tensor holding the gradient of x with respect to the scalar loss at the end.
A PyTorch Tensor is conceptionally similar to a numpy array: it is an n-dimensional grid of numbers, and like numpy PyTorch provides many functions to efficiently operate on Tensors. As a simple example, we provide a flatten
function below which reshapes image data for use in a fully-connected neural network.
Recall that image data is typically stored in a Tensor of shape N x C x H x W, where:
This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we use fully connected affine layers to process the image, however, we want each datapoint to be represented by a single vector – it’s no longer useful to segregate the different channels, rows, and columns of the data. So, we use a “flatten” operation to collapse the C x H x W
values per representation into a single long vector. The flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a “view” of that data. “View” is analogous to numpy’s “reshape” method: it reshapes x’s dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don’t need to specify that explicitly).
def flatten(x):
N = x.shape[0] # read in N, C, H, W
return x.view(N, -1) # "flatten" the C * H * W values into a single vector per image
def test_flatten():
x = torch.arange(12).view(2, 1, 3, 2)
print('Before flattening: ', x)
print('After flattening: ', flatten(x))
Before flattening: tensor([[[[ 0, 1],
[ 2, 3],
[ 4, 5]]],
[[[ 6, 7],
[ 8, 9],
[10, 11]]]])
After flattening: tensor([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
Here we define a function two_layer_fc
which performs the forward pass of a two-layer fully-connected ReLU network on a batch of image data. After defining the forward pass we check that it doesn’t crash and that it produces outputs of the right shape by running zeros through the network.
You don’t have to write any code here, but it’s important that you read and understand the implementation.
import torch.nn.functional as F # useful stateless functions
def two_layer_fc(x, params):
A fully-connected neural networks; the architecture is:
NN is fully connected -> ReLU -> fully connected layer.
Note that this function only defines the forward pass;
PyTorch will take care of the backward pass for us.
The input to the network will be a minibatch of data, of shape
(N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,
and the output layer will produce scores for C classes.
- x: A PyTorch Tensor of shape (N, d1, ..., dM) giving a minibatch of
input data.
- params: A list [w1, w2] of PyTorch Tensors giving weights for the network;
w1 has shape (D, H) and w2 has shape (H, C).
- scores: A PyTorch Tensor of shape (N, C) giving classification scores for
the input data x.
# first we flatten the image
x = flatten(x) # shape: [batch_size, C x H x W]
w1, w2 = params
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don't need to keep references to intermediate values.
# you can also use `.clamp(min=0)`, equivalent to F.relu()
x = F.relu(x.mm(w1))
#mm 是 torch.mm() 的缩写,表示矩阵乘法。它可以用于两个二维张量(矩阵)之间的矩阵乘法,返回一个新的张量。
x = x.mm(w2)
return x
def two_layer_fc_test():
hidden_layer_size = 42
x = torch.zeros((64, 50), dtype=dtype) # minibatch size 64, feature dimension 50
w1 = torch.zeros((50, hidden_layer_size), dtype=dtype)
w2 = torch.zeros((hidden_layer_size, 10), dtype=dtype)
scores = two_layer_fc(x, [w1, w2])
print(scores.size()) # you should see [64, 10]
torch.Size([64, 10])
Here you will complete the implementation of the function three_layer_convnet
, which will perform the forward pass of a three-layer convolutional network. Like above, we can immediately test our implementation by passing zeros through the network. The network should have the following architecture:
filters, each with shape KW1 x KH1
, and zero-padding of twochannel_2
filters, each with shape KW2 x KH2
, and zero-padding of oneNote that we have no softmax activation here after our fully-connected layer: this is because PyTorch’s cross entropy loss performs a softmax activation for you, and by bundling that step in makes computation more efficient.
HINT: For convolutions: http://pytorch.org/docs/stable/nn.html#torch.nn.functional.conv2d; pay attention to the shapes of convolutional filters!
def three_layer_convnet(x, params):
Performs the forward pass of a three-layer convolutional network with the
architecture defined above.
- x: A PyTorch Tensor of shape (N, 3, H, W) giving a minibatch of images
- params: A list of PyTorch Tensors giving the weights and biases for the
network; should contain the following:
- conv_w1: PyTorch Tensor of shape (channel_1, 3, KH1, KW1) giving weights
for the first convolutional layer
- conv_b1: PyTorch Tensor of shape (channel_1,) giving biases for the first
convolutional layer
- conv_w2: PyTorch Tensor of shape (channel_2, channel_1, KH2, KW2) giving
weights for the second convolutional layer
- conv_b2: PyTorch Tensor of shape (channel_2,) giving biases for the second
convolutional layer
- fc_w: PyTorch Tensor giving weights for the fully-connected layer. Can you
figure out what the shape should be?
- fc_b: PyTorch Tensor giving biases for the fully-connected layer. Can you
figure out what the shape should be?
- scores: PyTorch Tensor of shape (N, C) giving classification scores for x
conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params
scores = None
# TODO: Implement the forward pass for the three-layer ConvNet. #
x1 = F.relu(F.conv2d(x,conv_w1,conv_b1,padding=2))
x2 = F.relu(F.conv2d(x1,conv_w2,conv_b2,padding=1))
scores = flatten(x2).mm(fc_w)+fc_b
return scores
After defining the forward pass of the ConvNet above, run the following cell to test your implementation.
When you run this function, scores should have shape (64, 10).
def three_layer_convnet_test():
x = torch.zeros((64, 3, 32, 32), dtype=dtype) # minibatch size 64, image size [3, 32, 32]
conv_w1 = torch.zeros((6, 3, 5, 5), dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]
conv_b1 = torch.zeros((6,)) # out_channel
conv_w2 = torch.zeros((9, 6, 3, 3), dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]
conv_b2 = torch.zeros((9,)) # out_channel
# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.zeros((9 * 32 * 32, 10))
fc_b = torch.zeros(10)
scores = three_layer_convnet(x, [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b])
print(scores.size()) # you should see [64, 10]
torch.Size([64, 10])
Let’s write a couple utility methods to initialize the weight matrices for our models.
initializes a weight tensor with the Kaiming normalization method.zero_weight(shape)
initializes a weight tensor with all zeros. Useful for instantiating bias parameters.The random_weight
function uses the Kaiming normal initialization method, described in:
He et al, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV 2015, https://arxiv.org/abs/1502.01852
def random_weight(shape):
Create random Tensors for weights; setting requires_grad=True means that we
want to compute gradients for these Tensors during the backward pass.
We use Kaiming normalization: sqrt(2 / fan_in)
if len(shape) == 2: # FC weight 初始化FC的W(就是fc_w)
fan_in = shape[0]
fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]
#np.prod(shape[1:]) 表示从第二个维度开始的所有维度的元素的乘积。这是在计算张量的总参数数量。
# randn is standard normal distribution generator.
w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)
w.requires_grad = True
return w
def zero_weight(shape):
return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)
# create a weight of shape [3 x 5]
# you should see the type `torch.cuda.FloatTensor` if you use GPU.
# Otherwise it should be `torch.FloatTensor`
random_weight((3, 5))
tensor([[ 0.2397, -0.8578, 1.0392, 0.0155, -0.2978],
[ 0.6780, -0.7439, -0.3429, -0.6997, 0.2470],
[ 0.4724, -0.2693, 0.9230, -0.6088, 0.1044]], device='cuda:0',
When training the model we will use the following function to check the accuracy of our model on the training or validation sets.
When checking accuracy we don’t need to compute any gradients; as a result we don’t need PyTorch to build a computational graph for us when we compute scores. To prevent a graph from being built we scope our computation under a torch.no_grad()
context manager.
def check_accuracy_part2(loader, model_fn, params):
Check the accuracy of a classification model.
- loader: A DataLoader for the data split we want to check
- model_fn: A function that performs the forward pass of the model,
with the signature scores = model_fn(x, params)
- params: List of PyTorch Tensors giving parameters of the model
Returns: Nothing, but prints the accuracy of the model
split = 'val' if loader.dataset.train else 'test'
#如果数据集是训练集,即 loader.dataset.train 的值为 True,则 split 的值为 'val'。
#如果数据集不是训练集,即 loader.dataset.train 的值为 False,则 split 的值为 'test'
print('Checking accuracy on the %s set' % split)
num_correct, num_samples = 0, 0
with torch.no_grad():
for x, y in loader:
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.int64)
scores = model_fn(x, params)
_, preds = scores.max(1)
#scores.max(1) 表示对 scores 的每一行取最大值,并返回最大值和最大值所在的列索引(就是类别)。
#它是一个一维张量,维度为 (batch_size,)
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))
We can now set up a basic training loop to train our network. We will train the model using stochastic gradient descent without momentum. We will use torch.functional.cross_entropy
to compute the loss; you can read about it here.
The training loop takes as input the neural network function, a list of initialized parameters ([w1, w2]
in our example), and learning rate.
def train_part2(model_fn, params, learning_rate):
Train a model on CIFAR-10.
- model_fn: A Python function that performs the forward pass of the model.
It should have the signature scores = model_fn(x, params) where x is a
PyTorch Tensor of image data, params is a list of PyTorch Tensors giving
model weights, and scores is a PyTorch Tensor of shape (N, C) giving
scores for the elements in x.
- params: List of PyTorch Tensors giving weights for the model
- learning_rate: Python scalar giving the learning rate to use for SGD
Returns: Nothing
for t, (x, y) in enumerate(loader_train):
# Move the data to the proper device (GPU or CPU)
x = x.to(device=device, dtype=dtype)
y = y.to(device=device, dtype=torch.long)
# Forward pass: compute scores and loss
scores = model_fn(x, params)
loss = F.cross_entropy(scores, y)
# Backward pass: PyTorch figures out which Tensors in the computational
# graph has requires_grad=True and uses backpropagation to compute the
# gradient of the loss with respect to these Tensors, and stores the
# gradients in the .grad attribute of each Tensor.
# Update parameters. We don't want to backpropagate through the
# parameter updates, so we scope the updates under a torch.no_grad()
# context manager to prevent a computational graph from being built.
with torch.no_grad():
for w in params:
w -= learning_rate * w.grad
# Manually zero the gradients after running the backward pass
if t % print_every == 0:
print('Iteration %d, loss = %.4f' % (t, loss.item()))
check_accuracy_part2(loader_val, model_fn, params)
Now we are ready to run the training loop. We need to explicitly allocate tensors for the fully connected weights, w1
and w2
Each minibatch of CIFAR has 64 examples, so the tensor shape is [64, 3, 32, 32]
After flattening, x
shape should be [64, 3 * 32 * 32]
. This will be the size of the first dimension of w1
The second dimension of w1
is the hidden layer size, which will also be the first dimension of w2
Finally, the output of the network is a 10-dimensional vector that represents the probability distribution over 10 classes.
You don’t need to tune any hyperparameters but you should see accuracies above 40% after training for one epoch.
hidden_layer_size = 4000
learning_rate = 1e-2
w1 = random_weight((3 * 32 * 32, hidden_layer_size))
w2 = random_weight((hidden_layer_size, 10))
train_part2(two_layer_fc, [w1, w2], learning_rate)
Iteration 0, loss = 3.4006
Checking accuracy on the val set
Got 127 / 1000 correct (12.70%)
Iteration 100, loss = 1.7955
Checking accuracy on the val set
Got 345 / 1000 correct (34.50%)
Iteration 200, loss = 2.3046
Checking accuracy on the val set
Got 378 / 1000 correct (37.80%)
Iteration 300, loss = 1.9541
Checking accuracy on the val set
Got 400 / 1000 correct (40.00%)
Iteration 400, loss = 2.0212
Checking accuracy on the val set
Got 390 / 1000 correct (39.00%)
Iteration 500, loss = 1.5006
Checking accuracy on the val set
Got 422 / 1000 correct (42.20%)
Iteration 600, loss = 1.7669
Checking accuracy on the val set
Got 425 / 1000 correct (42.50%)
Iteration 700, loss = 1.8033
Checking accuracy on the val set
Got 423 / 1000 correct (42.30%)
In the below you should use the functions defined above to train a three-layer convolutional network on CIFAR. The network should have the following architecture:
You should initialize your weight matrices using the random_weight
function defined above, and you should initialize your bias vectors using the zero_weight
function above.
You don’t need to tune any hyperparameters, but if everything works correctly you should achieve an accuracy above 42% after one epoch.
learning_rate = 3e-3
channel_1 = 32
channel_2 = 16
conv_w1 = None
conv_b1 = None
conv_w2 = None
conv_b2 = None
fc_w = None
fc_b = None
# TODO: Initialize the parameters of a three-layer ConvNet. #
conv_w1 = random_weight((channel_1, 3, 5, 5))
conv_b1 = zero_weight((channel_1, ))
conv_w2 = random_weight((channel_2, channel_1, 3, 3))
conv_b2 = zero_weight((channel_2, ))
fc_w = random_weight((channel_2 * 32 * 32, 10))
fc_b = zero_weight((10, ))
params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
train_part2(three_layer_convnet, params, learning_rate)
Iteration 0, loss = 3.2408
Checking accuracy on the val set
Got 116 / 1000 correct (11.60%)
Iteration 100, loss = 1.7337
Checking accuracy on the val set
Got 339 / 1000 correct (33.90%)
Iteration 200, loss = 1.9163
Checking accuracy on the val set
Got 389 / 1000 correct (38.90%)
Iteration 300, loss = 2.0088
Checking accuracy on the val set
Got 417 / 1000 correct (41.70%)
Iteration 400, loss = 1.7278
Checking accuracy on the val set
Got 415 / 1000 correct (41.50%)
Iteration 500, loss = 1.6017
Checking accuracy on the val set
Got 428 / 1000 correct (42.80%)
Iteration 600, loss = 1.4811
Checking accuracy on the val set
Got 447 / 1000 correct (44.70%)
Iteration 700, loss = 1.6551
Checking accuracy on the val set
Got 463 / 1000 correct (46.30%)
Barebone PyTorch requires that we track all the parameter tensors by hand. This is fine for small networks with a few tensors, but it would be extremely inconvenient and error-prone to track tens or hundreds of tensors in larger networks.
PyTorch provides the nn.Module
API for you to define arbitrary network architectures, while tracking every learnable parameters for you. In Part II, we implemented SGD ourselves. PyTorch also provides the torch.optim
package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam. It even supports approximate second-order methods like L-BFGS! You can refer to the doc for the exact specifications of each optimizer.
To use the Module API, follow the steps below:
Subclass nn.Module
. Give your network class an intuitive name like TwoLayerFC
In the constructor __init__()
, define all the layers you need as class attributes. Layer objects like nn.Linear
and nn.Conv2d
are themselves nn.Module
subclasses and contain learnable parameters, so that you don’t have to instantiate the raw tensors yourself. nn.Module
will track these internal parameters for you. Refer to the doc to learn more about the dozens of builtin layers. Warning: don’t forget to call the super().__init__()
In the forward()
method, define the connectivity of your network. You should use the attributes defined in __init__
as function calls that take tensor as input and output the “transformed” tensor. Do not create any new layers with learnable parameters in forward()
! All of them must be declared upfront in __init__
After you define your Module subclass, you can instantiate it as an object and call it just like the NN forward function in part II.
Here is a concrete example of a 2-layer fully connected network:
class TwoLayerFC(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
# assign layer objects to class attributes
self.fc1 = nn.Linear(input_size, hidden_size)
# nn.init package contains convenient initialization methods
# http://pytorch.org/docs/master/nn.html#torch-nn-init
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# forward always defines connectivity
x = flatten(x)
scores = self.fc2(F.relu(self.fc1(x)))
return scores
def test_TwoLayerFC():
input_size = 50
x = torch.zeros((64, input_size), dtype=dtype) # minibatch size 64, feature dimension 50
model = TwoLayerFC(input_size, 42, 10)
scores = model(x)
print(scores.size()) # you should see [64, 10]
torch.Size([64, 10])
It’s your turn to implement a 3-layer ConvNet followed by a fully connected layer. The network architecture should be the same as in Part II:
5x5 filters with zero-padding of 2channel_2
3x3 filters with zero-padding of 1num_classes
classesYou should initialize the weight matrices of the model using the Kaiming normal initialization method.
HINT: http://pytorch.org/docs/stable/nn.html#conv2d
After you implement the three-layer ConvNet, the test_ThreeLayerConvNet
function will run your implementation; it should print (64, 10)
for the shape of the output scores.
class ThreeLayerConvNet(nn.Module):
def __init__(self, in_channel, channel_1, channel_2, num_classes):
# TODO: Set up the layers you need for a three-layer ConvNet with the #
# architecture defined above. #
self.conv1 = nn.Conv2d(in_channel,channel_1,(5,5),padding=2)
self.conv2 = nn.Conv2d(channel_1,channel_2,(3,3),padding=1)
self.fc = nn.Linear(channel_2*32*32,num_classes)
def forward(self, x):
scores = None
# TODO: Implement the forward function for a 3-layer ConvNet. you #
# should use the layers you defined in __init__ and specify the #
# connectivity of those layers in forward() #
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
scores = self.fc(flatten(x))
return scores
def test_ThreeLayerConvNet():
x = torch.zeros((64, 3, 32, 32), dtype=dtype) # minibatch size 64, image size [3, 32, 32]
model = ThreeLayerConvNet(in_channel=3, channel_1=12, channel_2=8, num_classes=10)
scores = model(x)
print(scores.size()) # you should see [64, 10]
torch.Size([64, 10])
Given the validation or test set, we can check the classification accuracy of a neural network.
This version is slightly different from the one in part II. You don’t manually pass in the parameters anymore.
def check_accuracy_part34(loader, model):
if loader.dataset.train:
print('Checking accuracy on validation set')
print('Checking accuracy on test set')
num_correct = 0
num_samples = 0
model.eval() # set model to evaluation mode
with torch.no_grad():
for x, y in loader:
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.long)
scores = model(x)
_, preds = scores.max(1)
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
We also use a slightly different training loop. Rather than updating the values of the weights ourselves, we use an Optimizer object from the torch.optim
package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks.
def train_part34(model, optimizer, epochs=1):
Train a model on CIFAR-10 using the PyTorch Module API.
- model: A PyTorch Module giving the model to train.
- optimizer: An Optimizer object we will use to train the model
- epochs: (Optional) A Python integer giving the number of epochs to train for
Returns: Nothing, but prints model accuracies during training.
model = model.to(device=device) # move the model parameters to CPU/GPU
for e in range(epochs):
for t, (x, y) in enumerate(loader_train):
model.train() # put model to training mode
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.long)
scores = model(x)
loss = F.cross_entropy(scores, y)
# Zero out all of the gradients for the variables which the optimizer
# will update.
# This is the backwards pass: compute the gradient of the loss with
# respect to each parameter of the model.
# Actually update the parameters of the model using the gradients
# computed by the backwards pass.
if t % print_every == 0:
print('Iteration %d, loss = %.4f' % (t, loss.item()))
check_accuracy_part34(loader_val, model)
Now we are ready to run the training loop. In contrast to part II, we don’t explicitly allocate parameter tensors anymore.
Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of TwoLayerFC
You also need to define an optimizer that tracks all the learnable parameters inside TwoLayerFC
You don’t need to tune any hyperparameters, but you should see model accuracies above 40% after training for one epoch.
hidden_layer_size = 4000
learning_rate = 1e-2
model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
train_part34(model, optimizer)
Iteration 0, loss = 3.5774
Checking accuracy on validation set
Got 125 / 1000 correct (12.50)
Iteration 100, loss = 2.7529
Checking accuracy on validation set
Got 340 / 1000 correct (34.00)
Iteration 200, loss = 2.0865
Checking accuracy on validation set
Got 392 / 1000 correct (39.20)
Iteration 300, loss = 1.7863
Checking accuracy on validation set
Got 430 / 1000 correct (43.00)
Iteration 400, loss = 1.9395
Checking accuracy on validation set
Got 423 / 1000 correct (42.30)
Iteration 500, loss = 1.4940
Checking accuracy on validation set
Got 374 / 1000 correct (37.40)
Iteration 600, loss = 1.6123
Checking accuracy on validation set
Got 431 / 1000 correct (43.10)
Iteration 700, loss = 1.8174
Checking accuracy on validation set
Got 448 / 1000 correct (44.80)
You should now use the Module API to train a three-layer ConvNet on CIFAR. This should look very similar to training the two-layer network! You don’t need to tune any hyperparameters, but you should achieve above above 45% after training for one epoch.
You should train the model using stochastic gradient descent without momentum.
learning_rate = 3e-3
channel_1 = 32
channel_2 = 16
model = None
optimizer = None
# TODO: Instantiate your ThreeLayerConvNet model and a corresponding optimizer #
model = ThreeLayerConvNet(3, channel_1, channel_2, 10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
train_part34(model, optimizer)
Iteration 0, loss = 2.7838
Checking accuracy on validation set
Got 112 / 1000 correct (11.20)
Iteration 100, loss = 1.9773
Checking accuracy on validation set
Got 362 / 1000 correct (36.20)
Iteration 200, loss = 1.7640
Checking accuracy on validation set
Got 425 / 1000 correct (42.50)
Iteration 300, loss = 1.5294
Checking accuracy on validation set
Got 458 / 1000 correct (45.80)
Iteration 400, loss = 1.8433
Checking accuracy on validation set
Got 451 / 1000 correct (45.10)
Iteration 500, loss = 1.2423
Checking accuracy on validation set
Got 476 / 1000 correct (47.60)
Iteration 600, loss = 1.6279
Checking accuracy on validation set
Got 480 / 1000 correct (48.00)
Iteration 700, loss = 1.6463
Checking accuracy on validation set
Got 502 / 1000 correct (50.20)
Part III introduced the PyTorch Module API, which allows you to define arbitrary learnable layers and their connectivity.
For simple models like a stack of feed forward layers, you still need to go through 3 steps: subclass nn.Module
, assign layers to class attributes in __init__
, and call each layer one by one in forward()
. Is there a more convenient way?
Fortunately, PyTorch provides a container Module called nn.Sequential
, which merges the above steps into one. It is not as flexible as nn.Module
, because you cannot specify more complex topology than a feed-forward stack, but it’s good enough for many use cases.
Let’s see how to rewrite our two-layer fully connected network example with nn.Sequential
, and train it using the training loop defined above.
Again, you don’t need to tune any hyperparameters here, but you shoud achieve above 40% accuracy after one epoch of training.
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
def forward(self, x):
return flatten(x)
hidden_layer_size = 4000
learning_rate = 1e-2
model = nn.Sequential(
nn.Linear(3 * 32 * 32, hidden_layer_size),
nn.Linear(hidden_layer_size, 10),
# you can use Nesterov momentum in optim.SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
momentum=0.9, nesterov=True)
train_part34(model, optimizer)
Iteration 0, loss = 2.3417
Checking accuracy on validation set
Got 156 / 1000 correct (15.60)
Iteration 100, loss = 1.7241
Checking accuracy on validation set
Got 369 / 1000 correct (36.90)
Iteration 200, loss = 2.3021
Checking accuracy on validation set
Got 386 / 1000 correct (38.60)
Iteration 300, loss = 1.4864
Checking accuracy on validation set
Got 418 / 1000 correct (41.80)
Iteration 400, loss = 1.7475
Checking accuracy on validation set
Got 433 / 1000 correct (43.30)
Iteration 500, loss = 1.6403
Checking accuracy on validation set
Got 420 / 1000 correct (42.00)
Iteration 600, loss = 1.8132
Checking accuracy on validation set
Got 457 / 1000 correct (45.70)
Iteration 700, loss = 1.9169
Checking accuracy on validation set
Got 459 / 1000 correct (45.90)
Here you should use nn.Sequential
to define and train a three-layer ConvNet with the same architecture we used in Part III:
You can use the default PyTorch weight initialization.
You should optimize your model using stochastic gradient descent with Nesterov momentum 0.9.
Again, you don’t need to tune any hyperparameters but you should see accuracy above 55% after one epoch of training.
channel_1 = 32
channel_2 = 16
learning_rate = 1e-2
model = None
optimizer = None
# TODO: Rewrite the 2-layer ConvNet with bias from Part III with the #
# Sequential API. #
model = nn.Sequential(
nn.Conv2d(3,channel_1,(5,5),padding = 2),
nn.Conv2d(channel_1,channel_2,(3,3),padding =1),
nn.Linear(channel_2 * 32 * 32,10)
optimizer = optim.SGD(model.parameters(),lr = learning_rate,momentum = 0.9,nesterov = True)
train_part34(model, optimizer)
Iteration 0, loss = 2.3297
Checking accuracy on validation set
Got 131 / 1000 correct (13.10)
Iteration 100, loss = 1.7009
Checking accuracy on validation set
Got 479 / 1000 correct (47.90)
Iteration 200, loss = 1.4049
Checking accuracy on validation set
Got 495 / 1000 correct (49.50)
Iteration 300, loss = 1.3381
Checking accuracy on validation set
Got 515 / 1000 correct (51.50)
Iteration 400, loss = 1.5696
Checking accuracy on validation set
Got 535 / 1000 correct (53.50)
Iteration 500, loss = 1.1625
Checking accuracy on validation set
Got 557 / 1000 correct (55.70)
Iteration 600, loss = 1.1777
Checking accuracy on validation set
Got 583 / 1000 correct (58.30)
Iteration 700, loss = 1.3879
Checking accuracy on validation set
Got 589 / 1000 correct (58.90)
In this section, you can experiment with whatever ConvNet architecture you’d like on CIFAR-10.
Now it’s your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves at least 70% accuracy on the CIFAR-10 validation set within 10 epochs. You can use the check_accuracy and train functions from above. You can use either nn.Module
or nn.Sequential
Describe what you did at the end of this notebook.
Here are the official API documentation for each component. One note: what we call in the class “spatial batch norm” is called “BatchNorm2D” in PyTorch.
For each network architecture that you try, you should tune the learning rate and other hyperparameters. When doing this there are a couple important things to keep in mind:
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are not required to implement any of these, but don’t miss the fun if you have time!
# TODO: #
# Experiment with any architectures, optimizers, and hyperparameters. #
# Achieve AT LEAST 70% accuracy on the *validation set* within 10 epochs. #
# #
# Note that you can use the check_accuracy function to evaluate on either #
# the test set or the validation set, by passing either loader_test or #
# loader_val as the second argument to check_accuracy. You should not touch #
# the test set until you have finished your architecture and hyperparameter #
# tuning, and only run the test set once at the end to report a final value. #
model = None
optimizer = None
learning_rate = 1e-2
model = nn.Sequential(
nn.Conv2d(3, 32, (3, 3), padding=1),
nn.Conv2d(32, 32, (3, 3), padding=1),
nn.MaxPool2d((2, 2)),
nn.Conv2d(32, 64, (3, 3), padding=1),
nn.Conv2d(64, 64, (3, 3), padding=1),
nn.MaxPool2d((2, 2)),
nn.Conv2d(64, 128, (3, 3), padding=1),
nn.Conv2d(128, 128, (3, 3), padding=1),
nn.Conv2d(128, 128, (3, 3), padding=1),
nn.MaxPool2d((2, 2)),
nn.Linear(128 * 4 * 4, 512),
nn.Linear(512, 128),
nn.Linear(128, 10),
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
momentum=0.9, nesterov=True)
# You should get at least 70% accuracy.
# You may modify the number of epochs to any number below 15.
train_part34(model, optimizer, epochs=10)
Iteration 0, loss = 2.2939
Checking accuracy on validation set
Got 78 / 1000 correct (7.80)
Iteration 100, loss = 1.4764
Checking accuracy on validation set
Got 376 / 1000 correct (37.60)
Iteration 200, loss = 1.2769
Checking accuracy on validation set
Got 495 / 1000 correct (49.50)
Iteration 300, loss = 1.3469
Checking accuracy on validation set
Got 576 / 1000 correct (57.60)
Iteration 400, loss = 1.1578
Checking accuracy on validation set
Got 608 / 1000 correct (60.80)
Iteration 500, loss = 0.9947
Checking accuracy on validation set
Got 528 / 1000 correct (52.80)
Iteration 600, loss = 1.0122
Checking accuracy on validation set
Got 653 / 1000 correct (65.30)
Iteration 700, loss = 1.0121
Checking accuracy on validation set
Got 621 / 1000 correct (62.10)
Iteration 0, loss = 0.7112
Checking accuracy on validation set
Got 640 / 1000 correct (64.00)
Iteration 100, loss = 0.9337
Checking accuracy on validation set
Got 602 / 1000 correct (60.20)
Iteration 200, loss = 0.7894
Checking accuracy on validation set
Got 647 / 1000 correct (64.70)
Iteration 300, loss = 0.9193
Checking accuracy on validation set
Got 723 / 1000 correct (72.30)
Iteration 400, loss = 0.6172
Checking accuracy on validation set
Got 716 / 1000 correct (71.60)
Iteration 500, loss = 0.9520
Checking accuracy on validation set
Got 707 / 1000 correct (70.70)
Iteration 600, loss = 0.7301
Checking accuracy on validation set
Got 714 / 1000 correct (71.40)
Iteration 700, loss = 0.7232
Checking accuracy on validation set
Got 745 / 1000 correct (74.50)
Iteration 0, loss = 0.6495
Checking accuracy on validation set
Got 711 / 1000 correct (71.10)
Iteration 100, loss = 0.5902
Checking accuracy on validation set
Got 717 / 1000 correct (71.70)
Iteration 200, loss = 0.5553
Checking accuracy on validation set
Got 745 / 1000 correct (74.50)
Iteration 300, loss = 0.6477
Checking accuracy on validation set
Got 757 / 1000 correct (75.70)
Iteration 400, loss = 0.7282
Checking accuracy on validation set
Got 710 / 1000 correct (71.00)
Iteration 500, loss = 0.6967
Checking accuracy on validation set
Got 769 / 1000 correct (76.90)
Iteration 600, loss = 0.6092
Checking accuracy on validation set
Got 760 / 1000 correct (76.00)
Iteration 700, loss = 0.5791
Checking accuracy on validation set
Got 773 / 1000 correct (77.30)
Iteration 0, loss = 0.6160
Checking accuracy on validation set
Got 769 / 1000 correct (76.90)
Iteration 100, loss = 0.6843
Checking accuracy on validation set
Got 785 / 1000 correct (78.50)
Iteration 200, loss = 0.5082
Checking accuracy on validation set
Got 781 / 1000 correct (78.10)
Iteration 300, loss = 0.5353
Checking accuracy on validation set
Got 797 / 1000 correct (79.70)
Iteration 400, loss = 0.3607
Checking accuracy on validation set
Got 785 / 1000 correct (78.50)
Iteration 500, loss = 0.4162
Checking accuracy on validation set
Got 783 / 1000 correct (78.30)
Iteration 600, loss = 0.5460
Checking accuracy on validation set
Got 790 / 1000 correct (79.00)
Iteration 700, loss = 0.5572
Checking accuracy on validation set
Got 804 / 1000 correct (80.40)
Iteration 0, loss = 0.2772
Checking accuracy on validation set
Got 792 / 1000 correct (79.20)
Iteration 100, loss = 0.4721
Checking accuracy on validation set
Got 739 / 1000 correct (73.90)
Iteration 200, loss = 0.5043
Checking accuracy on validation set
Got 786 / 1000 correct (78.60)
Iteration 300, loss = 0.3167
Checking accuracy on validation set
Got 804 / 1000 correct (80.40)
Iteration 400, loss = 0.3734
Checking accuracy on validation set
Got 799 / 1000 correct (79.90)
Iteration 500, loss = 0.3836
Checking accuracy on validation set
Got 805 / 1000 correct (80.50)
Iteration 600, loss = 0.5413
Checking accuracy on validation set
Got 812 / 1000 correct (81.20)
Iteration 700, loss = 0.4147
Checking accuracy on validation set
Got 806 / 1000 correct (80.60)
Iteration 0, loss = 0.3038
Checking accuracy on validation set
Got 812 / 1000 correct (81.20)
Iteration 100, loss = 0.4533
Checking accuracy on validation set
Got 813 / 1000 correct (81.30)
Iteration 200, loss = 0.3536
Checking accuracy on validation set
Got 816 / 1000 correct (81.60)
Iteration 300, loss = 0.4305
Checking accuracy on validation set
Got 826 / 1000 correct (82.60)
Iteration 400, loss = 0.3989
Checking accuracy on validation set
Got 815 / 1000 correct (81.50)
Iteration 500, loss = 0.3402
Checking accuracy on validation set
Got 807 / 1000 correct (80.70)
Iteration 600, loss = 0.2214
Checking accuracy on validation set
Got 797 / 1000 correct (79.70)
Iteration 700, loss = 0.4604
Checking accuracy on validation set
Got 823 / 1000 correct (82.30)
Iteration 0, loss = 0.4529
Checking accuracy on validation set
Got 814 / 1000 correct (81.40)
Iteration 100, loss = 0.2152
Checking accuracy on validation set
Got 806 / 1000 correct (80.60)
Iteration 200, loss = 0.1658
Checking accuracy on validation set
Got 815 / 1000 correct (81.50)
Iteration 300, loss = 0.2976
Checking accuracy on validation set
Got 828 / 1000 correct (82.80)
Iteration 400, loss = 0.3042
Checking accuracy on validation set
Got 806 / 1000 correct (80.60)
Iteration 500, loss = 0.2392
Checking accuracy on validation set
Got 827 / 1000 correct (82.70)
Iteration 600, loss = 0.4960
Checking accuracy on validation set
Got 785 / 1000 correct (78.50)
Iteration 700, loss = 0.3522
Checking accuracy on validation set
Got 841 / 1000 correct (84.10)
Iteration 0, loss = 0.2608
Checking accuracy on validation set
Got 824 / 1000 correct (82.40)
Iteration 100, loss = 0.3661
Checking accuracy on validation set
Got 814 / 1000 correct (81.40)
Iteration 200, loss = 0.2136
Checking accuracy on validation set
Got 790 / 1000 correct (79.00)
Iteration 300, loss = 0.1856
Checking accuracy on validation set
Got 810 / 1000 correct (81.00)
Iteration 400, loss = 0.1881
Checking accuracy on validation set
Got 832 / 1000 correct (83.20)
Iteration 500, loss = 0.3742
Checking accuracy on validation set
Got 826 / 1000 correct (82.60)
Iteration 600, loss = 0.3237
Checking accuracy on validation set
Got 827 / 1000 correct (82.70)
Iteration 700, loss = 0.5204
Checking accuracy on validation set
Got 836 / 1000 correct (83.60)
Iteration 0, loss = 0.3380
Checking accuracy on validation set
Got 824 / 1000 correct (82.40)
Iteration 100, loss = 0.0432
Checking accuracy on validation set
Got 832 / 1000 correct (83.20)
Iteration 200, loss = 0.1119
Checking accuracy on validation set
Got 822 / 1000 correct (82.20)
Iteration 300, loss = 0.4075
Checking accuracy on validation set
Got 831 / 1000 correct (83.10)
Iteration 400, loss = 0.1174
Checking accuracy on validation set
Got 845 / 1000 correct (84.50)
Iteration 500, loss = 0.2131
Checking accuracy on validation set
Got 818 / 1000 correct (81.80)
Iteration 600, loss = 0.2047
Checking accuracy on validation set
Got 827 / 1000 correct (82.70)
Iteration 700, loss = 0.3477
Checking accuracy on validation set
Got 829 / 1000 correct (82.90)
Iteration 0, loss = 0.1140
Checking accuracy on validation set
Got 839 / 1000 correct (83.90)
Iteration 100, loss = 0.0698
Checking accuracy on validation set
Got 818 / 1000 correct (81.80)
Iteration 200, loss = 0.1086
Checking accuracy on validation set
Got 840 / 1000 correct (84.00)
Iteration 300, loss = 0.1484
Checking accuracy on validation set
Got 828 / 1000 correct (82.80)
Iteration 400, loss = 0.0591
Checking accuracy on validation set
Got 834 / 1000 correct (83.40)
Iteration 500, loss = 0.1721
Checking accuracy on validation set
Got 824 / 1000 correct (82.40)
Iteration 600, loss = 0.2057
Checking accuracy on validation set
Got 833 / 1000 correct (83.30)
Iteration 700, loss = 0.0819
Checking accuracy on validation set
Got 841 / 1000 correct (84.10)
In the cell below you should write an explanation of what you did, any additional features that you implemented, and/or any graphs that you made in the process of training and evaluating your network.
Now that we’ve gotten a result we’re happy with, we test our final model on the test set (which you should store in best_model). Think about how this compares to your validation set accuracy.
best_model = model
check_accuracy_part34(loader_test, best_model)
Checking accuracy on test set
Got 8286 / 10000 correct (82.86)