今天是尝试用 PyTorch 框架来跑 MNIST 手写数字数据集的第三天,主要学习训练网络。本 blog 主要记录一个学习的路径以及学习资料的汇总。
注意:这是用 Python 2.7 版本写的代码
第一天(LeNet 网络的搭建):https://blog.csdn.net/qq_36627158/article/details/108098147
第二天(加载 MNIST 数据集):https://blog.csdn.net/qq_36627158/article/details/108119048
第三天(训练模型):https://blog.csdn.net/qq_36627158/article/details/108163693
第四天(单例测试):https://blog.csdn.net/qq_36627158/article/details/108183655
感谢 凯神 提供的代码与耐心指导!
from lenet import Net
import torch
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from mnist_load import testset_loader, trainset_loader
LEARNING_RATE = 0.001
MOMENTUM = 0.9
EPOCH = 5
if torch.cuda.is_available():
device = torch.device('cuda')
print 'cuda'
else:
device = torch.device('cpu')
print 'cpu'
mnist_model = Net().to(device)
optimizer = optim.SGD(
mnist_model.parameters(),
lr=LEARNING_RATE,
momentum=MOMENTUM
)
# save_model
def save_checkpoint(checkpoint_path, model, optimizer):
# state_dict: a Python dictionary object that:
# - for a model, maps each layer to its parameter tensor;
# - for an optimizer, contains info about the optimizer's states and hyperparameters used.
state = {
'model': model.state_dict(),
'optimizer' : optimizer.state_dict()
}
torch.save(state, checkpoint_path)
print 'model saved to ', checkpoint_path
# train
def mnist_train(epoch, save_interval):
mnist_model.train() # set training mode
iteration = 0
loss_plt = []
for ep in range(epoch):
for batch_idx, batch_data in enumerate(trainset_loader):
images, labels = batch_data
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad()
output = mnist_model(images)
loss = F.cross_entropy(output, labels)
loss_plt.append(loss)
loss.backward()
optimizer.step()
print 'Train Epoch:', ep+1, '\tBatch: ', batch_idx+1, '/', len(trainset_loader), '\tLoss: ', loss.item()
# different from before: saving model checkpoints
if iteration % save_interval == 0 and iteration > 0:
save_checkpoint('module/pytorch-mnist-batchsize-128-%i.pth' % iteration, mnist_model, optimizer)
iteration += 1
mnist_test()
# save the final model
save_checkpoint('module/pytorch-mnist-batch-128-%i.pth' % iteration, mnist_model, optimizer)
plt.plot(loss_plt, label='loss')
plt.legend()
plt.show()
# test
def mnist_test():
mnist_model.eval() # set evaluation mode
test_loss = 0
correct = 0
with torch.no_grad():
for images, labels in testset_loader:
images = images.to(device)
labels = labels.to(device)
output = mnist_model(images)
test_loss += F.cross_entropy(output, labels).item()
pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(labels.view_as(pred)).sum().item()
test_loss /= len(testset_loader.dataset)
print '\nTest set: Average loss:', test_loss, '\tAccuracy:', (100. * correct / len(testset_loader.dataset)), '%\n'
if __name__ == '__main__':
mnist_train(EPOCH, save_interval=1000)
https://pytorch.org/docs/stable/optim.html
一开始以为是自己电脑配置(内存不够大)太低,每次 load 一个 batch 的图片数量不能太多,所以就一直在改 BATCH_SIZE 这个超参数。后面不停降低 BATCH_SIZE 还总报错,就意识到应该不是内存容量的问题。
后来查了一下,是加载数据(batch)的线程数目问题
https://blog.csdn.net/breeze210/article/details/99679048
好吧,原来 Python 写文件的时候,如果路径中的文件夹不存在,是不会自动创建好的。Mark!
凯神的解释:MOMENTUM 动量是随机梯度下降中用于更新模型权重的一个参数
https://www.lizenghai.com/archives/29512.html
https://pytorch.org/docs/stable/optim.html
将所有最开始读取数据时的 tensor 变量 copy 一份到指定设备 device 上,之后的运算都在指定设备上进行。
https://www.jb51.net/article/178049.htm
https://blog.csdn.net/qq_39463274/article/details/105295272?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
if __name__ == "__main__"
https://blog.csdn.net/yjk13703623757/article/details/77918633
a way to save the current state of your experiment so that you can pick up from where you left off.
https://www.cnblogs.com/Arborday/p/9740253.html
因为后面反向传播时优化器会自动计算梯度,不要让上一次迭代的梯度影响到本次迭代的梯度
https://blog.csdn.net/scut_salmon/article/details/82414730
最开始有点搞不清楚这两个函数分别是干什么的。后来看视频拿个类比,我就明白了
线性回归中,权值参数的公式为:w_new = w_old + lr * gradient
loss.backward() 就相当于计算 gradient 的
optimizer.step() 就相当于根据 gradient 计算 w_new = w_old + lr * gradient 的
https://v.qq.com/x/page/t0554h33liw.html
Use both. They do different things, and have different scopes.
with torch.no_grad: disables tracking of gradients in autograd.
model.eval(): changes the forward() behaviour of the module it is called upon. eg, it disables dropout and has batch norm use the entire population statistics
https://www.cnblogs.com/shiwanghualuo/p/11789018.html
https://blog.csdn.net/songyu0120/article/details/103884586?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param