目录:
在train脚本中,主要是对参数的设置和调整
需要思考的参数如下,在train脚本中需要设置batch_size
,device
,dataloader
,loss_function
,optimizer
,epoch
等参数。按照epoch,每一轮取测试集中全部数据进行训练。
书写顺序如下:
写argparse()
方法收集需要传递的所有参数,传入main函数中(可选)。
main函数中思路如下:
transforms
,dataset
,dataloader
,batch_size
等参数,因为dataloader中要用到。device
,loss_function
,optimizer
,model模型
,迁移学习,加载预训练权重等。下以AlexNet中的train.py为例:
'''
训练脚本
只有训练过程,没有验证,没有评价指标
'''
# --- add path
import os, sys
project_path = os.path.dirname(__file__)
root_path = os.path.dirname(project_path)
sys.path.append(project_path)
# ---
import torch
import torch.nn as nn
from torchvision import transforms, datasets, utils
import torch.optim as optim
from model import AlexNet
import json, time
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
def parse_args():
"""get args"""
import argparse
def generate_json(train_dataset:object=None):
"""generate class indices"""
def see_pic(test_dataset, train_dataset):
"""to see pics"""
def main():
# 路径
project_path = os.path.dirname(__file__)
root_path = os.path.dirname(project_path)
weight_path = os.path.join(root_path, "_weight", "AlexNet_2.pth")
# bach size
batch_size = 32
# data_loader
train_data_loader = None
test_data_loader = None
data_transform = {
"train": transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
]),
"test": transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
}
train_dataset = datasets.ImageFolder(
root=os.path.join(root_path, "_data", "flower_data2", "train"),
transform=data_transform["train"],
)
train_data_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=0,
)
test_dataset = datasets.ImageFolder(
root=os.path.join(root_path, "_data", "flower_data2", "val"),
transform=data_transform["test"],
)
test_data_loader = torch.utils.data.DataLoader(
dataset=test_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=0,
)
# 生成json文件
generate_json(train_dataset=train_dataset)
# 训练参数
net = AlexNet(num_classes=2, init_weights=True) # 模型
loss_function = nn.CrossEntropyLoss() # 损失函数
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(device)
optimizer = optim.Adam(net.parameters(), lr = 0.0002)
# 迁移学习,加载预训练权重
try:
net.load_state_dict(torch.load(weight_path))
except Exception as e:
print("load weight failed")
try:
os.makedirs(os.path.join(root_path, "_weight"))
except:
print("already has path:{}".format(os.path.join(root_path, "_weight")))
else:
print("mkdir path")
else:
print("load weight successifully")
# 开始训练
best_acc = 0.0
for epoch in range(10):
# train
net.train()
running_loss = 0.0
t1 = time.perf_counter()
for step, data in enumerate(train_data_loader, start=0):
images, labels = data
# 传入GPU
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad() # 梯度清零
outputs = net(images) # 进行训练,正向传播
loss = loss_function(outputs, labels) # 指定x,y
loss.backward() # 误差反向传播
optimizer.step() # 更新optimizer中的参数
# print statistics
running_loss += loss.item() # 更新损失
# 一些花里胡哨的东西,打印进度条
rate = (step + 1) / len(train_data_loader) # len = len(dataset) / batch_size = 3306 / 32 = 104
a = "*" * int(rate * 50)
b = "*" * int((1 - rate) * 50)
print("\rtrain loss: {:^3.0f}%[{}->{}]{:.3f}".format(int(rate * 100), a, b, loss), end="")
print()
print("train time used about {} s".format(time.perf_counter() - t1))
# test
net.eval() # 指定为evaliation, 这个时候不会传递梯度
acc = 0.0 # accumulate accurate number / epoch
with torch.no_grad():
for data_test in test_data_loader:
test_images, test_labels = data_test
outputs = net(test_images.to(device))
predict_y = torch.max(outputs, dim = 1)[1]
acc += (predict_y == test_labels.to(device)).sum().item()
accurate_test = acc / len(test_dataset)
if accurate_test > best_acc:
best_acc = accurate_test
torch.save(net.state_dict(), weight_path)
print("[epoch %d] train_loss: %.3f test_accuracy: %.3f" %
(epoch + 1, running_loss / step, acc / len(test_dataset)))
if __name__ == "__main__":
args = parse_args()
main(args)
在DataLoader环节我们需要选择合适的Transforms传入Dataset,向DataLoader中传入Dataset和batch,DataLoader就会每次从Dataset中取出batch个数据。其中最为重要的就是选定适合的Transforms传入Dataset中,设定合适的DataLoader。
Transforms选定如下:
DataLoader案例如下:
# dataloader
device = torch.device(args.device if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
num_workers = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])
# segmentation nun_classes + background
num_classes = args.num_classes + 1
# using compute_mean_std.py
mean = (0.709, 0.381, 0.224)
std = (0.127, 0.079, 0.043)
train_dataset = DriveDataset(args.data_path,
train=True,
transforms=get_transform(train=True, mean=mean, std=std))
val_dataset = DriveDataset(args.data_path,
train=False,
transforms=get_transform(train=False, mean=mean, std=std))
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size,
num_workers=num_workers,
shuffle=True,
pin_memory=True,
collate_fn=train_dataset.collate_fn)
val_loader = torch.utils.data.DataLoader(val_dataset,
batch_size=1,
num_workers=num_workers,
pin_memory=True,
collate_fn=val_dataset.collate_fn)
在该步骤中需要指定模型,优化器,加载预训练权重(迁移学习)。
# 模型
model = create_model(num_classes=num_classes)
model.to(device)
params_to_optimize = [p for p in model.parameters() if p.requires_grad]
# 优化器
optimizer = torch.optim.SGD(
params_to_optimize,
lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay
)
我们使用torch.load()
和torch.save()
用来加载和保存训练超参,我们在load和save中指定model.load_state_load()
和model.state_dict()
用来将训练的权重超参保存为字典格式进行存储,如下:
# 保存.pth文件
# 设定文件存储的格式
save_file = {"model": model.state_dict(),
"optimizer": optimizer.state_dict(), # 优化器中参数
"lr_scheduler": lr_scheduler.state_dict(),
"epoch": epoch,
"args": args}
torch.save(save_file, "save_weights/best_model.pth")
# 加载.pth文件
# 从.pth文件中取数据
checkpoint = torch.load(args.resume, map_location='cpu') # args.resume="save_weights/best_model.pth"; map_location指的是映射到CPU上加载模型
model.load_state_dict(checkpoint['model']) # 从dictionary中根据key取value,若是用.state_dict()进行存储,则需要用.load_state_dict()将值取出
optimizer.load_state_dict(checkpoint['optimizer'])
lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
args.start_epoch = checkpoint['epoch'] + 1 # 若不是用.state_dict()取出,则直接取出来用便可
每一次训练都是在epoch中进行,每一个epoch需要进行训练和测试并将训练结果进行存储,并记录每一轮训练时长。
训练的完整代码如下:
# 用来保存训练以及验证过程中信息
results_file = "/home/yingmuzhi/unet/results{}.txt".format(datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
best_dice = 0.
start_time = time.time()
for epoch in range(args.start_epoch, args.epochs):
mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch, num_classes,
lr_scheduler=lr_scheduler, print_freq=args.print_freq, scaler=scaler)
confmat, dice = evaluate(model, val_loader, device=device, num_classes=num_classes)
val_info = str(confmat)
print(val_info)
print(f"dice coefficient: {dice:.3f}")
# write into txt
with open(results_file, "a") as f:
# 记录每个epoch对应的train_loss、lr以及验证集各指标
train_info = f"[epoch: {epoch}]\n" \
f"train_loss: {mean_loss:.4f}\n" \
f"lr: {lr:.6f}\n" \
f"dice coefficient: {dice:.3f}\n"
f.write(train_info + val_info + "\n\n")
if args.save_best is True:
if best_dice < dice:
best_dice = dice
else:
continue
save_file = {"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"lr_scheduler": lr_scheduler.state_dict(),
"epoch": epoch,
"args": args}
if args.amp:
save_file["scaler"] = scaler.state_dict()
if args.save_best is True:
torch.save(save_file, "save_weights/best_model.pth")
else:
torch.save(save_file, "save_weights/model_{}.pth".format(epoch))
total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print("training time {}".format(total_time_str))
使用argparse封装需要的参数。
参看https://blog.csdn.net/qq_43369406/article/details/127787799
argparse函数案例如下:
def parse_args():
import argparse
parser = argparse.ArgumentParser(description="pytorch unet training")
parser.add_argument("--data-path", default="./", help="DRIVE root")
# exclude background
parser.add_argument("--num-classes", default=1, type=int)
parser.add_argument("--device", default="cuda", help="training device")
parser.add_argument("-b", "--batch-size", default=4, type=int)
parser.add_argument("--epochs", default=200, type=int, metavar="N",
help="number of total epochs to train")
parser.add_argument('--lr', default=0.01, type=float, help='initial learning rate')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
help='momentum')
parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
metavar='W', help='weight decay (default: 1e-4)',
dest='weight_decay')
parser.add_argument('--print-freq', default=1, type=int, help='print frequency')
parser.add_argument('--resume', default='', help='resume from checkpoint')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='start epoch')
parser.add_argument('--save-best', default=True, type=bool, help='only save best dice weights')
# Mixed precision training parameters
parser.add_argument("--amp", default=False, type=bool,
help="Use torch.cuda.amp for mixed precision training")
args = parser.parse_args()
return args
if __name__ == '__main__':
args = parse_args()
args.data_path
DataLoader
的作用是接收一个dataset
对象,并生成一个DataLoader对象,它的函数声明如下:
torch.utils.data.DataLoader(dataset, batch_size=1,
shuffle=None, sampler=None, batch_sampler=None, num_workers=0,
collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None,
*, prefetch_factor=2, persistent_workers=False, pin_memory_device='')
其实我们只要知道DataLoader接收一个dataset对象并生成一个DataLoader对象便可,我们需要指定DataLoader
中的dataset
对象,batch_size
每一次迭代(一个epoch)导入的图片的个数,batch_size由硬件设备显存决定,一般batch_size越大训练效果越好,shuffle
是否打乱,num_workers
载入数据的线程数(在linux下可以定义,在windows下设置为0)。
iter()和next()事python自带的函数,iter() 函数接收一个支持迭代的集合对象(注意list不是迭代器),返回一个迭代器。object – 支持迭代的集合对象,函数定义如下:
iter(object[, sentinel])
next()会调用迭代器的下一个元素。
指定损失函数为CrossEntropyLoss,它的函数定义如下:
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)
往往不需要传实参,直接默认值便可。
定义优化器为Adam优化器,函数声明如下:
torch.optim.Adam(params, lr=0.001,
betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, *, foreach=None, maximize=False, capturable=False)
需要指定params
和lr
参数,其中params处往往传入网络的全部参数net.parameters()
,torch.nn.Module继承而来的方法,使传递全部参数,指定初始学习率lr=0.001
。
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,以tuple类型返回。一般用在 for 循环当中,常见案例如下:
>>> seq = ['one', 'two', 'three']
>>> for i, element in enumerate(seq):
... print i, element
...
0 one
1 two
2 three
dataloader 是个迭代器,故将dataloader传入后会输出每一次迭代的batch,如这个数据集每次的batch就是一个四维tensor(images)和一个一维tensor(labels)。
没一次batch的images进行处理后都需要调用该方法清除梯度,每一次batch要累加loss。每一次epoch清除loss。但每一次epoch和batch都会更新net.parameters()
# 每一次batch都要做一次这个
outputs = net(inputs) # 正向传播计算y估
loss = loss_function(outputs, lables) # 应该又是一个回调函数,计算损失函数的值,即y估和y的残差
loss.backward() # 误差的反向传播,这一步骤和下一步骤才是BP的完整算法-更新参数(即计算偏导,给参数赋值)
optimizer.step() # 更新参数 update parameters in net
running_loss += loss.item()
loss是一个tensor(1.1)的0维tensor,使用item()将其转换为标量。
在验证集中计算准确值时候不进行梯度的自动更新。
最终经过网络输出的outputs是一个[batch, labels]的[10000, 10]的Tensor,torch.max(outputs, dim=1)
指我们对outsputs的第一个维度(10个数据中)取最大值,torch.max(outputs, dim=1)[1]
指的是将所取数的序列号(0-9)返回给predict_y。所以predict_y是一个10000的Tensor。
predict_y和test_label都是torch.Size([10000])的tensor,索引tensor([])的方法可以看做list[0, 1]用索引调值,索引torch.Size([])的方法需要使用tensor.size(0)索引值。特殊的,对于tensor(1)这样的0维向量,则使用.item()方法将其转换为数值。
.sum()语句计算predict_y和test_label中相等元素的个数,返回一个0维的tensor(12)变量,使用.item()方法获取它的数值。使用test_label.size(0)获得test_label在第一维度的值(10000)。相除即得accuracy准确率,precision是精度。
launch.json里添加
"purpose":["debug-in-terminal"]