背景
从入门 Tensorflow
到沉迷 keras
再到跳出安逸选择pytorch
,根本原因是在参加天池雪浪AI制造数据竞赛的时候,几乎同样的网络模型和参数,以及相似的数据预处理方式,结果得到的成绩差距之大让我无法接受,故转为 pytorch
,keras
只用来做一些 NLP 的项目(毕竟积累了一些"祖传模型")~
注:本项目以 交通标志数据集
为例,需要的可以进行下载 traffic-sign,完整代码地址:pytorch-image-classification
更新 :2018年10月22日第二次更新,版本 0.1.1
更改:
数据增强方式由 pytorch 内置方式改为自定义,便于后期多 channels 模型更改,同时也可以借用 opencv 的强大库进行数据预处理(pytorch 的数据读取采用的是 PIL 库)。
输出打印方式采用 logger 的形式,动态更新。
保存最优模型的方式采用半个 epoch 计算一次
pytorch 0.4.0
0. 图像分类框架结构
在我们学习完机器学习、深度学习、卷积神经网络以及结构化机器学习项目等理论知识后,如何动手完成一个实际的项目往往是一个瓶颈期,只有将所学知识灵活运用,才敢说自己学了这些。前面的那些课程,我在研一上学期的时候都学习过,但直到研一下开始实习后,才逐渐能够独立完成项目,甚至参加一些数据竞赛。
在我使用 pytorch 的过程中,将其分为七大部分:数据加载
,模型定义
,评测标准定义
,训练过程定义
,验证过程定义
,测试过程定义
,参数定义
。
文件组织如下:
==============================================================
- checkpoints/
- bestmodels/
- dataset/
- aug.py
- dataloader.py
- logs/
- models/
- pretrained_models/
- model.py
- submit/
- config.py
- main.py
- utils.py
==============================================================
- checkpoints/ : 存放训练保存的模型(
bestmodels/
保存在验证集上效果最好的模型); - models/ : 存放一些自定义的模型,如果不想使用 pytorch 自定义的网络模型,可以在这里添加(记得添加
__init__.py
文件); - submit/ : 输出的预测文件或者说比赛所需要你提交的结果文件,常见的是
csv
格式的; - logs/: 存放记录训练日志(.txt格式文件)
- dataset/:包含
aug.py
dataloader.py
两文件,主要实现数据增强和数据加载两个功能
-
config.py
: 参数定义文件,以参数类的形式定义所需要提前设定或者修改的参数,例如:数据路径,学习率,训练 epoch 等; -
model.py
: 定义模型加载,可有可无,为了方便进行模型的 fine tune 我喜欢单独列出来; -
utils.py
: 定义了一些常用的评测标准,比如 mAP,Accuracy,loss 等。 -
main.py
: 主文件,包含训练、测试、验证等过程;
1. 参数定义: config.py
参数定义的方式有很多种,有的人喜欢直接在主文件中进行设置;有的喜欢用 argparse
这个模块;也有人喜欢用 json
格式的文件,但是总的来说都不够简洁,我个人喜欢单独创建个 config.py
然后创建个 Python 类,以类属性的形式定义参数,详情见下:
class DefaultConfigs(object):
#1.string parameters
train_data = "../data/train/"
test_data = ""
val_data = "../data/val/"
model_name = "resnet50"
weights = "./checkpoints/"
best_models = weights + "best_model/"
submit = "./submit/"
logs = "./logs/"
gpus = "1"
#2.numeric parameters
epochs = 40
batch_size = 4
img_height = 224
img_weight = 224
num_classes = 62
seed = 888
lr = 1e-3
lr_decay = 1e-4
weight_decay = 1e-4
config = DefaultConfigs()
2. 数据加载: data_loader.py
pytorch 的数据读取方式有两种,一种是不同类别的图像按照文件夹进行划分,比如交通标志数据集:
- train/
- 00000/
- 01153_00000.png
- 01153_00001.png
- 00001/
- 00025_00000.png
- 00025_00001.png
- 00000/
train_data = torchvision.datasets.ImageFolder(
"/data2/dockspace_zcj/traffic-sign/train/",#图片文件存放路径
transform = None #定义的数据增强方式
)
data_loader = torch.utils.data.DataLoader(train_data,
batch_size=20,
shuffle=True
)
"""
在模型训练过程时只需要加载data_loader就可以了,
具体方式在main文件中可见
"""
aug.py
由于代码较多,不在此展示,详情请移步 github 。常用的增强方式均在此文件中列举出来,如果需要添加,可根据样例,自行添加。
因此采用继承 torch.utils.data.Dataset 类,新建一个数据加载的 python 类,在__get_item__(self,index)
函数中添加数据增强,代码如下:
from torch.utils.data import Dataset
from torchvision import transforms as T
from config import config
from PIL import Image
from dataset.aug import *
from itertools import chain
from glob import glob
from tqdm import tqdm
import random
import numpy as np
import pandas as pd
import os
import cv2
import torch
#1.set random seed
random.seed(config.seed)
np.random.seed(config.seed)
torch.manual_seed(config.seed)
torch.cuda.manual_seed_all(config.seed)
#2.define dataset
class ChaojieDataset(Dataset):
def __init__(self,label_list,transforms=None,train=True,test=False):
self.test = test
self.train = train
imgs = []
if self.test:
for index,row in label_list.iterrows():
imgs.append((row["filename"]))
self.imgs = imgs
else:
for index,row in label_list.iterrows():
imgs.append((row["filename"],row["label"]))
self.imgs = imgs
if transforms is None:
if self.test or not train:
self.transforms = Compose([
Resize((config.img_weight,config.img_height)),
Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
else:
self.transforms = Compose([
Resize((config.img_weight,config.img_height)),
FixRandomRotate(bound='Random'),
RandomHflip(),
RandomVflip(),
Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
else:
self.transforms = transforms
def __getitem__(self,index):
if self.test:
filename = self.imgs[index]
img = cv2.imread(filename)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
img = self.transforms(img)
return torch.from_numpy(img).float(),filename
else:
filename,label = self.imgs[index]
img = cv2.imread(filename)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
img = self.transforms(img)
return torch.from_numpy(img).float(),label
def __len__(self):
return len(self.imgs)
def collate_fn(batch):
imgs = []
label = []
for sample in batch:
imgs.append(sample[0])
label.append(sample[1])
return torch.stack(imgs, 0), \
label
def get_files(root,mode):
#for test
if mode == "test":
files = []
for img in os.listdir(root):
files.append(root + img)
files = pd.DataFrame({"filename":files})
return files
elif mode != "test":
#for train and val
all_data_path,labels = [],[]
image_folders = list(map(lambda x:root+x,os.listdir(root)))
all_images = list(chain.from_iterable(list(map(lambda x:glob(x+"/*.png"),image_folders))))
print("loading train dataset")
for file in tqdm(all_images):
all_data_path.append(file)
labels.append(int(file.split("/")[-2]))
all_files = pd.DataFrame({"filename":all_data_path,"label":labels})
return all_files
else:
print("check the mode please!")
注:定义的 get_files(root,mode)
函数是为了使用 pandas
的读取方式,便于在没有提供验证集的数据集上对训练集进行随机划分,主要体现在平衡随机划分的数据上。
3. 数据加载: model.py
创建这个文件夹的原因一是因为太多的代码放到主文件中显得过于臃肿,另外也不利于修改模型进行 fine tune ,以 resnet101
为例:
import torchvision
import torch.nn.functional as F
from torch import nn
from config import config
def get_net():
#return MyModel(torchvision.models.resnet101(pretrained = True))
model = torchvision.models.resnet101(pretrained = True)
model.avgpool = nn.AdaptiveAvgPool2d(1)
model.fc = nn.Linear(2048,config.num_classes)
return model
4. 评价指标: utils.py
pytrorch
不像 keras
那样把一些模型评价指标都给封装好了,只需要在 fit
的过程中附上 metrics=[acc]
就可以了,其他的指标添加一下即可。当然也可以利用 torchnet
这个模块,但是我在使用过程中发现想要的指标没有,就自己定义了,其实大差不差。
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def accuracy(y_pred, y_actual, topk=(1, )):
"""Computes the precision@k for the specified values of k"""
maxk = max(topk)
batch_size = y_actual.size(0)
_, pred = y_pred.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(y_actual.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].view(-1).float().sum(0)
res.append(correct_k.mul_(100.0 / batch_size))
return res
5. 主要文件: main.py
之所以要自己定义训练、验证和测试函数,就是因为 pytorch
没有封装好,需要我们自己来设定,详细内容在代码中有注释,如果有疑问可以联系我
# -*- coding: utf-8 -*-
# @Time : 2018/7/31 09:41
# @Author : Spytensor
# @File : main.py
# @Email : [email protected]
#====================================================
# 定义模型训练/验证/预测等
#====================================================
import os
import random
import time
import json
import torch
import torchvision
import numpy as np
import pandas as pd
import warnings
from datetime import datetime
from torch import nn,optim
from config import config
from collections import OrderedDict
from torch.autograd import Variable
from torch.utils.data import DataLoader
from dataset.dataloader import *
from sklearn.model_selection import train_test_split,StratifiedKFold
from timeit import default_timer as timer
from models.model import *
from utils import *
#1. set random.seed and cudnn performance
random.seed(config.seed)
np.random.seed(config.seed)
torch.manual_seed(config.seed)
torch.cuda.manual_seed_all(config.seed)
os.environ["CUDA_VISIBLE_DEVICES"] = config.gpus
torch.backends.cudnn.benchmark = True
warnings.filterwarnings('ignore')
#2. evaluate func
def evaluate(val_loader,model,criterion):
#2.1 define meters
losses = AverageMeter()
top1 = AverageMeter()
top2 = AverageMeter()
#2.2 switch to evaluate mode and confirm model has been transfered to cuda
model.cuda()
model.eval()
with torch.no_grad():
for i,(input,target) in enumerate(val_loader):
input = Variable(input).cuda()
target = Variable(torch.from_numpy(np.array(target)).long()).cuda()
#2.2.1 compute output
output = model(input)
loss = criterion(output,target)
#2.2.2 measure accuracy and record loss
precision1,precision2 = accuracy(output,target,topk=(1,2))
losses.update(loss.item(),input.size(0))
top1.update(precision1[0],input.size(0))
top2.update(precision2[0],input.size(0))
return [losses.avg,top1.avg,top2.avg]
#3. test model on public dataset and save the probability matrix
def test(test_loader,model,folds):
#3.1 confirm the model converted to cuda
csv_map = OrderedDict({"filename":[],"probability":[]})
model.cuda()
model.eval()
for i,(input,filepath) in enumerate(tqdm(test_loader)):
#3.2 change everything to cuda and get only basename
filepath = [os.path.basename(x) for x in filepath]
with torch.no_grad():
image_var = Variable(input).cuda()
#3.3.output
#print(filepath)
#print(input,input.shape)
y_pred = model(image_var)
print(y_pred.shape)
smax = nn.Softmax(1)
smax_out = smax(y_pred)
#3.4 save probability to csv files
csv_map["filename"].extend(filepath)
for output in smax_out:
prob = ";".join([str(i) for i in output.data.tolist()])
csv_map["probability"].append(prob)
result = pd.DataFrame(csv_map)
result["probability"] = result["probability"].map(lambda x : [float(i) for i in x.split(";")])
result.to_csv("./submit/{}_submission.csv" .format(config.model_name + "_" + str(folds)),index=False,header = None)
#4. more details to build main function
def main():
fold = 0
#4.1 mkdirs
if not os.path.exists(config.submit):
os.mkdir(config.submit)
if not os.path.exists(config.weights):
os.mkdir(config.weights)
if not os.path.exists(config.best_models):
os.mkdir(config.best_models)
if not os.path.exists(config.logs):
os.mkdir(config.logs)
if not os.path.exists(config.weights + config.model_name + os.sep +str(fold) + os.sep):
os.makedirs(config.weights + config.model_name + os.sep +str(fold) + os.sep)
if not os.path.exists(config.best_models + config.model_name + os.sep +str(fold) + os.sep):
os.makedirs(config.best_models + config.model_name + os.sep +str(fold) + os.sep)
#4.2 get model and optimizer
model = get_net()
model = torch.nn.DataParallel(model)
model.cuda()
optimizer = optim.SGD(model.parameters(),lr = config.lr,momentum=0.9,weight_decay=config.weight_decay)
#optimizer = optim.Adam(model.parameters(),lr = config.lr,amsgrad=True,weight_decay=config.weight_decay)
criterion = nn.CrossEntropyLoss().cuda()
log = Logger()
log.open(config.logs + "log_train.txt",mode="a")
log.write("\n------------------------------------ [START %s] %s\n\n" % (datetime.now().strftime('%Y-%m-%d %H:%M:%S'), '-' * 40))
#4.3 some parameters for K-fold and restart model
start_epoch = 0
best_precision1 = 0
resume = False
#4.4 restart the training process
if resume:
checkpoint = torch.load(config.best_models + str(fold) + "/model_best.pth.tar")
start_epoch = checkpoint["epoch"]
fold = checkpoint["fold"]
best_precision1 = checkpoint["best_precision1"]
model.load_state_dict(checkpoint["state_dict"])
optimizer.load_state_dict(checkpoint["optimizer"])
#4.5 get files and split for K-fold dataset
#4.5.1 read files
train_data_list = get_files(config.train_data,"train")
val_data_list = get_files(config.val_data,"val")
#test_files = get_files(config.test_data,"test")
"""
#如果没有提供验证集,可在此进行划分
#4.5.2 split
split_fold = StratifiedKFold(n_splits=3)
folds_indexes = split_fold.split(X=origin_files["filename"],y=origin_files["label"])
folds_indexes = np.array(list(folds_indexes))
fold_index = folds_indexes[fold]
#4.5.3 using fold index to split for train data and val data
train_data_list = pd.concat([origin_files["filename"][fold_index[0]],origin_files["label"][fold_index[0]]],axis=1)
val_data_list = pd.concat([origin_files["filename"][fold_index[1]],origin_files["label"][fold_index[1]]],axis=1)
"""
#train_data_list,val_data_list = train_test_split(origin_files,test_size = 0.1,stratify=origin_files["label"])
#4.5.4 load dataset
train_dataloader = DataLoader(ChaojieDataset(train_data_list),batch_size=config.batch_size,shuffle=True,collate_fn=collate_fn,pin_memory=True)
val_dataloader = DataLoader(ChaojieDataset(val_data_list,train=False),batch_size=config.batch_size * 2,shuffle=True,collate_fn=collate_fn,pin_memory=False)
#test_dataloader = DataLoader(ChaojieDataset(test_files,test=True),batch_size=1,shuffle=False,pin_memory=False)
#scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer,"max",verbose=1,patience=3)
scheduler = optim.lr_scheduler.StepLR(optimizer,step_size = 5,gamma=0.1)
#4.5.5.1 define metrics
train_losses = AverageMeter()
train_top1 = AverageMeter()
train_top2 = AverageMeter()
valid_loss = [np.inf,0,0]
model.train()
#logs
log.write('** start training here! **\n')
log.write(' |------------ VALID -------------|----------- TRAIN -------------| \n')
log.write('lr iter epoch | loss top-1 top-2 | loss top-1 top-2 | time \n')
log.write('----------------------------------------------------------------------------------------------------\n')
#4.5.5 train
start = timer()
for epoch in range(start_epoch,config.epochs):
scheduler.step(epoch)
#4.5.5.2 train
for iter,(input,target) in enumerate(train_dataloader):
lr = get_learning_rate(optimizer)
#evaluate every half epoch
if iter == len(train_dataloader) // 2:
valid_loss = evaluate(val_dataloader,model,criterion)
is_best = valid_loss[1] > best_precision1
best_precision1 = max(valid_loss[1],best_precision1)
save_checkpoint({
"epoch":epoch + 1,
"model_name":config.model_name,
"state_dict":model.state_dict(),
"best_precision1":best_precision1,
"optimizer":optimizer.state_dict(),
"fold":fold,
"valid_loss":valid_loss,
},is_best,fold)
#adjust learning rate
#scheduler.step(valid_loss[1])
print("\r",end="",flush=True)
log.write('%0.8f %5.1f %6.1f | %0.3f %0.3f %0.3f | %0.3f %0.3f %0.3f | %s' % (\
lr, iter/len(train_dataloader) + epoch, epoch,
valid_loss[0], valid_loss[1], valid_loss[2],
train_losses.avg, train_top1.avg, train_top2.avg,
time_to_str((timer() - start),'min'))
)
log.write('\n')
time.sleep(0.01)
#4.5.5 switch to continue train process
#scheduler.step(epoch)
model.train()
input = Variable(input).cuda()
target = Variable(torch.from_numpy(np.array(target)).long()).cuda()
output = model(input)
loss = criterion(output,target)
precision1_train,precision2_train = accuracy(output,target,topk=(1,2))
train_losses.update(loss.item(),input.size(0))
train_top1.update(precision1_train[0],input.size(0))
train_top2.update(precision2_train[0],input.size(0))
#backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
lr = get_learning_rate(optimizer)
print('\r',end='',flush=True)
print('%0.8f %5.1f %6.1f | %0.3f %0.3f %0.3f | %0.3f %0.3f %0.3f | %s' % (\
lr, iter/len(train_dataloader) + epoch, epoch,
valid_loss[0], valid_loss[1], valid_loss[2],
train_losses.avg, train_top1.avg, train_top2.avg,
time_to_str((timer() - start),'min'))
, end='',flush=True)
# best_model = torch.load(config.best_models + os.sep+ str(fold) + 'model_best.pth.tar')
# model.load_state_dict(best_model["state_dict"])
# test(test_dataloader,model,fold)
if __name__ =="__main__":
main()
6. 训练结果
------------------------------------ [START 2018-10-22 19:47:48] ----------------------------------------
loading train dataset
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4572/4572 [00:00<00:00, 589769.58it/s]
loading train dataset
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2520/2520 [00:00<00:00, 603496.98it/s]
** start training here! **
|------------ VALID -------------|----------- TRAIN -------------|
lr iter epoch | loss top-1 top-2 | loss top-1 top-2 | time
----------------------------------------------------------------------------------------------------
0.00010000 0.5 0.0 | 0.578 82.063 91.706 | 1.661 63.354 72.242 | 0 hr 01 min
0.00010000 1.5 1.0 | 0.254 93.532 96.270 | 0.936 78.442 85.356 | 0 hr 04 min
0.00010000 2.5 2.0 | 0.226 94.563 97.619 | 0.691 83.567 89.771 | 0 hr 06 min
0.00010000 3.5 3.0 | 0.186 91.944 97.976 | 0.551 86.738 92.206 | 0 hr 09 min
0.00010000 4.5 4.0 | 0.214 95.357 99.087 | 0.461 88.771 93.700 | 0 hr 11 min
0.00010000 5.5 5.0 | 0.111 97.222 99.246 | 0.399 90.161 94.699 | 0 hr 14 min
7. 总结
无论使用哪种框架,自己用起来舒服才是最好的,因为 `pytorch` 相比 `keras` `tensorflow` 而言还不够完善,存在一些难以理解的 `bug` 所以最好能够对应版本去使用,整个项目使用的是 `pytorch 0.4.0` 。最后声明一下,本篇文章是我在做图像分类问题时,自己整合多份代码,最后完成的,按照自己的需要增添了一些模块,具体参考的代码在参考文献中给出。
另外,附上完整代码地址:[pytorch-image-classification](https://github.com/spytensor/pytorch-image-classification)!
8. 参考文献
- [pytorch-classification](https://github.com/bearpaw/pytorch-classification)
- [pytorch-best-practice](https://github.com/chenyuntc/pytorch-best-practice)