充分利用GPU资源-自定义数据集预处理DALI加速

数据预处理DALI加速

文章的主要内容:

  1. DALI库介绍和安装
  2. 基于DALI库进行数据预处理
  3. GPU加速训练

1.DALI库介绍和安装

1.1常见的数据集加载方式

最近使用pytorch进行一些训练工作,一共有11万张图像,由于数据集格式不是官方提供的那种,便写了一个自定义数据集的pytorch实现的方式,其实pytorch的数据加载方式是比较固定的了,我研究使用的自定义数据集是需要两个图像一个标签的组合形式,自定义的数据集加载方式如下:

from PIL import Image
import torch
from torch.utils.data import Dataset
import numpy as np
from torchvision import transforms

class MyDataSet(Dataset):
    def __init__(self,img_name:list,labels:list,img_path:str,seg_img_path:str,transform=None):
        self.img_name = img_name
        self.img_path = img_path
        self.seg_img_path = seg_img_path
        self.img_labels = labels
        self.transform = transform


    def __len__(self):
        return len(self.img_name)

    def __getitem__(self, item):
        if self.img_labels[item] == 0:
            image_path = self.img_path+ r'accident_img/'+ str(self.img_name[item]) + '.png'
            seg_img_path = self.seg_img_path +r'accident_seg/'+ str(self.img_name[item]) + '.png'
        if self.img_labels[item] == 1:
            image_path = self.img_path + r'not_accident_img/' + str(self.img_name[item]) + '.png'
            seg_img_path = self.seg_img_path + r'no_accident_seg/' + str(self.img_name[item]) + '.png'
        img = Image.open(image_path)
        seg_img = Image.open(seg_img_path)
        if img.mode != 'RGB':
            raise ValueError("image: {} isn't RGB mode.".format(image_path))


        tran_Tensor = transforms.ToTensor() #定义类型转换

        img = tran_Tensor(img)
        # seg_img = tran_Tensor(np.array(seg_img))
        seg_img = torch.from_numpy(np.array(seg_img))
        seg_img = torch.unsqueeze(seg_img,0)

        fuse_img = torch.cat([img,seg_img],dim=0)
        img_label = self.img_labels[item]

        if self.transform is not None:
            img = self.transform(fuse_img)
        image,seg_img = img[:3],torch.unsqueeze(img[-1],0)
        return image,seg_img,img_label

    @staticmethod
    def collate_fn(batch):
        # 官方实现的default_collate可以参考
        # https://github.com/pytorch/pytorch/blob/67b7e751e6b5931a9f45274653f4f653a4e6cdf6/torch/utils/data/_utils/collate.py
        img,seg_img,img_label = tuple(zip(*batch))

        images = torch.stack(img, dim=0)
        seg_img = torch.stack(seg_img, dim=0)

        img_label = torch.as_tensor(img_label)
        # img_speed = torch.as_tensor(img_speed)
        # img_speed = torch.unsqueeze(img_speed, 1)
        return images,seg_img,img_label

然后使用transforms.Compose的方式对transforms的预处理进行组合,最后用DalaLoader进行调用,如下:

"""
图像预处理
"""
data_transform = {
    "train": transforms.Compose([transforms.RandomRotation(degrees=10, expand=True),
                                 transforms.RandomHorizontalFlip(),
                                 transforms.RandomResizedCrop(256),
                                 # transforms.ToTensor(),
                                 transforms.Normalize([0.485, 0.456, 0.406, 0], [0.229, 0.224, 0.225, 1])]),
    "val": transforms.Compose([transforms.Resize(256),
                               # transforms.ToTensor(),
                               transforms.Normalize([0.485, 0.456, 0.406, 0], [0.229, 0.224, 0.225, 1])])}
"""
数据加载
"""
train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=batch_size,
                                               shuffle=True,
                                               pin_memory=True,
                                               num_workers=4,
                                               collate_fn=train_dataset.collate_fn)
val_loader = torch.utils.data.DataLoader(val_dataset,
                                          batch_size=batch_size,
                                          shuffle=True,
                                          pin_memory=True,
                                          num_workers=4,
                                          collate_fn=train_dataset.collate_fn)

但是这样会存在一个问题,pytorch对数据集进行预处理时,通常情况下transforms主要使用cpu进行,而cpu数据转移到gpu设备上通常是在训练阶段和验证阶段完成,这会导致即使num_workers的数值调的再大,对于好一点的gpu设备来说,会等待数据传进来,我之前采用这样的训练方式11万张图像一个epoch差不多需要20分钟(也可能其他地方没有做进一步的优化),在尝试了英伟达的DALI库之后,确实感觉效果很不错,所以写一篇博文来记录一下

1.2DALI库简要介绍和安装

DALI主要可以将对图像的预处理工作转移到GPU上完成,在数据预处理和加载阶段就充分使用GPU,此时会占用一部分显存,过多的我就不介绍了,主要分享一个安装DALI的链接,会让大家快速使用到这个库

Release DALI v1.6.0 · NVIDIA/DALI · GitHub

基于DALI库进行数据预处理

关于我的自制数据集使用DALI的代码如下:

import random
import torch
from nvidia.dali.plugin.pytorch import DALIGenericIterator
from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types
import numpy as np
from nvidia import dali
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
from PIL import Image
from torchvision import transforms


class DataSource(object):
    def __init__(self, img_name:list, labels:list,img_path:str,seg_img_path:str,shuffle=True,batch_size=64):
        self.batch_size = batch_size
        self.img_name = img_name
        self.img_path = img_path
        self.seg_img_path = seg_img_path
        self.img_labels = labels
        self.paths = list(zip(*(img_name,labels)))
        if shuffle:
            random.shuffle(self.paths)

    def __iter__(self):
        self.i = 0
        return self

    def __next__(self):
        imgs = []
        seg_imgs = []
        labels = []

        if self.i >= len(self.paths):
            self.__iter__()
            raise StopIteration

        for _ in range(self.batch_size):
            img_name, label = self.paths[self.i % len(self.paths)]
            if label == 0:
                image_path = self.img_path + r'accident_img/' + str(img_name) + '.png'
                seg_img_path = self.seg_img_path + r'accident_seg/' + str(img_name) + '.png'
            if label == 1:
                image_path = self.img_path + r'not_accident_img/' + str(img_name) + '.png'
                seg_img_path = self.seg_img_path + r'no_accident_seg/' + str(img_name) + '.png'
            """
            dali定义的读取方式
            """
            img_file = open(image_path, 'rb')
            img_seg_file = open(seg_img_path, 'rb')
            imgs.append(np.frombuffer(img_file.read(), dtype=np.uint8))
            seg_imgs.append(np.frombuffer(img_seg_file.read(), dtype=np.uint8))
            labels.append(np.array([label]))
            """
            自定义读取方式
            """
            # img = Image.open(image_path)
            # seg_img = Image.open(seg_img_path)
            # img = np.array(img)
            # seg_img = np.array(seg_img)
            # imgs.append(img)
            # seg_imgs.append(seg_img)
            # labels.append(np.array([label]))
            self.i += 1

        return (imgs,seg_imgs, labels)

    def __len__(self):
        return len(self.paths)

    next = __next__

class SourcePipeline(Pipeline):
    def __init__(self,  batch_size, num_threads, device_id, external_data,modeltype):
        super(SourcePipeline, self).__init__(batch_size,
                                                     num_threads,
                                                     device_id,
                                                     seed=12,
                                                     exec_async=True,
                                                     exec_pipelined=True,
                                                     prefetch_queue_depth = 2
                                                     )
        self.input_data = ops.ExternalSource(num_outputs=3)
        self.external_data = external_data
        self.model_type = modeltype
        self.iterator = iter(self.external_data)
        self.res = ops.Resize(device="gpu", resize_x=256, resize_y=256)
        self.decode = ops.decoders.Image(device="mixed", output_type=types.RGB)
        self.cat = ops.Cat(device="gpu",axis=2)
        self.tran = ops.Transpose(device="gpu",perm=[2,0,1])
        self.crop = ops.RandomResizedCrop(device="gpu",size =256,random_area=[0.08, 1.25])
        self.resize = ops.Resize(device='gpu', resize_x=256, resize_y=256)
        self.no_mirrror_cmnp = ops.CropMirrorNormalize(device="gpu",
                                            output_dtype=types.FLOAT,
                                            output_layout=types.NCHW,
                                            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255, 0],
                                            std=[0.229 * 255, 0.224 * 255, 0.225 * 255, 1])
        self.mirrror_cmnp = ops.CropMirrorNormalize(device="gpu",
                                                   output_dtype=types.FLOAT,
                                                   output_layout=types.NCHW,
                                                   mirror = 1,
                                                   mean=[0.485 * 255, 0.456 * 255, 0.406 * 255, 0],
                                                   std=[0.229 * 255, 0.224 * 255, 0.225 * 255, 1])

        # self.flip = ops.random.CoinFlip(device="gpu", probability=0.5


    def define_graph(self):
        self.img,self.img_seg,self.labels = self.input_data()

        """
        读取图像数据(效果不是很好)
        """
        # images =self.img
        # img_seg = self.img_seg
        # img_seg = img_seg[:,:,dali.newaxis]

        """
        图像维度拼接,为统一预处理准备
        """
        # images = fn.cat(images,img_seg,axis = 2)
        # images = self.tran(images)


        """
        读取图像数据、seg_img变为一维通道
        """
        image = self.decode(self.img)
        img_seg = self.decode(self.img_seg)
        img_seg = img_seg[:,:,0:1]

        """
        cat维度拼接
        """
        fuse_img = self.cat(image, img_seg)

        if self.model_type == 'train':
            """
            对四通道图像Normalize处理、HWC-->CHW、裁剪、随机翻转
            """
            probability = random.random()
            if probability < 0.5:
                fuse_img = self.no_mirrror_cmnp(fuse_img)
            else:
                fuse_img = self.mirrror_cmnp(fuse_img)
            fuse_img = self.crop(fuse_img)
        if self.model_type == 'val':
            fuse_img = self.no_mirrror_cmnp(fuse_img)
            fuse_img = self.resize(fuse_img)

        """
        各数据源提取
        """
        image = fuse_img[0:3]
        img_seg = fuse_img[-1]
        img_seg = img_seg[dali.newaxis]

        """
        标签处理
        """
        label = self.labels[0]


        return (image,img_seg,label)

    def iter_setup(self):
        try:
            image,seg_img,labels = self.iterator.next()
            self.feed_input(self.img, image)
            self.feed_input(self.img_seg, seg_img)
            self.feed_input(self.labels, labels)
        except StopIteration:
            self.iterator = iter(self.external_data)
            raise StopIteration


class CustomDALIGenericIterator(DALIGenericIterator):
    def __init__(self, length,  pipelines,output_map, **argw):
        self._len = length # dataloader 的长度
        output_map = output_map
        super().__init__(pipelines, output_map, **argw)

    def __next__(self):
        batch = super().__next__()
        return self.parse_batch(batch)

    def __len__(self):
        return self._len

    def parse_batch(self, batch):
        img = batch[0]['imgs']
        seg_img = batch[0]['seg_imgs']
        label = batch[0]["labels"]  # bs * 1

        return {"image": img,"seg_image": seg_img,"labels": label}

然后在main函数中进行调用,main中相关代码如下,首先定义了数据读取的函数:

def Data_preprocessed(batch_size=64,num_threads=12):
    """
        参数+数据源定义
        """
    train_data_path = r'csv_dataset/train.csv'
    val_data_path = r'csv_dataset/val.csv'
    img_path = r'../BSVI_imgs/'
    seg_img_path = r'../seg_dataset/'
    train_data = pd.read_csv(train_data_path)
    val_data = pd.read_csv(val_data_path)
    train_img_name, train_label = train_data.iloc[:, 0].tolist(), train_data.iloc[:, -1].tolist()
    val_img_name, val_label = val_data.iloc[:, 0].tolist(), val_data.iloc[:, -1].tolist()
    """
    数据加载
    """
    train_eii = DataSource(batch_size=batch_size, img_name=train_img_name, labels=train_label, img_path=img_path,
                           seg_img_path=seg_img_path, shuffle=True)
    train_pipe = SourcePipeline(batch_size=batch_size, num_threads=num_threads, device_id=0, external_data=train_eii, modeltype='val')
    train_iter = CustomDALIGenericIterator(len(train_eii) / batch_size, pipelines=[train_pipe],
                                           output_map=["imgs", "seg_imgs", "labels"],
                                           last_batch_padded=True,
                                           last_batch_policy=LastBatchPolicy.PARTIAL,
                                           auto_reset=True)

    val_eii = DataSource(batch_size=batch_size, img_name=val_img_name, labels=val_label, img_path=img_path,
                         seg_img_path=seg_img_path, shuffle=True)
    val_pipe = SourcePipeline(batch_size=batch_size, num_threads=num_threads, device_id=0, external_data=val_eii, modeltype='val')
    val_iter = CustomDALIGenericIterator(len(val_eii) / batch_size, pipelines=[val_pipe],
                                         output_map=["imgs", "seg_imgs", "labels"],
                                         last_batch_padded=True,
                                         last_batch_policy=LastBatchPolicy.PARTIAL,
                                         auto_reset=True)

    train_loader = train_iter
    val_loader = val_iter
    return train_loader,val_loader,len(train_eii),len(val_eii)

在主函数中使用如下:

"""
 训练\验证数据读取
 """
 train_loader, val_loader, train_number, val_number = Data_preprocessed(batch_size=128,num_threads=12)

 train_batch_number = int(train_loader._len)
 val_batch_number = int(val_loader._len)

后面就和pytorch的数据一样了,我使用了这篇博文进行DALI数据对齐dataloader,之后就是正常的训练验证工作
大力(DALI)出奇迹,一文看懂 Pytorch 使用 NVIDA.DALI 加载自定义数据 dataloader

GPU加速训练

由于确实DALI会有点复杂,建议大家可以看看官方文档,会好理解一些:
DALI官方文档
最后我有经过测试,通过这种方式,我的一个epoch训练时间从20分钟左右缩减到4分钟出头,加速了近5倍
可能比较复杂的过程集中在Pipeline的搭建,可以多看看demo,Github也会有一些不错的案例

参考汇总

  1. DALI官方文档
  2. 大力(DALI)出奇迹,一文看懂 Pytorch 使用 NVIDA.DALI 加载自定义数据 dataloader
  3. nvidia.dali:深度学习加速神器!

你可能感兴趣的:(pytorch,深度学习,pytorch,python)