DataLoader是PyTorch中读取数据的一个重要接口,基本上用PyTorch训练模型都会用到。这个接口的目的是:将自定义的Dataset根据batch size大小、是否shuffle等选项封装成一个batch size大小的Tensor,后续只需要再包装成Variable即可作为模型输入用于训练。
PyTorch中的数据读取主要包含三个类,其过程主要是以下四步:
1.Dataset
2.DataLoader
3.DataLoaderIter
4.循环DataLoader,进行训练
一、处理过程伪代码
dataset = MyDataset()
dataloader = DataLoader(dataset)
num_epoches = 100
for epoch in range(num_epoches):
for img, label in dataloader:
....
前三者的关系可以理解成一个层层封装的关系,Dataset是数据集,DataLoader将数据集进行封装,最终在内部使用DataLoaderIter进行迭代。这里只简单分析一下DataLoader这个类。
二、构造函数
class torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=,
pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None)
三、DataLoader的参数
-dataset (Dataset):定义要加载的数据集。
-batch_size (int, optional):定义batch_size大小,也就是一次加载样本的数量,默认是1。
-shuffle (bool, optional):在每个epoch开始的时候,是否进行数据重排序,默认False.
-sampler (Sampler, optional):定义从数据集中取样本的策略,如果进行了指定,那么shuffle必须是False。
-batch_sampler (Sampler, optional):与上一个参数sampler相似,但是一次只返回一个batch索引(indices),另外batch_sampler与batch_size,shuffle,sampler,drop_last互斥,也就是说设定了这个,其他四个就不能设定了。
-num_workers (int, optional):定义加载数据使用的进程数,0代表所有的数据都被装载进主进程,默认是0。
-collate_fn (callable*, *optional):将一个list的sample组成一个mini-batch的函数.
-pin_memory (bool, optional):如果设定为True,那么data loader将会在返回之前,将tensors拷贝到CUDA中的固定内存。
-drop_last (bool, optional): 这个是对最后的未完成的batch来说的,比如你的batch_size设置为64,而一个epoch只有100个样本,如果设置为True,那么训练的时候后面的36个就被扔掉了。如果为False(默认),那么会继续正常执行,只是最后的batch_size会小一点。
-timeout (numeric, optional): 如果是正数,表明从worker进程中收集一个batch等待的时间,若超出设定的时间还没有收集到,那就不收集这个内容了。这个numeric总是大于等于0。默认为0。
-worker_init_fn (callable, optional):每个worker的初始化函数,如果设置成None, 则在数据加载之前,在每个worker子进程上调用它,并将worker id ([0, num_workers - 1])作为输入。(默认值:无)。
Dataloader的处理逻辑是先通过Dataset类里面的 __getitem__ 函数获取单个的数据,然后组合成batch,再使用collate_fn所指定的函数对这个batch做一些操作
四、源代码
class DataLoader(object):
r"""
Data loader. Combines a dataset and a sampler, and provides
single- or multi-process iterators over the dataset.
Arguments:
dataset (Dataset): dataset from which to load the data.
batch_size (int, optional): how many samples per batch to load
(default: 1).
shuffle (bool, optional): set to ``True`` to have the data reshuffled
at every epoch (default: False).
sampler (Sampler, optional): defines the strategy to draw samples from
the dataset. If specified, ``shuffle`` must be False.
batch_sampler (Sampler, optional): like sampler, but returns a batch of
indices at a time. Mutually exclusive with batch_size, shuffle,
sampler, and drop_last.
num_workers (int, optional): how many subprocesses to use for data
loading. 0 means that the data will be loaded in the main process.
(default: 0)
collate_fn (callable, optional): merges a list of samples to form a mini-batch.
pin_memory (bool, optional): If ``True``, the data loader will copy tensors
into CUDA pinned memory before returning them.
drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
if the dataset size is not divisible by the batch size. If ``False`` and
the size of dataset is not divisible by the batch size, then the last batch
will be smaller. (default: False)
timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default: 0)
worker_init_fn (callable, optional): If not None, this will be called on each
worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
input, after seeding and before data loading. (default: None)
.. note:: By default, each worker will have its PyTorch seed set to
``base_seed + worker_id``, where ``base_seed`` is a long generated
by main process using its RNG. However, seeds for other libraies
may be duplicated upon initializing workers (w.g., NumPy), causing
each worker to return identical random numbers. (See
:ref:`dataloader-workers-random-seed` section in FAQ.) You may
use ``torch.initial_seed()`` to access the PyTorch seed for each
worker in :attr:`worker_init_fn`, and use it to set other seeds
before data loading.
.. warning:: If ``spawn`` start method is used, :attr:`worker_init_fn` cannot be an
unpicklable object, e.g., a lambda function.
"""
__initialized = False
def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None,
num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False,
timeout=0, worker_init_fn=None):
self.dataset = dataset
self.batch_size = batch_size
self.num_workers = num_workers
self.collate_fn = collate_fn
self.pin_memory = pin_memory
self.drop_last = drop_last
self.timeout = timeout
self.worker_init_fn = worker_init_fn
if timeout < 0:
raise ValueError('timeout option should be non-negative')
if batch_sampler is not None:
if batch_size > 1 or shuffle or sampler is not None or drop_last:
raise ValueError('batch_sampler option is mutually exclusive '
'with batch_size, shuffle, sampler, and '
'drop_last')
self.batch_size = None
self.drop_last = None
if sampler is not None and shuffle:
raise ValueError('sampler option is mutually exclusive with '
'shuffle')
if self.num_workers < 0:
raise ValueError('num_workers option cannot be negative; '
'use num_workers=0 to disable multiprocessing.')
if batch_sampler is None:
if sampler is None:
if shuffle:
sampler = RandomSampler(dataset) #将list打乱,这里也是调用写好的类
else:
sampler = SequentialSampler(dataset)
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
self.sampler = sampler
self.batch_sampler = batch_sampler
self.__initialized = True
def __setattr__(self, attr, val):
if self.__initialized and attr in ('batch_size', 'sampler', 'drop_last'):
raise ValueError('{} attribute should not be set after {} is '
'initialized'.format(attr, self.__class__.__name__))
super(DataLoader, self).__setattr__(attr, val)
def __iter__(self):
return _DataLoaderIter(self)
def __len__(self):
return len(self.batch_sampler)
五、使用
train_data = trainset() #数据集
trainloader = DataLoader(train_data, batch_size=4,shuffle=True) #加载数据集,设定参数
参考链接:
https://www.cnblogs.com/ranjiewen/p/10128046.html
https://blog.csdn.net/zw__chen/article/details/82806900
https://zhuanlan.zhihu.com/p/35698470
https://www.jianshu.com/p/8ea7fba72673
https://blog.csdn.net/u014380165/article/details/79058479
https://blog.csdn.net/gdymind/article/details/82226509