Dataset/TensorDataset/Dataloader
DataLoader
CLASS torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None)
Parameters:
*dataset – dataset from which to load the data.(数据集)
*batch_size – how many samples per batch to load (default: `1`).(批次大小)
*shuffle – set to `True` to have the data reshuffled at every epoch (default: `False`).(是否打乱数据集)
*sampler – defines the strategy to draw samples from the dataset. If specified, `shuffle` must be False.(采样方式,若开启,shuffle为false)
*batch_sampler – like sampler, but returns a batch of indices at a time. Mutually exclusive with `batch_size`, `shuffle`, `sampler`, and `drop_last`.(像sampler一样采样,但是返回的是批大小的索引)
*num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: `0`)(设置多少个子进程)
*collate_fn – merges a list of samples to form a mini-batch.(将样本融合为一个mini-batch)
*pin_memory – If `True`, the data loader will copy tensors into CUDA pinned memory before returning them.
*drop_last – set to `True` to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If `False` and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: `False`)
*timeout – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: `0`)
*worker_init_fn – If not `None`, this will be called on each worker subprocess with the worker id (an int in `[0, num_workers - 1]`) as input, after seeding and before data loading. (default: `None`)
DataLoader将datasets和sampler结合,在整个数据集上进行单个或多个线程的迭代。
Dataset
CLASS torch.utils.data.Dataset
Dataset是Pytorch中的一个抽象Class,所有的datasets都应该是它的子类,并且应该重写len和getitem来覆盖,其中getitem支持从整数(0,len(dataset))进行indexing。
例子:
我们生成数据集(x,y)其中 y = 5x + xsin(x) + noise。
代码如下:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
#定义Mydataset继承自Dataset,并重写__getitem__和__len__
class Mydataset(Dataset):
def __init__(self, num):
super(Mydataset, self).__init__()
self.num = num #生成多少个点(多少个数据)
def linear_f(x):
y = 5 * x + np.sin(x) * x + np.random.normal(
0, scale=1, size=x.size) # y = 5*x + x*sin(x) + noise
return y
self.x_train = np.linspace(0, 50, num=self.num) #从0-50生成num多个点
self.y_train = linear_f(self.x_train)
self.x_train = torch.Tensor(self.x_train)#转化为张量
self.y_train = torch.Tensor(self.y_train)
# indexing
def __getitem__(self, index):
return self.x_train[index], self.y_train[index]
#返回数据集大小,应该是(x_transpose,y_transpose)大小即num*2,这里我直接返回了num
def __len__(self):
return self.num
if __name__ == "__main__":
num = 30
myset = Mydataset(num=num)
myloader = DataLoader(dataset=myset, batch_size=1, shuffle=False)
for data in myloader:
print(data)
输出:
[tensor([0.]), tensor([0.6236])]
[tensor([1.7241]), tensor([10.0654])]
[tensor([3.4483]), tensor([16.2440])]
[tensor([5.1724]), tensor([20.8587])]
[tensor([6.8966]), tensor([37.4495])]
[tensor([8.6207]), tensor([49.7567])]
[tensor([10.3448]), tensor([43.4765])]
[tensor([12.0690]), tensor([56.8027])]
[tensor([13.7931]), tensor([81.5497])]
[tensor([15.5172]), tensor([79.9304])]
[tensor([17.2414]), tensor([69.9289])]
[tensor([18.9655]), tensor([95.5326])]
[tensor([20.6897]), tensor([123.2667])]
[tensor([22.4138]), tensor([102.8235])]
[tensor([24.1379]), tensor([99.7624])]
[tensor([25.8621]), tensor([146.6746])]
[tensor([27.5862]), tensor([155.5347])]
[tensor([29.3103]), tensor([122.1721])]
[tensor([31.0345]), tensor([144.0730])]
[tensor([32.7586]), tensor([194.8411])]
[tensor([34.4828]), tensor([175.9601])]
[tensor([36.2069]), tensor([144.5954])]
[tensor([37.9310]), tensor([198.9595])]
[tensor([39.6552]), tensor([233.5584])]
[tensor([41.3793]), tensor([185.2545])]
[tensor([43.1034]), tensor([182.1095])]
[tensor([44.8276]), tensor([258.0302])]
[tensor([46.5517]), tensor([258.3729])]
[tensor([48.2759]), tensor([199.5736])]
[tensor([50.]), tensor([235.5305])]
TensorDataset
CLASS torch.utils.data.TensorDataset(*tensors)
Parameters: *tensors – tensors that have the same size of the first dimension.
包装了张量的数据集,即传入张量(第一个维度相同),会通过第一个维度indexing。
例子:与上例同(x,y)数据集
from torch.utils.data import DataLoader,TensorDataset
import numpy as np
import torch
def linear_f(x):
y = 5 * x + np.sin(x) * x + np.random.normal(
0, scale=1, size=x.size) # y = 5*x + x*sin(x) + noise
return y
if __name__ == "__main__":
num = 30
x_train = np.linspace(0, 50, num=num) #从0-50生成num多个点
y_train = linear_f(x_train)
x_train = torch.Tensor(x_train)#转化为张量
y_train = torch.Tensor(y_train)
myset = TensorDataset(x_train,y_train)
myloader = DataLoader(dataset=myset, batch_size=1, shuffle=False)
for data in myloader:
print(data)
结果应该和上例相似。