目前的项目只打算用pytorchvideo的data部分,pytorch-lightning的封装对我目前的代码不太友好,所以除了data部分还是用普通的pytorch。这个代码分析也是基于普通pytorch+pytorchvideo.data组合的。
这部分没啥好说的,注意video_sampler传的是类名,不是对象。
video sampler是在已有的paths里选哪个视频,clip sampler是在选中的视频选哪个视频段clip。
self._video_random_generator = None
if video_sampler == torch.utils.data.RandomSampler:
self._video_random_generator = torch.Generator()
self._video_sampler = video_sampler(
self._labeled_videos, generator=self._video_random_generator
)
else:
self._video_sampler = video_sampler(self._labeled_videos)
从这部分代码可以看出,这个dataset比较推荐使用RandomSampler作为video sampler。因为用其它的sampler非常不方便,比如我之前样本不平衡想要WeightedRandomSampler,但是不能传weights进去,只能改写源码。
对于这部分代码我比较疑惑的是,以前用DDP,使用的sampler都是DistributedSampler,需要在每一个epoch调用sampler的set_epoch。这样保证了同一个epoch不同rank得到的id不重复,每一轮epoch的数据顺序也是变化的(随机种子加入了epoch变量)。那现在的dataset能满足DDP的使用吗?后面用了MultiProcessSampler
来满足多个worker采样不重叠。
if not self._video_sampler_iter:
# Setup MultiProcessSampler here - after PyTorch DataLoader workers are spawned.
self._video_sampler_iter = iter(MultiProcessSampler(self._video_sampler))
设置一个video_sampler_iter,这个MultiprocessSampler可以解决多个worker的取数据问题。把video id按照num_workers进行拆分。
MultiProcessSampler
class MultiProcessSampler(torch.utils.data.Sampler):
"""
MultiProcessSampler splits sample indices from a PyTorch Sampler evenly across
workers spawned by a PyTorch DataLoader.
"""
def __init__(self, sampler: torch.utils.data.Sampler) -> None:
self._sampler = sampler
def __iter__(self):
"""
Returns:
Iterator for underlying PyTorch Sampler indices split by worker id.
"""
worker_info = torch.utils.data.get_worker_info()
if worker_info is not None and worker_info.num_workers != 0:
# Split sampler indexes by worker.
video_indexes = range(len(self._sampler))
worker_splits = np.array_split(video_indexes, worker_info.num_workers)
worker_id = worker_info.id
worker_split = worker_splits[worker_id]
if len(worker_split) == 0:
logger.warning(
f"More data workers({worker_info.num_workers}) than videos"
f"({len(self._sampler)}). For optimal use of processes "
"reduce num_workers."
)
return iter(())
iter_start = worker_split[0]
iter_end = worker_split[-1] + 1
worker_sampler = itertools.islice(iter(self._sampler), iter_start, iter_end)
else:
# If no worker processes found, we return the full sampler.
worker_sampler = iter(self._sampler)
return worker_sampler
其实我还是不太搞得懂取数据的worker和DDP里的node的关系,不知道这样设置是不是可以在DDP下实现正确的随机,所以我做了一个实验。
实验设置:
并行的GPU数:2
batch_size:4
num_workers:8
epoch数:3
数据量:63
输出了两个GPU上的video id,这里放出前两个epoch的video id结果,下图可以看出两个rank的数据是重叠的,没有像DistributedSampler会对半分。
这就很麻烦了,因为之前强调过video_sampler传入的是类不是对象,所以需要自己重写一下LabeledVideoDataset把sampler返回出来。当然如果不在乎一个epoch读取数据是重复的,就可以不做修改,只是要记得跑一个epoch用两张卡是跑了两遍数据。跑验证集阶段也是如此,每张卡各跑了一遍。
在设置好MultiprocessSampler以后就开始取数,有10次尝试机会
首先尝试根据video id取video的信息,将video从video path解码,再获得label等info_dict信息。如果是接着取上一个video的不同clip,就直接从_self.loaded_video_label
里读取。
if self._loaded_video_label:
video, info_dict, video_index = self._loaded_video_label
else:
video_index = next(self._video_sampler_iter)
try:
video_path, info_dict = self._labeled_videos[video_index]
video = self.video_path_handler.video_from_path(
video_path,
decode_audio=self._decode_audio,
decoder=self._decoder,
)
self._loaded_video_label = (video, info_dict, video_index)
然后根据clip_sampler获得clip信息
(
clip_start,
clip_end,
clip_index,
aug_index,
is_last_clip,
) = self._clip_sampler(
self._next_clip_start_time, video.duration, info_dict
)
clip_sampler提供了四种,按需选择,或者自己写,不详述了。
if sampling_type == "uniform":
return UniformClipSampler(*args)
elif sampling_type == "random":
return RandomClipSampler(*args)
elif sampling_type == "constant_clips_per_video":
return ConstantClipsPerVideoSampler(*args)
elif sampling_type == "random_multi":
return RandomMultiClipSampler(*args)
将读取到的信息放入dict返回
frames = self._loaded_clip["video"]
audio_samples = self._loaded_clip["audio"]
sample_dict = {
"video": frames,
"video_name": video.name,
"video_index": video_index,
"clip_index": clip_index,
"aug_index": aug_index,
**info_dict,
**({"audio": audio_samples} if audio_samples is not None else {}),
}
if self._transform is not None:
sample_dict = self._transform(sample_dict)
# User can force dataset to continue by returning None in transform.
if sample_dict is None:
continue
return sample_dict