Environment 环境
Reference 参考链接
Video Datasets 视频数据集 & 加载
加载 UCF101 数据集
加载 HMDB51 数据集
加载 Kinetics 400 数据集
Video I/O 视频 I/O 操作
torchvision.io.read_video()
torchvision.io.read_video_timestamps()
torchvision.io.write_video()
class torchvision.io.VideoReader(path, stream='video')
Video Transform 视频变换操作
ToTensorVideo()
NormalizeVideo()
RandomHorizontalFlipVideo()
CenterCropVideo()
RandomCropVideo()
RandomResizedCropVideo()
Example
Video Classification Model 视频动作分类模型
Example
打开 Anaconda Prompt,输入以下命令:
conda install -c continuumcrew anaconda-navigator=1.5.1
conda update --all
打开 Anaconda Prompt,切换到相应环境,输入以下命令:
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
import torchvision.datasets as datasets
data = datasets.UCF101(
root='path/UCF-101',
annotation_path='path/UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist',
frames_per_clip=16,
num_workers=0
)
print(data)
返回值:
- video (Tensor[T, H, W, C]): the `T` video frames
- audio(Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
- label (int): class of the video clip
注意:
原因 & 解决方案:https://stackoverflow.com/questions/61522539/i-cant-import-the-ucf-101-dataset-torchvision-list-index-out-of-range-error
原因:trainlist01/02/03.txt 和 testlist01/02/03.txt 中的 video path 长这样:ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi 和 windows 系统路径要求的斜杠( \ )不一样
我用的是其中的第一种解决方案:把 trainlist01/02/03.txt 和 testlist01/02/03.txt 中的 / 全部替换为 \
参数:
root (string) – Root directory of the HMDB51 Dataset.
annotation_path (str) – Path to the folder containing the split files.
frames_per_clip (int) – Number of frames in a clip.
step_between_clips (int) – Number of frames between each clip.
fold (int, optional) – Which fold to use. Should be between 1 and 3.
train (bool, optional) – If
True
, creates a dataset from the train split, otherwise from thetest
split.transform (callable, optional) – A function/transform that takes in a TxHxWxC video and returns a transformed version.
返回值:
- video (Tensor[T, H, W, C]): the `T` video frames
- audio(Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
- label (int): class of the video clip
参数:
root (string) – Root directory of the Kinetics-400 Dataset.
frames_per_clip (int) – number of frames in a clip
step_between_clips (int) – number of frames between each clip
transform (callable, optional) – A function/transform that takes in a TxHxWxC video and returns a transformed version.
返回值:
- video (Tensor[T, H, W, C]): the `T` video frames
- audio(Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
- label (int): class of the video clip
官方文档:
源码:https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#read_video
Parameters
filename (str) – path to the video file
start_pts (int if pts_unit = 'pts', optional) – float / Fraction if pts_unit = ‘sec’, optional the start presentation time of the video
end_pts (int if pts_unit = 'pts', optional) – float / Fraction if pts_unit = ‘sec’, optional the end presentation time
pts_unit (str, optional) – unit in which start_pts and end_pts values will be interpreted, either ‘pts’ or ‘sec’. Defaults to ‘pts’.
Returns
vframes (Tensor[T, H, W, C]) – the T video frames
aframes (Tensor[K, L]) – the audio frames, where K is the number of channels and L is the number of points
info (Dict) – metadata for the video and audio. Can contain the fields video_fps (float) and audio_fps (int)
补充知识:什么是时间戳?什么是 pts?
https://blog.csdn.net/tanningzhong/article/details/105564589
时间戳单位
前面我们提到采样率,感觉到采样率是个很大的单位,一般标准的音频AAC采样率达到了44kHz,视频采样率也规定在90000Hz.所以我们衡量时间的单位不能再是秒,毫秒这种真实的时间单位,我们的单位应该转换为采样率,也就是一个采样的时间为音视频的时间单位,这就是时间戳的真实值。当我们要播放和控制时,我们再将时间戳根据采样率转换为真实的时间即可。
一句话,时间戳不是真实的时间是采样次数。比如时间戳是160,我们不能认为是160秒或者160毫秒,应该是160个采样。要换算真实时间,我们必须知道采样率,比如8000,那么说明1秒被划分成8000分之一,如果你要明确160个采样占用的时间,则160*(1/8000)即可,即20毫秒。
时间戳增量
就是一帧图像和另外一帧图像之间的时间戳差值,或者一帧音频和一帧音频的时间戳差值。同理时间戳增量也是采样个数的差值不是真实时间差值,还是要根据采样率才能换算成真实时间。
所以对于视频和音频的时间戳计算要一定明确帧率是多少,采样率是多少。
比如视频而言,帧率25,那么对于90000的采样率来说,一帧占用的采样数就是90000/25也就是3600,说明每帧图像的时间戳增量应该是3600,换算成实际时间就是3600*(1/90000)=0.04秒=40毫秒,这也和1/25=0.04秒=40毫秒一致。
对于AAC音频,一帧1024个采样,采样频率是44kHz,所以一帧的播放时间应该是1024*(1/44100)=0.0232秒=23.22毫秒。
用个 Example 更直观的理解这两个概念:
import torchvision.io as io
vframes, aframes, info = io.read_video(
filename='path/v_ApplyEyeMakeup_g01_c01.avi',
pts_unit='pts',
end_pts=3
)
print(vframes.shape)
print(info)
# output:
# torch.Size([3, 240, 320, 3])
# {'video_fps': 25.0, 'audio_fps': 44100}
# --------------------------------------------------------------------
import torchvision.io as io
vframes, aframes, info = io.read_video(
filename='path/v_ApplyEyeMakeup_g01_c01.avi',
pts_unit='sec',
end_pts=3
)
print(vframes.shape)
print(info)
# output:
# torch.Size([75, 240, 320, 3])
# {'video_fps': 25.0, 'audio_fps': 44100}
源码:https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#read_video_timestamps
Parameters
filename (str) – path to the video file
pts_unit (str, optional) – unit in which timestamp values will be returned either ‘pts’ or ‘sec’. Defaults to ‘pts’.
Returns
pts (List[int] if pts_unit = ‘pts’) – List[Fraction] if pts_unit = ‘sec’ presentation timestamps for each one of the frames in the video.
video_fps (float, optional) – the frame rate for the video
Example:
import torchvision.io as io
v_pts, v_fps = io.read_video_timestamps(
filename='path/v_ApplyEyeMakeup_g01_c01.avi',
pts_unit='pts'
)
print(v_pts)
print(v_fps)
# output
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164]
# 25.0
# ---------------------------------------------------------------------------
import torchvision.io as io
v_pts, v_fps = io.read_video_timestamps(
filename='path/v_ApplyEyeMakeup_g01_c01.avi',
pts_unit='sec'
)
print(v_pts)
print(v_fps)
# output
# [Fraction(1, 25), Fraction(2, 25), Fraction(3, 25), Fraction(4, 25), Fraction(1, 5), Fraction(6, 25), Fraction(7, 25), Fraction(8, 25), Fraction(9, 25), Fraction(2, 5), Fraction(11, 25), Fraction(12, 25), Fraction(13, 25), Fraction(14, 25), Fraction(3, 5), Fraction(16, 25), Fraction(17, 25), Fraction(18, 25), Fraction(19, 25), Fraction(4, 5), Fraction(21, 25), Fraction(22, 25), Fraction(23, 25), Fraction(24, 25), Fraction(1, 1), Fraction(26, 25), Fraction(27, 25), Fraction(28, 25), Fraction(29, 25), Fraction(6, 5), Fraction(31, 25), Fraction(32, 25), Fraction(33, 25), Fraction(34, 25), Fraction(7, 5), Fraction(36, 25), Fraction(37, 25), Fraction(38, 25), Fraction(39, 25), Fraction(8, 5), Fraction(41, 25), Fraction(42, 25), Fraction(43, 25), Fraction(44, 25), Fraction(9, 5), Fraction(46, 25), Fraction(47, 25), Fraction(48, 25), Fraction(49, 25), Fraction(2, 1), Fraction(51, 25), Fraction(52, 25), Fraction(53, 25), Fraction(54, 25), Fraction(11, 5), Fraction(56, 25), Fraction(57, 25), Fraction(58, 25), Fraction(59, 25), Fraction(12, 5), Fraction(61, 25), Fraction(62, 25), Fraction(63, 25), Fraction(64, 25), Fraction(13, 5), Fraction(66, 25), Fraction(67, 25), Fraction(68, 25), Fraction(69, 25), Fraction(14, 5), Fraction(71, 25), Fraction(72, 25), Fraction(73, 25), Fraction(74, 25), Fraction(3, 1), Fraction(76, 25), Fraction(77, 25), Fraction(78, 25), Fraction(79, 25), Fraction(16, 5), Fraction(81, 25), Fraction(82, 25), Fraction(83, 25), Fraction(84, 25), Fraction(17, 5), Fraction(86, 25), Fraction(87, 25), Fraction(88, 25), Fraction(89, 25), Fraction(18, 5), Fraction(91, 25), Fraction(92, 25), Fraction(93, 25), Fraction(94, 25), Fraction(19, 5), Fraction(96, 25), Fraction(97, 25), Fraction(98, 25), Fraction(99, 25), Fraction(4, 1), Fraction(101, 25), Fraction(102, 25), Fraction(103, 25), Fraction(104, 25), Fraction(21, 5), Fraction(106, 25), Fraction(107, 25), Fraction(108, 25), Fraction(109, 25), Fraction(22, 5), Fraction(111, 25), Fraction(112, 25), Fraction(113, 25), Fraction(114, 25), Fraction(23, 5), Fraction(116, 25), Fraction(117, 25), Fraction(118, 25), Fraction(119, 25), Fraction(24, 5), Fraction(121, 25), Fraction(122, 25), Fraction(123, 25), Fraction(124, 25), Fraction(5, 1), Fraction(126, 25), Fraction(127, 25), Fraction(128, 25), Fraction(129, 25), Fraction(26, 5), Fraction(131, 25), Fraction(132, 25), Fraction(133, 25), Fraction(134, 25), Fraction(27, 5), Fraction(136, 25), Fraction(137, 25), Fraction(138, 25), Fraction(139, 25), Fraction(28, 5), Fraction(141, 25), Fraction(142, 25), Fraction(143, 25), Fraction(144, 25), Fraction(29, 5), Fraction(146, 25), Fraction(147, 25), Fraction(148, 25), Fraction(149, 25), Fraction(6, 1), Fraction(151, 25), Fraction(152, 25), Fraction(153, 25), Fraction(154, 25), Fraction(31, 5), Fraction(156, 25), Fraction(157, 25), Fraction(158, 25), Fraction(159, 25), Fraction(32, 5), Fraction(161, 25), Fraction(162, 25), Fraction(163, 25), Fraction(164, 25)]
# 25.0
源码:https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#write_video
Parameters
filename (str) – path where the video will be saved
video_array (Tensor[T, H, W, C]) – tensor containing the individual frames, as a uint8 tensor in [T, H, W, C] format
fps (Number) – frames per second
官方文档:https://pytorch.org/docs/stable/torchvision/io.html#fine-grained-video-api
Fine-grained video-reading API. Supports frame-by-frame reading of various streams from a single video container.
Parameters
path (string) – Path to the video file in supported format
stream (string, optional) – descriptor of the required stream, followed by the stream id, in the format
{stream_type}:{stream_id}
. Defaults to"video:0"
. Currently available options include['video', 'audio']
注意:我使用的时候报出了如下错误。原因是 VideoReader 还在测试【Beta】中,网上有人说安装 ffmpeg 后就可以了,但是我试了不管是在系统还是在 conda 下安装都没有用,还是等正式推出之后再说吧。。。
参考:
常用函数:
Returns:
a dictionary with fields
data
andpts
containing decoded frame and corresponding timestamp
Returns:
dictionary containing duration and frame rate for every stream
Parameters
time_s (float) – seek time in seconds
官方源码:
我暂时没有找到官方文档,不过从源码里的注释里也能明白。
第二个链接里官方给出的 video 相关的 Transform 函数如下:
Convert tensor data type from uint8 to float, divide value by 255.0 and permute the dimensions of clip tensor.
和图片的 ToTensor() 操作类似,但要注意维度的顺序!
Args:
clip (torch.tensor, dtype=torch.uint8): Size is (T, H, W, C)
Return:
clip (torch.tensor, dtype=torch.float): Size is (C, T, H, W)
Normalize the video clip by mean subtraction and division by standard deviation.
和图片的 Normalize() 函数是一致的。不过图片通常使用 ImageNet 的 mean 和 std,视频用的是 Kinetics-400 的 mean = [0.43216, 0.394666, 0.37645] and std = [0.22803, 0.22145, 0.216989](来源:https://pytorch.org/docs/stable/torchvision/models.html#video-classification) 。
Args:
mean (3-tuple): pixel RGB mean
std (3-tuple): pixel RGB standard deviation
inplace (boolean): whether do in-place normalization
Flip the video clip along the horizonal direction with a given probability.
没有 Video Vertically Flip 也能理解吧
Args:
p (float): probability of the clip being flipped. Default value is 0.5
Args:
clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)crop_size: int / tuple
Returns:
torch.tensor: central cropping of video clip. Size is (C, T, crop_size, crop_size)
Args:
clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)size: int / tuple
Returns:
torch.tensor: randomly cropped/resized video clip.
Args:
clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)scale:【Default】(0.08, 1.0)
ratio:【Default】(3.0 / 4.0, 4.0 / 3.0)
interpolation_mode:【Default】"bilinear"
Returns:
torch.tensor: randomly cropped/resized video clip.
import torchvision.transforms as transform
import torchvision.transforms._transforms_video as v_transform
import torchvision.io as io
vframes, aframes, info = io.read_video(
filename='path/v_ApplyEyeMakeup_g01_c01.avi',
pts_unit='pts',
)
trans = transform.Compose([
v_transform.ToTensorVideo(),
v_transform.RandomHorizontalFlipVideo(),
v_transform.RandomResizedCropVideo(112),
])
print(vframes.shape)
print(trans(vframes))
print(trans(vframes).shape)
# output:
# 原来的 video clip tensor's shape:torch.Size([164, 240, 320, 3])
# Transform 后的 video clip tensor's shape:torch.Size([3, 164, 112, 112])
官方文档:https://pytorch.org/docs/stable/torchvision/models.html#video-classification
源码:https://pytorch.org/docs/stable/_modules/torchvision/models/video/resnet.html
模型:
ResNet 3D 18
ResNet MC 18
ResNet (2+1)D
这些模型我没太详细接触过,文档里已经非常贴心的给出了相应的论文:https://arxiv.org/abs/1711.11248。
Parameters
pretrained (bool) – If True, returns a model pre-trained on Kinetics-400
progress (bool) – If True, displays a progress bar of the download to stderr
Returns
Network
import torchvision.models.video as v_model
model = v_model.r3d_18(pretrained=True)
print(model)