PyTorch 1.7 Video 初体验(Video Datasets,Video IO,Video Classification Models,Video Transform)

目录

Environment   环境

Reference   参考链接

Video Datasets   视频数据集 & 加载

加载 UCF101 数据集

加载 HMDB51 数据集

加载 Kinetics 400 数据集

Video I/O   视频 I/O 操作

torchvision.io.read_video()

torchvision.io.read_video_timestamps()

torchvision.io.write_video()

class torchvision.io.VideoReader(path, stream='video')

Video Transform   视频变换操作

ToTensorVideo()

NormalizeVideo()

RandomHorizontalFlipVideo()

CenterCropVideo()

RandomCropVideo()

RandomResizedCropVideo()

Example

Video Classification Model   视频动作分类模型

Example

 


 

Environment   环境

  • Win 10
  • Anaconda Navigator
  • PyCharm
  • cuda 10.1
  • torch 1.7.1
  • torchvision 0.8.2
  • Python 3.8

 


 

Reference   参考链接

  • Anaconda Navigator 版本的升级:https://www.cnblogs.com/developerchen/p/8879516.html

打开 Anaconda Prompt,输入以下命令:

conda install -c continuumcrew anaconda-navigator=1.5.1

conda update --all

  • torch 1.7.1 的安装:https://pytorch.org/get-started/locally/

打开 Anaconda Prompt,切换到相应环境,输入以下命令:

pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

  • Pytorch 1.7.1 官方文档:https://pytorch.org/docs/stable/index.html

 


 

Video Datasets   视频数据集 & 加载

  • UCF101:https://pytorch.org/docs/stable/torchvision/datasets.html#ucf101
  • HMDB51:https://pytorch.org/docs/stable/torchvision/datasets.html#hmdb51
  • Kinetics400:https://pytorch.org/docs/stable/torchvision/datasets.html#kinetics-400
  • ......

 

加载 UCF101 数据集

import torchvision.datasets as datasets

data = datasets.UCF101(
    root='path/UCF-101',
    annotation_path='path/UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist',
    frames_per_clip=16,
    num_workers=0
)

print(data)

返回值

  • video (Tensor[T, H, W, C]): the `T` video frames
  • audio(Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
  • label (int): class of the video clip

 

注意:

  • win 10 系统下运行该代码一定要加上 num_workers=0,不然会报出如下错误

PyTorch 1.7 Video 初体验(Video Datasets,Video IO,Video Classification Models,Video Transform)_第1张图片

 

  • 还需要安装 PyAV 这个库,安装命令:pip install av

 

  • 在导入 UCF101 数据时,由于 windows 路径用的是“\”,所以在加载数据集时会报出如下错误:

PyTorch 1.7 Video 初体验(Video Datasets,Video IO,Video Classification Models,Video Transform)_第2张图片

原因 & 解决方案:https://stackoverflow.com/questions/61522539/i-cant-import-the-ucf-101-dataset-torchvision-list-index-out-of-range-error

原因:trainlist01/02/03.txt 和  testlist01/02/03.txt 中的 video path 长这样:ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi          和 windows 系统路径要求的斜杠()不一样

我用的是其中的第一种解决方案:把 trainlist01/02/03.txt 和  testlist01/02/03.txt 中的 / 全部替换为 \

 

 

加载 HMDB51 数据集

参数

  • root (string) – Root directory of the HMDB51 Dataset.

  • annotation_path (str) – Path to the folder containing the split files.

  • frames_per_clip (int) – Number of frames in a clip.

  • step_between_clips (int) – Number of frames between each clip.

  • fold (intoptional) – Which fold to use. Should be between 1 and 3.

  • train (booloptional) – If True, creates a dataset from the train split, otherwise from the test split.

  • transform (callableoptional) – A function/transform that takes in a TxHxWxC video and returns a transformed version.

返回值

  • video (Tensor[T, H, W, C]): the `T` video frames
  • audio(Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
  • label (int): class of the video clip

 

 

加载 Kinetics 400 数据集

参数

  • root (string) – Root directory of the Kinetics-400 Dataset.

  • frames_per_clip (int) – number of frames in a clip

  • step_between_clips (int) – number of frames between each clip

  • transform (callableoptional) – A function/transform that takes in a TxHxWxC video and returns a transformed version.

返回值

  • video (Tensor[T, H, W, C]): the `T` video frames
  • audio(Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
  • label (int): class of the video clip

 


 

Video I/O   视频 I/O 操作

官方文档

  • https://pytorch.org/docs/stable/torchvision/io.html?highlight=video
  • https://pytorch.org/docs/stable/torchvision/io.html#fine-grained-video-api

 

torchvision.io.read_video()

源码:https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#read_video

Parameters

  • filename (str) – path to the video file

  • start_pts (int if pts_unit = 'pts'optional) – float / Fraction if pts_unit = ‘sec’, optional the start presentation time of the video

  • end_pts (int if pts_unit = 'pts'optional) – float / Fraction if pts_unit = ‘sec’, optional the end presentation time

  • pts_unit (stroptional) – unit in which start_pts and end_pts values will be interpreted, either ‘pts’ or ‘sec’. Defaults to ‘pts’.

 

Returns

  • vframes (Tensor[T, H, W, C]) – the T video frames

  • aframes (Tensor[K, L]) – the audio frames, where K is the number of channels and L is the number of points

  • info (Dict) – metadata for the video and audio. Can contain the fields video_fps (float) and audio_fps (int)

补充知识:什么是时间戳?什么是 pts

https://blog.csdn.net/tanningzhong/article/details/105564589

  • 时间戳单位

前面我们提到采样率,感觉到采样率是个很大的单位,一般标准的音频AAC采样率达到了44kHz,视频采样率也规定在90000Hz.所以我们衡量时间的单位不能再是秒,毫秒这种真实的时间单位,我们的单位应该转换为采样率,也就是一个采样的时间为音视频的时间单位,这就是时间戳的真实值。当我们要播放和控制时,我们再将时间戳根据采样率转换为真实的时间即可。

一句话,时间戳不是真实的时间是采样次数。比如时间戳是160,我们不能认为是160秒或者160毫秒,应该是160个采样。要换算真实时间,我们必须知道采样率,比如8000,那么说明1秒被划分成8000分之一,如果你要明确160个采样占用的时间,则160*(1/8000)即可,即20毫秒。

  • 时间戳增量

就是一帧图像和另外一帧图像之间的时间戳差值,或者一帧音频和一帧音频的时间戳差值。同理时间戳增量也是采样个数的差值不是真实时间差值,还是要根据采样率才能换算成真实时间。

所以对于视频和音频的时间戳计算要一定明确帧率是多少,采样率是多少。

比如视频而言,帧率25,那么对于90000的采样率来说,一帧占用的采样数就是90000/25也就是3600,说明每帧图像的时间戳增量应该是3600,换算成实际时间就是3600*(1/90000)=0.04秒=40毫秒,这也和1/25=0.04秒=40毫秒一致。

对于AAC音频,一帧1024个采样,采样频率是44kHz,所以一帧的播放时间应该是1024*(1/44100)=0.0232秒=23.22毫秒。

用个 Example 更直观的理解这两个概念

import torchvision.io as io


vframes, aframes, info = io.read_video(
    filename='path/v_ApplyEyeMakeup_g01_c01.avi',
    pts_unit='pts',
    end_pts=3
)

print(vframes.shape)
print(info)


# output:
# torch.Size([3, 240, 320, 3])
# {'video_fps': 25.0, 'audio_fps': 44100}


# --------------------------------------------------------------------


import torchvision.io as io


vframes, aframes, info = io.read_video(
    filename='path/v_ApplyEyeMakeup_g01_c01.avi',
    pts_unit='sec',
    end_pts=3
)

print(vframes.shape)
print(info)


# output:
# torch.Size([75, 240, 320, 3])
# {'video_fps': 25.0, 'audio_fps': 44100}

 

torchvision.io.read_video_timestamps()

源码:https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#read_video_timestamps

Parameters

  • filename (str) – path to the video file

  • pts_unit (stroptional) – unit in which timestamp values will be returned either ‘pts’ or ‘sec’. Defaults to ‘pts’.

 

Returns

  • pts (List[int] if pts_unit = ‘pts’) – List[Fraction] if pts_unit = ‘sec’ presentation timestamps for each one of the frames in the video.

  • video_fps (float, optional) – the frame rate for the video

Example

import torchvision.io as io

v_pts, v_fps = io.read_video_timestamps(
    filename='path/v_ApplyEyeMakeup_g01_c01.avi',
    pts_unit='pts'
)

print(v_pts)
print(v_fps)


# output
# [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164]
# 25.0



# ---------------------------------------------------------------------------




import torchvision.io as io

v_pts, v_fps = io.read_video_timestamps(
    filename='path/v_ApplyEyeMakeup_g01_c01.avi',
    pts_unit='sec'
)

print(v_pts)
print(v_fps)


# output
# [Fraction(1, 25), Fraction(2, 25), Fraction(3, 25), Fraction(4, 25), Fraction(1, 5), Fraction(6, 25), Fraction(7, 25), Fraction(8, 25), Fraction(9, 25), Fraction(2, 5), Fraction(11, 25), Fraction(12, 25), Fraction(13, 25), Fraction(14, 25), Fraction(3, 5), Fraction(16, 25), Fraction(17, 25), Fraction(18, 25), Fraction(19, 25), Fraction(4, 5), Fraction(21, 25), Fraction(22, 25), Fraction(23, 25), Fraction(24, 25), Fraction(1, 1), Fraction(26, 25), Fraction(27, 25), Fraction(28, 25), Fraction(29, 25), Fraction(6, 5), Fraction(31, 25), Fraction(32, 25), Fraction(33, 25), Fraction(34, 25), Fraction(7, 5), Fraction(36, 25), Fraction(37, 25), Fraction(38, 25), Fraction(39, 25), Fraction(8, 5), Fraction(41, 25), Fraction(42, 25), Fraction(43, 25), Fraction(44, 25), Fraction(9, 5), Fraction(46, 25), Fraction(47, 25), Fraction(48, 25), Fraction(49, 25), Fraction(2, 1), Fraction(51, 25), Fraction(52, 25), Fraction(53, 25), Fraction(54, 25), Fraction(11, 5), Fraction(56, 25), Fraction(57, 25), Fraction(58, 25), Fraction(59, 25), Fraction(12, 5), Fraction(61, 25), Fraction(62, 25), Fraction(63, 25), Fraction(64, 25), Fraction(13, 5), Fraction(66, 25), Fraction(67, 25), Fraction(68, 25), Fraction(69, 25), Fraction(14, 5), Fraction(71, 25), Fraction(72, 25), Fraction(73, 25), Fraction(74, 25), Fraction(3, 1), Fraction(76, 25), Fraction(77, 25), Fraction(78, 25), Fraction(79, 25), Fraction(16, 5), Fraction(81, 25), Fraction(82, 25), Fraction(83, 25), Fraction(84, 25), Fraction(17, 5), Fraction(86, 25), Fraction(87, 25), Fraction(88, 25), Fraction(89, 25), Fraction(18, 5), Fraction(91, 25), Fraction(92, 25), Fraction(93, 25), Fraction(94, 25), Fraction(19, 5), Fraction(96, 25), Fraction(97, 25), Fraction(98, 25), Fraction(99, 25), Fraction(4, 1), Fraction(101, 25), Fraction(102, 25), Fraction(103, 25), Fraction(104, 25), Fraction(21, 5), Fraction(106, 25), Fraction(107, 25), Fraction(108, 25), Fraction(109, 25), Fraction(22, 5), Fraction(111, 25), Fraction(112, 25), Fraction(113, 25), Fraction(114, 25), Fraction(23, 5), Fraction(116, 25), Fraction(117, 25), Fraction(118, 25), Fraction(119, 25), Fraction(24, 5), Fraction(121, 25), Fraction(122, 25), Fraction(123, 25), Fraction(124, 25), Fraction(5, 1), Fraction(126, 25), Fraction(127, 25), Fraction(128, 25), Fraction(129, 25), Fraction(26, 5), Fraction(131, 25), Fraction(132, 25), Fraction(133, 25), Fraction(134, 25), Fraction(27, 5), Fraction(136, 25), Fraction(137, 25), Fraction(138, 25), Fraction(139, 25), Fraction(28, 5), Fraction(141, 25), Fraction(142, 25), Fraction(143, 25), Fraction(144, 25), Fraction(29, 5), Fraction(146, 25), Fraction(147, 25), Fraction(148, 25), Fraction(149, 25), Fraction(6, 1), Fraction(151, 25), Fraction(152, 25), Fraction(153, 25), Fraction(154, 25), Fraction(31, 5), Fraction(156, 25), Fraction(157, 25), Fraction(158, 25), Fraction(159, 25), Fraction(32, 5), Fraction(161, 25), Fraction(162, 25), Fraction(163, 25), Fraction(164, 25)]
# 25.0

 

torchvision.io.write_video()

源码:https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#write_video

Parameters

  • filename (str) – path where the video will be saved

  • video_array (Tensor[THWC]) – tensor containing the individual frames, as a uint8 tensor in [T, H, W, C] format

  • fps (Number) – frames per second

 

class torchvision.io.VideoReader(path, stream='video')

官方文档:https://pytorch.org/docs/stable/torchvision/io.html#fine-grained-video-api

Fine-grained video-reading API. Supports frame-by-frame reading of various streams from a single video container.

Parameters

  • path (string) – Path to the video file in supported format

  • stream (stringoptional) – descriptor of the required stream, followed by the stream id, in the format {stream_type}:{stream_id}. Defaults to "video:0". Currently available options include ['video', 'audio']

注意:我使用的时候报出了如下错误。原因是 VideoReader 还在测试【Beta】中,网上有人说安装 ffmpeg 后就可以了,但是我试了不管是在系统还是在 conda 下安装都没有用,还是等正式推出之后再说吧。。。

参考

  • "RuntimeError: Not compiled with video_reader support" raises when I use the new fine-grained VideoReader API.    https://github.com/pytorch/vision/issues/2934#issuecomment-718834813
  • 官方解释报错链接(还在测试【Beta】中):https://github.com/pytorch/vision/releases/tag/v0.8.0
  • ffmpeg 的 conda 下载与安装:conda install ffmpeg
  • ffmpeg 的 windows 下载与安装:https://www.zhihu.com/question/288655694/answer/1605692761

常用函数

  • __next__() :Decodes and returns the next frame of the current stream

Returns:

a dictionary with fields data and pts containing decoded frame and corresponding timestamp

  • get_metadata():Returns video metadata

Returns:

dictionary containing duration and frame rate for every stream

  • seek(time_s: float):Seek within current stream.

Parameters

time_s (float) – seek time in seconds

 


 

Video Transform   视频变换操作

官方源码

  • https://github.com/pytorch/vision/blob/master/torchvision/transforms/_functional_video.py(比下一个链接更底层一点)
  • https://github.com/pytorch/vision/blob/master/torchvision/transforms/_transforms_video.py(上一个链接包装了一下)

我暂时没有找到官方文档,不过从源码里的注释里也能明白。

 

第二个链接里官方给出的 video 相关的 Transform 函数如下:

  • RandomCropVideo
  • RandomResizedCropVideo
  • CenterCropVideo
  • NormalizeVideo
  • ToTensorVideo
  • RandomHorizontalFlipVideo

 

ToTensorVideo()

Convert tensor data type from uint8 to float, divide value by 255.0 and permute the dimensions of clip tensor.

和图片的 ToTensor() 操作类似,但要注意维度的顺序!

Args:
            clip (torch.tensor, dtype=torch.uint8): Size is (T, H, W, C)


Return:
            clip (torch.tensor, dtype=torch.float): Size is (C, T, H, W)

 

NormalizeVideo()

Normalize the video clip by mean subtraction and division by standard deviation.

和图片的 Normalize() 函数是一致的。不过图片通常使用 ImageNet 的 mean 和 std,视频用的是 Kinetics-400 的 mean = [0.43216, 0.394666, 0.37645] and std = [0.22803, 0.22145, 0.216989](来源:https://pytorch.org/docs/stable/torchvision/models.html#video-classification)  。

 Args:
        mean (3-tuple): pixel RGB mean
        std (3-tuple): pixel RGB standard deviation
        inplace (boolean): whether do in-place normalization

 

RandomHorizontalFlipVideo()

Flip the video clip along the horizonal direction with a given probability.

没有 Video Vertically Flip 也能理解吧

Args:
        p (float): probability of the clip being flipped. Default value is 0.5

 

CenterCropVideo()

Args:
            clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)

            crop_size: int / tuple


Returns:
            torch.tensor: central cropping of video clip. Size is (C, T, crop_size, crop_size)

 

RandomCropVideo()

Args:
            clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)

            size: int / tuple


Returns:
            torch.tensor: randomly cropped/resized video clip.

 

RandomResizedCropVideo()

Args:
            clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)

            scale:【Default】(0.08, 1.0)

            ratio:【Default】(3.0 / 4.0, 4.0 / 3.0)

            interpolation_mode:【Default】"bilinear"


Returns:
            torch.tensor: randomly cropped/resized video clip.

 

Example

import torchvision.transforms as transform
import torchvision.transforms._transforms_video as v_transform
import torchvision.io as io


vframes, aframes, info = io.read_video(
    filename='path/v_ApplyEyeMakeup_g01_c01.avi',
    pts_unit='pts',
)

trans = transform.Compose([
    v_transform.ToTensorVideo(),
    v_transform.RandomHorizontalFlipVideo(),
    v_transform.RandomResizedCropVideo(112),
])

print(vframes.shape)
print(trans(vframes))
print(trans(vframes).shape)




# output:
# 原来的 video clip tensor's shape:torch.Size([164, 240, 320, 3])
# Transform 后的 video clip tensor's shape:torch.Size([3, 164, 112, 112])

 


 

Video Classification Model   视频动作分类模型

官方文档:https://pytorch.org/docs/stable/torchvision/models.html#video-classification

源码:https://pytorch.org/docs/stable/_modules/torchvision/models/video/resnet.html

模型

  • ResNet 3D 18

  • ResNet MC 18

  • ResNet (2+1)D

这些模型我没太详细接触过,文档里已经非常贴心的给出了相应的论文:https://arxiv.org/abs/1711.11248。

Parameters

  • pretrained (bool) – If True, returns a model pre-trained on Kinetics-400

  • progress (bool) – If True, displays a progress bar of the download to stderr

 

Returns

        Network

 

Example

import torchvision.models.video as v_model

model = v_model.r3d_18(pretrained=True)

print(model)

 

你可能感兴趣的:(学习,人工智能,PyTorch,pytorch,video,video,datasets,video,transform,视频动作分类模型)