C3D网络介绍

1. 模型简介

C3D模型广泛用于3D视觉任务。C3D网络的构造类似于常见的2D卷积网,主要区别在于C3D使用像卷积3D这样的3D操作,而2D卷积网则是通常的2D架构。要了解有关C3D网络的更多信息,您可以阅读原始论文学习3D卷积网络的时空特征。

3D卷积图示:
C3D网络介绍_第1张图片
深度学习在图像领域的成功应用产生了多个优秀预训练特征提取模型。提取的特征基本都是网络后面FC层的激活值,并且在迁移学习任务上表现良好。但是,基于图像训练的深度特征因为没有编码运动特征,并不适合用在视频上。本文就提出了一种可以学习spatial-temporal特征的深度3D ConvNet。虽然3D卷积并不是本文首次提出,但是本文基于大规模的监督训练数据集和深度网络架构,在不同视频分析任务中本文都取得了不错的效果。C3D提取出的特征将视频中物体信息、场景信息和动作信息都隐式编码进特征,使得不需要根据特定任务进行finetune都可以取得不错的效果。并且C3D拥有上述提到的好的视频特征描述子应具有的四个特征。

2.模型结构

C3D网络有8个卷积,5个最大池化和2个完全连接的层,然后是softmax输出层。所有 3D 卷积核都是 3 × 3 × 3,在空间和时间维度上都步幅为 1。3D 池化图层表示为从池 1 到池5。所有池化内核均为 2 × 2 × 2,但池 1 为 1 × 2 × 2。每个完全连接的层有 4096 个输出单元。

网络结构代码

C3D功能函数说明:
数据预处理:
VideoResize:改变输入视频大小
VideoRescale:此运算符将使用给定的重新缩放和移位来重新缩放输入视频帧。output=image*rescale+shift重新缩放输入视频。
VideoRandomCrop:在随机位置裁剪给定的视频序列(t x h x w x c)
VideoRandomHorizontalFlip:以给定的概率翻转视频的每一帧
VideoReOrder:重新排列数据的维度顺序

参数:
in_d:输入数据的深度,它可以被视为视频的帧数。默认值:16。
in_h:输入帧的高度。默认值:112。
in_w:输入帧的宽度。默认值:112。
in_channel(int):输入数据的通道数。默认值:3。
kernel_size(Union[int,Tuple[int]]):C3D中每个conv3d层的卷积核大小。
默认值:(3,3,3)。
head_channel(Tuple[int]):两个全连接层大小。默认值:[4096,4096]。
num_classes(int):类的数量,它是每个样本的分类得分大小,即:math:CLASSES_{out}。默认值:400。
keep_prob(Tuple[int]):multi-dense-layer头部的dropout概率,概率数等于multi-dense-layer的数量。
pretrained(bool):如果为“True”,它将创建一个预训练模型,预训练模型将被加载
从网络。如果为“False”,它将创建一个c3d模型,并对权重和偏差进行统一初始化。
Inputs:

  • x (Tensor) - Tensor of shape :math:(N, C_{in}, D_{in}, H_{in}, W_{in}).
    Outputs:
    Tensor of shape :math:(N, CLASSES_{out}).
""" C3D network."""

import math
from typing import Tuple, Union

from mindspore import nn
from src.models.layers.dropout_dense import DropoutDense
from src.models.layers.c3d_backbone import C3DBackbone
from src.utils.class_factory import ClassFactory, ModuleType

__all__ = ['C3D']


@ClassFactory.register(ModuleType.MODEL)
class C3D(nn.Cell):
    """
    TODO: introduction c3d network.

    Args:
        in_d: Depth of input data, it can be considered as frame number of a video. Default: 16.
        in_h: Height of input frames. Default: 112.
        in_w: Width of input frames. Default: 112.
        in_channel(int): Number of channel of input data. Default: 3.
        kernel_size(Union[int, Tuple[int]]): Kernel size for every conv3d layer in C3D.
            Default: (3, 3, 3).
        head_channel(Tuple[int]): Hidden size of multi-dense-layer head. Default: [4096, 4096].
        num_classes(int): Number of classes, it is the size of classfication score for every sample,
            i.e. :math:`CLASSES_{out}`. Default: 400.
        keep_prob(Tuple[int]): Probability of dropout for multi-dense-layer head, the number of probabilities equals
            the number of dense layers.
        pretrained(bool): If `True`, it will create a pretrained model, the pretrained model will be loaded
            from network. If `False`, it will create a c3d model with uniform initialization for weight and bias.

    Inputs:
        - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`.

    Outputs:
        Tensor of shape :math:`(N, CLASSES_{out})`.

    Supported Platforms:
        ``GPU``

    Examples:
        >>> import numpy as np
        >>> import mindspore as ms
        >>> from msvideo.models import C3D
        >>>
        >>> net = C3D(16, 128, 128)
        >>> x = ms.Tensor(np.ones([1, 3, 16, 128, 128]), ms.float32)
        >>> output = net(x)
        >>> print(output.shape)
        (1, 400)

    About c3d:

    TODO: c3d introduction.

    Citation:

    .. code-block::

        TODO: c3d Citation.
    """

    def __init__(self,
                 in_d: int = 16,
                 in_h: int = 112,
                 in_w: int = 112,
                 in_channel: int = 3,
                 kernel_size: Union[int, Tuple[int]] = (3, 3, 3),
                 head_channel: Union[int, Tuple[int]] = (4096, 4096),
                 num_classes: int = 400,
                 keep_prob: Union[float, Tuple[float]] = (0.5, 0.5, 1.0)):
        super().__init__()
        last_d = math.ceil(in_d / 16)
        last_h = math.ceil((math.ceil(in_h / 16) + 1) / 2)
        last_w = math.ceil((math.ceil(in_w / 16) + 1) / 2)
        backbone_output_channel = 512 * last_d * last_h * last_w

        # backbone
        self.backbone = C3DBackbone(in_channel=in_channel,
                                    kernel_size=kernel_size)
        # flatten
        self.flatten = nn.Flatten()

        # classifier
        activations = ('relu', 'relu', None)
        if isinstance(head_channel, int):
            head_channel = (head_channel,)
        if isinstance(float, int):
            keep_prob = (keep_prob,)
        head_channel = list(head_channel)
        head_channel.insert(0, backbone_output_channel)
        head_channel.append(num_classes)
        dense_layers = []
        for i in range(len(head_channel)-1):
            dense_layers.append(DropoutDense(head_channel[i],
                                             head_channel[i+1],
                                             activation=activations[i],
                                             keep_prob=keep_prob[i]))
        self.classifier = nn.SequentialCell(dense_layers)

    def construct(self, x):
        x = self.backbone(x)
        x = self.flatten(x)
        x = self.classifier(x)
        return x

完整的可执行项目连接:
基于MindSpore的C3D实现

你可能感兴趣的:(c语言,网络,深度学习)