行为识别模型R(2+1)D的模型结构

行为识别模型R( 2+1)D的模型结构

  • R(2+1)D代码链接与论文
    • R(2+1)D模型结构
    • SpatioTemporalConv模块结构
    • SpatioTemporalResLayer模块结构
    • SpatioTemporalResBlock模块结构

R(2+1)D代码链接与论文

链接: https://github.com/jfzhang95/pytorch-video-recognition.
论文:《A Closer Look at Spatiotemporal Convolutions for Action Recognition》

建议对照代码看

R(2+1)D模型结构

R(2+1)D模型结构图
block_type=SpatioTemporalResBlock
layer_size=[2,2,2,2]

Created with Raphaël 2.3.0 inputs:(N,3,16,112,112) SpatioTemporalConv:(3,64,(1,7,7),stride=(1,2,2),padding=(0,3,3),first_conv=True) ------------------------------------outputs:(N,64,16,56,56)----------------------------------------- SpatioTemporalResLayer:(64,64,3,layer_size[0],block_type=block_type) -----------------------------outputs:(N,64,16,56,56)----------------------------------- SpatioTemporalResLayer:(64,128,3,layer_size[1],block_type=block_type,downsample=True) ------------------------------------------outputs:(N,128,8,28,28)------------------------------------------------- SpatioTemporalResLayer:(128,256,3,layer_size[2],block_type=block_type,downsample=True) ------------------------------------------outputs:(N,256,4,14,14)------------------------------------------------- SpatioTemporalResLayer:(256,512,3,layer_size[3],block_type=block_type,downsample=True) ------------------------------------------outputs:(N,512,2,7,7)------------------------------------------------- AdaptiveAvgPool3d outputs:(N,512,1,1,1) View(-1,512) outputs:(N,512) Softmax Max Index outputs:(N,1) outputs:(N,1)

SpatioTemporalConv模块结构

SpatioTemporalConv的输入参数:(in_channels,out_channels,kernel_size,stride=1,padding=0,
bias=False,first_conv=False)

Created with Raphaël 2.3.0 Inputs Spatial_conv:Conv3d(in_channels,intermed_channels, spatial_kernel_size,spatial_stride,sptial_padding,bias) BatchNorm3d Relu Temporal_conv:Conv3d(intermed_channels,out_channels, temporal_kernel_size,temporal_stride,temporal_padding,bias) BatchNorm3d Relu outputs

在代码中,当first_conv=True时intermed_channels=45,否则intermed_channels=(kernel_size[0] * kernel_size[1] * kernel_size[2] * in_channels * out_channels)/(kernel_size[1] * kernel_size[2] * in_channels+kernel_size[0] * out_channels),temporal_kernel_size=(3,1,1),spatial_padding=(1,0,0)在R3D中的Conv3d卷积核尺寸为txtxt,而R(2+1)D将其变为先进行卷积核为1xtxt的Spatial_conv,然后再进行tx1x1的Temporal_conv。

spatial_kernel_size=(1,kernel_size[1],kernel_size[2])
spatial_stride=(1,stride[1],stride[2])
spatial_padding=(0,padding[1],padding[2])

temporal_kernel_size=(kernel_size[0],1,1)
temporal_stride=(stride[0],1,1)
temporal_padding=(padding[0],0,0)

SpatioTemporalResLayer模块结构

SpatioTemporalResLayer的输入参数:(in_channels,out_channels,kernel_size,layer_size,
block_type=SpatioTemporalResBlock,downsample=False)

Created with Raphaël 2.3.0 Inputs block_type(in_channels,out_channels,kernel_size,downsample) block_type(out_channels,out_channels,kernel_size)(layer_size-1次循环这一层) outputs

SpatioTemporalResBlock模块结构

该模型为残差网络结构

SpatioTemporalResBlock的输入参数:(in_channels,out_channels,kernel_size,stride=1,downsample=False)
padding=kernel_size//2
当downsample=True时,将对输入的Tensor进行下采样,输入的Tensor为(N,C,D,X,Y),输出的Tensor为(N,out_channels,D/2,X/2,Y/2),此时的模型结构图如下:

Created with Raphaël 2.3.0 Inputs:x,shape(N,C,D,X,Y) SpatioTemporalConv(in_channels, out_channels,kernel_size,padding,stride=2) ouputs:(N,out_channels,D/2,X/2,Y/2) BatchNorm3d Relu SpatioTemporalConv(out_channels, out_channels,kernel_size,padding) ouputs:(N,out_channels,D/2,X/2,Y/2) BatchNorm3d Relu outpus: res outputs:Relu(res+x) shape(N,out_channels,D/2,X/2,Y/2) SpatioTemporalConv(out_channels, out_channels,1,,stride=2) ouputs:(N,out_channels,D/2,X/2,Y/2) BatchNorm3d outpus: x

当downsample=False时,对输入的Tensor不进行下采样,输入和输出的Tensor的shape一样,此时的模型结构图如下:

Created with Raphaël 2.3.0 Inputs:x,shape(N,C,D,X,Y) SpatioTemporalConv(in_channels, out_channels,kernel_size,padding) ouputs:(N,out_channels,D,X,Y) BatchNorm3d Relu SpatioTemporalConv(out_channels, out_channels,kernel_size,padding) ouputs:(N,out_channels,D,X,Y) BatchNorm3d Relu outpus: res outputs:Relu(res+x) shape(N,out_channels,D,X,Y)

你可能感兴趣的:(计算机视觉,深度学习,卷积神经网络)