行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding

行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding

Lin, Ji , C. Gan , and S. Han . “TSM: Temporal Shift Module for Efficient Video Understanding.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) IEEE, 2019.

Contents

  • 行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding
    • Motivations
    • Solutions
    • Experiments
    • English Expression
    • Advantages and Drawbacks
    • 代码
      • 数据准备
        • Kinetics 400

Motivations

  1. Temporal Shift Module (TSM) can achieve the performance of 3D CNN but maintain 2D CNN’s complexity.

  2. Address shift, which is a hardware-friendly primitive, has also been exploited for compact 2D CNN design on image recognition tasks

    Chen, Weijie, et al. “All you need is a few shifts: Designing efficient convolutional neural networks for image classification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

    Wu, Bichen, et al. “Shift: A zero flop, zero parameter alternative to spatial convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第1张图片

    He, Yihui, et al. “Addressnet: Shift-based primitives for efficient convolutional neural networks.” 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019.

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第2张图片

Solutions

行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第3张图片

  • partial shift:

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第4张图片

  • residual shift:

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第5张图片

  • Online Video understanding

    1/8 feature maps (red)of each residual block was cached in the memory

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第6张图片

Experiments

  • Something-Something-V1 SOTA

行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第7张图片

Compare to 2D-CNN: TSN & TRN (late temporal fusion) which have low FLOPs and Param, however, lack of temporal modeling.

Zhou, Bolei, et al. “Temporal relational reasoning in videos.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Compare to early 2D->late 3D: ECO (medium-level temporal fusion)

Compare to Non-local I3D + GCN (all-level temporal fusion): The GCN needs a Region Proposal Network [34] trained on MSCOCO object detection dataset [30] unfair

TVL1 optical flow algorithm [55] implemented in OpenCV with CUDA

  • Cost vs. Accuracy

行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第8张图片

  • Online Recognition with TSM

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第9张图片

English Expression

  1. There are works to trade off 交替使用 between temporal modeling and computation, such as post-hoc fusion [13, 9, 58, 7] and mid-level temporal fusion [61, 53, 46]
  2. Data movement increases the memory footprint 空间占用 and inference latency on hardware 硬件延时. Worse still, such effect is exacerbated 恶化 in the video understanding networks due to large activation size (5D tensor).

Advantages and Drawbacks

  1. Shift 操作不会增加计算复杂度,保持2D卷积的同时,交换了同一个时间窗口内的部分通道;Shift核是个平移滤波器

  2. 不同移动设备上的实验也很充分

  3. 移出去的部分删除,空缺的部分 pad zero;万一移除了当前帧的有用通道怎么办?

  4. 2D卷积上的shift只是depthwise 的一个特例,相比全连接省同样量级的参数,且性能没有下降太多,shift与不同depthwise比较还有个优点是运行时间不受卷积核大小影响,普通depthwise参数量与卷积核边长平方成正比,而shift只能用3x3卷积;海康有个文章 all you need is a few shift,提出稀疏化的shift操作(损失函数惩罚无用shift),里面可视化了一个block中所有的shift操作,统计各个方向shift的占比,保持不变的最多;

    2Dshift和时序shift的区别:2D在spatial中对1个特征图内部shift,采用平移卷积核,temporal shift是1个channel,也就是1个特征图整体被shift进其他时间点的同层通道

  5. 总觉得论文Figure存在误导, y t y_t yt并不是最终的prediction,论文居然对 y t y_t yt没有说明;看下面这张图才是正道:每N帧给出一个预测结果

    行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding_第10张图片

代码

数据准备

Kinetics 400

目录层级:类别名-视频名(YouTubeID_timestart_duartion.mp4)
Origin video: --07WQ2iBlw_1_10.mp4 ~ youtube_id-time_start-duration.mp4

  1. 视频转图片
    python vid2img_kinetics.py 视频路径(video/类别名/视频们.mp4) 图片路径(images/)
cmd = 'ffmpeg -i \"{}\" -threads 1 -vf scale=-1:331 -q:v 0 \"{}/img_%05d.jpg\"'.format(video_file_path, dst_directory_path)

短边整成 331

  1. 生成新的标签
    gen_label_kinetics.py

标签:Input label: kinetics_val.csv

label,youtube_id,time_start,time_end,split
javelin throw,--07WQ2iBlw,1.0,11.0,validate

label空格换成下划线,双引号号单引号也处理下
Output label: val_videofolder.txt

path,num,label
javelin_throw/--07WQ2iBlw_1.0,

img_dir: …/data/kinetics/images256/javelin_throw/–07WQ2iBlw_1.0

文件夹 --07WQ2iBlw_1.0 == youtubeID_starttime

你可能感兴趣的:(行为识别,视频理解)