2021-11-12 Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Author and Department
Chiara et. al. 米兰理工大学,意大利; 发表在上ICPR,2020.

论文有代码,但是复现不正确,之后跟踪继续。

目 录

文章目录

  • Abstract
  • Summary
  • Research Objective(s)/Motivation
    • Contribution
  • Background / Problem Statement(Introduction)
    • Problem Statement
  • Method(s)
    • Spatial Self-Attention (SSA)
    • Temporal Self-Attention (TSA)
    • Two-Stream Spatial Temporal Transformer Network
  • Experiments
    • Ablation Study
  • References(optional)

Abstract

分为三个部分:1.background 2.motivation 3.method 4. conclusion

  • Background: Skeleton data has been demonstrated to be robust to illumination changes(光线变化) etc. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem(虽然骨架数据对于复杂环境鲁棒性较强,但是对于3D数据潜在信息的有效编码仍然是个问题)

  • Motivation:I think rubbing Transformer’s hotness. In addition, The existing methods ignore the correlation between joint pairs.

  • Method:Spatial-Temporal Transformer network(ST-TR)

    • Spatial Self-Attention module (SSA): Understand intra-frame interactions between different body parts;

    • Temporal Self-Attention module (TSA):model inter-frame correlations.

  • Conclusion:A two-stream network which outperforms state-of-the-art models on both NTU-RGB+D 60 and NTU-RGB+D 120.

Summary

写完笔记之后最后填,概述文章的内容,以后查阅笔记的时候先看这一段。注:写文章summary切记需要通过自己的思考,用自己的语言描述。忌讳直接Ctrl + c原文。

Research Objective(s)/Motivation

  作者目的是通过Spatial Self-Attention module (SSA) 和Temporal Self-Attention module (TSA) 提取自适应低层特征,建模人类行为中的交互。

Contribution

  • Author propose a novel two-stream Transformer-based model (both the Termporal and spatial dimensions)

  • Spatial Self-Attention (SSA) & Temporal SelfAttention (TSA)

    • SSA module dynamically build links between skeleton joints, 该模块获取人体各部分之间的关系,与动作有关,而非完全遵守自然人体关节结构。

    • TSA study the dynamics of joints along time.

Background / Problem Statement(Introduction)

Problem Statement

  1. The topology of the graph representing the human body is fixed for all layers and actions, preventing the extraction of rich representations(图表示人体的拓扑结构都是固定的,不能够提取丰富的表达)
  2. 时空卷积都是基于2D卷积的,所以都受限于局部邻居的特征影响;
  3. correlations between body joints not linked in the human skeleton(人体的关节点未连接的部分同样有关联性)。

Method(s)

2021-11-12 Spatial Temporal Transformer Network for Skeleton-based Action Recognition_第1张图片

Spatial Self-Attention (SSA)

  如图1(a)所示, first calculate q i t ∈ R d q q_i^t\in \mathcal{R}^{dq} qitRdq, k i t ∈ R d q k_i^t\in \mathcal{R}^{dq} kitRdq and v i t ∈ R d q v_i^t\in \mathcal{R}^{dq} vitRdq;Then, 计算a query-key dot product 获取权重 α i , j t ∈ m a t g h \alpha_{i,j}^t\in matgh αi,jtmatgh(权重代表两个节点之间的关联性强度)。
a weighted sum is computed to obtain a new embedding for node i t i^t it( ∑ \sum 的目的是为了获取节点新的嵌入)
a i . j t = q i t ⋅ k j t T , ∀ t ∈ T , z i t = ∑ j s o f t m a x j ( a i . j t d k ) v j t (1) a_{i.j}^t=\mathbf{q_i^t}\cdot \mathbf{k_j^t}^T,\forall{t}\in T, \mathbf{z}_i^t=\sum_jsoftmax_j(\frac{a_{i.j}^t}{\sqrt{d_k}})\mathbf{v}_j^t\tag{1} ai.jt=qitkjtT,tT,zit=jsoftmaxj(dk ai.jt)vjt(1)

  Multi-head 自注意力经过重复H次嵌入提取过程,每次采用不同集合的学习参数。,从而获得节点嵌入 z i 1 t , … , z i H t z_{i_1}^t,…,z_{i_H}^t zi1t,,ziHt,所有参考 i t i^t it,如 c o n c a t ( z i 1 t , … , z i H t ) ⋅ W O concat(z_{i_1}^t,…,z_{i_H}^t)\cdot W_O concat(zi1t,,ziHt)WO,并且构成SSA的输出特征。

  总结,这部分就是为了获取节点与其他节点在空间中的特征聚合

  因此,如图1a所示,节点的关系( a i . j t a_{i.j}^t ai.jt score)动态的预测;所有动作的关系结构并不是固定的,都是随着样本自适应改变。SSA操作和全连接的图卷积相似,但是核心values( a i . j t a_{i.j}^t ai.jt score)是基于骨架动作动态预测的。

Temporal Self-Attention (TSA)

a i . j v = q i v ⋅ k j v , ∀ v ∈ V , z i v = ∑ j s o f t m a x j ( a i . j v d k ) v j v (2) a_{i.j}^v=\mathbf{q_i^v}\cdot \mathbf{k_j^v},\forall{v}\in V, \mathbf{z}_i^v=\sum_jsoftmax_j(\frac{a_{i.j}^v}{\sqrt{d_k}})\mathbf{v}_j^v\tag{2} ai.jv=qivkjv,vV,ziv=jsoftmaxj(dk ai.jv)vjv(2)
i v , j v i^v,j^v iv,jv分别表示节点v在时刻i,j的情况。其他和SSA一样。

Two-Stream Spatial Temporal Transformer Network

  既然有了SSA和TSA,那么下一步就是为了合并。

2021-11-12 Spatial Temporal Transformer Network for Skeleton-based Action Recognition_第2张图片

作者分别用SSA和TSA代替ST-GCN中的GCN和TCN

Spatial Transformer Stream (S-TR)
S − T R ( x ) = C o n v 2 D ( 1 × K t ) ( S S A ( x ) ) \mathbf{S-TR}(x)=Conv_{2D(1\times K_t)}(\mathbf{SSA}(x)) STR(x)=Conv2D(1×Kt)(SSA(x)). Following the original Transformer structure,Batch Normalization layer and skip connections are used。

Temporal Transformer Stream (T-TR)

T − T R ( x ) = T S A ( G C N ( x ) ) \mathbf{T-TR}(x)=\mathbf{TSA}(GCN(x)) TTR(x)=TSA(GCN(x)).

Experiments

++Datasets++:NTU RGB+D 60 and NTU RGB+D 120.

Ablation Study

2021-11-12 Spatial Temporal Transformer Network for Skeleton-based Action Recognition_第3张图片

STR stream achieves slightly better performance(+0.4%) than the T-TR stream. 原因:S-TR的SSA只有25个关节点,而时间维度相关需要大量的帧。并且在参数方面也是下降了的

2021-11-12 Spatial Temporal Transformer Network for Skeleton-based Action Recognition_第4张图片

其中“playing with phone”,“typing”, and “cross hands” on S-TR 收益最大,上时间关联或者两个人的如:“hugging”, “point finger”, “pat on back”, on T-TR收益最大。

References(optional)

[1] Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeletonbased human action recognition. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 635–644 (2020)
[2]Zehui, L., Liu, P., Huang, L., Fu, J., Chen, J., Qiu, X., Huang, X.: Dropattention: A regularization method for fully-connected self-attention networks. arXiv preprint arXiv:1907.11065 (2019)

下一步任务,代码解析,因为代码复现目前有问题,还在进一步调整

你可能感兴趣的:(深度学习,期刊会议论文阅读,transformer,计算机视觉,人工智能)