Human skeleton is intuitively represented as a sparse graph with joints as nodes and natural connections between them as edges.
Formulation of a general part-based graph convolutional network (PB-GCN) .
Use of geometric and motion features in place of 3D joint locations at each vertex.
即,几何信息(relative joint coordinates)和运动信息(temporal displacements)的使用
Exceeding the state-of-the-art on challenging benchmark datasets NTURGB+D and HDM05.
Y ( v i ) = ∑ v j ∈ N k ( v i ) W ( L ( v j ) ) X ( v j ) Y(v_i) = \sum_{v_j\in \ N_k (v_i)} W(L(v_j))X(v_j) Y(vi)=vj∈ Nk(vi)∑W(L(vj))X(vj)
将邻域 N k ( v i ) N_k(v_i) Nk(vi)换一种表示形式(用邻接矩阵 A A A表示),且将邻域数从 k k k降为1,则得到下面的式子
Y ( v i ) = ∑ j A n o r m ( i , j ) W ( L ( v j ) ) X ( v j ) Y(v_i) = \sum_j A^{norm}(i, j) W(L(v_j)) X(v_j) Y(vi)=j∑Anorm(i,j)W(L(vj))X(vj)
In general, a part-based graph can be constructed as a combination of subgraphs where each subgraph has certain properties that define it.
We consider scenarios in which the partitions can share vertices or have edges connecting them.
即,一个图被划分为不同的子图,不同的子图会共享顶点或共享边。
G = ⋃ p ∈ { 1 , . . . , n } P p ∣ P p = ( V p , ε p ) G = \bigcup_{p \in \{1,...,n\}} P_p |P_p=(V_p, \varepsilon _p) G=p∈{1,...,n}⋃Pp∣Pp=(Vp,εp)
We consider left and right parts of hands and legs together in order to be agnostic to laterality [31] (handedness / footedness) of the human when performing an action.
即,排除侧向性的干扰(左手招手和右手招手都是招手)。
we divide the upper and lower components of appendicular skeleton into left and right (shown in Figure 1(d)), resulting in six parts
图的连接有两种方式:点连接 & 边连接。此处采用的是点连接。
To cover all natural connections between joints in skeleton graph, we include an overlap of at least one joint between two adjacent parts.
即,每个子图之间有至少有一个公用的node。
不同于上述提到的单图的卷积公式(Eq.2) ,划分为子图后,graph有新的卷积公式。
同时,有几个概念需要重新定义。
graph convolutions over a part identifies the properties of that subgraph and an aggregation across subgraphs learns the relations between them.
For a part-based graph, convolutions for each part are performed separately and the results are combined using an aggregation function F a g g F_{agg} Fagg
即,先通过子图内卷积(一阶邻域),再通过聚合函数 F a g g F_{agg} Fagg计算各子图的联系。
公式表达如下:
Y p ( v i ) = ∑ v j ∈ N k p ( v i ) W p ( L p ( v j ) ) X p ( v j ) , p ∈ 1 , . . . , n Y_p(v_i) = \sum_{v_j\in N_{kp}(v_i)} W_p(L_p(v_j)) X_p(v_j), p \in {1,...,n} Yp(vi)=vj∈Nkp(vi)∑Wp(Lp(vj))Xp(vj),p∈1,...,n
边共享形式:
Y ( v i ) = F a g g ( Y p 1 ( v i ) , Y p 2 ( v j ) ) ∣ ( v i , v j ) ∈ ε ( p 1 , p 2 ) , ( p 1 , p 2 ) ∈ { 1 , . . . , n } × { 1 , . . . , n } Y(v_i) = F_{agg}(Y_{p1}(v_i),Y_{p2}(v_j)) | (v_i, v_j) \in \varepsilon(p1,p2), (p1, p2) \in \{1,...,n\} × \{1,...,n\} Y(vi)=Fagg(Yp1(vi),Yp2(vj))∣(vi,vj)∈ε(p1,p2),(p1,p2)∈{1,...,n}×{1,...,n}
顶点共享形式:
Y ( v i ) = F a g g ( Y p 1 ( v i ) , Y p 2 ( v i ) ) ∣ ( p 1 , p 2 ) ∈ { 1 , . . . , n } × { 1 , . . . , n } Y(v_i) = F_{agg}(Y_{p1}(v_i),Y_{p2}(v_i)) | (p1, p2) \in \{1,...,n\} × \{1,...,n\} Y(vi)=Fagg(Yp1(vi),Yp2(vi))∣(p1,p2)∈{1,...,n}×{1,...,n}
The S-videos are represented as spatio-temporal graphs.
即,S-video 的本质是 spatio-temporal graphs.
we spatially convolve each partition independently for each frame, aggregate them at each frame and perform temporal convolution on the temporal dimension of the aggregated graph.
即大致分为两步,细致可分为3步:
For each vertex, we use 1-neighborhood ( k k k = 1) for spatial dimension ( N 1 N_1 N1) as the skeleton graph is not very large and a τ τ τ-neighborhood ( k k k = τ τ τ) for the temporal dimension ( N τ N_τ Nτ ), N τ N_τ Nτ is not part-specific.
空间邻域和时间邻域的划分,由下式表示:
N 1 p ( v i ) = { v j ∣ d ( v i , v j ) ≤ 1 , v i , v j ∈ V p } N_{1p}(v_i) = \{ v_j | d(v_i, v_j) ≤ 1, v_i, v_j \in V_p\} N1p(vi)={vj∣d(vi,vj)≤1,vi,vj∈Vp}
N τ ( v i t a ) = { v i t b ∣ d ( v i t a , v i t b ) ≤ ∣ τ 2 ∣ } N_τ (v_{it_a}) = \{v_{it_b} | d(v_{it_a}, v_{it_b}) ≤|\frac{τ}{2}|\} Nτ(vita)={vitb∣d(vita,vitb)≤∣2τ∣}
For ordering vertices in the receptive fields (or neighborhoods), we use a single label spatially ( L S : V → { 0 } ) L_S : V → \{0\}) LS:V→{0}) to weigh vertices in N 1 p N_{1p} N1p of each vertex equally and τ τ τ labels temporally ( L T : V → { 0 , . . . , τ − 1 } L_T : V → \{0,..., τ −1\} LT:V→{0,...,τ−1}) to weigh vertices across frames in N τ N_τ Nτ differently.
即,对于 root 节点,空间邻域内 label 相同(为0),时间邻域内 label 不同。
公式表达如下:
L S ( v j t ) = { 0 ∣ v j t ∈ N 1 p ( v i t ) } L_S(v_{jt}) = \{0 | v_{jt} \in N_{1p}(v_{it})\} LS(vjt)={0∣vjt∈N1p(vit)}
L T ( v i t b ) = { ( ( t b − t a ) + ∣ τ 2 ∣ ) ∣ v i t b ∈ N τ ( v i t a ) } L_T (v_{it_b}) = \{((t_b −t_a) +|\frac{τ}{2}|) | v_{it_b} ∈ N_τ (v_{it_a} )\} LT(vitb)={((tb−ta)+∣2τ∣)∣vitb∈Nτ(vita)}
Z p ( v j t ) = W p ( L S ( v j t ) ) X p ( v j t ) Z_p(v_{jt}) = W_p(L_S(v_{jt})) X_p(v_{jt}) Zp(vjt)=Wp(LS(vjt))Xp(vjt)
Y p ( v i t ) = ∑ v j t ∈ N 1 p ( v i t ) A p ( i , j ) Z p ( v j t ) ∣ p ∈ { 1 , . . . , 4 } Y_p(v_{it}) = \sum_{v_{jt} \in N_{1p}(v_{it})} A_p(i, j)Z_p(v_{jt}) | p \in \{1,...,4\} Yp(vit)=vjt∈N1p(vit)∑Ap(i,j)Zp(vjt)∣p∈{1,...,4}
Y S ( v i t ) = F a g g ( { Y 1 ( v i t ) , . . . , Y n ( v i t ) } ) Y_S(v_{it}) = F_{agg}(\{Y_1(v_{it}),...,Y_n(v_{it})\}) YS(vit)=Fagg({Y1(vit),...,Yn(vit)})
Y T ( v i t a ) = ∑ v j t b ∈ N τ ( v i t a ) W T ( L T ( v i t b ) ) Y S ( v i t b ) Y_T (v_{it_a}) = \sum_{v_{jt_b} \in N_τ (v_{it_a})} W_T (L_T(v_{it_b})) Y_S(v_{it_b}) YT(vita)=vjtb∈Nτ(vita)∑WT(LT(vitb))YS(vitb)
g}({Y_1(v_{it}),…,Y_n(v_{it})})
$$
Y T ( v i t a ) = ∑ v j t b ∈ N τ ( v i t a ) W T ( L T ( v i t b ) ) Y S ( v i t b ) Y_T (v_{it_a}) = \sum_{v_{jt_b} \in N_τ (v_{it_a})} W_T (L_T(v_{it_b})) Y_S(v_{it_b}) YT(vita)=vjtb∈Nτ(vita)∑WT(LT(vitb))YS(vitb)