1 Background and Motivation

human pose estimation 问题就是 localization human joints 的问题,难点在于

  • strong articulations(各式各样的连接)
  • small and barely visible joints
  • occlusions and the need to capture the context

当前主流的方法是 Part-based models,但有点像盲人摸象,效率不高还容易以偏概全。reason about pose in a holistic manner 似乎更加合理,然而现有的 holistic manner-based 方法在 real-world 中 with limited success

本文,作者蹭蹭 Deep Neural Network( DNN) 的热度(在分类和定位视觉任务上表现的还不错),借助 DNN 方法,采用回归的方式来实现 holistic human pose estimation,优势如下

  • capture the full context of each body joint
  • 比 graphical models 方法简单(不需要设计 model topology and interactions between joints)

2 Advantages / Contributions

第一个用 Deep Neural Networks(DNN)来做 human pose estimation,直接回归坐标,配合 cascade 在 4 个 academic datasets 上获得了 SOTA

3 Method

( x , y → ) (x,\overrightarrow{y}) (x,y ) x x x 是 image data, y → \overrightarrow{y} y 是 GT pose vector

y → = ( . . . , y → i T , . . . ) T \overrightarrow{y} = (...,\overrightarrow{y}_i^T,...)^T y =(...,y iT,...)T, i ∈ { 1 , . . . , k } i \in\{1,...,k\} i{1,...,k}

y → i \overrightarrow{y}_i y i 包含 i t h i^{th} ith joint 的横纵坐标

人形框(可以是整张图片),表示为 b = ( b c , b w , b h ) b = (b_c,b_w,b_h) b=(bc,bw,bh)

关键点坐标 y → i \overrightarrow{y}_i y i 在人形框中归一化的结果是

也即横纵坐标减去 bbox 的中心坐标,横坐标除以 bbox 的 width,纵坐标除以 bbox 的 height


N ( y → ; b ) = ( . . . , N ( y → i ; b ) T , . . . ) T N(\overrightarrow{y};b) =(...,N(\overrightarrow{y}_i;b)^T,...)^T N(y ;b)=(...,N(y i;b)T,...)T

可以简写成 N ( x ; b ) N(x;b) N(x;b) N ( ⋅ ) N(\cdot) N()

3.1 Pose Estimation as DNN-based Regression

神经网络采用的是 AlexNet(参考【Keras-AlexNet】CIFAR-10),其输出为 ψ ( x ; θ ) ∈ R 2 k \psi(x;\theta) \in \mathbb{R}^{2k} ψ(x;θ)R2k, 2k 是 k 个关键点的坐标值


公式(2)的过程为,输入归一化后的 image data,经 AlexNet 网络预测出关键点坐标后,逆归一化,还原到原图上


free layer 指的是 LRN( local response normalization layer) 和 P(pooling layer)


C ( 55 × 55 × 96 ) − > L R N − > P − > C ( 27 × 27 × 256 ) − > L R N − > P − > C ( 13 × 13 × 384 ) − > C ( 13 × 13 × 384 ) − > C ( 13 × 13 × 256 ) − > P − > F ( 4096 ) − > F ( 4096 ) C(55×55×96)->LRN->P->C(27×27×256)->LRN->P->C(13×13×384)->C(13×13×384)->C(13×13×256)->P->F(4096)->F(4096) C(55×55×96)>LRN>P>C(27×27×256)>LRN>P>C(13×13×384)>C(13×13×384)>C(13×13×256)>P>F(4096)>F(4096)

其中 C 是 convolutional layer,F 是 fully connected layer



Loss 为


采用的是 L2 distance between the prediction and the true pose vector

cascade 借鉴于《 Deep convolutional network cascade for facial point detection》

3.2 Cascade of Pose Regressors

Fig 2

后续的 stage 输入为前面 stage 的子图

subsequent pose regressors see higher resolution images and thus learn features for finer scales which ultimately leads to higher precision

不同 stage 采用的都是同一个网路结构 ψ \psi ψ,但是网络结构的参数 θ \theta θ 不同,回归器记为 ψ ( x ; θ s ) \psi(x;\theta_s) ψ(x;θs),其中 s ∈ { 1 , . . . , S } s \in \{1,...,S\} s{1,...,S} 表示不同的 stage,实验中 S S S 为 3

Stage 1:

y 1 → ← N − 1 ( ψ ( N ( x ; b 0 ) ; θ 1 ) ; b 0 ) \overrightarrow{y^1} \leftarrow N^{-1}(\psi(N(x;b^0);\theta_1);b^0) y1 N1(ψ(N(x;b0);θ1);b0)

bounding box b 0 b^0 b0

Stage s s s:

y i s → ← y i ( s − 1 ) → + N − 1 ( ψ i ( N ( x ; b i ( s − 1 ) ) ; θ s ) ; b i ( s − 1 ) ) \overrightarrow{y_i^s} \leftarrow \overrightarrow{y_i^{(s-1)}} + N^{-1}(\psi_i(N(x;b_i^{(s-1)});\theta_s);b_i^{(s-1)}) yis yi(s1) +N1(ψi(N(x;bi(s1));θs);bi(s1))

其中 b i s b_i^s bis 的迭代过程如下

b i s ← ( y i s → , σ d i a m ( y s → ) ) b_i^s \leftarrow (\overrightarrow{y_i^s},\sigma diam(\overrightarrow{y^s})) bis(yis ,σdiam(ys ))

表示截取以 y i s → \overrightarrow{y_i^s} yis 为中心,长度为 σ d i a m ( y s → ) \sigma diam(\overrightarrow{y^s}) σdiam(ys ) 的 sub-image 作为下一个 stage 的输入,其中 σ \sigma σ 为缩放因子, d i a m diam diam 是直径的意思(depends on the concrete pose definition and dataset),本文中被定义为 the distance between a shoulder and hip from opposing sides

训练 cascade stage 时候,采用了 simulated predictions,在 GT 上加入了一定的扰动(根据高斯分布采样)作为新的 GT,其中高斯分布的方差根据训练集中 y i ( s − 1 ) → − y i s → \overrightarrow{y_i^{(s-1)}} - \overrightarrow{y_i^s} yi(s1) yis 计算得到

simulated predictions 用公式化表示如下所示

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第2张图片

数据集 D D D 变成了 D A s D_A^s DAs

4 Experiments

4.1 Datasets

1)Frames Labeled In Cinema (FLIC)

官网: https://bensapp.github.io/flic-dataset.html

在 face detector 的基础上粗略的框出人(enlarge),再进行关键点的检测

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第3张图片
4000 training and 1000 test images obtained from popular Hollywood movies


【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第4张图片
参考 姿态估计数据集可视化【附代码】

2)Leeds Sports Pose Dataset(LSP)


【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第5张图片


11000 training and 1000 test images,14 joints

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第6张图片 【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第7张图片
参考 姿态估计数据集可视化【附代码】

Right ankle 右脚踝
Right knee 右膝盖
Right hip 右臀
Left hip 左臀
Left knee 左膝盖
Left ankle 左脚踝
Right wrist 右手腕
Right elbow 右手肘
Right shoulder 右肩膀
Left shoulder 左肩膀
Left elbow 左手肘
Left wrist 左手腕
Neck 脖子
Head top 头顶

标配 12 个点 + 脖子和头顶,可视化的时候,脖子会和左右臀中心点连接

3)ImageParse dataset

4)Buffy dataset


Percentage of Correct Parts(PCP)
Percentage of Detected Joints(PDJ)

4.2 Results and Discussion

1)PCP metric on LSP dataset

Percentage of Correct Parts(PCP)

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第8张图片

2)PDJ metric on FLIC and LSP dataset

Percentage of Detected Joints(PDJ)


2 个点,对比其它 4 个方法,DeepPose 采用 stage2
【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第9张图片

4 个点,solo 文献 13 中的方法

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第10张图片
【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第11张图片

3)Effects of cascade-based refinement

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第12张图片
cascade stage 的中收益在 [0.15,0.2] 区间里

stage 的级联虽然能 look at higher resolution inputs,但是 have more limited context

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第13张图片

4)Cross-dataset Generalization

在 LSP 数据集上训练,ImageParse dataset 上测试

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第14张图片

在 FLIC 数据集上训练,Buffy dataset 上测试

【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第15张图片


【DeepPose】《DeepPose:Human Pose Estimation via Deep Neural Networks》_第16张图片

5 Conclusion(own) / Future work

  • In most of the cases, when the estimated pose is not precise, it still has a correct shape

  • Further, we show that using a generic convolutional neural network, which was originally designed for classification tasks, can be applied to the different task of localization.

  • In future, we plan to investigate novel architectures which could be potentially better tailored towards localization problems in general, and in pose estimation in particular
