3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information(2016)

Keywords: human pose estimation, convolutional neural network, 2D-3D joint optimization

Abstract

We tackle the 3D human pose estimation task with end-to-end learning using CNNs.Relative 3D positions between one joint and the other joints are learned via CNNs.

两个创新点:
(1)we added 2D pose information to estimate a 3D pose from an image by concatenating 2D pose estimation result with the features from an image.
(2)we have found that more accurate 3D poses are obtained by combining information on relative positions with respect to multiple joints,instead of just one root joint.

Introduction

整体介绍Human Pose Estimation——由2D的CNN引入3D的CNN,将CNN的优势扩展到3D——总结【5】【6】【7】CNN网络的缺点,叙述增加2D信息的优势(From 2D pose information,undesirable 3D joint positions which generate unnatural human pose may be discarded)

Framework

we propose a simple yet powerful 3D human pose estimation framework based on the regression of joint positions using CNNs.We introduce two strategies to improve the regression results from the baseline CNNs.

(1)not only the image features but also 2D joint classification results are used as input features for 3D pose estimation——this scheme successfully incorporates the correlation between 2D and 3D poses
(2)rather than estimating relative positions with respect to multiple joints——this scheme effectively reduces the error of the joints that are far from the root joint

Related Work

主要介绍基于CNN的2D和3D Human Pose Estimation(详见原文)。
The method proposed in this paper aims to provide an end-to-end learning framework to estimate 3D structure of a human body from a single image.Similar to 【5】,3D and 2D pose information are jointly learned in a single CNN.Unlike the previous works,we directly propagate the 2D classification results to the 3D pose regressors inside the CNNs.

3D-2D Joint Estimation of Human Body Using CNN

The key idea of our method is to train CNN which performs 3D pose estimation using both image features from the input image and 2D pose information retrieved from the same CNN.In other words,the proposed CNN is trained for both 2D joint clasification and 3D joint regression tasks simultaneously.

Structure of the Baseline CNN

3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information(2016)_第1张图片
The CNN used in this experiment consists of five convolutional layers, three pooling layers, two parallel sets of two fully connected layers, and loss layers for 2D and 3D pose estimation tasks. The CNN accepts a 225 × 225 sized image as an input. The sizes and the numbers of filters as well as the strides are specified in Figure 1. The filter sizes of convolutional and pooling layers are the same as those of ZFnet 【21】, but we reduced the number of feature maps to make the network smaller.
We divided an input image into N g × N g N_g×N_g Ng×Nggrids and treat each grid as a separate class,which results in N g 2 N_g^2 Ng2classes per joint.The target probability fot the i t h ith ith grid g i g_i gi of the j t h jth jth joint is inversely proportional to the distance from the ground truth position.
p ^ j ( g i ) = d − 1 ( y ^ j , c i ) I ( g i ) ∑ k = 1 N g 2 d − 1 ( y ^ j , c k ) I ( g k )   ( 1 ) \hat p_j(g_i)=\frac{d^{-1}(\hat y_j,c_i)I(g_i)}{\sum _{k=1}^{N_g^2}d^{-1}(\hat y_j,c_k)I(g_k)}\space (1) p^j(gi)=k=1Ng2d1(y^j,ck)I(gk)d1(y^j,ci)I(gi) (1)
—— d − 1 ( x , y ) d^{-1}(x,y) d1(x,y) is the inverse of the Euclidean distance between the point x and y in the 2D pixel space, y ^ j \hat y_j y^j is the ground truth position of the j t h jth jth joint in the image,and c i c_i ci is the center of the grid g i g_i gi.
I ( g i ) I(g_i) I(gi) is an indicator function that is equal to 1 if the grid g i g_i gi is one of the four nearest neighbors.
I ( g i ) = { 1    i f d ( y ^ j , c i ) < ω g o    o t h e r w i s e , ( 2 ) I(g_i)= \begin {cases} 1\space \space if d(\hat y_j,c_i)<\omega_g\\ o\space \space otherwise, \end {cases} (2) I(gi)={1  ifd(y^j,ci)<ωgo  otherwise,(2)
—— ω g \omega_g ωg is the width.Hence,higher probability is assigned to the grid closer to the ground truth joint positon,and p ^ j ( g i ) \hat p_j(g_i) p^j(gi) is normalized so that the sum og the class probabilities is equal to 1.Finally,the objective of the 2D classification task for the j t h jth jth join is to minimize the following cross entropy loss function.
L 2 D ( j ) = − ∑ i = 1 N g 2 p ^ j ( g i ) l o g p j ( g i ) ,   ( 3 ) L_{2D}(j)=-\sum_{i=1}^{N_g^2}\hat p_j(g_i)logp_j(g_i),\space (3) L2D(j)=i=1Ng2p^j(gi)logpj(gi), (3)
—— p j ( g i ) p_j(g_i) pj(gi) is the probability that comes from the softmax output of the CNN.
Estimating 3D position of joints is formulated as a regression task.Since the search space is much larger than the 2D case,it is undersirable to solve 3D pose estimation as a classification task.The 3D loss funcion is designed as a square of the Euclidean distance between the prediction and the ground truth.We estimate 3D position of each joint relative to the root node.the loss function for the j t h jth jth joint when the root node is the r t h rth rth joint becomes
L 3 D ( j , r ) = ∣ ∣ R j − ( J ^ j − J ^ r ) ∣ ∣ 2   ( 4 ) L_{3D}(j,r)=||R_j-(\hat J_j - \hat J_r)||^2\space (4) L3D(j,r)=Rj(J^jJ^r)2 (4)
—— R j R_j Rj is the predicted relative 3D position of the j t h jth jth joint from the root node, J ^ j \hat J_j J^j is the ground truth 3D position of the j t h jth jth joint,and J ^ r \hat J_r J^r is that of the root node.The overall cost function of the CNN combines (3) and (4) with weights,
L a l l = λ 2 D ∑ j = 1 N j L 2 D ( j ) + λ 3 D ∑ j ≠ r N j L 3 D ( j , r )   ( 5 ) L_{all}=\lambda_{2D}\sum _{j=1}^{N_j}L_{2D}(j)+\lambda_{3D}\sum_{j≠r}^{N_j}L_{3D}(j,r)\space (5) Lall=λ2Dj=1NjL2D(j)+λ3Dj=rNjL3D(j,r) (5)

3D Joint Regression with 2D Classification Features

3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information(2016)_第2张图片
详见原文。The joint locations in an image are usually a strong cue of guessing 3D pose.To exploit 2D classification result for the 3D pose estimation,we concatenate the outputs of softmax in the 2D classification task with the outputs of the fully connected layers in the 3D loss part.The proposed structure after the last pooling layer is shown in Figure(2).First,the 2D classification result is concatenated( p r o b s   2 D   l a y e r   i n   F i g u r e 2 probs\space 2D \space layer \space in \space Figure2 probs 2D layer in Figure2)and passes the fully connected layer( f c   p r o b s   2 D fc\space probs\space 2D fc probs 2D).Then,the feature vectors from 2D and 3D part are concatenated( f c   2 D − 3 D fc\space 2D-3D fc 2D3D),which is used for 3D pose estimation task.Note that the error from the fc probs 2D layer is not back-propagated to the probs 2D layer to ensure that layers used for the 2D classification are trained only by the 2D loss part.【3】repeatedly uses the 2D classification result as an input by concatenating it with feature maps from CNN.we simply vectorized the softmax result to produce N g × N g × N j N_g×N_g×N_j Ng×Ng×Nj feature vector rather than convolving the probability map with features in the convolutional layers.

Multiple 3D Pose Regression from Different Root Nodes3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information(2016)_第3张图片

介绍基础框架及其缺点,【5】提出一种解决办法,并介绍了【5】的缺点,进而提出本文的方法:we estimate the relative position over multiple joints.(基础框架计算各关节与根关节的相对位置,缺点是距离越远,精度越低。【5】提出计算各个节点与父节点之间的相对位置,缺点是中间节点的误差会累积)。令 N r N_r Nr为选择的根节点的数目。实验中令 N r = 6 N_r=6 Nr=6可以使得大部分关节或者是根节点或者是邻居节点,可视化如图3(b)。6个3D regression losses如图2。整体误差为
L a l l = λ 2 D ∑ j = 1 N j L 2 D ( j ) + λ 3 D ∑ r ∈ R ∑ j ≠ r N j L 3 D ( j , r )   ( 6 ) L_all=\lambda_{2D}\sum_{j=1}^{N_j}L_{2D}(j)+\lambda_{3D}\sum_{r\in R}\sum_{j≠r}^{N_j}L_{3D}(j,r)\space (6) Lall=λ2Dj=1NjL2D(j)+λ3DrRj=rNjL3D(j,r) (6)
—— R R R is the set containing the joint indices that are used as root nodes.When the 3D losses share the same fully connected layers,the trained model outputs the same pose estimation results across all joints.To break this symmetry,we put the fully connected layers for each 3D losses( f c 2   l a y e r s   i n   F i g u r e 2 fc2\space layers\space in\space Figure2 fc2 layers in Figure2)
At the test time,all the pose estimation results are translated so that the mean of each pose bacomes zero.Final prediction is generated by averaging the translated results.In other words,the 3D position of the j t h jth jth joint X j X_j Xj is calculated as
X j = ∑ r ∈ R X j ( r ) N r   ( 7 ) X_j=\frac{\sum _{r\in R}X_j^{(r)}}{N_r}\space (7) Xj=NrrRXj(r) (7)
—— X j ( r ) X_j^{(r)} Xj(r) is the predicted 3D position of the j t h jth jth joint when the r t h rth rth joint is set to a root node.

Implementation Details

详见原文

Conclusions

We expect that the perfprmance can be further improved by incorporating temporal information to the CNN by applying the concepts of recurrent neural network or 3D convolution[26].Also,efficient aligning method for multiple regression results may boost the accuracy of pose estimation.

你可能感兴趣的:(human,pose,estimation)