

Jamie Shotton      Andrew Fitzgibbon       Mat Cook       Toby Sharp      Mark Finocchio      Richard Moore      Alex Kipman      Andrew Blake

Microsoft ResearchCambridge & Xbox Incubation








We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

Having applications in different areas, like human-computer interaction, telepresence and health-care, human body tracking is an area where many achievements were made in the last years because of the improvements in real-time depth cameras. Kinect made that possible on consumer hardware, running at interactive rates.
Other systems are able to achieve high speeds by tracking from frame to frame, however, they have problems to re-initialize quickly and are not robust.  The approachpresented in the paper uses per-frame initialization and recovery to avoid the same obstacles that other systems face.
This per-frame algorithm realizes pose recognition in parts, dividing the human body into parts, performing a per-pixel classification task and then detecting 3D position candidates for skeletal joints. The Figure I below shows the three main parts of the algorithm, from the depth image as input to the final 3D joint proposals.
