This work introduces( 提出) a novel(新奇的) convolutional network architecture for the task of human pose estimation.
Features are processed across(交叉处理) all scales and consolidated(加固的) to best capture the various(各种各样的) spatial(存在于空间的) relationships associated with the body.
We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network.
We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a fifinal set of predictions. State-of-the-art(最先进的) results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
A key step toward understanding people in images and video is accurate pose estimation.
Given a single RGB image, we wish to determine the precise pixel location of important keypoints of the body.
Achieving an understanding of a person’s posture and limb articulation is useful for higher level tasks like action recognition, and also serves as a fundamental tool in fifields such as human computer interaction and animation.
As a well established problem in vision, pose estimation has plagued researchers with a variety of formidable challenges over the years.
A good pose estimation system must be robust to occlusion and severe deformation, successful on rare(罕见的,特殊的) and novel( 异常的) poses, and invariant to changes in appearance due to factors like clothing and lighting.
Early work tackles such diffiffifficulties using robust image features and sophisticated structured prediction [1–9]: the former is used to produce local interpretations, whereas the latter is used to infer a globally consistent pose.
This conventional pipeline, however, has been greatly reshaped by convolutional neural networks (ConvNets) [10–14], a main driver behind an explosive rise in performance across many computer vision tasks.
Recent pose estimation systems [15–20] have universally adopted ConvNets as their main building block, largely replacing hand-crafted features and graphical models; this strategy has yielded drastic improvements on standard benchmarks [1, 21, 22].
We continue along this trajectory and introduce a novel “stacked hourglass” network design for predicting human pose.
The network captures and consolidates information across all scales of the image.
We refer to the design as an hourglass based on our visualization of the steps of pooling and subsequent upsampling used to get the fifinal output of the network.
Like many convolutional approaches that produce pixel-wise outputs, the hourglass network pools down to a very low resolution, then upsamples and combines features across multiple resolutions [15, 23]. On the other hand, the hourglass diffffers from prior designs primarily in its more symmetric topology.
We expand on a single hourglass by consecutively placing multiple hourglass modules together end-to-end.
This allows for repeated bottom-up, top-down inference across scales.
In conjunction with the use of intermediate supervision, repeated bidirectional inference is critical to the network’s fifinal performance.
The fifinal network architecture achieves a signifificant improvement on the state of-the-art for two standard pose estimation benchmarks (FLIC [1] and MPII Human Pose [21]).
On MPII there is over a 2% average accuracy improvement across all joints, with as much as a 4-5% improvement on more diffiffifficult joints like the knees and ankles.
With the introduction of “DeepPose” by Toshev et al. [24], research on human pose estimation began the shift from classic approaches [1–9] to deep networks.
Toshev et al. use their network to directly regress the x,y coordinates of joints.
The work by Tompson et al. [15] instead generates heatmaps by running an image through multiple resolution banks in parallel to simultaneously capture features at a variety of scales.
Our network design largely builds offff of their work, exploring how to capture information across scales and adapting their method for combining features across difffferent resolutions.
A critical feature of the method proposed by Tompson et al. [15] is the joint use of a ConvNet and a graphical model.
Their graphical model learns typical spatial relationships between joints.
Others have recently tackled this in similar ways [17, 20, 25] with variations on how to approach unary score generation and
pairwise comparison of adjacent joints.
Chen et al. [25] cluster(使聚集) detections(侦查) into typical orientations(方向) so that(以致) when their classififier(分类器) makes predictions additional information is available indicating the likely location of a neighboring joint.
We achieve superior performance without the use of a graphical model or any explicit modeling of the human body.
There are several examples of methods making successive predictions for pose estimation. Carreira et al. [19] use what they refer to as Iterative Error Feedback.
A set of predictions is included with the input, and each pass through the network further refifines these predictions.
Their method requires multi-stage training and the weights are shared across each iteration.
Wei et al. [18] build on the work of multi-stage pose machines [26] but now with the use of ConvNets for feature extraction.
Given our use of intermediate supervision, our work is similar in spirit to these methods, but our building block (the hourglass module) is difffferent.
Hu & Ramanan [27] have an architecture more similar to ours that can also be used for multiple stages of predictions, but their model ties weights in the bottom-up and top-down portions of computation as well as across iterations.
Tompson et al. build on their work in [15] with a cascade to refifine predictions.
This serves to increase effiffifficency and reduce memory usage of their method while improving localization performance in the high precision range [16].
One consideration is that for many failure cases a refifinement of position within a local window would not offffer much improvement since error cases often consist of either occluded or misattributed limbs.
For both situations, any further evaluation at a local scale will not improve the prediction.
There are variations to the pose estimation problem which include the use of additional features such as depth or motion cues.
Also, there is the more challenging task of simultaneous annotation of multiple people [17, 31].
In addition, there is work like that of Oliveira et al. [32] that performs human part segmentation based on fully convolutional networks [23].
Our work focuses solely on the task of keypoint localization of a single person’s pose from an RGB image.
Our hourglass module before stacking is closely connected to fully convolutional networks [23] and other designs that process spatial information at multiple scales for dense prediction [15, 33–41].
Xie et al. [33] give a summary of typical architectures. Our hourglass module diffffers from these designs mainly in its more symmetric distribution of capacity between bottom-up processing (from high resolutions to low resolutions) and top-down processing (from low resolutions to high resolutions).
For example, fully convolutional networks [23] and holistically-nested architectures [33] are both heavy in bottom-up processing but light in their top-down processing, which consists only of a (weighted) merging of predictions across multiple scales.
Fully convolutional networks are also trained in multiple stages.
The hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures [42–45].
Noh et al. [42] use the conv-deconv architecture to do semantic segmentation, Rematas et al. [44] use it to predict reflflectance maps of objects.
Zhao et al. [43] develop a unifified framework for supervised, unsupervised and semi-supervised learning by adding a reconstruction loss.
Yang et al. [46] employ an encoder-decoder architecture without skip connections for image generation. Rasmus et al. [47] propose a denoising auto-encoder with special, “modulated” skip connections for unsupervised/semi-supervised feature learning.
The symmetric topology of these networks is similar, but the nature of the operations is quite difffferent in that we do not use unpooling or deconv layers.
Instead, we rely on simple nearest neighbor upsampling and skip connections for top-down processing.
Another major difffference of our work is that we perform repeated bottom-up, top-down inference by stacking multiple hourglasses.
The design of the hourglass is motivated by the need to capture information at every scale.
While local evidence is essential for identifying features like faces and hands, a fifinal pose estimate requires a coherent understanding of the full body.
The person’s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at difffferent scales in the image.
The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.
The network must have some mechanism to effffectively process and consolidate features across scales. Some approaches tackle this with the use of separate pipelines that process the image independently at multiple resolutions and combine features later on in the network [15, 18].
Instead, we choose to use a single pipeline with skip layers to preserve spatial information at each resolution.
The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial fifilters to be applied that compare features across the entire space of the image.
The hourglass is set up as follows: Convolutional and max pooling layers are used to process features down to a very low resolution.
At each max pooling step, the network branches offff and applies more convolutions at the original pre-pooled resolution.
After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales.
To bring together information across two adjacent resolutions, we follow the process described by Tompson et al. [15] and do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features.
The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.
After reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the fifinal network predictions.
The output of the network is a set of heatmaps where for a given heatmap the network predicts the probability of a joint’s presence at each and every pixel.
The full module (excluding the fifinal 1x1 layers) is illustrated in Figure 3.
While maintaining the overall hourglass shape, there is still some flflexibility in the specifific implementation of layers.
Difffferent choices can have a moderate impact on the fifinal performance and training of the network. We explore several options for layer design in our network.
Recent work has shown the value of reduction steps with 1x1 convolutions, as well as the benefifits of using consecutive smaller fifilters to capture a larger spatial context. [12, 14]
For example, one can replace a 5x5 fifilter with two separate 3x3 fifilters. We tested our overall network design, swapping in difffferent layer modules based offff of these insights.
We experienced an increase in network performance after switching from standard convolutional layers with large fifilters and no reduction steps to newer methods like the residual learning modules presented by He et al. [14] and “Inception”-based designs [12].
After the initial performance improvement with these types of designs, various additional explorations and modififications to the layers did little to further boost performance or training time.
Our fifinal design makes extensive use of residual modules. Filters greater than 3x3 are never used, and the bottlenecking restricts the total number of parameters at each layer curtailing total memory usage.
The module used in our network is shown in Figure 4. To put this into the context of the full network design, each box in Figure 3 represents a single residual module.
Operating at the full input resolution of 256x256 requires a signifificant amount of GPU memory, so the highest resolution of the hourglass (and thus the fifinal output resolution) is 64x64.
This does not affffect the network’s ability to produce precise joint predictions. The full network starts with a 7x7 convolutional layer with stride 2, followed by a residual module and a round of max pooling to bring the resolution down from 256 to 64.
Two subsequent residual modules precede the hourglass shown in Figure 3. Across the entire hourglass all residual modules
output 256 features.
We take our network architecture further by stacking multiple hourglasses endto-end, feeding the output of one as input into the next.
This provides the network with a mechanism for repeated bottom-up, top-down inference allowing for reevaluation of initial estimates and features across the whole image.
The key to this approach is the prediction of intermediate heatmaps upon which we can apply a loss.
Predictions are generated after passing through each hourglass where the network has had an opportunity to process features at both local and global contexts.
Subsequent hourglass modules allow these high level features to be processed again to further evaluate and reassess higher order spatial relationships.
This is similar to other pose estimations methods that have demonstrated strong performance with multiple iterative stages and intermediate supervision [18, 19, 30].
Consider the limits of applying intermediate supervision with only the use of a single hourglass module. What would be an appropriate place in the pipeline to generate an initial set of predictions?
Most higher order features are present only at lower resolutions except at the very end when upsampling occurs.
If supervision is provided after the network does upsampling then there is no way for these features to be reevaluated relative to each other in a larger global context.
If we want the network to best refifine predictions, these predictions cannot be exclusively evaluated at a local scale.
The relationship to other joint predictions as well as the general context and understanding of the full image is crucial.
Applying supervision earlier in the pipeline before pooling is a possibility, but at this point the features at a given pixel are the result of processing a relatively local receptive fifield and are thus ignorant of critical global cues.
Repeated bottom-up, top-down inference with stacked hourglasses alleviates these concerns.
Local and global cues are integrated within each hourglass module, and asking the network to produce early predictions requires it to have a high-level understanding of the image while only partway through the full network.
Subsequent stages of bottom-up, top-down processing allow for a deeper reconsideration of these features.
This approach for going back and forth between scales is particularly important because preserving the spatial location of features is essential to do the fifinal localization step.
The precise position of a joint is an indispensable cue for other decisions being made by the network. With a structured problem like pose estimation, the output is an interplay of many difffferent features that should come together to form a coherent understanding of the scene.
Contradicting evidence and anatomic impossiblity are big giveaways that somewhere along the line a mistake was made, and by going back and forth the network can maintain precise local information while considering and then reconsidering the overall coherence of the features.
We reintegrate intermediate predictions back into the feature space by mapping them to a larger number of channels with an additional 1x1 convolution.
These are added back to the intermediate features from the hourglass along with the features output from the previous hourglass stage (visualized in Figure 4).
The resulting output serves directly as the input for the following hourglass module which generates another set of predictions.
In the fifinal network design, eight hourglasses are used. It is important to note that weights are not shared across hourglass modules, and a loss is applied to the predictions of all hourglasses using the same ground truth.
The details for the loss and ground truth are described below.
和汤普森等人一样的技术。[15]用于监督。采用均方误差(MSE)损失,将预测的热图与以联合位置为中心的二维高斯(标准差为1 px)的地面真实热图进行比较。为了提高精度阈值的性能,在转换回图像的原始坐标空间之前,在其下一个最高邻域的方向上将预测值设置为像素的四分之一。在MPII人体姿态中,有些关节没有相应的地面真实标注。在这种情况下,关节要么被截断,要么被严重遮挡,因此为了监督,提供了所有零点的地面真实热图。
评估是使用标准百分比的正确关键点(PCK)度量,它报告的百分比的检测,在一个归一化距离内的地面真相。对于fic,距离是由躯干大小标准化的,对于MPII则是头部大小的一小部分(称为PCKh)。结果如图6和表1所示。我们在FLIC上的结果非常有竞争力,肘部的[email protected]精度达到99%,手腕上的准确率达到97%。值得注意的是,这些结果是以观察者为中心的,这与其他人如何在fic上评估他们的输出是一致的。
MPII:我们在MPIIHumanPose数据集上的所有关节上获得最先进的结果。所有数字见表2以及图7中的PCK曲线。在腕关节、肘部、膝盖和脚踝等困难关节上,我们在最新的最先进结果中提高了3.5%([email protected]),平均错误率从16.3%下降12.8%。最终肘关节准确度为91.2%,手腕精度为87.1%。MPII网络的预测示例见图5。
为了观察这些选择的效果,我们首先比较了沙漏中每个阶段有四个剩余模块的两层网络,以及一个小时的玻璃,但取而代之的是八个剩余模块。在图8中,它们分别称为HG-堆栈和HG。尽管层数和参数大致相同,但在使用堆叠设计时,可以看到训练方面的适度改进。其次,我们考虑了中间监管的影响。对于两层网络,我们遵循本文所描述的程序来实施监督.在一个沙漏中应用同样的概念并不容易,因为高阶全局特征只在较低的分辨率下出现,而跨尺度的特征直到后期才被合并。我们探索在网络中的各个点应用监督,例如在池之前或之后,以及在不同的解决方案中。在图8中,表现最好的方法为HG-Int,在第五次输出分辨率之前的两个最高分辨率的上采样之后,应用了中间监督。这种监督确实提高了性能,但不足以超过包括堆叠在内的改进(HG-堆栈-Int)。在图9中,我们比较了共享大约相同数量的参数的2-、4-和8堆栈MOD ELL的验证精度,并包括它们中间预测的准确性。每增加一次堆叠的最后业绩,就会有轻微的改善,由87.4%增至87.8%至88.1%。在中间阶段效果更显着。例如,在每个网络的一半,相应的中间预测的准确率是:84.6%,86.5%和87.1%。请注意,在8层网络的中间部分的精度仅低于2层网络的最终精度。观察网络早期犯的错误和后来改正的错误是很有趣的。图9显示了一些例子,常见的错误就像其他人关节的混合,或者左、右的错误。对于正在运行的文件,从最后的热图中可以明显看出,对于网络来说,左右之间的决定仍然有点模糊。鉴于图像的外观,混淆是合理的。一个值得注意的案例是中间的例子,其中网络最初激活在图像中可见的手腕上。在进一步处理时,热图根本不激活在原来的位置,而是选择一个合理的位置为被遮挡的手腕。