Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop 论文学习笔记(1)

Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop 论文学习笔记(1)


Model-based human pose estimation is currently approached through two different paradigms.
Optimization-based methods fit a parametric body model to 2D observations in an iterative manner,leading to accurate image-model alignments,but are often slow and sensitive to the initialization.
In contrast,regression-based methods,that use a deep network to directly estimate the model parameters from pixels,tend to provide reasonable,but not pixel accurate,results while requiring huge amounts of supervision.

单词:iterative 迭代 alignments 排成直线(对齐) pixel 像素 supervision 监督 estimate 估计

In this work,instead of investigating which approach is better,our key insight is that the two paradigms can form a strong collaboration.A reasonable,directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate.Similarly,a pixel accurate fit from iterative optimization can act as strong supervision for the network.This is the core of our proposed approach SPIN(SMPL oPtimization IN the loop)

单词:investigate 研究 collaboration 合作,协作 optimization 最优化,最佳条件

The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop,and the fitted estimate is subsequently used to supervise the network.Our approach is self-improving by nature,since better network estimates can lead the optimization to better solutions,while more accurate optimization fits provide better supervision for the network.We demonstrate the effectiveness of our approach in different settings,where 3D ground truth is scarc,or not available,and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins.

深度网络初始化一个最佳的迭代参数,该参数适用于身体模型拟合到训练回路中的2D joints模块,然后使用拟合的估计值来监督网络。我们的方法本质上是自改进的,因为更好地网络估计可以导出最佳的优化方案,而更精确的优化条件可以为网络提供更好的监督。我们证明了我们的方法在不同预设条件下的有效性,在这些条件下,3D ground truth训练集很少或者不可用,我们在显著性差异上始终优于最先进的基于模型的姿态估计方法。单词:demonstrate 证明 consistently 一贯的,一如始终的 outperform 超越,超过


With the emergence of deep learning architectures,the dilemma between regression-based and optimization-based approaches for many computer vision problems has benn more relevant than ever.Should we regress the relative camera pose,or use bundle adjustment?
Is it more appropriate to regress the parameters of a face model ,or fit the model to facial landmarks?These types of questions are ubiquitous within our community.Among others,3D model-based human pose estimation has initiated similar discussions,since both optimization-based and regression-based approaches have had significant success recently.However,one can argue that both paradigms have weak and strong points.Based on which paradigm is better,if we aim to push the field forward,we need to consider ways for collaboration between the two.

随着深度学习的兴起,计算机视觉中基于回归和基于优化方法之间的问题比以往任何时候都更加密切。我们应该使用回归相机位姿还是bundle adjustment的方法?(Bundle Adjustment是Visual Odometry中重要的优化方式,relative camera pose是一种CV中常使用到的方法)
单词:emergence 兴起,出现 dilemma 困境 relevant 相关的 parameter 参数 facial landmarks 人脸标志点 ubiquitous 无所不在的 paradigm 典范,范例,样式 regress 倒退,回归

Although 3D model-based human pose is a very challenging and highly ambiguous problem,there have been fundamental works that attempt to address it.Optimization-based methods,are pretty well explored and understood.
Optimization-based methods,are pretty well explored and understood.Given a parametric model of the human body, e.g.,SMPL,an iterative fitting approach attempts to estimate the body pose and shape that best explains 2D observations ,most typically 2D joints locations. Since we explicitly optimiza for the agreement of the model with image features,we typically get a good fit,but the optimization tends to be very slow and is quite sensetive to the choice of the initialization.On the other hand,recent deep learning advances have shifted the spotlight towards purely regression-based methods,using deep networks to regress the parameters of the model directly from images. In theory,this is a very promising direction,since the deep regressor can take all pixel values into consideration,instead of relying only on a sparse set of 2D locations.Unfortunately,this type of one-shot prediction might lead to mediocre image-model alignment,while at the same time a large amount of data is necessary to properly train the network. So naturally,there is a large list of arguments in favor and against each method.

单词:ambiguous 模棱两可的 explicitly 明确地,明白地 shifted the spotlight 聚焦,关注到某一点上 mediocre 平庸的,普通的,平常的

In this work,we advocate that instead of arguing over one paradigm or the other we should embrace the strengths and the weaknesses of each method and use them in a tight collaboration during training.In our approach, a deep network is used to regress the parameters of the SMPL parametric model.These regressed values initializa the iterative fitting routine that aligns the model to the image given the 2D keypoints.Subsequently,the parameters of the fitted model are used as supervision for the network,closing the loop between the regression and the optimization method.This is the core of our approach,SPIN,that fits the model within the training loop,and uses it as a privileged form of supervision for the neural network.(Figure 2)

单词:embrace 拥抱,欣然接受 aligns 校准,排成直线

A critical characteristic of our proposed approach is that it is self-improving by nature. In the early training stages,the network will produce results close to the mean pose meaning that the iterative fitting will be prone to make errors.As more examples are provided to the network as supervision by the iterative fitting module,it will learn to produce more meaningful shapes that will also lead the optimization to more accurate model fits.


Moreover,since the iterative fitting requires only 2D keypoints to fit the model,our network can be trained even when no image with corresponding 3D ground truth is available,since the 3D supevision will be provided by the optimization module.
Finally,and most crucially in terms of performance,our network is trained with explicit 3D supervision,in the form of model parameters and full shape instead of weaker 2D reprojection errors as in previous works. This privileged form of supervision turns out to by very important to improve the regression performance.Our approach is benchmarked in different settings and in a variety of indoor and in-the-wild datasets and it outperforms state-of-the-art model-based approaches by a significant margin.

单词:crucially 至关重要的 benchmarked 基准

二、Related work

Recent works have made significant advances in the frontier of skeleton-based 3D human pose estimation from single images, with many approaches achieving impressive results. Although this line of work has boosted the interest for 3D human pose estimation,here we will focus our review on model-based pose estimation. Approaches in this category consider a parametric model of the human body, like SMPL or SCAPE, and the goal is to estimate the full body 3D pose and shape.

单词:skeleton 骨骼 frontier 国家,边界 boost 使增长

Optimization-based methods: Optimization-based approaches used to be the leading paradigm for model-based human pose estimation. Early work in the area attempted to estimate the parameters of the SCAPE model using silhouettes or keypoints and often there was some manual user intervention needed.Recently, the first fully automatic approach, SMPLify, was introduced by Bogo et al. Using an off-the-shelf keypoint detector, SMPLify fits SMPL to 2D keypoint detections, using strong priors to guide the optimization.
Beyond SMPLify, different updates to the standard pipeline have investigated incorporating in the fitting procedure, silhouette cues, multiple views, or even handle multiple people. More recently, works have demonstrated fits for more expressive models in the multi-view,as well as the single view setting. In this work,we exploit the particular effectiveness of optimization-based approaches to produce pixel-accurate fittings, but instead of using them to produce good predictions at test time,our goal is to leverage them to supply direct supervision for a neural network.

(利用Optimization-based approaches来生成一个模型,作为神经网络的结果与CNN卷积出来的regressed shape进行对比,不断对初始化的模型进行优化,优化的方法是SMPLify)
单词:silhouettes 轮廓 manual 手工的 intervention 干预,干涉 off-the-shelf 现成的 detector 探测器,检测器

Regression-based methods: On the other end of the spectrum,recent works rely exclusively on regression to address the problem of 3D human pose and shape estimation.In most cases,given a single RGB image,a deep network is used to regress the model parameters. Considering the lack of images with full 3D shape ground truth, the majority of these workshave focused on alternative supervision signals to train the deep networks. Most of them rely heavily on 2D annotations including 2D keypoints,silhouettes,orparts segmentation. This information can be used as input,intermediate representation, or as supervision, by enforcing different reprojection losses. Although these constraints are very useful, they are providing weak supervision for the network. Instead, we argue that strong model-based supervision,i.e.,direct supervision on the model parameters and/or output mesh is crucial to improve performance. Although this type of ground truth is rarely available, we use a fitting routine in the training loop to provide the strong supervision signal to train the network.

基于回归的方法:另一方面,最近的工作完全依赖于回归来解决三维人体姿态和形状估计问题。在大多数情况下,给定单个rgb图像,使用深度网络来回归模型参数。由于缺乏具有完整三维地形真实感的图像,这些工作大多集中在使用其他的监督方法来训练神经网络。它们大多严重依赖于二维标记,包括二维关键点、轮廓或分块划分。通过使用重投影误差,这些信息可以用作输入、中间表示或监督。尽管这些约束非常有用,但它们对网络的监督很弱。相反,我们认为一个针对模型的有效的监督,也就是说,对模型参数和输出网络的直接监督对提升性能至关重要。虽然这种真实值很少出现,但我们在训练回路中使用一种fitting routine来提供强大的监控信号来训练网络。
单词:reprojection losses 重投影误差[4] ground truth 真实值,标准的答案[5]

Iterative fitting meets direct regression: Ideas of using regression approaches to improve fitting and vice versa have also been considered before in the literature. Early optimization methods required a good initial estimate which could be obtained by a discriminative approach.
Lassner et al.[18] used SMPLify to get good model fits, which could be later used for regression tasks (e.g., part segmentation or landmark detection).
Rogez et al.[29] also employed 3D pose pseudo annotations for training.
Pavlakoset al.[27] used an initial prediction from their network to initialize and anchor the SMPLify optimization routine.
Varol et al. [38] proposed an extension of SMPLify to fit SMPL on the regressed volumetric representation of their network.
Although previous works have also considered the benefits of these two approaches, in our work we propose a much tighter collaboration by incorporating the fitting method within the training loop, in a self-improving manner, to harness better supervision for the network.

Rogez(文献[29])还使用了3D pose pseudo annotations(我也不知道是什么)进行训练。
尽管之前的工作也考虑了这两种方法的好处,但在我们的工作中,我们建议将fitting method以自我提升的方式放入到循环训练网络中,两者紧密结合,以便更好地监督网络。
单词:vice versa 反之亦然 discriminative approach(监督学习方法又可以分为生成方法(generative approach)和判别方法(discriminative approach),所生成的模型分别为生成模型和判别模型。)

To put our approach in a larger context, the idea of combining direct regression networks with different optimization routines has also emerged in different settings. Training a network jointly with a graphical model has been proposed by Tompson et al. [36] in the context of 2D human pose estimation. Similarly, for segmentation, it is popular to use a CRF on top of the segmentation network[7],while, unrolling the CRF optimization to train the network jointly with the optimization has also been investigated [30, 44]. These ideas have also translated to 3D, where Paschalidou et al. [25] unrolls the MRF optimization to train it jointly with a network for depth regression. Although we draw inspiration from these works, our motivation is different since instead of unrolling the optimization, or doing a simple post-processing, we leverage the iterative fitting to provide strong supervision to the network.




[1]: https://www.seas.upenn.edu/~nkolot/projects/spin/
[2]: https://www.cnblogs.com/xiaomingtalent/p/5158520.html
[3]: https://blog.csdn.net/qq_39521554/article/details/80653463
[5]: https://blog.csdn.net/FrankieHello/article/details/80486167
