Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop 论文学习笔记(1)

Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop 论文学习笔记(1)

摘要部分

原文
Model-based human pose estimation is currently approached through two different paradigms.
Optimization-based methods fit a parametric body model to 2D observations in an iterative manner,leading to accurate image-model alignments,but are often slow and sensitive to the initialization.
In contrast,regression-based methods,that use a deep network to directly estimate the model parameters from pixels,tend to provide reasonable,but not pixel accurate,results while requiring huge amounts of supervision.

基于模型的人体姿态估计目前有两种不同的方法。基于优化的方法,以迭代的方式将参数模型拟合到二维观测中,实现了精确的图像模型对齐,但是通常速度慢且对初始化敏感。相反,基于回归的方法,使用深度网络直接从像素估计模型参数,倾向于提供合理但不精确的结果,同时需要大量的监督。
单词:iterative 迭代 alignments 排成直线(对齐) pixel 像素 supervision 监督 estimate 估计

In this work,instead of investigating which approach is better,our key insight is that the two paradigms can form a strong collaboration.A reasonable,directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate.Similarly,a pixel accurate fit from iterative optimization can act as strong supervision for the network.This is the core of our proposed approach SPIN(SMPL oPtimization IN the loop)

在这项工作中,我们的主要见解是这两种方法可以结合,而不是去研究哪个方法更好。一个合理的,直接从网络回归的估计可以初始化最佳迭代,使拟合更快更准确。同样,最佳迭代得到的精确像素拟合可以作为对网络的有力监督。这就是我们提出的SPIN方法的核心。
单词:investigate 研究 collaboration 合作,协作 optimization 最优化,最佳条件

The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop,and the fitted estimate is subsequently used to supervise the network.Our approach is self-improving by nature,since better network estimates can lead the optimization to better solutions,while more accurate optimization fits provide better supervision for the network.We demonstrate the effectiveness of our approach in different settings,where 3D ground truth is scarc,or not available,and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins.

深度网络初始化一个最佳的迭代参数,该参数适用于身体模型拟合到训练回路中的2D joints模块,然后使用拟合的估计值来监督网络。我们的方法本质上是自改进的,因为更好地网络估计可以导出最佳的优化方案,而更精确的优化条件可以为网络提供更好的监督。我们证明了我们的方法在不同预设条件下的有效性,在这些条件下,3D ground truth训练集很少或者不可用,我们在显著性差异上始终优于最先进的基于模型的姿态估计方法。单词:demonstrate 证明 consistently 一贯的,一如始终的 outperform 超越,超过

一、Introduction

With the emergence of deep learning architectures,the dilemma between regression-based and optimization-based approaches for many computer vision problems has benn more relevant than ever.Should we regress the relative camera pose,or use bundle adjustment?
Is it more appropriate to regress the parameters of a face model ,or fit the model to facial landmarks?These types of questions are ubiquitous within our community.Among others,3D model-based human pose estimation has initiated similar discussions,since both optimization-based and regression-based approaches have had significant success recently.However,one can argue that both paradigms have weak and strong points.Based on which paradigm is better,if we aim to push the field forward,we need to consider ways for collaboration between the two.

随着深度学习的兴起,计算机视觉中基于回归和基于优化方法之间的问题比以往任何时候都更加密切。我们应该使用回归相机位姿还是bundle adjustment的方法?(Bundle Adjustment是Visual Odometry中重要的优化方式,relative camera pose是一种CV中常使用到的方法)
是回归人脸模型的参数,还是将模型拟合到人脸标志点上更合适?这些类型的问题在我们的社区中无处不在。其中,基于三维模型的人体姿态估计也引发了类似的讨论,因为基于优化和基于回归的方法最近都取得了显著的成功。然而,可以说这两种方法都有各自的优缺点,基于哪种方法更好,如果我们想推动这一领域的发展,我们需要考虑将这两种方法结合。
单词:emergence 兴起,出现 dilemma 困境 relevant 相关的 parameter 参数 facial landmarks 人脸标志点 ubiquitous 无所不在的 paradigm 典范,范例,样式 regress 倒退,回归

Although 3D model-based human pose is a very challenging and highly ambiguous problem,there have been fundamental works that attempt to address it.Optimization-based methods,are pretty well explored and understood.
Optimization-based methods,are pretty well explored and understood.Given a parametric model of the human body, e.g.,SMPL,an iterative fitting approach attempts to estimate the body pose and shape that best explains 2D observations ,most typically 2D joints locations. Since we explicitly optimiza for the agreement of the model with image features,we typically get a good fit,but the optimization tends to be very slow and is quite sensetive to the choice of the initialization.On the other hand,recent deep learning advances have shifted the spotlight towards purely regression-based methods,using deep networks to regress the parameters of the model directly from images. In theory,this is a very promising direction,since the deep regressor can take all pixel values into consideration,instead of relying only on a sparse set of 2D locations.Unfortunately,this type of one-shot prediction might lead to mediocre image-model alignment,while at the same time a large amount of data is necessary to properly train the network. So naturally,there is a large list of arguments in favor and against each method.

虽然基于三维模型的人体姿态是一个非常具有挑战性和高度模糊性的问题,但是已经有一些基础性的工作试图解决这个问题。
基于优化的方法已经得到了很好的探索和理解。给定人体的参数模型,例如SMPL,迭代拟合方法试图估计人体姿势和形状,以最好地解释二维观察,最典型的是二维结点位置。由于我们显式地对具有图像特征的模型进行了优化,我们通常得到一个很好的拟合,但是优化往往非常缓慢,并且对初始化的选择相当敏感。理论上,这是一个非常有前途的方向,因为深度回归器可以考虑所有像素值,而不是仅仅依赖于稀疏的二维位置。不幸的是,这种类型的一次预测可能导致图像模型对齐效果不太好,同时需要大量数据来正确训练网络。因此,很自然地,每一种方法都有大量的支持和反对意见。
单词:ambiguous 模棱两可的 explicitly 明确地,明白地 shifted the spotlight 聚焦,关注到某一点上 mediocre 平庸的,普通的,平常的

In this work,we advocate that instead of arguing over one paradigm or the other we should embrace the strengths and the weaknesses of each method and use them in a tight collaboration during training.In our approach, a deep network is used to regress the parameters of the SMPL parametric model.These regressed values initializa the iterative fitting routine that aligns the model to the image given the 2D keypoints.Subsequently,the parameters of the fitted model are used as supervision for the network,closing the loop between the regression and the optimization method.This is the core of our approach,SPIN,that fits the model within the training loop,and uses it as a privileged form of supervision for the neural network.(Figure 2)

在这项工作中,我们提倡不要争论每一个方法,我们应该接受每种方法的优点和缺点,并在训练时紧密协作使用它们。在我们的方法中,使用一个深度网络来回归SMPL模型的参数,这些回归值用于初始化迭代拟合程序,该程序将模型与给定二维关键点的图像对齐。随后,拟合模型的参数用于监督网络,(直到结束regression和optimization之间的循环,应该是Figure2中的||Oreg-Oopt||)。SPIN是我们方法的核心,它适合于训练循环中的模型,并将其用作神经网络的一种特殊形式的监督。
单词:embrace 拥抱,欣然接受 aligns 校准,排成直线

A critical characteristic of our proposed approach is that it is self-improving by nature. In the early training stages,the network will produce results close to the mean pose meaning that the iterative fitting will be prone to make errors.As more examples are provided to the network as supervision by the iterative fitting module,it will learn to produce more meaningful shapes that will also lead the optimization to more accurate model fits.

我们提出的方法的一个关键特点是,它本质上是自我改进的。在早期的训练阶段,网络会产生接近平均位姿的结果,这意味着迭代拟合容易出错。随着迭代拟合模块向网络提供更多的示例作为监督,它将学习生成更有意义的形状,这也将产生更精确的模型拟合。

Moreover,since the iterative fitting requires only 2D keypoints to fit the model,our network can be trained even when no image with corresponding 3D ground truth is available,since the 3D supevision will be provided by the optimization module.
Finally,and most crucially in terms of performance,our network is trained with explicit 3D supervision,in the form of model parameters and full shape instead of weaker 2D reprojection errors as in previous works. This privileged form of supervision turns out to by very important to improve the regression performance.Our approach is benchmarked in different settings and in a variety of indoor and in-the-wild datasets and it outperforms state-of-the-art model-based approaches by a significant margin.

此外,由于迭代拟合只需要二维关键点来拟合模型,因此即使没有具有相应三维地面真实感的图像,也可以训练我们的网络,因为优化模块将提供三维监督。最后,最关键的是在性能方面,我们的网络采用了显式的三维监督,用模型参数和完整的形状进行训练,而不像以前的工作那样具有较弱的二维重投影误差。这种特殊的监督形式对提高回归性能非常重要。我们的方法在不同的环境中、在各种室内和野外数据集中进行了基准测试,它的性能显著优于最新的基于模型的方法。
单词:crucially 至关重要的 benchmarked 基准

二、Related work

Recent works have made significant advances in the frontier of skeleton-based 3D human pose estimation from single images, with many approaches achieving impressive results. Although this line of work has boosted the interest for 3D human pose estimation,here we will focus our review on model-based pose estimation. Approaches in this category consider a parametric model of the human body, like SMPL or SCAPE, and the goal is to estimate the full body 3D pose and shape.

最近的工作在基于骨骼的三维人体姿势估计领域取得了重大进展,许多方法都取得了令人印象深刻的效果。虽然这项工作提高了人们对三维人体姿态估计的兴趣,但在这里我们将重点讨论基于模型的姿态估计。这类方法考虑人体的参数化模型,如smpl或scape,其目标是估计全身的三维姿势和形状。
单词:skeleton 骨骼 frontier 国家,边界 boost 使增长

Optimization-based methods: Optimization-based approaches used to be the leading paradigm for model-based human pose estimation. Early work in the area attempted to estimate the parameters of the SCAPE model using silhouettes or keypoints and often there was some manual user intervention needed.Recently, the first fully automatic approach, SMPLify, was introduced by Bogo et al. Using an off-the-shelf keypoint detector, SMPLify fits SMPL to 2D keypoint detections, using strong priors to guide the optimization.
Beyond SMPLify, different updates to the standard pipeline have investigated incorporating in the fitting procedure, silhouette cues, multiple views, or even handle multiple people. More recently, works have demonstrated fits for more expressive models in the multi-view,as well as the single view setting. In this work,we exploit the particular effectiveness of optimization-based approaches to produce pixel-accurate fittings, but instead of using them to produce good predictions at test time,our goal is to leverage them to supply direct supervision for a neural network.

基于优化的方法:基于优化的方法曾经是基于模型的人体姿势估计的主要范例。该领域的早期工作试图使用轮廓或关键点估计scape模型的参数,通常需要一些手动用户干预。最近,bogo等人引入了第一种全自动方法smplify。使用现成的关键点检测器,smplify沿用了smpl的二维关键点检测,使用强优先级指导优化。
除了smplify之外,标准化管道机制[3]的不同更新研究了fitting过程,人体轮廓选取,多维视角甚至处理多个人。最近,作品展示了在多视图以及单视图设置中更具表现力的模型。在这项工作中,我们利用基于优化的方法的特殊有效性来实现像素的精确匹配,但是我们的目标不是在训练时使用它们来生成良好的预测,而是利用他们来对神经网络提供直接的监督。
(利用Optimization-based approaches来生成一个模型,作为神经网络的结果与CNN卷积出来的regressed shape进行对比,不断对初始化的模型进行优化,优化的方法是SMPLify)
单词:silhouettes 轮廓 manual 手工的 intervention 干预,干涉 off-the-shelf 现成的 detector 探测器,检测器

Regression-based methods: On the other end of the spectrum,recent works rely exclusively on regression to address the problem of 3D human pose and shape estimation.In most cases,given a single RGB image,a deep network is used to regress the model parameters. Considering the lack of images with full 3D shape ground truth, the majority of these workshave focused on alternative supervision signals to train the deep networks. Most of them rely heavily on 2D annotations including 2D keypoints,silhouettes,orparts segmentation. This information can be used as input,intermediate representation, or as supervision, by enforcing different reprojection losses. Although these constraints are very useful, they are providing weak supervision for the network. Instead, we argue that strong model-based supervision,i.e.,direct supervision on the model parameters and/or output mesh is crucial to improve performance. Although this type of ground truth is rarely available, we use a fitting routine in the training loop to provide the strong supervision signal to train the network.

基于回归的方法:另一方面,最近的工作完全依赖于回归来解决三维人体姿态和形状估计问题。在大多数情况下,给定单个rgb图像,使用深度网络来回归模型参数。由于缺乏具有完整三维地形真实感的图像,这些工作大多集中在使用其他的监督方法来训练神经网络。它们大多严重依赖于二维标记,包括二维关键点、轮廓或分块划分。通过使用重投影误差,这些信息可以用作输入、中间表示或监督。尽管这些约束非常有用,但它们对网络的监督很弱。相反,我们认为一个针对模型的有效的监督,也就是说,对模型参数和输出网络的直接监督对提升性能至关重要。虽然这种真实值很少出现,但我们在训练回路中使用一种fitting routine来提供强大的监控信号来训练网络。
单词:reprojection losses 重投影误差[4] ground truth 真实值,标准的答案[5]

Iterative fitting meets direct regression: Ideas of using regression approaches to improve fitting and vice versa have also been considered before in the literature. Early optimization methods required a good initial estimate which could be obtained by a discriminative approach.
Lassner et al.[18] used SMPLify to get good model fits, which could be later used for regression tasks (e.g., part segmentation or landmark detection).
Rogez et al.[29] also employed 3D pose pseudo annotations for training.
Pavlakoset al.[27] used an initial prediction from their network to initialize and anchor the SMPLify optimization routine.
Varol et al. [38] proposed an extension of SMPLify to fit SMPL on the regressed volumetric representation of their network.
Although previous works have also considered the benefits of these two approaches, in our work we propose a much tighter collaboration by incorporating the fitting method within the training loop, in a self-improving manner, to harness better supervision for the network.

迭代法与直接回归法相结合:以前文献中也曾考虑过使用回归法改进拟合法的想法,反之亦然。早期的优化方法需要一个良好的初始估计,这可以通过一个判别方法获得。
Lassner(文献[18])使用smplify获得了好的模型,这些模型可以用于回归任务(例如,部分分割或地标检测。
Rogez(文献[29])还使用了3D pose pseudo annotations(我也不知道是什么)进行训练。
Pavlakoset(文献[27])使用来自他们网络的初始预测来初始化和确定SMPLify的优化路径。
Varol(文献[38])提出去扩展SMPLIF来沿用SMPL对于网络回归体积的表示。(翻译可能有错)
尽管之前的工作也考虑了这两种方法的好处,但在我们的工作中,我们建议将fitting method以自我提升的方式放入到循环训练网络中,两者紧密结合,以便更好地监督网络。
单词:vice versa 反之亦然 discriminative approach(监督学习方法又可以分为生成方法(generative approach)和判别方法(discriminative approach),所生成的模型分别为生成模型和判别模型。)

To put our approach in a larger context, the idea of combining direct regression networks with different optimization routines has also emerged in different settings. Training a network jointly with a graphical model has been proposed by Tompson et al. [36] in the context of 2D human pose estimation. Similarly, for segmentation, it is popular to use a CRF on top of the segmentation network[7],while, unrolling the CRF optimization to train the network jointly with the optimization has also been investigated [30, 44]. These ideas have also translated to 3D, where Paschalidou et al. [25] unrolls the MRF optimization to train it jointly with a network for depth regression. Although we draw inspiration from these works, our motivation is different since instead of unrolling the optimization, or doing a simple post-processing, we leverage the iterative fitting to provide strong supervision to the network.

为了将我们的方法放在一个更大的背景下,将直接回归网络与不同优化方法相结合的思想也出现在不同的情景下。在二维人体姿势估计的背景下,Tompson(文献[36])提出了用图形模型联合训练网络的方法。同样,对于分割,在分割网络上使用CRF是很流行的,同时,展开CRF优化与优化联合去训练网络也有人研究。这些想法也被用于3D,Paschalidou(文献[25])展开MRF优化,与深度回归网络联合训练。(使用机器翻译,相关文献还没有阅读,不是很懂)虽然我们从这些作品中得到了灵感,但我们的动机是不同的,因为我们没有使用组合扩展优化,也没有进行简单的后处理,而是利用迭代来为网络提供强大的监督。

三、写在最后的话

博主平时论文读的非常少,很多地方的翻译不是很精确,一些专业术语用机器翻译不是准确的,此文章仅供参考和学习。关于论文内容和研究成果,此文章表述的非常清楚,不再赘述。
参考文献中是一些专业术语的解释和该论文的链接。

参考文献:
[1]: https://www.seas.upenn.edu/~nkolot/projects/spin/
[2]: https://www.cnblogs.com/xiaomingtalent/p/5158520.html
[3]: https://blog.csdn.net/qq_39521554/article/details/80653463
[4]:https://blog.csdn.net/Yong_Qi2015/article/details/53404506?locationNum=5&fps=1
[5]: https://blog.csdn.net/FrankieHello/article/details/80486167

你可能感兴趣的:(论文翻译,论文学习,3D重建,Reconstruct,3D,Human,Pose,and,Shape)