Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies翻译[下]

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies翻译上](https://www.jianshu.com/p/002ca9fd4287)

4. Motion Capture with Frankenstein Model

4.用弗兰肯斯坦模型进行运动捕捉

We ﬁt the Frankenstein model to data to capture the total body motion, including the major limbs, the face, and ﬁngers. Our motion capture method relies heavily on ﬁtting mesh correspondences to 3D keypoints, which are obtained by triangulation of 2D keypoint detections across multiple camera views. To capture shape information we also use point clouds generated by multiview stereo reconstructions. Model ﬁtting is performed by an optimization framework to minimize distances between corresponded model joints and surface points and 3D keypoint detections, and iterative closest point (ICP) to the 3D point cloud.

我们用弗兰肯斯坦模型对数据进行拟合，以捕捉全身运动，包括主要肢体，面部和手指。我们的动作捕捉方法在很大程度上依赖于将网格对应关系配合到3D关键点，这些关键点是通过多个相机视图中的2D关键点检测的三角测量获得的。为了捕捉形状信息，我们还使用多视点立体重建生成的点云。模型拟合由优化框架执行，以最小化相应模型关节和表面点与3D关键点检测之间的距离，以及3D点云的迭代最近点（ICP）。

4.1. 3D Measurements

4.1。三维测量

We incorporate two types of measurements in our framework as shown in Fig. 3: (1) corresponded 3D keypoints, which map to known joints or surface points on the mesh models (see Fig. 2), and (2) uncorresponded 3D points from multiview stereo reconstruction, which we match using ICP. 3D Body, Face, and Hand Keypoints: We use the OpenPose detector [25] in each available view, which produces 2D keypoints on the body with the method of [14], and hand and face keypoints using the method of [41]. 3D body skeletons are obtained from the 2D detections using the method of [28], which uses known camera calibration parameters for reconstruction. The 3D hand keypoints are obtained by triangulating 2D hand pose detections, following the method of [41], and similarly for the facial keypoints. Note that subsets of 3D keypoints can be entirely missing if there aren’t enough 2D detections for triangulation, which can happen in challenging scenes with interocclusions or motion blur.

如图3所示，我们在我们的框架中引入了两种测量类型：（1）相应的3D关键点，它们映射到网格模型上的已知关节或表面点（参见图2）;以及（2）不相应的3D点多视图立体重建，我们使用ICP匹配。 3D身体，脸部和手关键点：我们在每个可用的视图中使用OpenPose检测器[25]，它使用[14]的方法在身体上生成2D关键点，并使用[41]的方法生成手部和脸部关键点。使用[28]的方法从二维检测获得三维人体骨骼，该方法使用已知的相机校准参数进行重建。根据[41]的方法，通过对2D手部姿态检测进行三角测量来获得3D手部关键点，并且类似地针对面部关键点。请注意，如果没有足够的二维检测用于三角测量，那么3D关键点的子集可能完全丢失，这可能发生在具有互斥或运动模糊的具有挑战性的场景中。

image

Figure 3: 3D measurements and Frankenstein ﬁtting result.

图3：三维测量和Frankenstein拟合结果。

3D Feet Keypoints: An important cue missing from the OpenPose detector are landmarks on the feet. For motion capture, this is an essential feature to prevent footskate, as well as to accurately determine the orientation of the feet. We therefore train a keypoint detector for the tip of the big toe, the tip of the little toe, and the ball of the foot. We annotate these 3 keypoints per foot in each of around 5000 person instances of the COCO dataset, and use the neural network architecture presented by [54] with a bounding box around the feet determined by the 3D body detections1.

3D脚关键点：OpenPose探测器中缺少的重要提示是脚上的地标。对于动作捕捉而言，这是防止脚踏板以及准确确定脚的方向的基本特征。因此，我们为大脚趾尖，小脚趾尖和脚掌训练一个关键点探测器。我们在每个约5000人的COCO数据集实例的每个脚注释这3个关键点，并使用由[54]提出的神经网络结构以及围绕由3D身体检测确定的脚的边界框1。

3D Point Clouds: We use the commercial software Capturing Reality to obtain 3D point clouds from the multiview images, with associated point normals.

3D点云：我们使用商业软件Capturing Reality从多视图图像中获取3D点云，并使用相关的点法线。

4.2. Objective Function

4.2。目标函数

We initially ﬁt every frame in the sequence independently. For clarity, we drop the time index from the notation and describe the process for a single frame, which optimizes the following cost function:

我们最初独立地处理序列中的每一帧。为了清楚起见，我们从符号中删除了时间索引，并描述了单个帧的过程，该过程优化了以下成本函数：

E(θU , ϕφU , tU ) = Ekeypoints

image

icp

image

seam

image

prior (11)

E（θU，φφU，TU）= Ekeypoints

image

ICP-

image

接缝

image

之前（11）

Anatomical Keypoint Cost: the term Ekeypoints matches 3D keypoint detections which are in direct corresponce to our mesh models. This includes joints (or end effects) in the body and hands, and also contains points corresponding to the surface of the mesh (e.g., facial keypoints and the tips of ﬁngers and toes). Both of these types of correspondence are expressed as combinations of vertices via a regression matrix J ∈ RC×N U , where C denotes the number of correspondences and

image

is the number of vertices in the model. Let D denote the set of available detections in a particular frame. The cost is then: Ekeypoints

image

keypoints

image

(12)

image

where

image

indexes a row in the correspondence regression matrix and represents an interpolated position using a small number of vertices, and

image

is the 3D detection. The λkeypoints is a relative weight for this term.

解剖关键点成本：术语Ekeypoints匹配与我们的网格模型直接对应的3D关键点检测。这包括身体和手中的关节（或末端效果），并且还包含与网格表面相对应的点（例如面部关键点和手指和脚趾的尖端）。这两种对应关系都通过回归矩阵J∈RC×N U表示为顶点的组合，其中C表示对应的数量，

image

是模型中顶点的数量。设D表示特定帧中可用检测的集合。成本是：Ekeypoints

image

关键点

image

（12）

image

其中

image

索引对应回归矩阵中的一行，并表示使用少量顶点的内插位置，而

image

是3D检测。λ关键点是该术语的相对权重。

1More details provided in the supplementary material.

1个补充材料中提供的细节。

ICP Cost: The 3D point cloud measurements are not a priori in correspondence with the model meshes. We therefore establish their correspondence to the mesh using Iterative Closest Point (ICP) during each solver iteration. We ﬁnd the closest 3D point in the point cloud to each of the mesh vertices,

ICP成本：三维点云测量与模型网格不一致。因此，我们在每次求解器迭代期间使用迭代最近点（ICP）建立它们与网格的对应关系。我们找到点云中最接近每个网格顶点的3D点，

image

where

image

is the closest 3D point to vertex j, where

image

is a vertex2 in

image

of the Frankenstein model. To ensure that this is a correct correspondence, we use thresholds for the distance and normals during the correspondence search.

其中

image

是距离顶点j最近的3D点，其中

image

是Frankenstein模型的

image

中的顶点2。为确保这是一个正确的对应关系，我们在通信搜索期间使用距离和法线的阈值。

Finally, for each vertex j we compute the point-to-plane residual, i.e., the distance along the normal direction,

最后，对于每个顶点j，我们计算点到面残差，即沿着法线方向的距离，

image

where

image

represents the point’s normal, and λicp is a relative weight for this term.

image

表示点的法线，λicp是该术语的相对权重。

Seam Constraints: The part models composing the Frankenstein model are rigidly linked by the skeletal hierarchy. However, the independent surface parameterizations of each of the part models may introduce discontinuities at the boundary between parts (e.g., a fat arm with a thin wrist). To avoid this artifact, we encourage the vertices around the seam parts to be close by penalizing differences between the last two rings of vertices around the seam of each part, and the corresponding closest point in the body model in the rest pose expressed as barycentric coordinates (see the supplementary materials for details).

缝限制：构成弗兰肯斯坦模型的部件模型由骨架层次严格链接。然而，每个部件模型的独立表面参数化可能在部件之间的边界处引入不连续性（例如，具有薄手腕的胖臂）。为了避免这种伪影，我们通过惩罚每个部分的接缝周围最后两圈顶点之间的差异来鼓励接缝部分周围的顶点靠近，并且休息姿态中的身体模型中对应的最近点表示为重心坐标（详情请参阅补充材料）。

Prior Cost: Depending on the number of measurements available in a particular frame, the set of parameters of

image

may not be determined uniquely (e.g., the width of the ﬁngers). More importantly, the 3D point clouds are noisy and cannot be well explained by the model due to hair and clothing, which are not captured by the SMPL and FaceWarehouse meshes, which can result in erroneous correspondences during ICP. Additionally, the joint locations of the models are not necessarily consistent with the annotation criteria used to train the 2D detectors. We are therefore forced to set priors over model parameters to avoid the model from overﬁtting to these sources of noise, Eprior

image

prior

image

prior

image

prior. The prior for each part is deﬁned by corresponding shape and pose priors, for which we use 0-mean standard normal priors for each parameter except for scaling factors, which are encouraged to be close to 1. Details and relative weights can be found in supplementary materials.

先前成本：根据特定帧中可用的测量次数，

image

的一组参数可能不是唯一确定的（例如，手指的宽度）。更重要的是，由于头发和衣服不能被SMPL和FaceWarehouse网格捕捉到，因此三维点云很嘈杂，不能很好地解释模型，这会导致ICP期间的错误对应。另外，模型的联合位置不一定与用于训练2D检测器的注释标准一致。因此，我们不得不在模型参数上设置先验模型，以避免模型过度适应这些噪声源，之前的

image

之前的

image

之前。每个部分的先验都是由相应的形状和姿势先验定义的，对于每个参数我们使用0均值标准正态先验，除了缩放因子，鼓励接近于1。详细信息和相对权重可以在补充材料中找到。

2We do not consider some parts (around hands and face), as depth sensor resolution is too low to improve the estimate. These parts are deﬁned as a mask.

2我们不考虑某些部分（手和脸部），因为深度传感器分辨率太低而无法改善估计。这些部件被定义为一个面罩。

image

Figure 4: Regressing detection target positions. (Left) The template model is aligned with target object. (Mid.) The torso joints of the template model (magenta) have discrepancy from the joint deﬁnitions of 3D keypoint detection (cyan). (Right) The newly regressed target locations (green) are more consistent with 3D keypoint detections.

图4：回归检测目标位置。（左）模板模型与目标对象对齐。（中）模板模型（洋红色）的躯干关节与3D关键点检测（青色）的关节定义不一致。（右）新回归的目标位置（绿色）与3D关键点检测更加一致。

4.3. Optimization Procedure

4.3。优化程序

The complete model is highly nonlinear, and due to the limited degrees of freedom of the skeletal joints, the optimization can get stuck in bad local minima. Therefore, instead of optimizing the complete model initially, we ﬁt the model in phases, starting with a subset of measurements and strong priors that are relaxed as optimization progresses.

完整的模型是高度非线性的，并且由于骨架关节的自由度有限，优化可能会陷入局部极小的不良状态。因此，我们不是首先对整个模型进行优化，而是分阶段地对模型进行分类，首先是测量的一个子集，随着优化的进行放松，这些优先级会很宽松。

Model ﬁtting is performed on each frame independently. To initialize the overall translation and rotation, we use four keypoints on the torso (left and right shoulders and hips) without using the ICP term, and with strong weight on the priors. Once the torso parts are approximately aligned, we use all available keypoints of all body parts, with small weight for the priors. The results at this stage already provide reasonable motion capture but do not accurately capture the shape (i.e., silhouette) of the subject. Finally, the entire optimization is performed including the ICP term to ﬁnd correspondences with the 3D point cloud. We run the ﬁnal optimization two times, ﬁnding new correspondences each time. For the optimization we use LevenbergMarquardt with the Ceres Solver library [1].

模型拟合是在每个框架上独立执行的。为了初始化整体平移和旋转，我们在躯干（左肩和臀部）上使用了四个关键点，而不使用ICP术语，并且在前身上具有较强的重量。一旦躯干部分大致对齐，我们使用所有身体部位的所有可用关键点，对于先验者来说重量轻。此阶段的结果已经提供了合理的动作捕捉，但不能精确地捕捉主体的形状（即轮廓）。最后，执行整个优化，包括ICP术语以查找与3D点云的对应关系。我们两次运行最终优化，每次都找到新的对应关系。为了优化，我们使用LevenbergMarquardt和Ceres Solver库[1]。

5. Creating Adam

5.创建亚当

We derive a new model, which we call Adam, enabling total body motion capture with a simpler parameterization than the part-based Frankenstein model. In particular, this new model has a single joint hierarchy and a common parameterization for all shape degrees of freedom, tying together the face, hand, and body shapes and avoiding the need for seam constraints. To build the model, it is necessary to reconstruct the shape and the motion of all body parts (face, body, and hands) from diverse subjects where model can learn the variations. To do this, we leverage our Frankenstein model and apply it on a dataset of 70 subjects where each of them performs a short range of motion in a multiview camera system. We selected 5 frames for each person in different poses and use the the reconstruction results to build Adam. From the data, both joint location information and linear shape blendshapes are learnt. Because we derive the model from clothed people, the blendshapes explain some variations of them.

我们推出了一种新的模型，我们称之为亚当，使用比基于零件的弗兰肯斯坦模型更简单的参数化来实现全身运动捕捉。特别是，这个新模型有一个单一的联合层次结构和一个通用的参数化，用于所有形状的自由度，将面部，手部和身体的形状拼接在一起，避免了对接缝约束的需求。要建立模型，有必要重建模型可以学习变化的不同主题的所有身体部位（面部，身体和手部）的形状和运动。为此，我们利用我们的Frankenstein模型并将其应用于70个主题的数据集，其中每个主题都在多视点相机系统中执行短距离运动。我们为不同姿势的每个人选择了5帧，并使用重建结果来建立亚当。从数据中，联合位置信息和线性形状混合形状都被学习到。因为我们从穿衣人身上得出模型，混合形状解释了它们的一些变化。

5.1. Regressing Detection Targets

5.1。回归检测目标

There exists a discrepancy between the joint locations of the body model (e.g., SMPL model in our case) and the location of the keypoint detections (i.e., a model joint vs. a detection joint), as shown in Fig. 4. This affects mainly the shoulder and hip joints, which are hard to precisely annotate. This difference has the effect of pulling the Frankenstein model towards a bad ﬁt even while achieving a low keypoint cost, Ekeypoints. We alleviate this problem by computing the relative location of the 3D detections with respect to the ﬁtted mesh vertices by leveraging the the reconstructed 70 people data. This allows us to deﬁne new targets for the keypoint detection cost that, on average, are a better match for the location of the 3D detections with respect to the mesh model, as shown in Fig. 4. In particular, given the ﬁtting results of 70 identities, we approximate the target 3D keypoint locations as a function of the ﬁnal ﬁtted mesh vertices following the procedure of [33] to ﬁnd a sparse, linear combination of vertices that approximates the position of the target 3D keypoint. Note that we do not change the joint location used in the skeleton hierarchy during LBS deformation, only the regression matrices

image

in Eq. (12).

如图4所示，在人体模型（例如我们的情况下的SMPL模型）的关节位置与关键点检测的位置（即，模型关节与检测关节）之间存在差异。这主要影响肩关节和髋关节，这些关节很难精确地注释。即使在实现低关键点成本Ekeypoints的情况下，这种差异也会导致Frankenstein模型向不适合的方向发展。我们通过利用重建的70人数据计算3D检测相对于拟合网格顶点的相对位置来缓解这个问题。这使我们能够为关键点检测成本定义新的目标，平均而言，它们与3D检测相对于网格模型的位置更匹配，如图4所示。特别是，考虑到70个身份的拟合结果，我们将目标3D关键点位置近似为最终拟合网格顶点的函数，遵循[33]的过程以找到近似于目标位置的稀疏线性顶点组合3D关键点。请注意，我们不会在LBS变形期间更改骨架层次结构中使用的联合位置，而只会修改EBS中的回归矩阵

image

。（12）。

5.2. Fitting Clothes and Hair

5.2。适合的衣服和头发

The SMPL model captures the shape variability of human bodies, but does not account for clothing or hair. Similarly, the FaceWarehouse template mesh was not design to model hair. However, for the natural interactions that we are most interested in capturing, people wear everyday clothing and sport typical hairstyles. To learn a new set of linear blendshapes that better capture the rough geometry of clothed people and jointly model face, it is required to reconstruct the accurate geometry of the source data. For this purpose, we reconstruct the out-of-shape spaces in the reconstructed 70 people results by Frankenstein model ﬁtting.

SMPL模型捕捉人体的形状变化，但不考虑衣服或头发。同样，FaceWarehouse模板网格不是用来模拟头发的。然而，对于我们最感兴趣的自然交互，人们穿着日常的衣服和运动典型的发型。为了学习一组更好地捕捉穿衣人员的粗糙几何并联合建模人脸的线性混合图形，需要重建源数据的精确几何图形。为此，我们通过Frankenstein模型拟合重构了70人的结果，重建了形状空间。

For each vertex in the Frankenstein model, we write

对于弗兰肯斯坦模型中的每个顶点，我们写

image

where

image

is a scalar displacement meant to compensate for the discrepancy between the Frankenstein model vertices and the 3D point cloud, along the normal direction at each vertex. We pose the problem as a linear system,

image

是一个标量位移，旨在补偿Frankenstein模型顶点和3D点云之间沿每个顶点的法线方向的差异。我们提出这个问题是一个线性系统，

image

where ∆Δ ∈ RN U contains the stacked per-vertex displacements,

image

are the vertices in the Frankenstein model,

image

are corresponding point cloud points,

image

contains the mesh vertex normals, and L ∈ RN U ×N U is the Laplace-Beltrami operator to regularize the deformation. We also use a weight matrix W to avoid large deformations where the 3D point cloud has lower resolution than the original mesh, like details in the face and hands.

其中ΔΔ∈RU包含堆叠的每顶点位移，

image

是Frankenstein模型中的顶点，

image

是对应的点云点，

image

包含网格顶点法线，并且L∈RUU×NU是Laplace-Beltrami操作员来调整变形。我们还使用权重矩阵W来避免3D点云的分辨率低于原始网格的大变形，如脸部和手部的细节。

5.3. Building the Shape Deformation Space

5.3。构建形状变形空间

After

image

ﬁtting, we warp each frame’s surface to the rest pose, applying the inverse of the LBS transform. With the ﬁtted surfaces warped to this canonical pose, we do PCA analysis to build a joint linear shape space that captures shape variations across the entire body. As in Section 3.3, we separate the expression basis for the face and retain the expression basis from the FaceWarehouse model, as our MVS point clouds are of too low resolution to ﬁt facial expressions.

经过

image

拼接后，我们将每个框架的表面扭曲到休息姿势，应用LBS变换的逆。通过将拟合的曲面扭曲到这个规范的姿态，我们进行PCA分析以建立一个联合线性形状空间，捕捉整个身体的形状变化。如第3.3节所述，我们将Face的表情基础分开，并保留Facewarehouse模型的表达基础，因为我们的MVS点云的分辨率太低而无法表达面部表情。

This model now can have shape variation for all parts, including body, hand, and face. The model also includes deformation of hair and clothing. That is this model can substitute parameters of

image

, and

image

这个模型现在可以为所有部分形状变化，包括身体，手部和脸部。该模型还包括头发和衣服的变形。那就是这个模型可以替代

image

，

image

和

image

的参数。

image

with

image

and

image

. As in SMPL, the vertices of this template mesh are ﬁrst displaced by a set of blendshapes in the rest pose, ˆvTi = vT 0i + ∑KTk=1 ski ϕφBk , where

image

is the i-th vertex of the k-th blendshape,

image

is the k-th shape coefﬁcients of ϕφT ∈ RKb , and

image

is the number of identity coefﬁcients,

image

is the mean shape and

image

is its i-th vertex. However, these blendshapes now capture variation across the face, hands, and body. These are then posed using LBS as in Eq. (6). We deﬁne the joints and weights for LBS followoing the part models, which is further explained in the supplementary material.

与

image

和

image

。和SMPL一样，这个模板网格的顶点首先被一系列静止姿势中的混合形状所替代，vTi = vT 0i +ΣKTk= 1，其中

image

是第k个混合形状的第i个顶点，

image

是φφT∈RKb的第k个形状系数，

image

是单位系数的个数，

image

是平均形状，

image

是它的第i个顶点。然而，这些混合形状现在捕捉脸部，手部和身体的变化。然后这些使用LBS如方程（6）。我们根据零件模型定义LBS的关节和重量，这在补充材料中有进一步解释。

5.4. Tracking with Adam

5.4。跟踪亚当

The cost function to capture total body motion using Adam model is similar to Eqn. 11 without the seam term:

使用亚当模型捕获全身运动的成本函数与公式11没有接缝术语：

E(θT , ϕφT , tT ) = Ekeypoints

image

icp

image

prior. (18) However, Adam is much easier to use than Frankenstein, because it only has a single type of shapes and pose parameters for all parts. Conceptually, it is based on the SMPL model parameterization, but with additional joints for the hands and facial expression blendshapes.

E（θT，φφT，tT）= Ekeypoints

image

icp

image

之前。（18）然而，Adam比Frankenstein更容易使用，因为它只有单一类型的形状和所有部分的姿态参数。从概念上讲，它基于SMPL模型参数化，但是具有用于手和面部表情混合形状的附加关节。

image

Figure 5: (Top) Visualization of silhouette from different methods with Ground-truth. The ground truth is drawn on red channel and the rendered silhouette masks from each model is drawn on green channel. Thus, the correctly overlapped region is shown as yellow color.; (Bottom) Silhouette accuracy compared to the ground truth silhouette.

图5 :(顶部）使用Ground-truth从不同方法的轮廓可视化。地面实况绘制在红色通道上，每个模型的渲染轮廓蒙版绘制在绿色通道上。因此，正确重叠的区域显示为黄色。（下图）与地面真实轮廓相比的轮廓精度。

Table 1: Accuracy of Silhouettes from different models

表1：来自不同模型的轮廓的准确性

image

Optical Flow Propagation: While ﬁtting each frame independently has beneﬁts—-it does not suffer from error accumulation and frames can be ﬁt in parallel—it typically produces jittery motion. To reduce this jitter, we use optical ﬂow to propagate the initial, per-frame ﬁt to neighboring frames to ﬁnd a smoother solution. More concretely, given the ﬁtting results at the frame t, we propagate this mesh to frames

image

and

image

using optical ﬂow at each vertex, which is triangulated into 3D using the method of [27]. Therefore, each vertex has at most three candidate positions: the original mesh, and the forward and backward propagated vertices (subject to a forward-backward consistency check). Given these propagated meshes, we reoptimize the model parameters by using all propagated mesh vertices as additional keypoints to ﬁnd a compromise mesh. We run this process multiple times (3, in our case), to further reduce jitter and ﬁll in frames with missing detections.

光学流动传播：虽然每个帧独立地配置有好处 - 它不会受到误差累积的影响，并且帧可以并行处理 - 它通常会产生抖动。为了减少这种抖动，我们使用光学流将初始的每帧图像传播到相邻帧，以找到更平滑的解决方案。更具体地说，考虑到帧t的拟合结果，我们使用[27]中的方法将每个顶点的光流传播到帧

image

和

image

中，并使用[27]的方法将其三维化。因此，每个顶点至多有三个候选位置：原始网格，以及向前和向后传播的顶点（受前向一致性检查的限制）。给定这些传播网格，我们通过使用所有传播的网格顶点作为附加关键点来重新优化模型参数以找到折中网格。我们多次运行这个过程（在我们的例子中是3次），以进一步减少抖动并填充缺少检测结果的帧。

6. Results

6.结果

We perform total motion capture using our two models, Frankenstein and Adam, on various challenging sequences.

我们使用我们的两个模型Frankenstein和Adam在各种具有挑战性的序列上执行全动作捕捉。

image

Figure 6: Total body reconstruction results on various human body motions. For each example scene, the ﬁtting results from three different models are shown by different colors (pink for SMPL [33], silver for Frankenstein, and gold for Adam).

图6：各种人体运动的全身重建结果。对于每个示例场景，来自三个不同模型的配色结果以不同的颜色显示（SMPL为粉红色[33]，Frankenstein为银色，Adam为黄金）。

For experiments, we use the dataset captured in the CMU Panoptic Studio [26]. We use 140 VGA cameras to reconstruct 3D body keypoints, 480 VGA cameras for feet, and 31 HD cameras for faces and hands keypoints, and 3D point clouds. We compare the ﬁts produced by our models with the body-only SMPL model [33].

对于实验，我们使用CMU Panoptic Studio [26]中捕获的数据集。我们使用140个VGA摄像头重建3D身体关键点，480个VGA摄像头用于脚，31个HD摄像头用于脸部和手部关键点以及3D点云。我们将我们的模型产生的吻合与仅有身体的SMPL模型进行比较[33]。

6.1. Quantitative Evaluation

6.1。定量评估

We evaluate how well each model can match a moving person by measuring overlap with the ground truth silhouette across 5 different viewpoints for a 10 second range of motion sequence. To obtain the ground truth silhouette, we run a background subtraction algorithm using a Gaussian model for the background of each pixel with a postprocessing to remove noise by morphological transforms. As an evaluation metric, we compute the percentage of overlapping region compared to the union between the GT silhouettes and the rendered forground masks after ﬁtting each model. Here, we compare the ﬁtting results of 3 different models: SMPL, our Frankenstein, and our Adam models. An example result is shown in Figure 5, and the results are shown in Fig 5 and Table 1. We ﬁrst compare accuracy between SMPL and Frankenstein model by using only 3D keypoints as measurement cues. The major source of im provement of Frankenstein over SMPL is in the articulated hand model (by construction, the body is almost identical), as seen in Fig. 5 (a). Including ICP term as cues provides better accuracy. Finally in the comparison between our two models, they show almost similar performance. Ideally we expect the Adam outperforms Frankenstein because it has more expressive power for hair and clothing, and it shows it shows better performance in a certain body shape (frame 50-75 in Fig 5). However, Adam sometimes produces artifacts showing lower accuracy; it tends to generate thinner legs, mainly due to poor 3D point cloud reconstructions on the source data on which Adam is trained. However, Adam is simpler for total body motion capture purpose and has potential to be improved once a large scale dataset is available with more optimized capture setup.

我们通过测量5个不同视点之间的地面真实轮廓与10秒运动序列范围的重叠来评估每个模型与移动人物的匹配程度。为了获得地面真实轮廓，我们使用高斯模型为每个像素的背景运行背景减法算法，并进行后处理以通过形态学变换去除噪声。作为一个评估指标，我们计算重叠区域的百分比，并与配置每个模型后GT轮廓和渲染后的遮罩之间的联合进行比较。在这里，我们比较了三种不同模型的拟合结果：SMPL，我们的科学怪人和我们的亚当模型。图5显示了一个示例结果，结果如图5和表1所示。我们首先比较SMPL和Frankenstein模型之间的准确性，仅使用3D关键点作为测量线索。如图5（a）所示，科学怪人超过SMPL的主要来源是关节手模型（通过构造，身体几乎相同）。将ICP术语作为线索提供更好的准确性。最后，在我们两个模型之间的比较中，他们表现出几乎相似的表现。理想情况下，我们预计Adam会比Frankenstein更胜一筹，因为它对头发和衣服有更强的表现力，并且它表明它在某种身体形状（图5中的50-75帧）中表现出更好的表现力。但是，亚当有时会产生精度较低的伪影;它倾向于产生更细的腿，这主要是由于在Adam训练的源数据上重建较差的三维点云。然而，Adam对于全身运动捕捉的目的更简单，并且一旦大规模数据集可用于更优化的捕捉设置，就有可能改进。

6.2. Qualitative Results

6.2。定性结果

We run our method on sequences where face and hand motions are naturally emerging with body motions. The sequences include short range of motions for 70 people used to build Adam, social communications of multiple people, a furniture building sequence with dexterous hand motions, musical performances such as cello and guitars, and commonly observable daily motions such as keyboard typing.

我们在脸部和手部运动随着身体运动自然出现的序列上运行我们的方法。这些序列包括用于建造亚当的70人的短距离运动，多人的社交沟通，灵巧的手部运动的家具建造序列，大提琴和吉他等音乐表演以及通常可观察到的日常动作，如键盘打字。

Most of these sequences are rarely demonstrated in previous markerless motion capture methods since capturing subtle details are the key to achieve the goal. The example results are shown in Figure 6. Here, we also qualitatively compare our models (in silver color for Frankenstein, and gold for Adam) with the SMPL model (in pink) [33]. It should be noted that the total body motion capture results based on our models produce much better realism for the scene by capturing the subtle details from hands and faces. Our results are best shown in the accompanying videos.

由于捕捉细微的细节是实现目标的关键，因此大多数这些序列很少在先前的无标记运动捕捉方法中演示。这个例子的结果如图6所示。在这里，我们也用SMPL模型（粉红色）定性地比较了我们的模型（弗兰肯斯坦的银色和亚当的黄金）[33]。应该注意的是，基于我们的模型的总体动作捕捉结果通过捕捉来自手和脸部的微妙细节，为场景产生更好的真实感。我们的结果最好在随附的视频中展示。

7. Discussion

7.讨论

We present the ﬁrst markerless method to capture total body motion including facial expression, coarse body motion from torso and limbs, and hand gestures at a distance. To achieve this, we present two types of models which can express motion in each of the parts. Our reconstruction results show compelling and realistic results, even when using only sparse 3D keypoint detections to drive the models.

我们提出了第一个无标记的方法来捕捉包括面部表情，躯干和四肢的粗体运动以及远处的手势在内的全身运动。为了达到这个目的，我们提出了两种可以在每个部分中表达运动的模型。即使仅使用稀疏的3D关键点检测来驱动模型，我们的重建结果也显示出引人注目的逼真效果。

References

参考

[1] S. Agarwal, K. Mierle, and Others. Ceres solver. http: //ceres-solver.org.

[1] S. Agarwal，K. Mierle和其他人。 Ceres求解器。 http：//ceres-solver.org。

[2] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ToG, 2005.

[2] D. Anguelov，P. Srinivasan，D. Koller，S. Thrun，J. Rodgers和J. Davis。花：：塑造人的完成和动画。在ToG，2005年。

[3] A. Baak, M. M¨uller, G. Bharaj, H.-P. Seidel, and C. Theobalt. A data-driven approach for real-time full body pose reconstruction from a depth camera. In Consumer Depth Cameras for Computer Vision. Springer, 2013.

[3] A. Baak，M.Müuller，G. Bharaj，H.-P. Seidel和C. Theobalt。一种数据驱动的方法，用于从深度相机进行实时全身姿态重建。用于计算机视觉的消费者深度相机。斯普林格，2013年。

[4] L. Ballan, A. Taneja, J. Gall, L. Van Gool, and M. Pollefeys. Motion capture of hands in action using discriminative salient points. In ECCV, 2012.

[4] L. Ballan，A. Taneja，J. Gall，L. Van Gool和M. Pollefeys。使用区别性突出点动作捕捉手部动作。在ECCV，2012年。

[5] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross. High-quality single-shot capture of facial geometry. In TOG, 2010.

[5] T. Beeler，B。Bickel，P. Beardsley，B. Sumner和M. Gross。高质量的单面拍摄面部几何。在TOG，2010年。

[6] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman, R. Sumner, and M. Gross. High-quality passive facial performance capture using anchor frames. In TOG, 2011.

[6] T.比勒，哈恩，D.布拉德利，B。比克尔，P.比尔兹利，C.戈茨曼，R.萨姆纳和M.格罗斯。使用锚帧高品质的被动面部表情捕捉。在TOG，2011年。

[7] R. Birdwhistell. Kinesics and context: Essays on body motion communication. In University of Pennsylvania Press, Philadelphia., 1970.

[7] R. Birdwhistell。运动学和背景：关于身体运动交流的散文。宾夕法尼亚大学出版社，费城，1970年。

[8] F. Bogo, A. Kanazawa, C. Lassner, P. V. Gehler, J. Romero, and M. J. Black. Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In CoRR, 2016.

[8] F. Bogo，A. Kanazawa，C. Lassner，P. V. Gehler，J. Romero和M. J. Black。保持它SMPL：从单个图像自动估计3D人体姿势和形状。在CoRR，2016。

[9] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High resolution passive facial performance capture. In TOG, 2010.

[9] D.布拉德利，W.海德里希，T.波帕和A.谢弗。高分辨率被动式面部表演捕捉。在TOG，2010年。

[10] C. Bregler, J. Malik, and K. Pullen. Twist based acquisition and tracking of animal and human kinematics. In IJCV, 2004.

[10] C. Bregler，J. Malik和K. Pullen。基于扭曲的采集和跟踪动物和人体运动学。在IJCV，2004年。

[11] T. Brox, B. Rosenhahn, J. Gall, and D. Cremers. Combined region and motion-based 3D tracking of rigid and articulated objects. In TPAMI, 2010.

[11] T. Brox，B. Rosenhahn，J. Gall和D. Cremers。结合区域和运动的三维跟踪刚性和铰接物体。在TPAMI，2010年。

[12] C. Cao, D. Bradley, K. Zhou, and T. Beeler. Real-time highﬁdelity facial performance capture. In TOG, 2015.

[12] C.曹，D.布拉德利，K.周和T.比勒。实时高清晰的脸部表情捕捉。在TOG，2015年。

[13] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. In TVCG, 2014.

[13] C. Cao，Y. Weng，S. Zhou，Y. Tong和K. Zhou。 Facewarehouse：面向视觉计算的3D面部表情数据库。在TVCG，2014年。

[14] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part afﬁnity ﬁelds. In CVPR, 2017.

[14] Z. Cao，T. Simon，S.-E.魏和Y.谢赫。实时多人2d姿态估计使用部分亲和力字段。在CVPR，2017年。

[15] K. M. Cheung, S. Baker, and T. Kanade. Shape-fromsilhouette across time part i: Theory and algorithms. In IJCV, 2005.

[15] K. M. Cheung，S. Baker和T. Kanade。形状 - 从时间的第i部分：理论和算法。在IJCV，2005年。

[16] S. Corazza, L. M¨undermann, E. Gambaretto, G. Ferrigno, and T. P. Andriacchi. Markerless Motion Capture through Visual Hull, Articulated ICP and Subject Speciﬁc Model Generation. In IJCV, 2010.

[16] S. Corazza，L.M¨undermann，E. Gambaretto，G. Ferrigno和T. P. Andriacchi。通过Visual Hull，铰接式ICP和主题特定模型生成的无标记运动捕捉。在IJCV，2010。

[17] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel, and S. Thrun. Performance capture from sparse multi-view video. In SIGGRAPH, 2008.

[17] E. de Aguiar，C. Stoll，C. Theobalt，N. Ahmed，H.-P.赛德尔和S.特朗。从稀疏多视点视频捕获性能。在SIGGRAPH，2008年。

[18] F. De la Torre, W.-S. Chu, X. Xiong, F. Vicente, X. Ding, and J. F. Cohn. Intraface. In FG, 2015.

[18] F. De la Torre，W.-S. Chu，X.Xiong，F. Vicente，X. Ding和J. F. Cohn。 Intraface。在2015年的FG。

[19] A. Elhayek, E. Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt. Efﬁcient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR, 2015.

[19] A. Elhayek，E. Aguiar，A. Jain，J. Tompson，L. Pishchulin，M. Andriluka，C. Bregler，B. Schiele和C. Theobalt。在低照相机数量的一般场景中进行高效的基于富网络的无标记运动捕捉。在CVPR，2015年。

[20] Y. Furukawa and J. Ponce. Dense 3d motion capture from synchronized video streams. In CVPR, 2008.

[20] Y. Furukawa和J. Ponce。来自同步视频流的密集的3d动作捕捉。在CVPR，2008年。

[21] J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel. Motion capture using joint skeleton tracking and surface estimation. In CVPR. IEEE, 2009.

[21] J. Gall，C. Stoll，E. De Aguiar，C. Theobalt，B. Rosenhahn和H.-P.赛德尔。使用关节骨骼跟踪和曲面估计进行运动捕捉。在CVPR中。 IEEE，2009。

[22] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt. Reconstructing detailed dynamic face geometry from monocular video. In TOG, 2013.

[22] P. Garrido，L. Valgaerts，C. Wu和C. Theobalt。从单眼视频重建详细的动态人脸几何。在TOG，2013年。

[23] D. Gavrila and L. Davis. Tracking of humans in action: A 3-D model-based approach. In ARPA Image Understanding Workshop, 1996.

[23] D. Gavrila和L.戴维斯。追踪行动中的人类：一种基于三维模型的方法。在1996年的ARPA图像理解研讨会上。

[24] A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec. Multiview face capture using polarized spherical gradient illumination. In TOG, 2011.

[24] A. Ghosh，G. Fyffe，B. Tunwattanapong，J. Busch，X. Yu和P. Debevec。使用偏振球形渐变照明的多视图人脸捕捉。在TOG，2011年。

[25] G. Hidalgo, Z. Cao, T. Simon, S.-E. Wei, H. Joo, and Y. Sheikh. Openpose. https://github.com/ CMU-Perceptual-Computing-Lab/openpose.

[25] G. Hidalgo，Z. Cao，T. Simon，S.-E. Wei，H. Joo和Y. Sheikh。 Openpose。 https://github.com/ CMU-Perceptual-Computing-Lab / openpose。

[26] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic studio: A massively multiview system for social motion capture. In ICCV, 2015.

[26] H. Joo，H. Liu，L. Tan，L. Gui，B. Nabbe，I. Matthews，T. Kanade，S. Nobuhara和Y. Sheikh。全景工作室：用于社交动作捕捉的大型多视点系统。在ICCV，2015。

[27] H. Joo, H. S. Park, and Y. Sheikh. Map visibility estimation for large-scale dynamic 3d reconstruction. In CVPR, 2014.

[27] H. Joo，H. S. Park和Y. Sheikh。用于大规模动态三维重建的地图可见性估计。在CVPR，2014年。

[28] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. Godisart, B. Nabbe, I. Matthews, et al. Panoptic studio: A massively multiview system for social interaction capture. In TPAMI, 2017.

[28] H. Joo，T. Simon，X. Li，H. Liu，L. Tan，L. Gui，S. Banerjee，T. Godisart，B. Nabbe，I. Matthews，et al。全景工作室：用于社交互动捕捉的大型多视图系统。在TPAMI，2017。

[29] R. Kehl and L. V. Gool. Markerless tracking of complex human motions from multiple views. In CVIU, 2006.

[29] R. Kehl和L. V. Gool。从多个视图无标记地跟踪复杂的人体运动。在CVIU，2006年。

[30] C. Keskin, F. Kırac¸, Y. E. Kara, and L. Akarun. Hand pose estimation and hand shape classiﬁcation using multi-layered randomized decision forests. In ECCV, 2012.

[30] C. Keskin，F.Kırac，Y. E. Kara和L. Akarun。使用多层随机决策森林手姿态估计和手形分类。在ECCV，2012年。

[31] H. Li, J. Yu, Y. Ye, and C. Bregler. Realtime facial animation with on-the-ﬂy correctives. In TOG, 2013.

[31] H. Li，J. Yu，Y. Ye和C. Bregler。实时面部动画，可以随时进行修正。在TOG，2013年。

[32] Y. Liu, J. Gall, C. Stoll, Q. Dai, H.-P. Seidel, and C. Theobalt. Markerless motion capture of multiple characters using multiview image segmentation. In TPAMI, 2013.

[32] Y.Liu，J.Gall，C.Stoll，Q. Dai，H.-P. Seidel和C. Theobalt。使用多视图图像分割的无标记运动捕获多个字符。在TPAMI，2013年。

[33] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. In TOG, 2015.

[33] M. Loper，N. Mahmood，J. Romero，G. Pons-Moll和M. J. Black。 Smpl：皮肤多人线性模型。在TOG，2015年。

[34] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shaﬁei, H. Seidel, W. Xu, D. Casas, and C. Theobalt. Vnect: Real-time 3d human pose estimation with a single RGB camera. In TOG, 2017.

[34] D. Mehta，S. Sridhar，O. Sotnychenko，H. Rhodin，M. Shafi ei，H. Seidel，W. Xu，D. Casas和C. Theobalt。 Vnect：使用单个RGB摄像头进行实时三维人体姿态估计。 2017年TOG。

[35] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.

A. Newell，K. Yang和J. Deng。用于人体姿势估计的堆积式沙漏网络。在2016年的ECCV中。

[36] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Tracking the articulated motion of two strongly interacting hands. In CVPR, 2012.

[36] I. Oikonomidis，N. Kyriazis和A. A. Argyros。跟踪两个强烈互动的手的关节运动。在CVPR，2012年。

[37] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. In TOG, 2015.

[37] G. Pons-Moll，J. Romero，N. Mahmood和M. J. Black。动态：运动中动态人体形状的模型。在TOG，2015年。

[38] J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. In TOG, 2017.

[38] J. Romero，D. Tzionas和M. J. Black。体现的双手：塑造和捕捉双手和身体。 2017年TOG。

[39] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, et al. Accurate, robust, and ﬂexible real-time hand tracking. In CHI, 2015.

T. Sharp，C. Keskin，D. Robertson，J. Taylor，J. Shotton，D. Kim，C. Rhemann，I. Leichter，A. Vinnikov，Y. Wei，et al。准确，强大且灵活的实时手部追踪。在CHI，2015。

[40] J. Shotton, A. Fitzgibbon, M. Cook, and T. Sharp. Real-time human pose recognition in parts from single depth images. In CVPR, 2011.

[40] J. Shotton，A. Fitzgibbon，M. Cook和T. Sharp。来自单个深度图像的部分实时人体姿态识别。在CVPR，2011年。

[41] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.

[41] T. Simon，H. Joo，I. Matthews和Y. Sheikh。使用多视图自举功能在单张图像中手动检测关键点。在CVPR，2017年。

[42] S. Sridhar, F. Mueller, A. Oulasvirta, and C. Theobalt. Fast and robust hand tracking using detection-guided optimization. In CVPR, 2015.

[42] S. Sridhar，F. Mueller，A. Oulasvirta和C. Theobalt。使用检测引导优化的快速和健壮的手部跟踪。在CVPR，2015年。

[43] S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive markerless articulated hand motion tracking using RGB and depth data. In ICCV, 2013.

[43] S. Sridhar，A. Oulasvirta和C. Theobalt。使用RGB和深度数据的交互式无标记铰接手部运动跟踪。在ICCV，2013年。

[44] C. Stoll, N. Hasler, J. Gall, H.-P. Seidel, and C. Theobalt. Fast articulated motion tracking using a sums of gaussians body model. In ICCV, 2011.

[44] C.斯托尔，N.哈斯勒，J.加尔，H.-P. Seidel和C. Theobalt。使用一定数量的高斯体模型进行快速关节运动跟踪。在ICCV，2011年。

[45] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded hand pose regression. In CVPR, 2015.

[45] X.孙，Y. Wei，S. Liang，X. Tang和J. Sun.级联的手势回归。在CVPR，2015年。

[46] D. Tang, H. Jin Chang, A. Tejani, and T.-K. Kim. Latent regression forest: Structured estimation of 3D articulated hand posture. In CVPR, 2014.

[46] D. Tang，H. Jin Chang，A. Tejani和T.-K.金。潜在回归森林：3D关节手势的结构化估计。在CVPR，2014年。

[47] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In CVPR, 2016.

[47] J.蒂斯，M. Zollhofer，M. Stamminger，C. Theobalt和M.Nießner。 Face2face：实时拍摄人脸并重新制作rgb视频。在CVPR，2016。

[48] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In CVPR, 2017.

[48] D. Tome，C. Russell和L. Agapito。从深层提升：来自单个图像的卷积3d姿态估计。在CVPR，2017年。

[49] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.

[49] J. J. Tompson，A. Jain，Y. LeCun和C. Bregler。用于人体姿态估计的卷积网络和图形模型的联合训练。在NIPS，2014年。

[50] D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys, and J. Gall. Capturing hands in action using discriminative salient points and physics simulation. In IJCV, 2016.

[50] D. Tzionas，L. Ballan，A. Srikantha，P. Aponte，M. Pollefeys和J. Gall。使用区别性突出点和物理模拟捕捉手中的动作。在IJCV，2016。

[51] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, and C. Theobalt. Lightweight binocular facial performance capture under uncontrolled lighting. In TOG, 2012.

L. Valgaerts，C. Wu，A. Bruhn，H.-P. Seidel和C. Theobalt。在不受控制的照明条件下拍摄轻量级双目面部表情。 2012年TOG。

[52] D. Vlasic, I. Baran, W. Matusik, and J. Popovi´c. Articulated mesh animation from multi-view silhouettes. In TOG, 2008.

[52] D. Vlasic，I. Baran，W. Matusik和J. Popovi'c。来自多视图轮廓的铰接式网格动画。在TOG，2008年。

[53] C. Wan, A. Yao, and L. Van Gool. Direction matters: hand pose estimation from local surface normals. In arXiv preprint arXiv:1604.02657, 2016.

[53] C. Wan，A. Yao和L. Van Gool。方向很重要：从局部表面法线进行手势估计。在arXiv预印本arXiv：1604.02657,2016。

[54] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.

[54] S.-E. Wei，V. Ramakrishna，T. Kanade和Y. Sheikh。卷积式姿态机。在CVPR，2016。

[55] H. Woltring. New possibilities for human motion studies by real-time light spot position measurement. In Biotelemetry, 1973.

[55] H. Woltring。通过实时光点位置测量进行人体运动研究的新可能性。在Biotelemetry，1973年。

[56] C. Wu, D. Bradley, M. Gross, and T. Beeler. An anatomically-constrained local deformation model for monocular face capture. In TOG, 2016.

[56] C. Wu，D. Bradley，M. Gross和T. Beeler。单眼面部捕捉的解剖学限制局部变形模型。在TOG，2016。

[57] C. Xu and L. Cheng. Efﬁcient hand pose estimation from a single depth image. In ICCV, 2013.

[57] C. Xu和L. Cheng。从单个深度图像进行有效的手势估计。在ICCV，2013年。

[58] Q. Ye, S. Yuan, and T.-K. Kim. Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In ECCV, 2016.

[58] Q. Ye，S. Yuan和T.-K.金。分层混合手势估计的局部pso空间注意深层网络。在2016年的ECCV中。

[59] W. Zhao, J. Chai, and Y.-Q. Xu. Combining marker-based mocap and rgb-d camera for acquiring high-ﬁdelity hand motion data. In ACM SIGGRAPH/eurographics symposium on computer animation, 2012.

[59] W. Zhao，J. Chai和Y.-Q.许。结合基于标记的mocap和rgb-d相机来获取高清晰度的手部运动数据。在ACM SIGGRAPH / eurographics电脑动画研讨会上，2012。

[60] X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis. 3d shape estimation from 2d landmarks: A convex relaxation approach. In CVPR, 2015.

[60] X. Zhou，S. Leonardos，X. Hu和K. Daniilidis。来自2D地标的3D形状估计：凸松弛方法。在CVPR，2015年。

[61] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kinematic pose regression. In ECCV Workshop on Geometry Meets Deep Learning, 2016.

[61] X. Zhou，X. Sun，W. Zhang，S. Liang，and Y. Wei。深度运动姿态回归。在2016年的ECCV几何会议深度学习研讨会上。

文章引用于 http://tongtianta.site/paper/1129
编辑 Lornatang
校准 Lornatang

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies翻译[下]

你可能感兴趣的:(Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies翻译[下])