Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies翻译 下](https://www.jianshu.com/p/c4463d318e3b)
Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies
全部捕获:用于追踪面部,手部和身体的3D变形模型
论文:http://arxiv.org/pdf/1801.01615v1.pdf
Abstract
摘要
We present a unified deformation model for the markerless capture of multiple scales of human movement, including facial expressions, body motion, and hand gestures. An initial model is generated by locally stitching together models of the individual parts of the human body, which we refer to as the “Frankenstein” model. This model enables the full expression of part movements, including face and hands by a single seamless model. Using a large-scale capture of people wearing everyday clothes, we optimize the Frankenstein model to create “Adam”. Adam is a calibrated model that shares the same skeleton hierarchy as the initial model but can express hair and clothing geometry, making it directly usable for fitting people as they normally appear in everyday life. Finally, we demonstrate the use of these models for total motion tracking, simultaneously capturing the large-scale body movements and the subtle face and hand motion of a social group of people.
我们提出了一个统一的变形模型,用于无标记地捕捉人体运动的多个尺度,包括面部表情,身体动作和手势。初始模型是通过局部缝合人体各个部分的模型产生的,我们称之为“弗兰肯斯坦”模型。该模型可以通过单一的无缝模型完整地表达零件移动,包括面部和手部。通过大规模捕捉穿着日常服装的人们,我们优化了弗兰肯斯坦模型以创建“亚当”。亚当是一个校准模型,与初始模型共享相同的骨架层次结构,但可以表达头发和服装的几何形状,使其可以直接用于配合人们的日常生活。最后,我们展示了使用这些模型进行全身运动追踪,同时捕捉大型身体动作以及社交群体的微妙脸部和手部动作。
1. Introduction
1.介绍
Social communication is a key function of human motion [7]. We communicate tremendous amounts of information with the subtlest movements. Between a group of interacting individuals, gestures such as a gentle shrug of the shoulders, a quick turn of the head, or an uneasy shifting of weight from foot to foot, all transmit critical information about the attention, emotion, and intention to observers. Notably, these social signals are usually transmitted by the organized motion of the whole body: with facial expressions, hand gestures, and body posture. These rich signals layer upon goal-directed activity in constructing the behavior of humans, and are therefore crucial for the machine perception of human activity.
社会交往是人类运动的关键功能[7]。我们通过最微妙的动作传达大量信息。在一群相互作用的人之间,诸如肩膀轻柔耸肩,头部快速转动,或者脚部之间不稳定的重量转移等手势都向观察者传递关于注意,情绪和意图的关键信息。值得注意的是,这些社交信号通常是通过整个身体的有组织运动来传递的:面部表情,手势和身体姿势。这些丰富的信号在构建人类行为时以目标为导向的活动分层,因此对机器对人类活动的感知至关重要。
∗Website: http://www.cs.cmu.edu/˜hanbyulj/totalcapture
*网站:http://www.cs.cmu.edu/~hanbyulj/totalcapture
However, there are no existing systems that can track, without markers, the human body, face, and hands simultaneously. Current markerless motion capture systems focus at a particular scale or on a particular part. Each area has its own preferred capture configuration: (1) torso and limb motions are captured in a sufficiently large working volume where people can freely move [17, 21, 44, 19]; (2) facial motion is captured at close range, mostly frontal, and assuming little global head motion [5, 24, 6, 9, 51]; (3) finger motion is also captured at very close distances from hands, where the hand regions are dominant in the sensor measurements [36, 49, 42, 50]. These configurations make it difficult to analyze these gestures in the context of social communication.
但是,没有现有的系统可以在没有标记的情况下同时跟踪人体,脸部和手部。目前无标记运动捕捉系统专注于特定的尺度或特定部分。每个区域都有其自己喜欢的捕捉配置:(1)躯干和肢体运动被捕捉到一个足够大的工作量,人们可以自由移动[17,21,44,19]; (2)面部运动是在近距离捕获的,大部分是前额运动,并且假设很少的全局头部运动[5,24,6,9,51]; (3)在距离手很近的距离处捕获手指运动,手部区域在传感器测量中占主导地位[36,49,42,50]。这些配置使得难以在社交沟通的背景下分析这些手势。
In this paper, we present a novel approach to capture the motion of the principal body parts for multiple interacting people (see Fig. 1). The fundamental difficulty of such capture is caused by the scale differences of each part. For example, the torso and limbs are relatively large and necessitate coverage over a sufficiently large working volume, while fingers and faces, due to their smaller feature size, require close distance capture with high resolution and frontal imaging. With off-the-shelf cameras, the resolution for face and hand parts will be limited in a room-scale, multi-person capture setup.
在本文中,我们提出了一种捕捉多个相互作用的人的主体部分运动的新方法(见图1)。这种捕获的根本困难是由每个部分的尺度差异造成的。例如,躯干和四肢相对较大,需要覆盖足够大的工作体积,而手指和脸部由于其较小的特征尺寸,需要使用高分辨率和正面成像进行近距离拍摄。使用现成的相机,面部和手部件的分辨率将受限于一个房间规模的多人拍摄设置。
To overcome this sensing challenge, we use two general approaches: (1) we leverage keypoint detection (e.g., faces [18], bodies [54, 14, 35], and hands [41]) in multiple views to obtain 3D keypoints, which is robust to multiple people and object interactions; (2) to compensate for the limited sensor resolution, we present a novel generative body deformation model, which has the ability to express the motion of the each of the principal body parts. In particular, we describe a procedure to build an initial body model, named “Frankenstein”, by seamlessly consolidating available part template models [33, 13] into a single skeleton hierarchy. We optimize this initialization using a capture of 70 people, and learn a new deformation model, named “Adam”, capable of additionally capturing variations of hair and clothing, with a simplified parameterization. We present a method to capture the total body motion of multiple people with the 3D deformable model. Finally, we demonstrate the performance of our method on various sequences of social behavior and person-object interactions, where the combination of face, limb, and finger motion emerges naturally.
为了克服这种感知挑战,我们使用两种一般方法:(1)在多个视图中利用关键点检测(例如,脸部[18],身体[54,14,35]和手[41])来获得3D关键点,这对多个人和对象的交互是强大的; (2)为了补偿有限的传感器分辨率,我们提出了一种新颖的生成体变形模型,它能够表达每个主体部分的运动。特别是,我们通过无缝地将可用的零件模板模型[33,13]合并到一个单独的骨架层次结构中,描述了一个构建名为“Frankenstein”的初始模型的过程。我们使用70人的捕捉来优化初始化,并学习一种名为“Adam”的新变形模型,该模型能够通过简化的参数化额外捕捉头发和衣服的变化。我们提出了一种用3D变形模型来捕捉多个人的全身运动的方法。最后,我们证明了我们的方法在社会行为和人 - 对象交互的各种序列上的表现,其中面部,肢体和手指运动的组合自然出现。
2. Related Work
2.相关工作
Motion capture systems performed by tracking retroreflective markers [55] are the most widely used motion capture technology due to their high accuracy. Markerless motion capture methods [23, 17, 21, 44] have been explored over the past two decades to achieve the same goal without markers, but they tend to implicitly admit that their performance is inferior by treating the output of marker based methods as a ground truth or an upper bound. However, over the last few years, we have witnessed a great advance in key point detections from images (e.g., faces [18], bodies [54, 14, 35], and hands [41]), which can provide reliable anatomical landmark measurements for markerless motion capture methods [19, 28, 41], while the performance of marker based methods relatively remains the same with their major disadvantages including: (1) a necessity of sparsity in marker density for reliable tracking which limits the spatial resolution of motion measurements, and (2) a limitation in automatically handling occluded markers which requires an expensive manual clean-up. Especially, capturing high-fidelity hand motion is still challenging in marker-based motion capture systems due to the severe self occlusions of hands [59], while occlusions are implicitly handled by guessing the occluded parts with uncertainty using the prior learnt from a large scale dataset [41]. Our method shows that the markerless motion capture approach potentially begins to outperform the marker-based counterpart by leveraging the learning based image measurements. As an evidence we demonstrate the motion capture from total body, which has not been demonstrated by other existing marker based methods. In this section, we review the most relevant markerless motion capture approaches to our method.
通过跟踪反射标记来执行动作捕捉系统[55]是由于其高精度而被广泛使用的动作捕捉技术。无标记运动捕获方法[23,17,21,44]在过去的二十年中已经被探索,以实现没有标记的相同目标,但是他们倾向于隐含地承认它们的性能较差,即将基于标记的方法的输出视为地面真相或上限。然而,在过去的几年里,我们在图像(例如脸部[18],身体[54,14,35]和手部[41])的关键点检测方面取得了巨大的进步,可以提供可靠的解剖标志无标记运动捕获方法的测量[19,28,41],而基于标记的方法的性能相对保持不变,其主要缺点包括:(1)对于可靠跟踪的标记密度的稀疏性的必要性,其限制了空间分辨率运动测量,以及(2)自动处理遮挡标记的限制,这需要昂贵的手动清理。特别是,由于手部的严重自我遮挡[59],捕获高清晰度手部运动在基于标记的运动捕捉系统中仍然具有挑战性,而通过使用先前从大规模学习得到的不确定性来猜测被遮挡部分时隐式处理遮挡部分数据集[41]。我们的方法表明,无标记运动捕捉方法可能开始通过利用基于学习的图像测量而优于基于标记的对应方法。作为一个证据,我们展示了来自全身的动作捕捉,而其他现有的基于标记的方法并未证明这一点。在本节中,我们将回顾我们方法中最相关的无标记运动捕捉方法。
Markerless motion capture largely focuses on the motion of the torso and limbs. The standard pipeline is based on a multiview camera setup and tracking with a 3D template model [32, 23, 15, 10, 29, 16, 52, 11, 44, 17, 20, 19]. In this approach, motion capture is performed by aligning the 3D template model to the measurements, which distinguish the various approaches and may include color, texture, silhouettes, point clouds, and landmarks. A parallel track of related work therefore focuses on capturing and improving body models for tracking, for which a highly controlled multiview capture system—specialized for single person capture—is used to build precise models. With the introduction of commodity depth sensors, single-view depth-based body motion capture became a popular direction [3, 40]. A recent collection of approaches aims to reconstruct 3D skeletons directly from monocular images, either by fitting 2D keypoint detections with a prior on human pose [60, 8] or getting even closer to direct regression methods [61, 34, 48].
无标记动作捕捉主要集中在躯干和四肢的运动。标准流水线基于多视点相机设置和3D模板模型跟踪[32,23,15,10,29,16,52,11,44,17,20,19]。在这种方法中,通过将3D模板模型与测量进行对齐来执行运动捕捉,其区分各种方法并且可以包括颜色,纹理,轮廓,点云和地标。因此,相关工作的平行轨迹重点在于捕捉和改进用于跟踪的人体模型,其中使用高度控制的多视图捕捉系统(专用于单人捕捉)来构建精确的模型。随着商品化深度传感器的引入,基于单视点深度的身体动作捕捉成为一种流行的方向[3,4]。最近一系列方法的目标是直接从单眼图像重建三维骨架,或者通过2D人物姿势上的关键点检测[60,8]或更接近直接回归方法[61,34,48]。
Facial scanning and performance capture has been greatly advanced over the last decade. There exist multiview based methods showing excellent performance on high-quality facial scanning [5, 24] and facial motion capture [6, 9, 51]. Recently, light-weighed systems based on a single camera show a compelling performance by leveraging morphable 3D face model on 2D measurements[22, 18, 31, 47, 13, 12, 56]. Hand motion captures are mostly lead by single depth sensor based methods [36, 46, 49, 30, 57, 45, 53, 43, 39, 42, 50, 58], with few exceptions based on multi-view systems [4, 43, 38]. In this work, we take the latter approach and use the method of [41] who introduced a hand keypoint detector for RGB images which can be directly applicable in multiview systems to reconstruct 3D hand joints.
在过去的十年中,面部扫描和性能捕获已经大大提高。基于多视图的方法在高质量面部扫描[5,24]和面部动作捕捉[6,9,51]中表现出优异的性能。最近,基于单个照相机的光称重系统通过在2D测量中利用可变形的3D人脸模型显示出令人信服的性能[22,18,31,47,13,12,56]。手动捕捉大多数是基于单深度传感器的方法[36,46,49,30,57,55,53,43,39,42,50,58],除少数例外基于多视图系统[4, 43,38]。在这项工作中,我们采用后一种方法,并使用[41]的方法,该方法为RGB图像引入了一个手关键点检测器,可以直接应用于多视图系统来重建3D手关节。
As a way to reduce the parameter space and overcome the complexity of the problems, generative 3D template models have been proposed in each field, for example the methods of [2, 33, 37] in body motion capture, the method of [13] for facial motion capture, and very recently, the combined body+hands model of Romero et al. [38]. A generative model with expressive power for total body motion has not been introduced.
作为减少参数空间和克服问题复杂性的一种方法,已经在每个领域提出了生成三维模板模型,例如[2,33,37]在身体运动捕捉中的方法,[13]的方法,面部动作捕捉,以及最近,罗梅罗等组合的身体+手模型。 [38]。尚未引入具有全身运动表现力的生成模型。
Figure 2: Part models and a unified Frankenstein model. (a) The body model [33]; (b) the face model [13]; and (c) a hand rig, where red dots have corresponding 3D keypoints reconstructed from detectors in (a-c). (d) Aligned face and hand models (gray meshes) to the body model (the blue wireframe mesh); and (e) the seamless Frankenstein model.
图2:零件模型和统一的Frankenstein模型。 (a)身体模型[33]; (b)人脸模型[13];和(c)手工钻机,其中红点具有从(a-c)中的检测器重建的对应3D关键点。 (d)将人脸模型(灰色网格)与人体模型(蓝色线框网格)对齐;和(e)无缝的弗兰肯斯坦模型。
_
3. Frankenstein Model
3.弗兰肯斯坦模型
The motivation for building the Frankenstein body model is to leverage existing part models—SMPL [33] for the body, FaceWarehouse [13] for the face, and an artistdefined hand rig—each of which capture shape and motion details at an appropriate scale for the corresponding part. This choice is not driven merely by the free availability of the component models: note that due to the trade-off between image resolution and field of view of today’s 3D scanning systems, scans used to build detailed face models will generally be captured using a different system than that used for the rest of the body. For our model, we merge all transform bones into a single skeletal hierarchy but keep the native parameterization of each component part to express identity and motion variations, as explained below. As the final output, the Frankenstein model produces motion parameters capturing the total body motion of humans, and generates a seamless mesh by blending the vertices of the component meshes.
构建弗兰肯斯坦人体模型的动机是利用现有的零件模型-SMPL [33]用于人体,FaceWarehouse [13]用于人脸,以及一个美术家精细的手动装置 - 每个模型以适当的比例捕捉形状和动作细节相应的部分。这种选择不仅仅取决于组件模型的可用性:请注意,由于当前3D扫描系统的图像分辨率和视野之间的折衷,用于构建详细人脸模型的扫描通常将使用不同的系统比用于身体其他部分的系统。对于我们的模型,我们将所有变换骨骼合并为单个骨架层次结构,但保留每个组件部分的原始参数化以表示身份和运动变化,如下所述。作为最终输出,Frankenstein模型产生捕捉人体总体运动的运动参数,并通过混合组件网格的顶点生成无缝网格。
3.1. Stitching Part Models
3.1。拼接零件模型
The Frankenstein model,
Frankenstein模型进行参数化,
is a seamless mesh expressing the motion and shape of the target subject.
是一个无缝网格,表示目标主体的运动和形状。
The motion and shape parameters of the model are a union of the part models’ parameters:
模型的运动和形状参数是零件模型参数的联合:
where the superscripts represent each part model: B for the body model, F for the face model, LH for for the left hand model, and RH for the right hand model.
上标表示每个零件模型:B代表人体模型,F代表人脸模型,LH代表左手模型,RH代表右手模型。
Each of the component part models maps from a subset of the above parameters to a set of vertices, respectively, VB ∈ RN B ×3, VF ∈ RN F ×3, VLH ∈ RN H ×3, and VRH ∈ RN H ×3, where the number of vertices of each mesh part isIn the Frankenstein model, all parts are rigidly linked by a single skeletal hierarchy. This unification is achieved by substituting the hands and face branches of the SMPL body skeleton with the corresponding skeletal hierarchies of the detailed part models. All parameters of the Frankenstein model are jointly optimized for motion tracking and identity fitting. The parameterization of each of the part models is detailed in the following sections.
在弗兰肯斯坦模型中,所有部分都由一个单独的骨架层级严格链接。这种统一是通过将SMPL身体骨骼的手和脸部分支替换为详细零件模型的相应骨架层次来实现的。Frankenstein模型的所有参数都针对运动跟踪和身份识别进行了联合优化。以下各节详细介绍了每个零件模型的参数化。
3.2. Body Model
3.2。身体模型
For the body, we use the SMPL model [33] with minor modifications. In this section, we summarize the salient aspects of the model in our notation. The body model,, is defined as follows,
对于身体,我们使用SMPL模型[33]进行小修改。在本节中,我们用符号总结模型的主要方面。身体模型定义如下,
SE(3) for each of J joints,
与SE(3)的线性混合蒙皮来获得,
3.3. Face Model
3.3。面部模型
As a face model, we build a generative PCA model from the FaceWarehouse dataset [13]. Specifically, the face part model,, is defined as follows,
作为一个人脸模型,我们从FaceWarehouse数据集中构建一个生成的PCA模型[13]。具体而言,面部模型定义如下,
SE(3) is required to compensate for displacements in the root location of the face joint due to body shape changes in Eq. 6.
最后,转换SE(3)来补偿由于方程式1中的身体形状变化引起的脸部关节根部位置的位移。 6。
Finally, each face vertex position is given by:
最后,每个人脸的顶点位置由下式给出:
3.4. Hand Model
3.4。手模型
We use an artist rigged hand mesh. Our hand model has文章引用于 http://tongtianta.site/paper/1129
编辑 Lornatang
校准 Lornatang