原文:Can 3D Pose be Learned from 2D Projections Alone?
收录:ECCV2018
代码:Pytorch
最近的方法将3D位姿估计问题分解为: ①估计2D地标位置(对应于骨骼关节);②从地标位置估计3D位姿。在这种方案下,先选择合适的2D位姿估计器来估计2D位姿,然后将其输入到常用的2D-3D提升算法来恢复3D位姿。
从单一的2D地标来预测3D骨架,其结果是无限的;但是并不是所有这些解都是物理可行的,这时需要用到先验来限制解,这些先验知识通常是从3D GT值中获取,但由于捕获系统的复杂性,这些先验知识受到了限制。本文相信利用无监督的算法(例如GAN)将有利于解决捕获3D数据的局限性。本文不使用任何额外的线索,如视频、多视角摄像机或深度图像,解决了将2D图像坐标提升到3D空间的基本问题。
区别:与之前的方法不同,本文没有通过3D数据或者利用2D-3D对应关系显式地学习先验。本文系统可以做到仅通过观察2D姿态来生成3D骨架。
当是失真的3D骨架时,换一个角度观察其2D投影,其结果也一定是失真的;而精确估计的3D位姿在任意方向上投影,最后得到的2D投影都符合真实的2D位姿分布。因此,根据这个性质来学习通过2D投影到3D上的先验。
对于3D姿态估计准确度估计可以从这两方面来看:(a)估计的3D姿态投影到原相机上,其2D投影和检测到的2D地标的接近程度;(b)3D姿态投影到一个随机的相机上,其2D投影分布应该符合真实2D地标的分布。
具体实现方法:GAN不用明确的监督就可以学习分布,本文方法通过2D姿态间接地学习一个潜在分布(即:3D姿态先验)。给定一个2D姿态,生成器假设关节位置的相对深度,以获得一个3D人体骨骼,然后将生成的3D骨架的随机2D投影与实际的2D位姿样本一起输入到鉴别器(见图2)。在训练过程中,输入到生成器和鉴别器的2D位姿不需要相对应。鉴别器从2D投影中学习先验,并使生成器最终产生逼真的3D骨架。
主要贡献:
为了与生成对抗网络命名惯例保持一致,本文将3D姿态估计网络称为生成器,为了简单起见,我们在摄像机坐标系中工作,单位为焦距的相机在世界坐标系中位于原点(0,0,0),令 xi = ( x i , y i ) , i = 1 , … , N = (x_{i},y_{i}),i = 1,…,N =(xi,yi),i=1,…,N,表示N个2D位姿地标,其中根关节(髋关节之间的中点)位于原点;因此2D输入则表示为x = [x1,…,xN],为了数值稳定,让头部顶点到根关节的距离大约为一个单位。
生成器:该结构输出每个关节点 xi 的深度偏移量oi。
G θ G ( G_{\theta _{G}}( GθG(xi ) = o i )=o_{i} )=oi
where θ G \theta _{G} θG are parameters of the generator learned during training. The depth of each point is defined as:
z i = m a x ( 0 , d + o i ) + 1 z_{i}=max(0,d+o_{i})+1 zi=max(0,d+oi)+1
where d d d denotes the distance between the camera and the 3D skeleton. In practice we use d d d = 10 units.
Back Projection Layer: 该层输入2D关节点 xi 和预测值 z i z_{i} zi 来计算3D点 Xi = [ z i x i , z i y i , z i z_{i}x_{i},z_{i}y_{i},z_{i} zixi,ziyi,zi],请注意,本文使用精确的透视投影而不是近似估计,如垂直投影或类透视投影。详细理解
Random Projection Layer: 将预测的3D骨架按随机生成的相机方向投影到2D姿态,然后输入到鉴别器。为了简单起见,本文随机旋转3D点并使用透视投影来获得假的2D投影。
注意:在透视投影中有一种固有的不确定性;将3D骨架的大小和与摄像机的距离加倍,会得到相同的2D投影,因此,预测绝对3D坐标的生成器对one batch中的每个训练样本的预测大小和距离之间有额外的自由度,这可能会导致批量内发生器的输出和梯度值出现巨大差异,并导致训练中的收敛问题。
解决方法:本文通过预测一个恒定深度d的深度偏移量并围绕它旋转来消除这种模糊性,从而产生稳定的训练。
Discriminator: The discriminator D D D is defined as a neural network that consumes either the fake 2D pose p p p (randomly projected from generated 3D skeleton) or a real 2D pose r r r (some projection, via camera or synthetic view, of a real 3D skeleton) and classifies them as either fake (target probability of 0) or real (target probability of 1), respectively.
D θ D ( D_{\theta _{D}}( DθD(u ) → [ 0 , 1 ] ) \rightarrow [0,1] )→[0,1]
where θ D θ_{D} θD are parameters of the discriminator learned during training and u u u denotes a 2D pose. Note that for any training sample x, we do not require r r r to be same as x or any of its multi-view correspondences. During learning we utilize a standard GAN loss defined as:
min G max D V ( D , G ) = E ( l o g ( D ( r ) ) ) + E ( l o g ( 1 − D ( p ) ) ) \min_{G}\max_{D}V(D,G)=E(log(D(r)))+E(log(1-D(p))) minGmaxDV(D,G)=E(log(D(r)))+E(log(1−D(p)))
对于训练,本文对2D姿态地标进行标准化,将根关节作为2D位姿中心,并缩放像素坐标,使训练数据的平均头–根距离在2D上为1/d单位,Although we can fit the entire data in GPU memory, we use a batch size of 32,768. We use the Adam optimizer with a starting learning rate of 0.0002 for both generator and discriminator networks. We varied the batch size between 8,192 and 65,536 in experiments but it did not have any significant effect on the performance. Training time on 8 TitanX GPUs is 0.4 seconds per batch.
Generator Architecture:The generator accepts a 28 dimensional input representing 14 2D joint locations. Inputs are connected to a fully connected layer to expand the dimensionality to 1024 and then fed into subsequent residual blocks. Similar to [26], a residual block is composed of a pair of fully connected layers, each with 1024 neurons followed by batch normalization and RELU (see Figure 3). The final output is reduced through a fully connected layer to produce 14 dimensional depth offsets (one for each pose joint). A total of 4 residual blocks are employed in the generator.
Discriminator Architecture:Similar to the generator, the discriminator also takes 28 inputs representing 14 2D joint locations, either from the real 2D pose dataset or the fake 2D pose projected from the hypothesized 3D skeleton. This goes through a fully connected layer of size 1024 to feed the subsequent 3 residual blocks as defined above. Finally, the output of the discriminator is a 2-class softmax layer denoting the probability of the input being real or fake.
Random Rotations:The random projection layer creates a random rotation by sampling an elevation angle φ φ φ randomly from [0,20] degrees and an azimuth angle θ θ θ from [0,360] degrees. These angles were chosen as a heuristic to roughly emulate probable viewpoints that most “in then wild” images would have.