第三,我们展示了在语义生成句法分析的有效性,使完成的结果,看起来更合理和一致的与周围的环境。
Related Work
相关的工作
Image completion. Image completion has been studied in numerous contexts, e.g., inpainting, texture synthesis, and sparse signal recovery.
图像修复。在许多情况下,图像修复完成了研究,如图像修复、纹理合成和稀疏信号恢复。
Since a thorough literature review is beyond the scope of this paper, and we discuss the most representative methods to put our work in proper context.
由于彻底的文献回顾超出了本文的范围,我们讨论了最有代表性的方法来把我们的工作放在适当的上下文中。
An early inpainting method [4] exploits a diffusion equation to iteratively propagate low-level features from known regions to unknown areas along the mask bound-aries.
一种早期的修复方法[ 4 ]利用一个扩散方程迭代地从已知区域迭代低层特征到未知区域。
While it performs well on inpainting, it is limited to deal with small and homogeneous regions.
虽然它在修复上表现良好,但它局限于处理小而均匀的区域。
Another method has been developed to further improve inpainting results by introducing texture synthesis [5].
另一种方法是通过引入纹理合成(5)来进一步改善修复结果。
In [29], the patch prior is learned to restore images with missing pixels. Recently Ren et al. [20] learn a convolutional network for inpainting.
在[ 29 ]中,学习之前的补丁可以恢复缺少像素的图像。最近任正非等人。[ 20 ]学习卷积网络用于修复。
The performance of image completion is significantly improved by an efficient patch matching algorithm [2] for nonpara-metric texture synthesis.
图像完成绩效是一个有效的块匹配算法[ 2 ]为非参数化的纹理合成显著提高。
While it performs well when sim-ilar patches can be found, it is likely to fail when the source image does not contain sufficient amount of data to fill in the unknown regions.
虽然它执行类似的补丁时,可以发现,它很可能失败,当源图像不包含数据量足以填补未知区域。
We note this typically occurs in ob-ject completion as each part is likely to be unique and no plausible patches for the missing region can be found.
我们注意到,这种情况通常发生在目标完成的每一部分可能是独特的,没有合理的补丁为失踪的地区可以找到。
Al-though this problem can be alleviated by using an external database [9], the ensuing issue is the need to learn high-level representation of one specific object class for patch match.
尽管这个问题可以通过使用外部数据库[ 9 ]来缓解,但随之而来的问题是需要学习一个特定对象类的高级表示,用于补丁匹配。
Wright et al. [27] cast image completion as the task for recovering sparse signals from inputs.
莱特等人。[ 27 ]将图像完成作为从输入中恢复稀疏信号的任务。
By solving a sparse linear system, an image can be recovered from some cor-rupted input.
通过求解稀疏线性系统,图像可以从一些肺中断输入恢复。
However, this algorithm requires the images to be highly-structured (i.e., data points are assumed to lie in a low-dimensional subspace), e.g., well-aligned face im-ages.
然而,该算法要求图像具有高度结构化(即,数据点被假定在低维子空间中),例如,对齐的面部IM。
In contrast, our algorithm is able to perform object completion without strict constraints.
与此相反,我们的算法能够在没有严格约束的情况下完成对象的完成。
Image generation. Vincent et al. [24] introduce denois-ing autoencoders that learn to reconstruct clean signals from corrupted inputs.
图像生成。文森特等。[ 24 ]介绍去噪autoencoders,学会从损坏的输入干净的信号重构。
In [7], Dosovitskiy et al. demonstrate that an object image can be reconstructed by inverting deep convolutional network features (e.g., VGG [21]) through a decoder network.
在[ 7 ],dosovitskiy等人。证明对象的图像可以通过反转深度卷积网络特征重构(如VGG [ 21 ])通过解码网络。
Kingma et al. [11] propose variational au-toencoders (VAEs) which regularize encoders by imposing prior over the latent units such that images can be generated by sampling from or interpolating latent units.
金马等人。[ 11 ]提出了变金toencoders(声控)规范实施之前,编码器在图像可以通过抽取或插值隐单元所产生的潜在的单位。
However, the generated images by a VAE are usually blurry due to its training objective based on pixel-wise Gaussian likelihood.
然而,生成的图像由Vae通常模糊由于基于像素的高斯似然训练目的。
Larsen et al. [12] improve a VAE by adding a discrimina-tor for adversarial training which stems from the generative adversarial networks (GANs) [8] and demonstrate more re-alistic images can be generated.
拉森等人。[ 12 ]提高VAE加入了对抗性训练,源于生成对抗网络识别Tor(Gans)[ 8 ],表现出更多的现实图像可以产生。
Closest to this work is the method proposed by Deepak et al. [17] which applies an autoencoder and integrates learn-ing visual representations with image completion.
最近的这项工作是由Deepak等人提出的方法。[ 17 ]采用自编码和集成学习的视觉表现与图像修复。
How-ever, this approach emphasizes more on unsupervised learn-ing of representations than image completion.
不管怎样,这种方法更强调无监督学习,而不是图像完成。
In essence, this is a chicken-and-egg problem.
从本质上讲,这是一个鸡和蛋的问题。
Despite the promising results on object detection, it is still not entirely clear if im-age completion can provide sufficient supervision signals for learning high-level features.
尽管在目标检测方面取得了可喜的成果,但IM年龄的完成是否能够提供足够的监督信号来学习高级特征尚不十分清楚。
On the other hand, seman-tic labels or segmentations are likely to be useful for im-proving the completion results, especially on a certain ob-ject category.
另一方面,语义标签或分割可能会提高完成的结果是有用的,尤其是对某个对象类。
With the goal of achieving high-quality im-age completion, we propose to use an additional semantic parsing network to regularize the generative networks.
随着我年龄的实现高质量完成目标,我们建议使用一个额外的语义解析网络规范生成网络。
Our model deals with severe image corruption (large region with missing pixels), and develops a combined reconstruction, adversarial and parsing loss for face completion.
我们的模型处理严重的图像损坏(缺少像素的大区域),并开发了一种用于人脸完成的联合重建、对抗和解析损失。
Proposed Algorithm
提出的算法
In this section, we describe the proposed model for ob-ject completion.
在本节中,我们描述了该模型对目标完成。
Given a masked image, our goal is to syn-thesize the missing contents that are both semantically con-sistent with the whole object and visually realistic.
给定一个戴面具的形象,我们的目标是同尺寸缺失的内容,都是语义一致的整体对象、逼真。
Figure 2 shows the proposed network that consists of one generator, two discriminators, and a parsing network.
图2显示了该网络包括一个发电机,两鉴别和分析网络。
3.1. Generator
3.1。发电机
The generator G is designed as an autoencoder to con-struct new contents given input images with missing re-gions.
发电机G的目的是作为一个自编码构建新的内容给出输入图像丢失的区域。
The masked (or corrupted) input, along with the filled noise, is first mapped to hidden representations through the encoder.
掩码(或损坏的)输入,连同填充的噪声,首先通过编码器映射到隐藏表示。
Unlike the original GAN model [8] which directly starts from a noise vector, the hidden rep-resentations obtained from the encoder capture more vari-ations and relationships between unknown and known re-gions, which are then fed into the decoder for generating contents.
不像原来甘模型[ 8 ]直接从噪声矢量出发,隐藏代表resentations从编码器捕获更多的变量和未知与已知的区域之间的关系,然后送入解码器生成的内容。
We use the architecture from “conv1” to “pool3” of the VGG-19 [21] network, stack two more convolution layers and one more pooling layer on top of that, and add a fully-connected layer after that as the encoder. The decoder is symmetric to the encoder with unpooling layers.
我们使用的架构,从“conv1”到“pool3”的vgg-19 [ 21 ]网络栈两卷积层和一个汇聚层上,之后为编码器添加一个全连接层。解码器是对称的编码器unpooling层。
Discriminator
鉴频器
The generator can be trained to fill the masked region or missing pixels with small reconstruction errors.
该生成器可以被训练以填充掩蔽区域或缺少重构误差的像素。
However, it does not ensure that the filled region is visually realis-tic and coherent.
然而,这并不能保证填充区域的视觉现实和相干。
As shown in Figure 3(c), the generated pixels are quite blurry and only capture the coarse shape of missing face components.
如图3(c)所示,生成的像素非常模糊,只捕获缺少的面部组件的粗形状。
To encourage more photo-realistic results, we adopt a discriminator D that serves as a binary classifier to distinguish between real and fake im-ages.
为了鼓励更多的照片逼真的结果,我们采用了一个鉴别器D,作为一个二进制分类器来区分真实和虚假的IM年龄。
The goal of this discriminator is to help improve the quality of synthesized results such that the trained discrim-inator is fooled by unrealistic images.
这种鉴频器的目的是帮助提高合成结果,训练新的终结者不切实际的形象愚弄质量。
We first propose a local D for the missing region which determines whether the synthesized contents in the missing region are real or not.
我们首先提出一个缺失区域的局部d,它决定缺失区域中的合成内容是否真实。
Compared with Figure 3(c), the net-work with local D (shown in Figure 3(d)) begins to help generate details of missing contents with sharper bound-aries.
与图3(c)相比,与本地D的网络工作(如图3(d)所示)开始有助于生成缺少内容的细节,并具有更明确的绑定白羊座。
It encourages the generated object parts to be se-mantically valid.
它鼓励产生的目标部分是SE-mantically有效。
However, its limitations are also obvious due to the locality.
但由于地域性的限制,其局限性也很明显。
First, the local loss can neither regu-larize the global structure of a face, nor guarantee the sta-tistical consistency within and outside the masked regions.
首先,局部损失不能调节larize一脸的整体结构,也不能保证统计一致性的内部和外部的屏蔽区。
Second, while the generated new pixels are conditioned on their surrounding contexts, a local D can hardly generate a direct impact outside the masked regions during the back propagation, due to the unpooling structure of the decoder.
第二,而产生新的像素的条件对其周围的环境,一个地方D几乎不产生直接影响的屏蔽区域外的反向传播过程中,由于结构的unpooling解码器。
Consequently, the inconsistency of pixel values along re-gion boundaries is obvious.
因此,沿区域边界像素值的不一致性是显而易见的。
Therefore, we introduce another global D to determine the faithfulness of an entire image.
因此,我们引入另一个全局d来确定整个图像的忠实度。
The fundamental idea is that the newly generated contents should not only be real-istic, but also consistent to the surrounding contexts.
基本的想法是,新生成的内容不仅要现实,而且也符合周围的环境。
From Figure 3(e), the network with additional global D greatly alleviates the inconsistent issue and further enforce the gen-erated contents to be more realistic.
从图3(E),与其他全球D大大减轻了不一致的问题,进一步加强产生的内容更现实的网络。
We note that the archi-tecture of two discriminators are similar to [19].
我们注意到,两个鉴别的体系结构类似于[ 19 ]。
3.3. Semantic Regularization
3.3。语义化
With a generator and two discriminators, our model can be regarded as a variation of the original GAN [8] model that is conditioned on contexts (e.g., non-mask regions).
与发电机和两个鉴别,我们的模型可以被看作是一个变化的原甘[ 8 ]模型的条件是上下文(例如,非屏蔽区域)。
However as a bottleneck, the GAN model tends to generate independent facial components that are likely not suitable to the original subjects with respect to facial expressions and parts shapes, as shown in Figure 3(e).
然而,作为一个瓶颈,GaN模型倾向于生成独立的面部组件,这些组件可能不适合原始主题的面部表情和零件形状,如图3(e)所示。
The top one is with big weird eyes and the bottom one contains two asymmetric eyes.
最上面的一个是大奇怪的眼睛和底部一个包含两个不对称的眼睛。
Furthermore, we find the global D is not effective in ensuring the consistency of fine details in the generated im-age.
此外,我们发现,全局D在确保生成的IM中的细微细节的一致性方面是无效的。
For example, if only one eye is masked, the generated eye does not fit well with another unmasked one.
例如,如果一只眼睛是盲眼的产生不与另一个合适的面具。
We show another two examples in Figure 4(c) where the generated eye is obviously asymmetric to the unmasked one although the generated eye itself is already realistic.
图4中所示的另一个实例(C),生成的眼睛显然是不对称的揭露一个虽然生成眼本身已经是现实。
Both cases in-dicate that more regularization is needed to encourage the generated faces to have similar high-level distributions with the real faces.
这两种情况表明,更多的正规化需要鼓励生成的面孔与房面有类似的高层次的分布。
Therefore we introduce a semantic parsing network to further enhance the harmony of the generated contents and existing pixels.
因此,我们引入了一个语义分析网络,以进一步提高生成内容和现有像素的协调性。
The parsing network is an autoencoder which bears some resemblance to the semantic segmenta-tion method [28].
解析网络是一种自编码具有语义分词方法[ 28 ]有些相似。
The parsing result of the generated image is compared with the one of the original image.
将生成的图像的解析结果与原始图像的解析结果进行比较。
As such, the generator is forced to learn where to generate features with more natural shape and size.
因此,发电机被迫学习在哪里产生更自然的形状和大小的功能。
In Figure 3(e)-(f) and Fig-ure 4(c)-(d), we show the generated images between models without and with the smenatic regularization.
图3(E)-(F)和图4(c)-(d),我们表明,生成的图像之间的模型和不与smenatic正则化。
Training Neural Networks
神经网络的训练
To effectively train our network, we use the curriculum strategy [3] by gradually increasing the difficulty level and network scale.
为了有效地培训我们的网络,我们采用课程策略3,逐步增加难度等级和网络规模。
The training process is scheduled in three stages.
培训过程分三个阶段进行。
First, we train the network using the reconstruction loss to obtain blurry contents.
首先,我们利用重建损失训练网络,以获得模糊的内容。
Second, we fine-tune the net-work with the local adversarial loss. The global adversarial loss and semantic regularization are incorporated at the last stage, as shown in Figure 3.
第二,我们用当地的对抗损失来调整网络。全局对抗性损失和语义正则化在最后阶段被合并,如图3所示。
Each stage prepares features for the next one to improve, and hence greatly increases the ef-fectiveness and efficiency of network training. For example, in Figure 3, the reconstruction stage (c) restores the rough shape of the missing eye although the contents are blurry.
每个阶段和特征为下一个提高,从而大大增加了有效性和网络训练的效率。例如,在图3中,重建阶段(c)恢复了缺少的眼睛的粗糙形状,尽管内容模糊。
Then local adversarial stage (d) then generates more details to make the eye region visually realistic, and the global ad-versarial stage (e) refines the whole image to ensure that the appearance is consist around the boundary of the mask.
那么当地的对抗阶段(D)再生成更多的信息使眼部视觉效果逼真,与全球广告versarial阶段(E)使整个图像以确保外观由掩模的边界附近。
The semantic regularization (f) finally further enforces more consistency between components and let the generated re-sult to be closer to the actual face.
语义化(F)最后进一步执行元件和让产生的结果更接近于实际面之间的一致性。
When training with the adversarial loss, we use a method similar to [19] especially to avoid the case when the discriminator is too strong at the beginning of the training process.
当对抗性损失训练时,我们使用类似于[ 19 ]的方法,特别是为了避免在训练开始时鉴别器太强的情况。
atasets
atasets
We use the CelebA [15] dataset to learn and evaluate our model.
我们用celeba [ 15 ]数据集学习和评估我们的模型。
It consists of 202,599 face images and each face im-age is cropped, roughly aligned by the position of two eyes, and rescaled to 128 128 3 pixels.
它由202599个人脸图像和人脸图像裁剪,大致平行的两只眼睛的位置,并调整到128 128 3像素。
We follow the standard split with 162,770 images for training, 19,867 for validation and 19,962 for testing.
我们按照标准分割162770个图像进行训练,19867个验证和19962个测试。
We set the mask size as 64 64 for training to guarantee that at least one essential facial com-ponent is missing.
我们把面具大小为64 64培训以确保至少有一个基本的面部组件丢失。
If the mask only covers smooth regions with a small mask size, it will not drive the model to learn semantic representations.
如果掩模只覆盖具有较小掩模大小的平滑区域,则它不会驱动模型学习语义表示。
To avoid over-fitting, we do data augmentation that includes flipping, shift, rotation (+/- 15 degrees) and scaling.
为了避免过度拟合,我们进行数据增强,包括翻转、移位、旋转(+ / - 15度)和缩放。
During the training process, the size of the mask is fixed but the position is randomly selected.
在训练过程中,面具的大小是固定的,但位置是随机选择的。
As such, the model is forced to learn the whole object in an holistic manner instead of a certain part only.
因此,模型必须以整体的方式学习整个对象,而不是只学习某个部分。
4.2. Face Parsing
4.2。面解析
Since face images in the CelebA [15] dataset do not have segment labels, we use the Helen face dataset [13] to train a face parsing network for regularization.
由于人脸图像的celeba [ 15 ]数据集没有段标签,我们使用海伦的脸集[ 13 ]为正则面解析网络列车。
The Helen dataset consists of 2,330 images and each face has 11 segment la-bels covering every main component of the face (e.g., hair, eyebrows, eyes) labelled by [22].
海伦的数据集包含2330个图像,每个面有11段拉上覆盖在脸上的每一个主要组成部分(如头发、眉毛、眼睛)标记的[ 22 ]。
We roughly crop the face in each image with the size of 128 128 first and then feed it into the parsing network to predict the label for each pixel.
我们首先在每个图像中粗略地裁剪出大小为128 128的图像,然后将其输入解析网络以预测每个像素的标签。
Our parsing network bears some resemblance to the semantic segmentation method [28] and we mainly modify its last layer with 11 outputs.
我们的解析网络与语义分割方法[ 28 ]有一些相似之处,我们主要修改它的最后一层,有11个输出。
We use the standard train-ing/testing split and obtain a parsing model, which achieves the f-score of 0.851 with overall facial components on the Helen test dataset, compared to the state-of-the-art multi-objective based model [14], with the corresponding f-score of 0.854.
我们使用标准的培训/测试分和获得一个分析模型,实现了0.851的F值对海伦的测试数据集的整体面部的组成部分,与国家的最先进的多目标模型[ 14 ],与0.854对应的F值。
This model can be further improved with more careful hyperparameter tuning but is currently sufficient to improve the quality of face completion.
该模型可以更仔细的超参数调整进一步提高但目前足以提高人脸的完成质量。
Several parsing re-sults on the Helen test images are presented in Figure 5.
图5给出了海伦测试图像上的几个解析结果。
Once the parsing network is trained, it remains fixed in our generation framework.
一旦解析网络被训练,它在我们的生成框架中仍然是固定的。
We first use the network on the CelebA training set to obtain the parsing results of orig-inally unmasked faces as the ground truth, and compare
我们首先利用网络来获得解析结果终于东窗事发面临原有地面真理的celeba训练,比较
them with the parsing on generated faces during training.
在训练中对生成的人脸进行解析。
The parsing loss is eventually back-propagated to the gen-erator to regularize face completion.
解析损失最终回传播到发电机规范的脸完成。
We show some parsing results on the CelebA dataset in Figure 5.
我们展示了一些分析结果图5中的celeba数据集。
The proposed se-mantic regularization can be regarded as measuring the dis-tance in feature space where the sensitivity to local image statistics can be achieved [6].
本文提出的语义化可以看作是测量距离在特征空间中的图像局部统计灵敏度可以达到[ 6 ]。
4.3. Face Completion
4.3。面对完成的
Qualitative results. Figure 6 shows our face completion results on the CelebA test dataset.
定性结果。图6显示了我们的脸上celeba完成结果的测试数据集。
In each test image, the mask covers at least one key facial components.
在每个测试图像中,掩模覆盖至少一个关键面部组件。
The third column of each panel shows our completion results are visu-ally realistic and pleasing.
每个小组的第三列显示我们的完成结果的现实和令人愉快的盟友。
Note that during the testing, the mask does not need to be restricted as a 64 64 square mask, but the number of total masked pixels is suggested to be no more than 64 64 pixels.
请注意,在测试期间,掩模不需要被限制为64 - 64平方掩码,但总掩码像素的数目建议不超过64个64像素。
We show typical examples with one big mask covering at least two face components (e.g., eyes, mouths, eyebrows, hair, noses) in the first two rows.
我们展示了一个典型的例子,一个大的面具覆盖至少两个面部成分(如眼睛,嘴巴,眉毛,头发,鼻子)在前两排。
We specifically present more results on eye regions since they can better reflect how realistic of the newly generated faces are, with the proposed algorithm.
我们特别提出了更多的眼睛区域的结果,因为它们可以更好地反映新生成的面孔是多么现实,与所提出的算法。
Overall, the algo-rithm can successfully complete the images with faces in side views, or partially/completely corrupted by the masks with different shapes and sizes.
总体而言,该算法可以成功地在侧视图面完整的图像,或部分/完全被各种不同形状和尺寸的面具。
We present a few examples in the third row where the real occlusion (e.g., wearing glasses) occurs.
我们在第三排中出现了一些实际遮挡的例子(例如戴眼镜)。
As sometimes whether a region in the image is occluded or not is subjec-tive, we give this option for users to assign the occluded regions through drawing masks.
有时候,无论是在图像区域被遮挡或不主观,我们给用户分配的遮挡区域,通过绘制面具这个选项。
The results clearly show that our model is able to restore the partially masked eye-glasses, or remove the whole eyeglasses or just the frames by filling in realistic eyes and eyebrows.
结果清楚地表明,我们的模型可以恢复部分蒙面眼镜,或删除整个眼镜或只是框架填补现实的眼睛和眉毛。
In the last row, we present examples with multiple, ran-domly drawn masks, which are closer to real-world applica-tion scenarios.
在最后一排,我们提出的例子多,随机绘制的面具,这是接近真实的应用场景。
Figure 7 presents completion results where different key parts (e.g., eyes, nose, and mouth) of the same input face image are masked.
图7给出了相同的输入人脸图像中不同的关键部分(例如眼睛、鼻子和嘴巴)被掩盖的完成结果。
It shows that our completion results are consistent and realistic regardless of the mask shapes and locations.
它表明我们的完成结果是一致的和现实的,不管面具的形状和位置。
Quantitative results.
定量的结果。
In addition to the visual results, we also perform quantitative evaluation using three metrics on the CelebA test dataset (19,962 images).
除了视觉效果,我们还进行定量评估,使用三个指标对celeba测试集(19962幅)。
The first one is the peak signal-to-noise ratio (PSNR) which directly mea-sures the difference in pixel values.
第一个是峰值信噪比(PSNR)直接措施的像素值的差异。
The second one is the structural similarity index (SSIM) that estimates the holistic similarity between two images.
结构相似性指数是第二个(SSIM),估计两个图像之间的整体相似性。
Lastly we use the identity distance measured by the OpenFace toolbox [1] to deter-mine the high-level semantic similarity of two faces.
最后我们利用露面工具箱[ 1 ]测量确定两个面的高层语义相似性身份的距离。
These three metrics are computed between the completion results obtained by different methods and the original face images.
这三个度量是在不同方法得到的完成结果和原始人脸图像之间计算的。
The results are shown in Table 1-3. Specifically, the step-
结果见表1-3。具体来说,步骤—
5
五
Figure 6. Face completion results on the CelebA [15] test dataset. In each panel from left to right: original images, masked inputs, our completion results.
图6。在celeba [ 15 ]面完成测试数据结果。在每个面板从左到右:原始图像,蒙面输入,我们的完成结果。
Figure 7. Face part completion. In each panel, left: masked input,
图7。工件完成面。在每个面板中,左:蒙面输入,
right: our completion result.
对:我们的完成结果。
wise contribution of each component is shown from the 2nd to the 5th column of each table, where M1-M5 correspond to five different settings of our own model in Figure 3 and O1-O6 are six different masks for evaluation as shown in Figure 8.
每个分量的贡献是从第二到每个表中的第五列,其中M1-M5对应图3和o1-o6五种不同的设置我们自己的模型是六个不同的面具进行评价如图8所示。
We then compare our model with the ContextEn-coder [17] (CE).
然后我们比较我们的模型与contexten编码器[ 17 ](CE)。
Since the CE model is originally not trained for faces, we retrain the CE model on the CelebA dataset for fair comparisons.
由于CE模型最初是不接受的面孔,我们训练模型对公平比较的celeba数据集。
As the evaluated masks O1-O6 are not in the image center, we use the inpaintRandom version of their code and mask 25% pixels masked in each image.
作为评价面具o1-o6不在图像中心,我们使用他们的代码inpaintrandom版本和面具25%像素每幅图像中的面具。
Finally we also replace the non-mask region of the output with original pixels. The comparison between our model (M4) and CE in 5th and 6th column show that our
最后,我们还用原始像素替换输出的非掩码区域。我们的模型(M4)和CE在第五和第六列之间的比较表明,我们的
(a) O1 (b) O2 (c) O3 (d) O4 (e) O5 (f) O6
(一)O1 O2(B)(C)(D)O4 O3(E)(F)O6 O5
Figure 8. Simulate face occlusions happened in real scenario with different masks O1-O6. From left to right: left half, right half, two eyes, left eye, right eye, and lower half.
图8。模拟人脸遮挡发生在现实的场景不同的面具o1-o6。从左到右:左半、右半、两眼、左眼、右眼、下半部。
We then compare our model with the ContextEn-coder [17] (CE).
然后我们比较我们的模型与contexten编码器[ 17 ](CE)。
Since the CE model is originally not trained for faces, we retrain the CE model on the CelebA dataset for fair comparisons.
由于CE模型最初是不接受的面孔,我们训练模型对公平比较的celeba数据集。
As the evaluated masks O1-O6 are not in the image center, we use the inpaintRandom version of their code and mask 25% pixels masked in each image.
作为评价面具o1-o6不在图像中心,我们使用他们的代码inpaintrandom版本和面具25%像素每幅图像中的面具。
Finally we also replace the non-mask region of the output with original pixels.
最后,我们还用原始像素替换输出的非掩码区域。
The comparison between our model (M4) and CE in 5th and 6th column show that our model performs generally better than the CE model, espe-cially on large masks (e.g., O1-O3, O6).
我们的模型之间的比较(M4)和CE第五和第六列显示我们的模型表现一般优于CE模型,尤其是在大口罩(例如,o1-o3,O6)。
In the last column, we show that the poisson blending [18] can further improve the performance.
在最后一列中,我们表明泊松混合[ 18 ]可以进一步改善性能。
Note that we obtain relatively higher PSNR and SSIM values when using the reconstruction loss (M1) only but it does not imply better qualitative results, as shown in Fig-ure 3(c).
请注意,我们获得较高的PSNR和SSIM使用重建时的损失值(M1)但这并不意味着更好的定性结果,如图3所示(C)。
These two metrics simply favor smooth and blurry results.
这两个指标简单地支持平滑和模糊的结果。
We note that the model M1 performs poorly as it hardly recovers anything and is unlikely to preserve the identity well, as shown in Table 3.
我们注意到,M1模型性能很差,因为它几乎不能恢复任何东西,也不可能很好地保持身份,如表3所示。
Although the mask size is fixed as 64 64 during the training, we test different sizes, ranging from 16 to 80 with a step of 8, to evaluate the generalization ability of our model.
虽然面具的大小是固定的,在训练过程中的64,我们测试不同的大小,范围从16到80,步骤为8,来评估我们的模型的泛化能力。
Figure 9 shows quantitative results.
图9显示定量结果。
The performance of the proposed model gradually drops with the increasing mask size, which is expected as the larger mask size indicates more uncertainties in pixel values.
随着掩模尺寸的增加,该模型的性能逐渐下降,这是因为较大的掩模尺寸表示像素值的不确定性。
But generally our model performs well for smaller mask sizes (smaller than 64).
但一般来说,我们的模型适用于较小的掩模尺寸(小于64)。
We observe a local minimum around the medium size (e.g., 32). It is because that the medium size mask is mostly likely to occlude only part of the component (e.g., half eye).
我们观察到中等大小的局部极小值(例如,32)。这是因为,中等大小的面膜主要是可能堵塞的组成部分(例如,半眼)。
It is found in experiments that generating a part of the compo-nent is more difficult than synthesizing new pixels for the whole component.
实验发现,生成的构件的一部分比合成新的像素的全成分更难。
Qualitative results of different size of masking are presented in Figure 6.
图6给出了不同大小的掩蔽的定性结果。
Traversing in latent space.
潜在空间遍历。
The missing region, although semantically constrained by the remaining pixels in an image, accommodates different plausible appearances as shown in Figure 10.
丢失的区域,尽管在图像中的剩余像素受到语义限制,但却容纳了不同的似是而非的外观,如图10所示。
We observe that when the mask is filled with different noise, all the generated contents are seman-tically realistic and consistent, but their appearances varies.
我们观察到,当面具都充满了不同的噪声,所有生成的内容是语义上的现实性和一致性,但他们的外表的变化。
This is different from the context encoder [17], where the mask is filled with zero values and thus the model only ren-ders single completion result.
这是由不同的上下文编码[ 17 ],那里的面具是用零填充值,因此模型只任工作单完成的结果。
It should be noted that under different input noise, the variations of our generated contents are unlikely to be as large as those in the original GAN [8, 19] model which is able to generate completely different faces.
需要注意的是,在不同的输入噪声下,我们生成的内容的变化不太可能像原来的GaN(8, 19)模型那样大,能够产生完全不同的面。
This is mainly due to the constraints from the contexts (i.e., non-mask re-gions).
这主要是由于限制从上下文(即非屏蔽区域)。
For example, in the second row of Figure 10 with only one eyebrow masked, the generated eyebrow is re-stricted to have the similar shape and size and reasonable position with the other eyebrow.
例如,在图10的第二行只有一个眉毛掩盖,生成的眉毛是限制具有相似的形状和大小与眉等合理的位置。
Therefore the variations on the appearance of the generated eyebrow are mainly re-flected at some details, such as the shade of the eyebrow.
因此,对生成的眉毛外观的变化主要体现在一些细节,如眉毛的阴影。
Face recognition
人脸识别
The identity distance in Table 3 partly reveals the net-work ability of preserving the identity information.
表3中的标识距离部分揭示了保持身份信息的网络工作能力。
In or-der to test to what extent the face identity can be preserved across its different examples, we evaluate our completion results in the task of face recognition.
为了测试在不同的例子中,面部身份能够被保存到多大程度上,我们评估我们在人脸识别任务中的完成结果。
Note that this task simulates occluded face recognition, which is still an open problem in computer vision.
请注意,这个任务模拟闭塞的人脸识别,这仍然是一个悬而未决的问题,在计算机视觉。
Given a probe face example, the goal of recognition is to find an example from the gallery set that belongs to the same identity.
给定一个探测面示例,识别的目标是从属于同一标识的图库集合中找到一个示例。
We randomly split the CelebA [15] test dataset into the gallery and probe set, to make sure that each identity has roughly the same amount of images in each set.
我们随机分celeba [ 15 ]测试数据到画廊和探针组,确保每个身份都有大致相同的图像数量。
Finally, we obtain the gallery and probe set with roughly 10,000 images respectively, cover-ing about 1,000 identities.
最后,我们分别获得了大约10000幅图像的画廊和探针集,覆盖了大约1000个身份。
We apply six masking types (O1-O6) for each probe im-age, as shown in Figure 8.
我们采用六屏蔽类型(o1-o6)每个探头的图像,如图8所示。
The probe images are new faces restored by the generator. These six masking types, to some extent, simulate the occlusions that possibly occurs in real scenarios.
探针图像是由生成器恢复的新面孔。这六种掩蔽类型在一定程度上模拟了真实场景中可能出现的遮挡现象。
For example, masking two eyes mainly refers to the occlusion by glasses and masking lower half face matches the case of wearing the scarf.
例如,蒙眼主要指遮挡眼镜和掩蔽下半脸与佩戴围巾的情况相匹配。
Each completed probe image is matched against those in the gallery, and top ranked matches can be analyzed to measure recognition performance.
每个完成的探针图像与画廊中的匹配图像相匹配,可以分析顶级比赛以识别识别性能。
We use the OpenFace [1] toolbox to find top K nearest matches based on the identity distance and re-port the average top K recognition accuracy over all probe images in Figure 11.
我们使用露面[ 1 ]工具箱找到基于身份的距离和港口的平均顶K的识别准确率在所有调查图像图11顶部K最近的比赛。
We carry out experiments with four variations of the probe image: the original one, the completed one by sim-ply filling random noise, by our reconstruction based model M1 and by our final model M5.
我们进行了四个不同的探针图像的实验:原来的一个,完成的一个填充随机噪声,我们的重建模型M1和我们的最终模型M5。
The recognition perfor-mance using original probe faces is regarded as the upper bound. Figure 11 shows that using the completed probe by our model M5 (green) achieves the closest performance to the upper bound (blue).
以原始探针面的识别性能作为上界。图11显示了使用我们的模型M5(绿色)完成的探针达到最接近上限(蓝色)的性能。
Although there is still a large gap between the performance of our M5 based recognition and the upper bound, especially when the mask is large (e.g., O1, O2), the proposed algorithm makes significant improvement with the completion results compared with that by either noise filling or the reconstruction loss (Lr).
虽然目前仍然是我们的M5为基础的识别和上界的性能差距很大,尤其是当面具是大的(例如,O1,O2),该算法可以明显改善完成结果与由噪声填充或重建的损失相比(LR)。
We consider the identity-preserving completion to be an in-teresting direction to pursue.
我们认为身份保存完成的是一个有趣的方向去追寻。