本文作者在 2018年 CVPR 上发表了一篇 Generative Image Inpainting with Contextual Attention;读者可以结合这两篇一起读一下,可以帮助大家理解作者在一年里,面对这个问题时的思路历程。其中,本文的网络结构和 是完全相同的,只是在基本卷积和GAN 训练中引入了新的变化。
[paper] : Free-Form Image Inpainting with Gated Convolution (2019 ICCV)
Generative Image Inpainting with Contextual Attention (2018 CVPR)
[github] : An open source framework for generative image inpainting task, with the support of Contextual Attention (CVPR 2018) and Gated Convolution (ICCV 2019 Oral).
Abstract
We present a generative image inpainting system to complete images with free-form mask and guidance.
本文是干嘛的。
The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shape, global and local GANs designed for a single rectangular mask are not applicable. Thus, we also present a patch-based GAN loss, named SN-PatchGAN, by applying spectral-normalized discriminator on dense image patches. SN-PatchGAN is simple in formulation, fast and stable in training.
本文的摘要思路是这样的:
直接说本文的第一个1)贡献、2)实现方法及3)解决的问题:1)门控卷积 gated convolution,2)通过提供一个可学习的动态特征选择机制,3)解决了将所有输入像素视为有效像素的普通卷积(vanilla convolution)问题。
接着说本文的第二个1)贡献、2)实现方法及3)解决的问题:1)谱规范化 PatchGAN,2)将频谱归一化鉴别器应用于密集图像 patch,3)由于自由形状的遮罩可以出现在任何形状的图像中,因此为单个矩形遮罩设计的全局和局部 GANs 是不适用的。
Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. Our system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces.
最后是实验结论及应用。
Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.
第一段:介绍 image inpainting 是干啥的。
In computer vision, two broad approaches to image inpainting exist: patch matching using low-level image features and feed-forward generative models with deep convolutional networks. The former approach [3, 8, 9] can synthesize plausible stationary textures, but usually makes critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach [15, 49, 45, 46, 38, 37, 48, 26, 52, 33, 35, 19] can exploit semantics learned from large scale datasets to synthesize contents in nonstationary images in an end-to-end fashion.
However, deep generative models based on vanilla convolutions are naturally ill-fitted for image hole-filling because the spatially shared convolutional filters treat all input pixels or features as same valid ones. For hole-filling, the input to each layer are composed of valid pixels/features outside holes and invalid ones in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (for example, the ones on hole boundary) pixels/features, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks [15, 49].
第二三段:介绍两种传统方法类型,及各自的优缺点:
patch matching:优点:可以合成看似平稳的纹理;缺点:在非平稳情况下,如复杂的场景、人脸和物体,通常会出现严重错误。
feed-forward generative models:优点:可以利用从大规模数据集中学到的语义,以端到端方式综合非平稳图像中的内容;缺点:基于普通卷积的卷积自然不适合于图像空洞填充,因为空间共享的卷积过滤器将所有输入像素或特征视为相同的有效特征;会导致诸如颜色差异、模糊和孔周围明显的边缘反应等视觉假象。
To address this limitation, recently partial convolution [23] is proposed where the convolution is masked and normalized to be conditioned only on valid pixels. It is then followed by a rule-based mask-update step to update valid locations for next layer. Partial convolution categorizes all input locations to be either invalid or valid, and multiplies a zero-or-one mask to inputs throughout all layers. The mask can also be viewed as a single un-learnable feature gating channel . However this assumption has several limitations. First, considering the input spatial locations across different layers of a network, they may include (1) valid pixels in input image, (2) masked pixels in input image, (3) neurons with receptive field covering no valid pixel of input image, (4) neurons with receptive field covering different number of valid pixels of input image (these valid image pixels may also have different relative locations), and (5) synthesized pixels in deep layers. Heuristically categorizing all locations to be either invalid or valid ignores these important information. Second, if we extend to user-guided image inpainting where users provide sparse sketch inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Third, for partial convolution the “invalid” pixels will progressively disappear layer by layer and the rule-based mask will be all ones in deep layers. However, to synthesize pixels in hole these deep layers may also need the information of whether current locations are inside or outside the hole. The partial convolution with all-ones mask cannot provide such information. We will show that if we allow the network to learn the mask automatically, the mask may have different values based on whether current locations are masked or not in input image, even in deep layers.
第四段:针对传统算法,最新的算法(与本文关系最为密切的算法)的优缺点(缺点即为本文着重处理的对象)。
partial convolution:
1)什么是部分卷积:可以参考我的博文 MyDLNote:Partial Conv. 或原文 partial convolution [2018 ECCV]
2)其缺点:
First:考虑输入空间位置的不同层之间传递网络,可能包括 (1) 输入图像的有效像素;(2) 在输入图像的 mask 像素;(3) 输入图像中,神经元感受野覆盖的非有效像素;(4) 输入图像中,神经元感受野涵盖不同数量的有效像素 (这些有效的像素也可能有不同的相对位置);和 (5) 深度层的生成像素。启发式地将所有位置归类为无效或有效,将忽略这些重要信息。
Second:如果我们扩展到用户引导的图像绘制,用户在 mask 内提供稀疏的草图,这些像素位置应该被认为是有效的还是无效的? 如何正确更新下一层的 mask 图?
Third:对于部分卷积,“无效”的像素会一层一层地逐渐消失,而基于规则的 mask 则会在深层全部消失。然而,为了合成孔内的像素,这些深层可能还需要知道当前位置是在孔内还是孔外的信息。全部为 1 的 mask 的部分卷积不能提供这样的信息。本文将展示,如果允许网络自动学习 mask,mask 可能有不同的值,这取决于当前位置是否在输入图像中被 masked,甚至在深层。
We propose gated convolution for free-form image inpainting. It learns a dynamic feature gating mechanism for each channel and each spatial location (for example, inside or outside masks, RGB channels or user-guidance channels). Specifically we consider the formulation where the input feature is firstly used to compute gating values ( is sigmoid function, is learnable parameter). The final output is a multiplication of learned feature and gating values where can be any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also have conditional inputs like sparse sketch. For network architectures, we stack gated convolution to form an encoder-decoder network following [49]. Our inpainting network also integrates contextual attention module within same refinement network [49] to better capture long-range dependencies.
第五段:本文的最重要的手段:门控卷积 gated convolution;及网络结构。
1)门控卷积
门控卷积包括两个支路:门控支路和主支路。
门控支路:一个卷积层 + sigmoid 函数:
主支路:一个卷积层 + 激活函数:
二路结合得到 门控卷积:全部元素(空间、通道)相乘(有点像混和注意力(不包括残差连接的部分)):。
2)网络结构
将门控卷积按照编解码器的形式堆叠。
整体网络结构和作者 2018 年 CVPR 上的结构相似,就是将原来网络中的(膨胀)卷积都用门控(膨胀)卷积替换。
Without compromise of performance, we also significantly simplify training objectives as two terms: a pixelwise reconstruction loss and an adversarial loss. The modification is mainly designed for free-form image inpainting. As the holes may appear anywhere in images with any shape, global and local GANs [15] designed for a single rectangular mask are not applicable. Instead, we propose a variant of generative adversarial networks, named SN-PatchGAN, motivated by global and local GANs [15], MarkovianGANs [21], perceptual loss [17] and recent work on spectral-normalized GANs [24]. The discriminator of SN-PatchGAN directly computes hinge loss on each point of the output map with format , formulating number of GANs focusing on different locations and different semantics (represented in different channels). SN-PatchGAN is simple in formulation, fast and stable in training and produces high-quality inpainting results.
第六段:本文的第二大贡献:SN-PatchGAN。
SN-PatchGAN 有两点:
1)判别器直接计算输出映射的每个点的铰链损耗,输出大小为 ,即输出的是个 3D 张量,针对不同的位置和不同的语义(在不同的通道中表示)形成 个的 GANs 判别。
可以参考图 3,一目了然,最早的判别器是 1个点输出,后来 PatchGAN 采用 2D 张量输出,本文提出的是个 3D 输出。这个创新还是很有意思的,因为它包含了 GAN 的原理,也包含了特征匹配的原理。
2)采用 谱规范化(spectral-normalization)。
For practical image inpainting tools, enabling user interactivity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user sketch as guided input. Comparison to other methods is summarized in Table 1. Our main contributions are as follows: (1) We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs. (2) We present a more practical patch-based GAN discriminator, SN-PatchGAN, for free-form image inpainting. It is simple, fast and produces high-quality inpainting results. (3) We extend our inpainting model to an interactive one, enabling user sketch as guidance to obtain more user-desired inpainting results. (4) Our proposed inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces in images.
Table 1: Comparison of different approaches including PatchMatch [3], Global&Local [15], ContextAttention [49], PartialConv [23] and our approach. The comparison of image inpainting is based on four dimensions: Semantic Understanding, Non-Local Algorithm, Free-Form Masks and User-Guided Option.
第七段:本文贡献总结。
首先,说明本文的方法能够实现人机交互功能,在许多可行的解决方案来填充图像中的空洞。在功能方面,与其他方法的比较如表 1 所示。
主要贡献如下:
(1) 引入门控卷积来学习跨所有层的每个空间位置上的每个通道的动态特征选择机制,显著提高了自由形状 mask 和输入的颜色一致性和画质。
(2) 提出了一种更实用的基于PatchGAN的识别器,SN-PatchGAN,用于绘画中的自由形态图像。该方法简单、快速,能产生高质量的涂装效果。
(3) 将 inpainting 模型扩展为交互式 inpainting 模型,以用户手绘为导向,获得更多用户想要的 inpainting 效果。
(4) 提出的 inpainting 系统在基准数据集 (包括 Places2 natural scenes 和 CelebA-HQ faces) 上实现了比以前更高质量的自由形态inpainting。展示了该系统可以帮助用户快速删除分散注意力的对象、修改图像布局、清除水印和编辑图像中的人脸。
本文的方法,用几个图,就足以说明了。这里不再按照原文一句句翻译了。看图就可以把这篇文章看完。
门控卷积的结构框图
Figure 2: Illustration of partial convolution (left) and gated convolution (right).
这张图将 部分卷积 与 门控卷积 进行直观对比。
公式的话,是这样的:
两个内容:
1) 3D 输出:这是因为,首先,本文考虑自由形式的图像绘画,其中可能有多个孔,任何形状在任何位置(即在空间上,需要 PatchGAN)。其次,在判别器中,不同的通道表达不同的特征(即在通道上,需要 PatchGAN)。
2) 采用了 谱规范化:使得 GAN 训练更稳定,收敛速度更快。
最后,作者也说明了本文采用的最终损失函数是:L1 reconstruction loss 和 SN-PatchGAN loss, 并且二者比例是 1 : 1。
网络结构就不多说了,分为两步走实现:粗糙的结果和细化的结果。
Figure 3: Overview of our framework with gated convolution and SN-PatchGAN for free-form image inpainting.
这里要说的是 Contextual Attention 模块,需要参考 Generative Image Inpainting with Contextual Attention。
llustration of the contextual attention layer.Firstly we use convolution to compute matching score offoreground patches with background patches (as convolu-tional filters). Then we apply softmax to compare and getattention score for each pixel. Finally we reconstruct fore-ground patches with background patches by performing de-convolution on attention score. The contextual attentionlayer is differentiable and fully-convolutional.
在补充材料中,有一个分析挺有意思的,就是在网络不同层中特征的可视化,如下图:
Figure 4: Comparisons of gated convolution and partial convolution with visualization and interpretation of learned gating
values. We first show our inpainting network architecture based on [4] by replacing all convolutions with gated convolutions in the 1st row. Note that for simplicity, the following refinement network in [4] is ignored in the figure. With same settings, we train two models based on gated convolution and partial convolution separately. We then directly visualize intermediate un-normalized gating values in the 2nd row. The values differ mainly based on three parts: background, mask and sketch. In the 3rd row, we provide an interpretation based on which part(s) have higher gating values. Interestingly we also find that for some channels (e.g. channel-31 of the layer after dilated convolution), the learned gating values are based on foreground/background semantic segmentation. For comparison, we also visualize the un-learnable fixed binary mask M of partial convolution in the 4th row.