定义 X s X_s Xs为source image,提供identity信息, X t X_t Xt为target image,提供attribute信息(包括pose、expression、scene lighting和background)
FaceShifter包含2个stage,在stage1中,采用Adaptive Embedding Integration Network(AEI-Net)生成high fidelity face swapping result Y ^ s , t \hat{Y}_{s,t} Y^s,t;在stage2中,采用Heuristic Error Acknowledging Network((HEAR-Net)处理脸部的遮挡问题,进一步生成更精细的结果 Y s , t Y_{s,t} Ys,t
如Figure 3(a)所示,stage1的网络结构包含3个部分
Identity Encoder
采用最新的人脸识别模型(来自文献[13]),取the last feature vector generated before the final FC layer作为identity embedding
Multi-level Attributes Encoder
Face attributes, such as pose, expression, lighting and background, require more spatial informations than identity.
为了保存attribute信息,取multi-level feature maps作为attribute embedding(之前的工作将attribute信息压缩为single vector)
具体来说,将 X t X_t Xt送入类似U-Net的网络,然后收集decoder部分每一层的feature map作为 z a t t ( X t ) z_{att}(X_t) zatt(Xt)
z a t t ( X t ) = { z a t t 1 ( X t ) , z a t t 2 ( X t ) , ⋯ , z a t t n ( X t ) } ( 1 ) \bm{z}_{att}(X_t)=\left \{ \bm{z}_{att}^1(X_t), \bm{z}_{att}^2(X_t), \cdots, \bm{z}_{att}^n(X_t) \right \} \qquad(1) zatt(Xt)={zatt1(Xt),zatt2(Xt),⋯,zattn(Xt)}(1)
其中 z a t t k ( X t ) \bm{z}_{att}^k(X_t) zattk(Xt)表示U-Net decoder第k层输出的feature map
值得注意的是,Multi-level Attributes Encoder不需要attribute annotation,能够通过self-supervised training的方式自动提取attribute信息
定义了attribute embedding之后,我们希望换脸结果 Y ^ x t \hat{Y}_{x_t} Y^xt与target image X t X_t Xt有相同的attribute embedding
Adaptive Attentional Denormalization Generator
这一步将2个embedding z i d ( X s ) \bm{z}_{id}(X_s) zid(Xs)和 z a t t ( X t ) \bm{z}_{att}(X_t) zatt(Xt)整合起来,用于生成换脸结果 Y ^ s , t \hat{Y}_{s,t} Y^s,t
之前的工作采用feature concatenation,会生成模糊的结果,因此我们提出Adaptive Attentional Denormalization(AAD),采用adaptive fashion的思想来解决这个问题
定义 h i n k \bm{h}_{in}^k hink表示AAD layer的输入,首先对 h i n k \bm{h}_{in}^k hink,进行instance normalization
h ˉ k = h i n k − μ k σ k ( 2 ) \bar{\bm{h}}_k=\frac{\bm{h}_{in}^k-\bm{\mu}^k}{\bm{\sigma}^k} \qquad(2) hˉk=σkhink−μk(2)
第一步,attributes embedding integration
ADD layer接收 z a t t k ∈ R C a t t k × H k × W k \bm{z}_{att}^k\in\mathbb{R}^{C_{att}^k\times H^k\times W^k} zattk∈RCattk×Hk×Wk作为输入,然后对 z a t t k \bm{z}_{att}^k zattk进行卷积得到 γ a t t k , β a t t k ∈ R C k × H k × W k \gamma_{att}^k, \beta_{att}^k\in\mathbb{R}^{C^k\times H^k\times W^k} γattk,βattk∈RCk×Hk×Wk
然后利用 γ a t t k , β a t t k \gamma_{att}^k, \beta_{att}^k γattk,βattk对normalized h ˉ k \bar{\bm{h}}_k hˉk进行denormalization,得到attribute activation A k \bm{A}^k Ak
A k = γ a t t k ⊗ h ˉ k + β a t t k ( 3 ) \bm{A}^k=\gamma_{att}^k\otimes\bar{\bm{h}}_k+\beta_{att}^k \qquad(3) Ak=γattk⊗hˉk+βattk(3)
第二步,identity embedding integration
从 X s X_s Xs中提取identity embedding z i d k \bm{z}_{id}^k zidk,然后对 z i d k \bm{z}_{id}^k zidk进行FC得到 γ i d k , β i d k ∈ R C k \gamma_{id}^k, \beta_{id}^k\in\mathbb{R}^{C^k} γidk,βidk∈RCk
以同样的方式对normalized h ˉ k \bar{\bm{h}}_k hˉk进行denormalization,得到identity activation I k \bm{I}^k Ik
I k = γ i d k ⊗ h ˉ k + β i d k ( 4 ) \bm{I}^k=\gamma_{id}^k\otimes\bar{\bm{h}}_k+\beta_{id}^k \qquad(4) Ik=γidk⊗hˉk+βidk(4)
第三步,adaptively attention mask
对 h ˉ k \bar{\bm{h}}_k hˉk进行conv+sigmoid运算,学习一个attentional mask M k \bm{M}^k Mk,最终利用 M k \bm{M}^k Mk对 A k \bm{A}^k Ak和 I k \bm{I}^k Ik进行组合
h o u t k = ( 1 − M k ) ⊗ A k + M k ⊗ I k ( 5 ) \bm{h}_{out}^k=\left ( 1-\bm{M}^k \right )\otimes\bm{A}^k+\bm{M}^k\otimes\bm{I}^k \qquad(5) houtk=(1−Mk)⊗Ak+Mk⊗Ik(5)
Figure 3 ( c)展示的就是上述所说的三步操作,然后将多个AAD layer组合起来,得到AAD ResBlk,如Figure 3(b)所示
Training Losses
首先设置multi-scale discriminator,得到adversarial loss L a d v \mathcal{L}_{adv} Ladv
然后定义identity preservation loss L i d \mathcal{L}_{id} Lid
L i d = 1 − c o s ( z i d ( Y ^ s , t ) , z i d ( X s ) ) ( 6 ) \mathcal{L}_{id}=1-cos\left ( \bm{z}_{id}\left ( \hat{Y}_{s,t} \right ), \bm{z}_{id}\left ( X_s \right ) \right ) \qquad(6) Lid=1−cos(zid(Y^s,t),zid(Xs))(6)
接着定义attributes preservation loss L a t t \mathcal{L}_{att} Latt
L a t t = 1 2 ∑ k = 1 n ∥ z a t t k ( Y ^ s , t ) − z a t t k ( X t ) ∥ 2 2 ( 8 ) \mathcal{L}_{att}=\frac{1}{2}\sum_{k=1}^{n}\left \| \bm{z}_{att}^k\left ( \hat{Y}_{s,t} \right ) - \bm{z}_{att}^k\left ( X_t \right ) \right \|_2^2 \qquad(8) Latt=21k=1∑n∥∥∥zattk(Y^s,t)−zattk(Xt)∥∥∥22(8)
在训练过程中以80%的比例令 X t = X s X_t=X_s Xt=Xs,则定义reconstruction loss L r e c \mathcal{L}_{rec} Lrec如下
L r e c = { 1 2 ∥ Y ^ s , t − X t ∥ 2 2 if X t = X s 0 otherwise ( 8 ) \mathcal{L}_{rec}=\left\{\begin{matrix} \frac{1}{2}\left \| \hat{Y}_{s,t}-X_t \right \|_2^2 & \text{if}\ X_t=X_s\\ 0 & \text{otherwise} \end{matrix}\right. \qquad(8) Lrec={21∥∥∥Y^s,t−Xt∥∥∥220if Xt=Xsotherwise(8)
对于AEI-NET,完整的损失函数如下
L A E I − N e t = L a d v + λ a t t L a t t + λ i d L i d + λ r e c L r e c ( 9 ) \mathcal{L}_{{\rm AEI-Net}}=\mathcal{L}_{adv}+\lambda_{att}\mathcal{L}_{att}+\lambda_{id}\mathcal{L}_{id}+\lambda_{rec}\mathcal{L}_{rec} \qquad(9) LAEI−Net=Ladv+λattLatt+λidLid+λrecLrec(9)
其中设置 λ a t t = λ r e c = 10 \lambda_{att}=\lambda_{rec}=10 λatt=λrec=10, λ i d = 5 \lambda_{id}=5 λid=5
stage1生成的图像能够很好的保持target attributes,但无法保持来自 X t X_t Xt中的遮挡(occlusion)
已有的工作额外训练一个face segmentation network,缺点是需要occlusion annotation,并且对于新的occlusion的泛化性不好
在实验中, X t X_t Xt是一幅包含occlusion的图像,令 X s = X t X_s=X_t Xs=Xt,重构的图像为 Y ^ t t = A E I - N e t ( X t , X t ) \hat{Y}_{tt}={\rm AEI\text{-}Net}(X_t, X_t) Y^tt=AEI-Net(Xt,Xt),我们观察到 Y ^ t t \hat{Y}_{tt} Y^tt本该重构出来的occlusion消失了,于是将 Y ^ t t \hat{Y}_{tt} Y^tt与 X t X_t Xt进行比对,可以得知图像中哪些地方是occlusion
定义heuristic error如下
Δ Y t = X t − A E I - N e t ( X t , X t ) ( 10 ) \Delta Y_t=X_t-{\rm AEI\text{-}Net}(X_t, X_t) \qquad(10) ΔYt=Xt−AEI-Net(Xt,Xt)(10)
如Figure 4(b)所示,HEAR-Net本质上是一个U-Net,接收 Δ Y t \Delta Y_t ΔYt和 Y ^ s , t \hat{Y}_{s,t} Y^s,t作为输入,输出最终的换脸结果 Y s , t Y_{s,t} Ys,t
Y s , t = H E A R - N e t ( Y ^ s , t , Δ Y t ) ( 11 ) Y_{s,t}={\rm HEAR\text{-}Net}\left ( \hat{Y}_{s,t}, \Delta Y_t \right ) \qquad(11) Ys,t=HEAR-Net(Y^s,t,ΔYt)(11)
训练HEAR-Net的损失项包含3项
第1项是the identity preservation loss L i d ′ \mathcal{L}_{id}' Lid′
L i d ′ = 1 − c o s ( z i d ( Y s , t ) , z i d ( X s ) ) ( 12 ) \mathcal{L}_{id}'=1-cos\left ( \bm{z}_{id}\left ( Y_{s,t} \right ), \bm{z}_{id}\left ( X_s \right ) \right ) \qquad(12) Lid′=1−cos(zid(Ys,t),zid(Xs))(12)
第2项是the change loss L c h g ′ \mathcal{L}_{chg}' Lchg′
L c h g ′ = ∣ Y ^ s , t − Y s , t ∣ ( 13 ) \mathcal{L}_{chg}'=\left | \hat{Y}_{s,t}-Y_{s,t} \right | \qquad(13) Lchg′=∣∣∣Y^s,t−Ys,t∣∣∣(13)
第3项是the reconstruction loss L r e c ′ \mathcal{L}_{rec}' Lrec′
L r e c ′ = { 1 2 ∥ Y s , t − X t ∥ 2 2 if X t = X s 0 otherwise ( 14 ) \mathcal{L}_{rec}'=\left\{\begin{matrix} \frac{1}{2}\left \| Y_{s,t}-X_t \right \|_2^2 & \text{if}\ X_t=X_s\\ 0 & \text{otherwise} \end{matrix}\right. \qquad(14) Lrec′={21∥Ys,t−Xt∥220if Xt=Xsotherwise(14)
总体的损失函数为三者之和
L H E A R - N e t = L r e c ′ + L i d ′ + L c h g ′ ( 15 ) \mathcal{L}_{{\rm HEAR\text{-}Net}}=\mathcal{L}_{rec}'+\mathcal{L}_{id}'+\mathcal{L}_{chg}' \qquad(15) LHEAR-Net=Lrec′+Lid′+Lchg′(15)