学习笔记:Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment

这篇论文中设计的网络结构用于seetaface人脸识别引擎中。

作者提出了一个粗到精的自动编码网络(CFAN),级联了一些堆叠自动编码网络(SANs)。

1、初步是将检测到的整体的人脸的低分辨率版本作为输入,这样第一个SAN就能快速并足够准确的预测标志点。---全局SAN

2、余下的SAN随后通过以越来越高分辨率的方式将当前标志(先前SAN的输出)提取的局部特征作为输入进行逐步细化。--局部SAN

在局部SAN中,在每个标志周围提取SIFT特征。


每个SAN都会根据前一个SAN预测的形状,尝试从不同尺度的面部图像到脸部形状的非线性映射。

采用全局特征作为首个SAN的输入可以避免平均形状带来的误差。

在从第一SAN获得面部形状的估计S0之后,连续的SAN(称为本地SAN)通过逐步回归当前位置和地面真值位置之间的偏差ΔS来努力改善形状。

为了表征精细变化,利用以较高分辨率从当前形状提取的形状索引特征来执行较小的搜索步长和较小的搜索区域。

所有面部特征点的形状索引特征被级联在一起,以便同时更新所有面部特征点。这样即使是在部分阻塞的情况下也能保证得到一个合理有效的结果。


一般来说,已有的对齐方式可被分为基于方法[7,21,14,34,19,6] 的整体特征和基于方法[8,10,15,23,9,25,35,32,31,2,28,11] 的局部特征。



学习笔记:Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment_第1张图片

Fig. 1. Overview of our Coarse-to-Fine Auto-encoder Networks (CFAN) for real-time
face alignment. H 1 , H 2 are hidden layers. Through function F Φ , the joint local features
Φ ( S i ) are extracted around facial landmarks of current shape S i .

通过使用这种渐进式和分辨率可变的策略,每个SAN的搜索空间,或换句话说,每个SAN的任务难度得到很好的控制,因此更易于处理。

在使用Intel i7-3770(3.4 GHz CPU)的台式机上,作者的方法(在Matlab代码中)每个图像大约需要23毫秒,以预测68个脸部检测时间。

假设有一副d个像素的人脸图像x Rd (d上标),Sg(x) Rp (g下标,p上标)表示p个标志的真实位置。面部标志检测是学习一个从图像到面部形状的映射函数F:

   F : S x.

一般情况下,F是复杂且非线性的。

为了达到映射的目的,k个隐藏层自动编码器作为深层神经网络堆叠,将图像映射到相应的形状。

具体来说,面部对齐任务被制定为使以下目标最小化:

学习笔记:Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment_第2张图片

F = {f1, f2, ..., fk}, fi 是深度网络中第 层的映射函数,σ 是sigmoid (不知咋翻译)函数和 ai 是每层的特征表示。
然而,Sigmoid函数的输出范围为[0 1],与位置范围不一致,因此在最后一层 fk 中利用线性回归得到准确的形状估计S0。

为了防止过拟合,一个定制项(权重衰减项)被加入式子中来降低权重的量级。


F中包含了大量参数,通过优化可以很容易就降低到局部最小值。

为了得到更好的优化,首先采用非监督的预训练过程初始化k-1层,并随机初始化第k层。然后用有监督的方式细致的调整整个网络。

对于第 i 层,通过优化下面的公式来达到预训练的目的:


其中

每个隐藏层的输出作为下一次的输入。对于第一层,a0 = x


因为局部特征点只能从它本身捕获信息,而忽视了与其它点的相关性。因此,级联所有的局部形状索引特征一起作为输入。


学习笔记:Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment_第3张图片

当当前位置离真实位置相当远是,有必要在先前的局部SAN上用大的搜索步长去近似。


全局SAN具有四层(每层包含三层隐藏层),后跟一个线性回归层,能够学习从50×50像素的整个面部图像到面部形状的非线性映射。每层隐藏单位数分别为1600,900,400。

局部SAN的每层隐藏单位数分别是1296,780,400.

在全局SAN和局部SAN中,α = 0.001 。



总结:作者主要采用分而治之的策略,将一幅人脸图像的全局特征作为输入,输入全局SAN中,找到相对精确的形状;然后将所有的面部形状索引特征级联一起作为局部SAN的输入;局部SAN又分为相同尺寸但不同分辨率的多个局部SAN,在每个局部SAN上搜索,提取面部标志周围的SIFT特征,以便最小化形状索引特征的位置到真实位置之间的偏差。通过由粗到精的过程,人脸能很好地对齐。与SDM,DCNN相比,CFAN表现都比它们好。采用非线性回归能获得较低的回归错误率。


1. 300 faces in-the-wild challenge, http://ibug.doc.ic.ac.uk/resources/300-W/
2. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust discriminative response
map fitting with constrained local models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3451 (2013)
3. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of
faces using a consensus of exemplars. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 545–552 (2011)
4. Bengio, Y.: Learning deep architectures for
AI. Foundations and TrendsR in Machine Learning 2(1), 1–127 (2009)
5. Burgos-Artizzu, X.P., Perona, P., Doll´ar, P.: Robust face landmark estimation
under occlusion. In: IEEE International Conference on Computer Vision, ICCV
(2013)
6. Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression.
In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp.
2887–2894 (2012)
7. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(6), 681–685
(2001)
8. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape modelstheir training and application. Computer Vision and Image Understanding
(CVIU) 61(1), 38–59 (1995)
9. Cristinacce, D., Cootes, T.F.: Feature detection and tracking with constrained
local models. In: British Machine Vision Conference (BMVC), vol. 17, pp. 929–938
(2006)
10. Cristinacce, D., Cootes, T.F.: Boosted regression active shape models. In: British
Machine Vision Conference (BMVC), pp. 1–10 (2007)
11. Dantone, M., Gall, J., Fanelli, G., Van Gool, L.: Real-time facial feature detection
using conditional regression forests. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 2578–2585 (2012)
12. Doll´ ar, P., Welinder, P., Perona, P.: Cascaded pose regression. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 1078–1085 (2010)
13. Grangier, D., Bottou, L., Collobert, R.: Deep convolutional networks for scene parsing. In: International Conference on Machine Learning Workshops, vol. 3 (2009)
14. Gross, R., Matthews, I., Baker, S.: Generic vs. person specific active appearance
models. Image and Vision Computing (IVC) 23(12), 1080–1093 (2005)
15. Gu, L., Kanade, T.: A generative shape regularization model for robust face alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS,
vol. 5302, pp. 413–426. Springer, Heidelberg (2008)
16. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust face detection using
the hausdorff distance. In: International Conference on Audio-and Video-based
Biometric Person Authentication (AVBPA), pp. 90–95 (2001)
17. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems
(NIPS), pp. 1106–1114 (2012)
18. Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature
localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C.
(eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 679–692. Springer, Heidelberg
(2012)


19. Liu, X.: Discriminative face alignment. IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI) 31(11), 1941–1954 (2009)
20. Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2480–2487
(2012)
21. Matthews, I., Baker, S.: Active appearance models revisited. International Journal
of Computer Vision (IJCV) 60(2), 135–164 (2004)
22. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: Xm2vtsdb: The extended
m2vts database. In: International Conference on Audio and Video-based Biometric
Person Authentication (AVBPA), vol. 964, pp. 965–966 (1999)
23. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape
model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS,
vol. 5305, pp. 504–513. Springer, Heidelberg (2008)
24. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic
methodology for facial landmark annotation. In: IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), pp. 896–903 (2013)
25. Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained
mean-shifts. In: IEEE International Conference on Computer Vision (ICCV), pp.
1034–1041 (2009)
26. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point
detection. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 3476–3483 (2013)
27. Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural
networks. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR (2014)
28. Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection using
boosted regression and graph models. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 2729–2736 (2010)
29. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple fea
tures. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
vol. 1, p. I–511 (2001)
30. Wu, Y., Wang, Z., Ji, Q.: Facial feature tracking under varying facial expressions
and face poses based on restricted boltzmann machines. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3452–3459 (2013)
31. Xiong, X., De la Torre, F.: Supervised descent method and its applications to face
alignment. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR (2013)
32. Yu, X., Huang, J., Zhang, S., Yan, W., Metaxas, D.N.: Pose-free facial landmark
fitting via optimized part mixtures and cascaded deformable shape model. In: IEEE
International Conference on Computer Vision, ICCV (2013)
33. Zhao, X., Kim, T.K., Luo, W.: Unified face analysis by iterative multi-output
random forests. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR (2014)
34. Zhao, X., Shan, S., Chai, X., Chen, X.: Locality-constrained active appearance
model. In: Asian Conference on Computer Vision (ACCV), pp. 636–647 (2013)
35. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization
in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 2879–2886 (2012)





你可能感兴趣的:(人脸识别笔记)