音频驱动嘴形之CodeTalker

论文发版:CVPR2023

应用:

Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions。语音信号->唇同步&面部表情
仓库地址:https://github.com/Doubiiu/CodeTalker

方法:

音频驱动嘴形之CodeTalker_第1张图片
cross-modal decoder :embedding block +multi-layer transformer decoder+ pre-trained codebook +VQ-VAE decoder
训练过程两部分:1)Discrete Facial Motion Space
VQ-VAE : encoder E+ decoder D+ context-rich codebook Z
音频驱动嘴形之CodeTalker_第2张图片
音频驱动嘴形之CodeTalker_第3张图片
输入:x = x - template;x:头模的顶点(vertices),shape[1,69,15069]
template:标准头模 ,shape[1,15069]
2)Speech-Driven Motion Synthesis
这部分包括:audio feature extractor(TCN)+transformer encoder+embedding block +multi-layer transformer decoder,其中
audio feature extractor(TCN)+transformer encoder,使用wav2vec 2.0作预训练模型,冻结第一步训练好的pre-trained codebook 和VQ-VAE decoder,训练embedding block +multi-layer transformer decoder。

评价指标:

1)Lip vertex error:计算每个帧的所有嘴唇顶点的最大L2误差,并取所有帧的平均值。
2)Upper-face dynamics deviation:measure the variation of facial dynamics for a motion sequence in comparison with that of the ground truth

驱动效果:

FaceTalk__audio

不同的人,说同一段话,嘴形驱动效果稍微有些差异。
音频驱动嘴形之CodeTalker_第4张图片
使用chinese-wav2vec2-base替代wav2vec2-base-960h,同一段音频,嘴形驱动效果差不多。

你可能感兴趣的:(语音识别,音视频)