VITS2来袭~

  论文:VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

  演示:https://vits-2.github.io/demo/

  论文:https://arxiv.org/abs/2307.16430

VITS2来袭~_第1张图片

VITS2来袭~_第2张图片

目前仍然存在的问题:

  1. intermittent unnaturalness

  2. low efficiency of the duration predictor

  3. complex input format to alleviate the limitations of alignment and duration modeling (use of blank token)

  4. insufficient speaker similarity in the multi-speaker model

  5. slow training, and strong dependence on the phoneme conversion.

提出的方法:

  1. a stochastic duration predictor trained through adversarial learning

  2. normalizing flows improved by utilizing the transformer block

  3. a speaker-conditioned text encoder to model multiple speakers’ characteristics better.

VITS2来袭~_第3张图片

你可能感兴趣的:(智能语音,人工智能,科技,语音识别,深度学习)