清华大学LightGrad-TTS,且流式实现

清华大学LightGrad-TTS,且流式实现_第1张图片

论文链接:

https://arxiv.org/abs/2308.16569

代码地址:

https://github.com/thuhcsi/LightGrad

数据支持:

针对BZNSYP和LJSpeech提供训练脚本

清华大学LightGrad-TTS,且流式实现_第2张图片

针对Grad-TTS提出两个问题:

  1. DPMs are not lightweight enough for resource-constrained devices.

  2. DPMs require many denoising steps in inference, which increases latency.

提出解决方案:

  1. To reduce model parameters, regular convolution networks in diffusion decoder are substituted with depthwise separable convolutions.

  2. To accelerate the inference procedure, we adopt a training-free fast sampling technique for DPMs (DPM-solver).

  3. Streaming inference is also implemented in LightGrad to reduce latency further.

清华大学LightGrad-TTS,且流式实现_第3张图片

Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.

LightGrad流式方案(基于三星论文):

论文链接:

https://arxiv.org/abs/2111.09052

具体实现:

  1. Decoder input is chopped into chunks at phoneme boundaries to cover several consecutive phonemes and the chunk lengths are limited to a predefined range.

  2. To incorporate context information into decoder, last phoneme of the previous chunk and first phoneme of the following chunk are padded to the head and tail of the current chunk.

  3. Then, the decoder generates mel-spectrogram for each padded chunk.

  4. After this, mel-spectrogram frames corresponding to the padded phonemes are removed to reverse the changes to each chunk.

清华大学LightGrad-TTS,且流式实现_第4张图片

你可能感兴趣的:(智能语音,人工智能,语音识别,科技)