Building a mixed-lingual neural TTS system with only monolingual data

文章地址：https://arxiv.org/pdf/1904.06063v1.pdf

合成样例地址：https://angelkeepmoving.github.io/mixed-lingual-tts/index.htm

摘要

本文主要研究对象是Encoder-Decoder的TTS框架（如：Tacotron、DeepVoice等）。研究的主要问题是能否使用一种语言（中文）训练模型，完成合成多种语言（中英文）的目的。合成的质量考虑两个方面，合成的一句话中音色的一致性（不会出现奇怪的韵律和音色等）以及一句话中读音的自然度和清晰度（断句正确，读音清晰）。本文将实现一个这种跨语种合成的模型，并做一下实验及分析。

1 简介

1.1相关工作

早先的关于多语种合成的方法主要有：

（1）通过录制当人双语语料进行模型训练。

（2）共享HMM state，其状态映射同样是通过双语语料训练而得到。

（3）通过统一语言编码。

（4）通过统一音素。

1.2主要贡献

本文主要贡献如下:

(1)分析网络对音素表示的作用。

（2）speaker embedding对混合语句一致性的影响。

（3）phene embedding对合成语音的影响。

（4）单语言语种能否训练出多语种模型。

2 多语种TTS模型

2.1 多人平均模型

直接用单语种语料训练一个多语种合成模型是很困难的，所以本文用中文和英文混合语料训练了一个基础模型。在后面使用这种混合语料时，用字典法标出了speak embedding，但在分析phoneme embedding中，没有使用speaker embedding。

2.2 speaker Embedding

本文采用查表法使用speaker embedding（类似查数据库，一人有一条对应的embedding表示），speakerbedding与Encoder-Decoder网络联合进行训练（估计是类似expressive tacotron里面那样做）。由于speaker embedding所放的位置会影响最终合成语音的一致性（中英混读是是否像一个人的读音），本文采取两种策略进行对比实验，一种是放在Encoder的输出端，网络称为：SE-ENC，另外一种放在Decoder的输入端，网络称为：SE-DEC（其实类似做法百度和谷歌的相关项目都做过）。后面实验表明SE-DEC效果更好。

2.3 Phoneme Embedding

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第1张图片

图1 phoneme embedding聚类结果

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第2张图片

图2 经过Encoder网络后的phoneme embedding

本文还研究了phoneme embedding对于最终合成语音自然度的影响。图1显示了原始的phoneme embedding结果，其中小写的是中文音素，大写的是英文音素。原始的中英文音素分布比较散乱，没有明显的区分度。图2显示了经过Encoder网络后的phoneme embedding聚类结果，显示出明显的区分度。作者猜测这可能是由于Encoder网络的输出受Decoder网络中的语音反馈影响以及对齐错误导致。

2.4 Phoneme-informed attention

根据2.3的分析，本文对研究了两种方式的phoneme embedding对注意力的影响。一种是给phoneme embedding加权重，使用了类似注意力机制的方法。其部分基础公式如下，具体看论文：

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第3张图片

公式1 权重计算相关公式

另外一种就是做一个ResNet，将phoneme attention embedding加到encoder output中（实验证明resnet效果更好）。如下图：

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第4张图片

图3 phoneme embedding resnet

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第5张图片

图4 整体网络架构

3 实验及分析

3.1实验设置

本文仅对女说话人进行实验，采用了35个中文女声，35个英文女声，中文女声每人大概500句，总共大概17197句，大概17小时数据，英文用的vctk里面的35个女声数据，总共14464句，大概8小时数据。我觉得这里数据可能有点失衡，中英文语料数量差别有点大，不过TTS语料的确是贵，英文开源的好语料（录音棚级别）基本没有。

音频采样一致使用24KHz，提取mel谱维度为80，线性谱维度1024。声码器使用griffin-lim。

3.2 实验分析

评估方法，AB偏好。即给AB两句话，人工评估A好、B好还是差不多。评估人数18人，每种语料句子数30句（A、B各30句）。

3.2.1 SE-ENC与SE-DEC比较

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第6张图片

图5 SE-ENC和SE-DEC偏好评估

3.2.2 SE-DEC 与 Re-Train AVM

SE-DEC与重训练AVM模型（先训练平均模型，再用目标人单人数据继续训练）比对。

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第7张图片

图6 SE-DEC 与Retrain-AVM

3.2.3 训练数据的选择

比对不同数据集对最终模型的影响，本文三个数据集：（1）中文数据集：(CORPUS-MAN。（2）英文数据集:CORPUS-ENG;混合数据集：CORPUS-MIX。

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第8张图片

图7 中文数据集VS英文数据集

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第9张图片

图8 中文数据集VS混合数据集

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第10张图片

图9 混合数据集VS英文数据集

3.2.4 phoneme embedding的使用

phoneme embedding一种为加权法，网络为：SE-DEC-PECV，另外一种为残差法，名称为：SE-DEC-RES。

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第11张图片

图10 SE-DEC-PECV 与 SE-DEC-RES

另外，还对比了SE-DEC-RES和没加RES的SE-DNC。

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data_第12张图片

图11 SE-DEC-RES 与 SE-DNC

4 总结

总结大概有以下几点：

(1)多人平均模型对最终模型有好处，且平均模型训练数据包含最终目标人数据更好。

（2）speaker embedding对语音一致性比较重要，同时speaker embedding的输入位置对结果也有影响，放在Decoder输入端效果优于放在Encoder输出端。

（3）虽然单语种语料可以训练出跨语种合成模型，但是有双语种语料更好。

（4）phoneme attention信息对合成语音的一致性及自然度都有好处。

另外，单语言训练的模型在混合语料的合成音上，有明显的过度音，效果不佳。

5 参考文献

[1] C. Traber, K. Huber, K. Nedir, B. Pfifister, E. Keller, and B. Zell

ner, “From multilingual to polyglot speech synthesis,” in Sixth

European Conference on Speech Communication and Technology,

1999.

[2] J. Latorre, K. Iwano, and S. Furui, “Polyglot synthesis using a

mixture of monolingual corpora,” in Proceedings.(ICASSP’05).

IEEE International Conference on Acoustics, Speech, and Signal

Processing, 2005., vol. 1. IEEE, 2005, pp. I–1.

[3] H. Liang, Y. Qian, and F. K. Soong, “An HMM-based bilingual

(Mandarin-English) TTS,” Proceedings of SSW6, 2007.

[4] Y. Qian, H. Cao, and F. K. Soong, “HMM-based mixed-language

(Mandarin-English) speech synthesis,” in 2008 6th International

Symposium on Chinese Spoken Language Processing. IEEE,

2008, pp. 1–4.

[5] Y. Qian, H. Liang, and F. K. Soong, “A cross-language state shar

ing and mapping approach to bilingual (Mandarin–English) TTS,”

IEEE Transactions on Audio, Speech, and Language Processing,

vol. 17, no. 6, pp. 1231–1239, 2009.

[6] J. He, Y. Qian, F. K. Soong, and S. Zhao, “Turning a mono

lingual speaker into multilingual for a mixed-language TTS,” in

Thirteenth Annual Conference of the International Speech Com

munication Association, 2012.

[7] B. Ramani, M. A. Jeeva, P. Vijayalakshmi, and T. Nagarajan,

“Voice conversion-based multilingual to polyglot speech synthe

sizer for Indian languages,” in 2013 IEEE International Confer

ence of IEEE Region 10 (TENCON 2013). IEEE, 2013, pp. 1–4.

[8] S. Sitaram and A. W. Black, “Speech synthesis of code-mixed

text.” in LREC, 2016.

[9] S. Sitaram, S. K. Rallabandi, S. Rijhwani, and A. W. Black, “Ex

periments with cross-lingual systems for synthesis of code-mixed

text.” in SSW, 2016, pp. 76–81.

[10] K. R. Chandu, S. K. Rallabandi, S. Sitaram, and A. W. Black,

“Speech synthesis for mixed-language navigation instructions.” in

INTERSPEECH, 2017, pp. 57–61.

[11] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN

approach to cross-lingual TTS,” in 2016 IEEE International Con

ference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2016, pp. 5515–5519.[12] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are all

you need: End-to-end multilingual speech recognition and syn

thesis with bytes,” arXiv preprint arXiv:1811.09021, 2018.

[13] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice

cloning with a few samples,” in Advances in Neural Information

Processing Systems, 2018, pp. 10 019–10 029.

[14] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new

speakers based on a short untranscribed sample,” arXiv preprint

arXiv:1802.06984, 2018.

[15] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,

R. Pang, I. L. Moreno, Y. Wu et al., “Transfer learning from

speaker verifification to multispeaker text-to-speech synthesis,” in

Advances in Neural Information Processing Systems, 2018, pp.

4480–4490.

[16] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop:

Voice fifitting and synthesis via a phonological loop,” arXiv

preprint arXiv:1707.06588, 2017.

[17] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping,

J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text

to-speech,” in Advances in neural information processing systems,

2017, pp. 2962–2970.

[18] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,

S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker

neural text-to-speech,” Proc. ICLR, 2018.

[19] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan

nan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker

embedding system,” arXiv preprint arXiv:1705.02304, 2017.

[20] R. J. Skerry-Ryan, E. Battenberg, X. Ying, Y. Wang, and R. A.

Saurous, “Towards end-to-end prosody transfer for expressive

speech synthesis with tacotron,” 2018.

[21] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans

lation by jointly learning to align and translate,” arXiv preprint

arXiv:1409.0473, 2014.

[22] C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-CSTR

VCTK corpus: English multi-speaker corpus for CSTR voice

cloning toolkit,” 2016.

[23] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,

Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural

TTS synthesis by conditioning wavenet on mel spectrogram pre

dictions,” in 2018 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–

4783.

[24] D. Griffifin and J. Lim, “Signal estimation from modifified short

time fourier transform,” IEEE Transactions on Acoustics, Speech,

and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.

[25] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,

N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al.,

“Tacotron: Towards end-to-end speech synthesis,” arXiv preprint

arXiv:1703.10135, 2017.

利用单语种数据合成多语种语音--Building a mixed-lingual neural TTS system with only monolingual data