Tacotron2 论文 + 代码详解
论文阅读 Tacotron2
Tacotron2 模型详解
The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms.
模型由循环的seq2seq模型将character embeddings -> mel-scale spectrograms,之后使用一个改进的WaveNet模型作为vocoder生成时域波形。
Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features.
对使用mel图作为WaveNet的输入,而不是用linguistic、duration、F0 features的影响。
We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
Over time, different techniques have dominated the field.
Concatenative synthesis with unit selection, the process of stitching small units of pre-recorded waveforms together [2, 3] was the state-of-the-art for many years.
Statistical parametric speech synthesis [4, 5, 6, 7], which directly generates smooth trajectories of speech features to be synthesized by a vocoder, followed, solving many of the issues that concatenative synthesis had with boundary artifacts.
However, the audio produced by these systems often sounds muffled and unnatural compared to human speech.
在不同时期,不同的TTS技术占据主导地位,但是无论是Concatenative Synthesis还是SPSS都存在一个问题,就是这些系统产生的音频和人类语音相比,往往听起来很沉闷、不自然。
WaveNet [8], a generative model of time domain waveforms, produces audio quality that begins to rival that of real human speech and is already used in some complete TTS systems [9, 10, 11].
The inputs to WaveNet (linguistic features, predicted log fundamental frequency (F0), and phoneme durations), however, require significant domain expertise to produce, involving elaborate text-analysis systems as well as a robust lexicon (pronunciation guide).
WaveNet输入的是像linguistic features, predicted log fundamental frequency (F0), and phoneme durations,这些输入需要大量的领域专业知识才能使用,这些专业知识包括精心设计的文本分析系统和robust词汇表(发音指南)。
Tacotron [12], a sequence-to-sequence architecture [13] for producing magnitude spectrograms from a sequence of characters, simplifies the traditional speech synthesis pipeline by replacing the production of these linguistic and acoustic features with a single neural network trained from data alone.
To vocode the resulting magnitude spectrograms, Tacotron uses the Griffin-Lim algorithm [14] for phase estimation, followed by an inverse short-time Fourier transform.
In this paper, we describe a unified, entirely neural approach to speech synthesis that combines the best of the previous approaches: a sequence-to-sequence Tacotron-style model [12] that generates mel spectrograms, followed by a modified WaveNet vocoder [10, 15].
Tacotron2仍然使用了一个seq2seq的Tacotron模型,通过这个模型产生mel图,将mel图输入改进的WaveNet Vocoder生成波形。
Trained directly on normalized character sequences and corresponding speech waveforms, our model learns to synthesize natural sounding speech that is difficult to distinguish from real human speech.
Our proposed system consists of two components, shown in Figure 1:
A recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence.
A modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames.
In this work we choose a low-level acoustic representation: mel-frequency spectrograms, to bridge the two components.
mel-frequency spectrograms介绍
A mel-frequency spectrogram is related to the linear-frequency spectrogram, i.e., the short-time Fourier transform (STFT) magnitude.
It is obtained by applying a nonlinear transform to the frequency axis of the STFT, inspired by measured responses from the human auditory system, and summarizes the frequency content with fewer dimensions.
使用mel-frequency spectrograms的优势:
Using a representation that is easily computed from time-domain waveforms allows us to train the two components separately.
This representation is also smoother than waveform samples and is easier to train using a squared error loss because it is invariant to phase within each frame.
While linear spectrograms discard phase information (and are therefore lossy), algorithms such as Griffin-Lim are capable of estimating this discarded information, which enables time-domain conversion via the inverse short-time Fourier transform.
Mel spectrograms discard even more information, presenting a challenging inverse problem.
However, in comparison to the linguistic and acoustic features used in WaveNet, the mel spectrogram is a simpler, lowerlevel acoustic representation of audio signals.
It should therefore be straightforward for a similar WaveNet model conditioned on mel spectrograms to generate audio, essentially as a neural vocoder.
Indeed, we will show that it is possible to generate high quality audio from mel spectrograms using a modified WaveNet architecture.
As in Tacotron, mel spectrograms are computed through a short-time Fourier transform (STFT) using a 50 ms frame size, 12.5 ms frame hop, and a Hann window function.
We experimented with a 5 ms frame hop to match the frequency of the conditioning inputs in the original WaveNet, but the corresponding increase in temporal resolution resulted in significantly more pronunciation issues.
We transform the STFT magnitude to the mel scale using an 80 channel mel filterbank spanning 125 Hz to 7.6 kHz, followed by log dynamic range compression.
Prior to log compression, the filterbank output magnitudes are clipped to a minimum value of 0.01 in order to limit dynamic range in the logarithmic domain.
The network is composed of an encoder and a decoder with attention.
The encoder converts a character sequence into a hidden feature representation which the decoder consumes to predict a spectrogram.
1. Encoder
Input characters are represented using a learned 512-dimensional character embedding, which are passed through a stack of 3 convolutional layers each containing 512 filters with shape 5 × 1, i.e., where each filter spans 5 characters, followed by batch normalization [18] and ReLU activations.
[batch_size, char_seq_length]
使用512维的Character Embedding,把每个character映射为512维的向量,输出维度为
[batch_size, char_seq_length, 512]
[batch_size, char_seq_length, 512]
As in Tacotron, these convolutional layers model longer-term context (e.g., N-grams) in the input character sequence.
The output of the final convolutional layer is passed into a single bi-directional [19] LSTM [20] layer containing 512 units (256 in each direction) to generate the encoded features.
- 上面得到的输出,扔给一个单层BiLSTM(用于生成编码特征),隐藏层维度是256,由于这是双向的LSTM,因此最终输出维度是
[batch_size, char_seq_length, 512]
2. Location sensitive attention
The encoder output is consumed by an attention network which summarizes the full encoded sequence as a fixed-length context vector for each decoder output step.
We use the location-sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature.
This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.
Attention probabilities are computed after projecting inputs and location features to 128-dimensional hidden representations.
Location features are computed using 32 1-D convolution filters of length 31.
3. Decoder
The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time.
The prediction from the previous time step is first passed through a small pre-net containing 2 fully connected layers of 256 hidden ReLU units.
We found that the pre-net acting as an information bottleneck was essential for learning attention.
The prenet output and attention context vector are concatenated and passed through a stack of 2 unidirectional LSTM layers with 1024 units.
PreNet的输出和Attention Context向量拼接在一起,传给一个含有1024个单元的两层单向LSTM。
post-net + Linear Projection
The concatenation of the LSTM output and the attention context vector is projected through a linear transform to predict the target spectrogram frame.
LSTM的输出再次和Attention Context向量拼接在一起,然后经过一个线性投影来预测目标频谱。
Finally, the predicted mel spectrogram is passed through a 5-layer convolutional post-net which predicts a residual to add to the prediction to improve the overall reconstruction.
最后,目标频谱帧经过一个5层卷积的post-net(后处理网络),再将该输出和Linear Projection的输出相加(残差连接)作为最终的输出。
Each post-net layer is comprised of 512 filters with shape 5 × 1 with batch normalization, followed by tanh activations on all but the final layer.
We minimize the summed mean squared error (MSE) from before and after the post-net to aid convergence
Linear Projection + sigmoid
In parallel to spectrogram frame prediction, the concatenation of decoder LSTM output and the attention context is projected down to a scalar and passed through a sigmoid activation to predict the probability that the output sequence has completed.
另一边,LSTM的输出和Attention Context向量拼接在一起,投影成标量后传给sigmoid激活函数,来预测输出序列是否已完成预测的概率。
This “stop token” prediction is used during inference to allow the model to dynamically determine when to terminate generation instead of always generating for a fixed duration.
这种 "停止标记 "的预测在推理过程中被使用,使模型能够动态地确定何时终止生成,而不是总是给定一个固定的停止时间。
Specifically, generation completes at the first frame for which this probability exceeds a threshold of 0.5.
The convolutional layers in the network are regularized using dropout [25] with probability 0.5, and LSTM layers are regularized using zoneout [26] with probability 0.1.
In order to introduce output variation at inference time, dropout with probability 0.5 is applied only to layers in the pre-net of the autoregressive decoder.
部分,使用了embedding + 3Conv_layers + Bi-directional_LSTM
- attention
location sensitive attentionPre_net
激活函数- 最后的
相加得到- 多帧预测方面,虽然Tacotron2没有使用多帧,但是实现原理类似
We use a modified version of the WaveNet architecture from [8] to invert the mel spectrogram feature representation into time-domain waveform samples.
扩张卷积(dilated convolution)
As in the original architecture, there are 30 dilated convolution layers, grouped into 3 dilation cycles, i.e., the dilation rate of layer k (k = 0 . . . 29) is
与原始WaveNet结构一样,有30个扩张卷积层,分为3个扩张周期,即第k层(k=0 … 29)的扩张率为
To work with the 12.5 ms frame hop of the spectrogram frames, only 2 upsampling layers are used in the conditioning stack instead of 3 layers.
Instead of predicting discretized buckets with a softmax layer, we follow PixelCNN++ [27] and Parallel WaveNet [28] and use a 10 component mixture of logistic distributions (MoL) to generate 16-bit samples at 24 kHz.
*没有使用softmax层预测离散片段,而是借鉴了PixelCNN++[27]和Parallel WaveNet[28],使用*10元混合逻辑分布(10-component MoL)来产生24kHz的16位深的语音样本。
To compute the logistic mixture distribution, the WaveNet stack output is passed through a ReLU activation followed by a linear projection to predict parameters (mean, log scale, mixture weight) for each mixture component.
The loss is computed as the negative log-likelihood of the ground truth sample.
Our training process involves first training the feature prediction network on its own, followed by training a modified WaveNet independently on the outputs generated by the first network.
1. 训练特征预测网络(Encoder+Attention+Decoder)
To train the feature prediction network, we apply the standard maximum-likelihood training procedure (feeding in the correct output instead of the predicted output on the decoder side, also referred to as teacher-forcing) with a batch size of 64 on a single GPU.
为了训练特征预测网络,我们在单个GPU上指定batch size为64,使用标准的最大似然训练步骤(在解码器端不是传入预测结果而是传入正确的结果,这种方法也被称为teacher-forcing)。
We use the Adam optimizer [29] with β1 = 0.9, β2 = 0.999, ε = 10^−6 and a learning rate of 10^−3 exponentially decaying to 10^−5 starting after 50,000 iterations.
我们使用Adam优化器[29],β1 = 0.9, β2 = 0.999, ε = 10-6,学习率为10-3,在50,000次迭代后开始指数式衰减到10-5。
We also apply L2 regularization with weight 10^−6.
2. 训练修改后的WaveNet
We then train our modified WaveNet on the ground truth-aligned predictions of the feature prediction network.
That is, the prediction network is run in teacher-forcing mode, where each predicted frame is conditioned on the encoded input sequence and the corresponding previous frame in the ground truth spectrogram.
This ensures that each predicted frame exactly aligns with the target waveform samples.
We train with a batch size of 128 distributed across 32 GPUs with synchronous updates, using the Adam optimizer with β1 = 0.9, β2 = 0.999, ? = 10−8 and a fixed learning rate of 10−4.
在训练过程中,使用Adam优化器并指定参数β1 = 0.9; β2 = 0.999; ε= 1e-8,学习率固定为10e-4,把batch size为128的批训练分布在32个GPU上执行并同步更新。
It helps quality to average model weights over recent updates.
Therefore we maintain an exponentially-weighted moving average of the network parameters over update steps with a decay of 0.9999 – this version is used for inference (see also [29]).
所以我们在更新网络参数时采用衰减率为0.9999的指数加权平均 – 这个处理用在推断中(请参照[29])。
To speed up convergence, we scale the waveform targets by a factor of 127.5 which brings the initial outputs of the mixture of logistics layer closer to the eventual distributions.
3. 训练使用的数据集
We train all models on an internal US English dataset[12], which contains 24.6 hours of speech from a single professional female speaker.
All text in our datasets is spelled out. e.g., “16” is written as “sixteen”, i.e., our models are all trained on normalized text.
When generating speech in inference mode, the ground truth targets are not known.
Therefore, the predicted outputs from the previous step are fed in during decoding, in contrast to the teacher-forcing configuration used for training.
1. 评估集
We randomly selected 100 fixed examples from the test set of our internal dataset as the evaluation set.
Note that while instances in the evaluation set never appear in the training set, there are some recurring patterns and common words between the two sets.
Since all the systems we compare are trained on the same data, relative comparisons are still meaningful.
2. MOS分数
Audio generated on this set are sent to a human rating service similar to Amazon’s Mechanical Turk where each sample is rated by at least 8 raters on a scale from 1 to 5 with 0.5 point increments, from which a subjective mean opinion score (MOS) is calculated.
在这个集合上产生的音频被送到一个类似于亚马逊的Mechanical Turk的人类评分服务系统中,每个样本由至少8个评分者在1到5分的范围内进行评分(以0.5分的增量),从中计算出一个主观的平均意见分数(MOS)。
Each evaluation is conducted independently from each other, so the outputs of two different models are not directly compared when raters assign a score to them.
1. MOS分数对比
2. 将系统合成的音频和ground truth进行并排评估
For each pair of utterances, raters are asked to give a score ranging from -3 (synthesized much worse than ground truth) to 3 (synthesized much better than ground truth).
进行评分,评分范围[-3, 3]。-3分表示合成的语音比真实的相差很多,3分表示合成的语音比真实语音更优秀。 结果
The overall mean score of −0.270 ± 0.155 shows that raters have a small but statistically significant preference towards ground truth over our results.
The comments from raters indicate that occasional mispronunciation by our system is the primary reason for this preference.
结果为−0.270 ± 0.155,说明真实的语音要强于合成的语音。
3. 在附件E中的测试集中评估MOS值
We ran a separate rating experiment on the custom 100-sentence test set from Appendix E of [11], obtaining a MOS of 4.354.
These results show that while our system is able to reliably attend to the entire input, there is still room for improvement in prosody modeling.
4. 测试系统对域外文本的概括能力
Finally, we evaluate samples generated from 37 news headlines to test the generalization ability of our system to out-of-domain text.
On this task, our model receives a MOS of 4.148±0.124 while WaveNet conditioned on linguistic features receives a MOS of 4.137 ± 0.128.
Examination of rater comments shows that our neural system tends to generate speech that feels more natural and human-like, but it sometimes runs into pronunciation difficulties, e.g., when handling names.
This result points to a challenge for end-to-end approaches – they require training on data that cover intended usage.
Tacotron2模型中的两个部分是分开训练的,但是对改进的WaveNet进行训练需要Spectrogram Prediction Network中预测的特征。
在消融研究中,我们使用了从ground truth中提取的mel图对WaveNet进行训练。
当使用ground truth进行训练,并且使用预测的特征进行合成时,得到的结果反而更差。
原因:预测的频谱图过度平滑,没有ground truth详细。当使用ground truth进行训练时,网络没有学会如何从过度平滑的特征中生成高质量的语音波形。
Since it is not possible to use the information of predicted future frames before they have been decoded, we use a convolutional post-processing network to incorporate past and future frames after decoding to improve the feature predictions.
To answer this question, we compared our model with and without the post-net, and found that without it, our model only obtains a MOS score of 4.429 ± 0.071, compared to 4.526 ± 0.066 with it, meaning that empirically the post-net is still an important part of the network design.
WaveNet中的一个决定性特征是其使用了dilated convolution。
This paper describes Tacotron 2, a fully neural TTS system that combines a sequence-to-sequence recurrent network with attention to predicts mel spectrograms with a modified WaveNet vocoder.
本文介绍了Tacotron 2,一个完全的神经TTS系统,它将一个序列到序列的递归网络与注意预测熔体谱图与一个修改的WaveNet声码相结合。
The resulting system synthesizes speech with Tacotron-level prosody and WaveNet-level audio quality.
This system can be trained directly from data without relying on complex feature engineering, and achieves state-of-the-art sound quality close to that of natural human speech.