Welcome to CSTR’s NN-TTS documentation!
Indices and tables
开始
- 必要软件/工具
- 数据准备（基于神经网络的语音合成系统）
  - 输入语言特征
  - 输出声学特征
- Recipes
  - 结构
  - 深度前馈神经网络
  - 混合密度神经网络
  - （深度）基于长短时记忆（LSTM）的递归神经网络（RNN）
  - （深度/混合）基于双向LSTM的RNN
- LSTM的变体
  - Stacked Bottlenecks
  - Trajectory modelling
模型
- 深度前馈/递归网络
层
- 递归神经网络单元
I/O 函数
- Binary I/O collections
工具类
- Data Provider
前端
- 标签归一化

Welcome to CSTR’s NN-TTS documentation!

Contents:

Indices and tables

Index
Module Index
Search Page

开始

必要软件/工具

系统已在Linux环境下测试，以下包是必须的。

Python 2.6/2.7
Theano 0.6/0.7 (http://deeplearning.net/software/theano/). 0.8 version is not tested yet.
Bandmat 0.5: https://pypi.python.org/pypi/bandmat/0.5
SPTK: http://sp-tk.sourceforge.net/

数据准备（基于神经网络的语音合成系统）

为了构建神经网络系统，必须准备语言特征作为系统输入，声学特征作为系统输出。请按照本节说明准备数据。

输入语言特征

神经网络将矢量作为输入，所以语言特征的字母表示需要矢量化。.

HTS 风格:请查看HTS样例中的HTS风格标签 ( http://hts.sp.nitech.ac.jp/)。
- 提供带有state级对齐的完整上下文标签。
- 提供匹配HTS标签的问题文件。
- 问题文件中的问题用作将完整上下文标签转换为二进制和/或数字的特征以进行矢量化。建议手工选择问题，问题的数量会影响矢量化输入特征的维度。
- 不同于HTS格式的问题，神经网络系统也支持使用‘CQS‘来提取数值，例如，** CQS “Pos_C-Word_in_C-Phrase(Fw)” {:(d+)+}**，其中‘:’和‘+’是分隔符， ‘(d+)‘ 是匹配数字特征的正则表达式。
“组合” 风格:
直接的 *矢量化*输入: 如果你更喜欢自己进行矢量化，可以直接提供二进制文件给系统。请按照以下说明准备你的二进制文件：
- 将输入特征矢量和声学特征对齐。输入和输出特征应该有相同数量的帧。
- 以二进制格式存储数据，带有 ‘float32‘ 精度。
- 在配置文件中，使用空的问题文件，并设置 appended_input_dim 为输入矢量的维度。
- 注：声音转换可以使用这种直接的矢量化输入。

输出声学特征

默认的设置是假设你使用STRAIGHT vocoder（c 语言版）。该声音合成机对学术用户是免费的。输出包括

梅尔倒谱系数（MCC），
band aperiodicities (BAP)，
对数尺度的基频（F0）。

请提供二进制格式的这三种特征，带有 ‘float32’ 精度，在配置文件中，提供每种特征的维度，例如

[Outputs] mgc : 60
[Outputs] dmgc : 180

dmgc 是指the dimensionality of MCC with delta and delta delta features。如果 dmgc 设为 60，只有静态特征被使用。请为每种特征设置文件扩展名，例如

[Extensions] mgc_ext : .mgc

[Extensions] bap_ext : .bap

[Extensions] lf0_ext : .lf0

开源的 WORLD vocoder同样被支持。作为SPSS的修改版本可以在版本库中找到。

如果你有自己更喜欢的声音合成机，请试着给每种特征一个昵称来匹配支持的类型。

Recipes

在系统中，提供了标准神经网络结构的若干示例样本。在下面描述他们：

结构

系统支持通过修改配置文件的[Architecture]部分，灵活改变神经网络结构：

hidden_layer_size : [512, 512, 512, 512]
hidden_layer_type : [‘TANH’, ‘TANH’, ‘TANH’, ‘TANH’]

默认使用前馈神经网络。但是系统支持不同类型的隐层：

‘TANH’ : 双曲正切激活函数
‘RNN’ : 简单但是标准的递归神经网络单元
‘LSTM’ : 标准的长短时记忆单元
‘GRU’ : The gated recurrent unit
‘SLSTM’: 简化的LSTM单元
‘BLSTM’: 双向LSTM单元

你可以通过选择每一隐层的隐层单元来定义自己的结构。对于每种类型的隐层，请查看Models 一节.

深度前馈神经网络

在‘./recipes/dnn’目录可以找到一个配置文件样例。请使用 ‘submit.sh ./run_lstm.py ./recipes/dnn/feed_foward_dnn.conf’ 来构建前馈神经网络。请修改配置文件来适应自己的工作环境（如，数据路径）。

混合密度神经网络

（深度）基于长短时记忆（LSTM）的递归神经网络（RNN）

一个样例配置文件提供于 ‘./recipes/dnn/hybrid_lstm.conf’。按照与深度前馈神经网络部分相同的示例样本进行。

（深度/混合）双向基于LSTM的RNN

样例配置文件提供于目录 ‘./recipes/blstm’ 。 ‘blstm.conf’ 用于多个双向LSTM层，而 while ‘hybrid_blstm.conf’用于混合结构，即在底层使用若干前馈层，顶部使用一个BLSTM层。

LSTM的变体

这一示例样本是为了支持Wu & King (ICASSP 2016)的论文。提供了LSTM的若干变体。请使用相应的配置文件来做实验。

Stacked Bottlenecks

Trajectory modelling

模型

深度前馈/递归网络

这些是不在源码评注（docstring）中的。

class models.deep_rnn. DeepRecurrentNetwork ( n_in, hidden_layer_size, n_out, L1_reg, L2_reg, hidden_layer_type, output_type='LINEAR' )

该类用于组合各种神经网络结构。从基础的前馈神经网络到双向门限递归神经网络和混合结构。混合指前馈和递归的组合结构。

__init__ ( n_in, hidden_layer_size, n_out, L1_reg, L2_reg, hidden_layer_type, output_type='LINEAR' )

该函数初始化一个神经网络

参数:

n_in – 输入特征维数
hidden_layer_size (A list of integers) – 隐层的数量
n_out (Integrer) – 输出特征维数
hidden_layer_type – 各隐层的激活类型，例如，TANH, LSTM, GRU, BLSTM
L1_reg – the L1 regulasation weight
L2_reg – the L2 regulasation weight
output_type – 输出层激活类型，默认是 ‘LINEAR’, 线性回归。
p_dropout – the dropout rate, 一个0到1之间的浮点数。

build_finetune_functions ( train_shared_xy, valid_shared_xy )

该函数用来构建微调函数，并更新梯度

参数:	train_shared_xy (tuple of shared variable) – theano shared variable for input and output training data valid_shared_xy (tuple of shared variable) – theano shared variable for input and output development data
返回:	finetune functions for training and development

parameter_prediction ( test_set_x )

该函数用于预测

参数:	test_set_x (python array variable) – 一个测试句子的输入特征
返回:	预测的特征

层

递归神经网络单元

这些是不在源码评注（docstring）中的。

class layers.gating. VanillaRNN ( rng, x, n_in, n_h )

该类实现了一个标准的递归神经网络：h_{t} = f(W^{hx}x_{t} + W^{hh}h_{t-1}+b_{h})

__init__ ( rng, x, n_in, n_h )

初始化一个标准的 RNN 隐藏单元

参数:	rng – random state, fixed value for randome state for reproducible objective results x – input data to current layer n_in – dimension of input data n_h – number of hidden units/blocks

recurrent_as_activation_function ( Wix, h_tm1 )

Implement the recurrent unit as an activation function. This function is called by self.__init__().

Parameters:	Wix (matrix) – it equals to W^{hx}x_{t}, as it does not relate with recurrent, pre-calculate the value for fast computation h_tm1 (matrix, each row means a hidden activation vector of a time step) – contains the hidden activation from previous time step
Returns:	h_t is the hidden activation of current time step

class layers.gating. LstmBase ( rng, x, n_in, n_h )

该类作为基类提供给所有长短时记忆（LSTM）相关的类。LSTM 的若干变体在 (Wu & King, ICASSP 2016)中被探讨：Zhizheng Wu, Simon King, “Investigating gated recurrent neural networks for speech synthesis”, ICASSP 2016

__init__ ( rng, x, n_in, n_h )

Initialise all the components in a LSTM block, including input gate, output gate, forget gate, peephole connections

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( ): A genetic recurrent activation function for variants of LSTM architectures.The function is called by self.recurrent_fn().

recurrent_fn ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1=None )

This implements a genetic recurrent function, called by self.__init__().

Parameters:	Wix – pre-computed matrix applying the weight matrix W on the input units, for input gate Wfx – Similar to Wix, but for forget gate Wcx – Similar to Wix, but for cell memory Wox – Similar to Wox, but for output gate h_tm1 – hidden activation from previous time step c_tm1 – activation from cell memory from previous time step
Returns:	h_t is the hidden activation of current time step, and c_t is the activation for cell memory of current time step

class layers.gating. VanillaLstm ( rng, x, n_in, n_h )

该类实现了标准的LSTM 块，继承自类 layers.gating.LstmBase。

__init__ ( rng, x, n_in, n_h )

Initialise a vanilla LSTM block

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1 ): This function treats the LSTM block as an activation function, and implements the standard LSTM activation function.The meaning of each input and output parameters can be found inlayers.gating.LstmBase.recurrent_fn()

class layers.gating. LstmNFG ( rng, x, n_in, n_h )

该类实现了一个不带遗忘门的LSTM 块，继承自类 layers.gating.LstmBase.

__init__ ( rng, x, n_in, n_h )

Initialise a LSTM with the forget gate

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1 ): This function treats the LSTM block as an activation function, and implements the LSTM (without the forget gate) activation function.The meaning of each input and output parameters can be found inlayers.gating.LstmBase.recurrent_fn()

class layers.gating. LstmNIG ( rng, x, n_in, n_h )

该类实现了一个不带输入门的LSTM 块，继承自类 layers.gating.LstmBase.

__init__ ( rng, x, n_in, n_h )

Initialise a LSTM with the input gate

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1 ): This function treats the LSTM block as an activation function, and implements the LSTM (without the input gate) activation function.The meaning of each input and output parameters can be found inlayers.gating.LstmBase.recurrent_fn()

class layers.gating. LstmNOG ( rng, x, n_in, n_h )

该类实现了一个不带输出门的LSTM 块，继承自类该类实现了一个不带遗忘门的LSTM 块，继承自类 layers.gating.LstmBase.

__init__ ( rng, x, n_in, n_h )

Initialise a LSTM with the output gate

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1 ): This function treats the LSTM block as an activation function, and implements the LSTM (without the output gate) activation function.The meaning of each input and output parameters can be found inlayers.gating.LstmBase.recurrent_fn()

class layers.gating. LstmNoPeepholes ( rng, x, n_in, n_h )

该类实现了一个不带peephole connections的LSTM 块，继承自类 layers.gating.LstmBase.

__init__ ( rng, x, n_in, n_h )

Initialise a LSTM with the peephole connections

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1 ): This function treats the LSTM block as an activation function, and implements the LSTM (without the output gate) activation function.The meaning of each input and output parameters can be found inlayers.gating.LstmBase.recurrent_fn()

class layers.gating. SimplifiedLstm ( rng, x, n_in, n_h )

该类实现了一个简化的LSTM 块，只保留遗忘门，继承自类 layers.gating.LstmBase.

__init__ ( rng, x, n_in, n_h )

Initialise a LSTM with the peephole connections

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

lstm_as_activation_function ( Wix, Wfx, Wcx, Wox, h_tm1, c_tm1 ): This function treats the LSTM block as an activation function, and implements the LSTM (simplified LSTM) activation function.The meaning of each input and output parameters can be found inlayers.gating.LstmBase.recurrent_fn()

class layers.gating. GatedRecurrentUnit ( rng, x, n_in, n_h )

该类实现了一个门限递归单元（GRU），提出于 Cho et al 2014 (http://arxiv.org/pdf/1406.1078.pdf).

__init__ ( rng, x, n_in, n_h )

Initialise a gated recurrent unit

Parameters:	rng – random state, fixed value for randome state for reproducible objective results x – input to a network n_in (integer) – number of input features n_h (integer) – number of hidden units

I/O 函数

Binary I/O collections

class io_funcs.binary_io. BinaryIOCollection

工具类

Data Provider

class utils.providers. ListDataProvider ( x_file_list, y_file_list, n_ins=0, n_outs=0, buffer_size=500000, sequential=False, shuffle=False )

该类提供接口以逐句或逐块将数据加载到CPU/GPU记忆空间。

在语音合成中，通常我们不能加载所有训练数据/评估数据到RAM中，我们要做以下三步：

步骤 1: 一个data provide会加载部分数据到缓冲中
步骤 2: 使用缓冲中的数据训练一个DNN
步骤 3: 迭代步骤1和步骤2，直到所有数据被用于DNN训练。到此为止，DNN训练的一个迭代完成。

当使用连续训练时，逐句数据加载有效。当帧的顺序不重要时，逐块加载被使用。

该类假定二进制格式带float32精度，不带任何header（例如，HTK header）。

__init__ ( x_file_list, y_file_list, n_ins=0, n_outs=0, buffer_size=500000, sequential=False, shuffle=False )

初始化一个data provider

参数:	x_file_list (python list) – DNN输入文件的文件名列表 y_file_list – DNN输出文件的文件名列表 n_ins – 输入特征的维数 n_outs – 输出特征的维数 buffer_size –缓冲大小，指缓冲中帧的数量。该值依赖于RAM/GPU的记忆空间大小 shuffle – True/False。指文件列表是否被打乱。当逐块加载数据时，缓冲中的数据被打乱而不论该值为True或False。

load_next_partition ( ): 加载一块数据。帧的数量为初始化时设置的缓冲大小。

load_next_utterance ( ): 加载一句数据。当逐句加载时（如顺序训练）该方法被调用。

make_shared ( data_set, data_name )

使数据能被theano实现共享。如果想知道为什么要使其共享，请参考theano文档：http://deeplearning.net/software/theano/library/compile/shared.html

参数:	data_set – CPU记忆空间中的普通数据 data_name – 指数据名称（如'x', 'y'，等）
Returns:	共享的数据集 – data_set

reset ( ): 当文件列表中的所有文件被用于DNN训练，重置 data provider 以开始新一轮迭代。

前端

标签归一化

class frontend.label_normalisation. HTSLabelNormalisation ( question_file_name=None, subphone_feats='full', continuous_flag=True )

该类用于将HTS格式的标签转换成连续或二进制值，并以float32精度二进制格式存储。

该类支持两种类型的问题：QS和CQS。

QS: 和HTS中使用的一样

CQS: 是本系统中新定义的问题。这儿有一个问题的样例： CQS C-Syl-Tone {_(d+)+}。正则表达式用于连续值。

HTS标签中应该有时间对齐信息。这儿有一个HTS标签的样例：

3050000 3100000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[2]

3100000 3150000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[3]

3150000 3250000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[4]

3250000 3350000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[5]

3350000 3900000 xx~#-p+l=i:1_4/A/0_0_0/B/1-1-4:1-1&1-4#1-3$1-4>0-1<0-1|i/C/1+1+3/D/0_0/E/content+1:1+3&1+2#0+1/F/content_1/G/0_0/H/4=3:1=1&L-L%/I/0_0/J/4+3-1[6]

305000 310000 是起始和终止时间。[2], [3], [4], [5], [6] 指HMM状态索引。

wildcards2regex ( question, convert_number_pattern=False ): 将HTK风格的问题转换为正则表达式来搜索标签。如果 convert_number_pattern为真，保持以下序列非转义来提取连续值：

(d+) – 处理不含小数点的数 ([d.]+) – 处理含与不含小数点的数

Merlin doc 0.0.1

目录

Welcome to CSTR’s NN-TTS documentation!

Indices and tables

开始

必要软件/工具

数据准备（基于神经网络的语音合成系统）

输入语言特征

输出声学特征

Recipes

结构

深度前馈神经网络

混合密度神经网络

（深度）基于长短时记忆（LSTM）的递归神经网络（RNN）

（深度/混合）双向基于LSTM的RNN

LSTM的变体

Stacked Bottlenecks

Trajectory modelling

模型

深度前馈/递归网络

层

递归神经网络单元

I/O 函数

Binary I/O collections

工具类

Data Provider

前端

标签归一化

你可能感兴趣的:(语音合成)