英文原文地址:点击打开链接
本人译文如下:
1. 介绍:
首先,需要完成 标准的 GMM-HMM 声学模型的训练
训练 monophone model 是通过 GMM-HMM System 做 utterance-level transcriptions,即训练 label-audio 的映射
训练 triphone model 是通过 GMM-HMM System 做 phoneme-to-audio aglignments
因此,DNN 是严格依赖于 GMM-HMM 的质量,如果 GMM-HMM 很差,那么 DNN 的结果也好不到哪里去(不管你用了多少个 epoch,用了什么样的 cost function,你的用了多么聪明的 learning rate);相反,如果 GMM-HMM 质量很高,那么 DNN 结果也会有很大的提升。
一个神经网络就是一个分类工具,能够将一些新的特征(如声学特征)分类到某一个 class。DNN 的输入 nodes 一般为 39 维的 MFCC 特征,输出的 nodes 为相关的 labels(eg: 900 个输出 <-> 900 个 context-dependent triphones[即 decision tree leaves])。也就是说:Acoustic features 用于训练 GMM-HMM 和 decision tree,这两部分是 Acoustic model(input layer and outlayer) 建模的关键部分。
隐藏层的尺寸不受限于前面讲的 GMM-HMM 结构或声学特征的维度,而取决于模型的研究人员和开发者。
一旦确定了 DNN 确定的 input node 和 output node 的维度,就可以做 phoneme-to-audio alignment 和 训练神经网络。
audio feature frames 做为 input layer 的输入,网络将为该 frame 分配一个 phoneme label;对于任意给定的 fames, 我们已经有对应的 gold-standard label(eg: 做 GMM-HMM alignments 后的 phoneme label) ,我们就可以比较网络输出的 phoneme lable 与真实的 phoneme,使用 loss function 和 backpropigation ,我们就可以迭代训练所有的 frames 得到网络层合适的 weights 和 biases。
注意,不像训练 GMM-HMM 时,需要对 audio frames 使用 EM 算法做 iteratively realign transcriptions;在 DNN 训练时不需要这样的工作。
最后,我们的目的是得到这样一个 DNN,它能将一个正确的 phoneme label 分配给相应的输入 audio frame。
2. 训练一个 DNN
主要的过程如下:
prepare_data.sh
script in a s5/local
directory)prepare_lang.sh
)align_si.sh
).make_mfcc.sh
script)train 目录结构如下:
我的项目中 train 目录位于 data/train
train/
├── feats.scp
└── split4
├── 1
│ └── feats.scp
├── 2
│ └── feats.scp
├── 3
│ └── feats.scp
└── 4
└── feats.scp
lang 目录结构如下:
我的项目中 lang 目录位于 data/lang
lang/
└── topo
align 目录结构如下:
我的项目中 align 目录位于 exp/tri1_ali 或者 exp/tri2_ali。。。
triphones_aligned/
├── ali.1.gz
├── ali.2.gz
├── ali.3.gz
├── ali.4.gz
├── final.mdl
├── num_jobs
└── tree
mfcc 目录结构如下:
我的项目中 mfcc 目录位于 exp/make_mfcc
mfcc/
├── raw_mfcc_train.1.ark
├── raw_mfcc_train.1.scp
├── raw_mfcc_train.2.ark
├── raw_mfcc_train.2.scp
├── raw_mfcc_train.3.ark
├── raw_mfcc_train.3.scp
├── raw_mfcc_train.4.ark
└── raw_mfcc_train.4.scp
为了简单说明框架,去掉了其它繁琐的细节
run_nnet2.sh,内容类似下面:
#!/bin/bash
# Joshua Meyer 2017
# This script is based off the run_nnet2_baseline.sh script from the wsj eg
# This is very much a toy example, intended to be for learning the ropes of
# nnet2 training and testing in Kaldi. You will not get state-of-the-art
# results.
# The default parameters here are in general low, to make training and
# testing faster on a CPU.
stage=1
experiment_dir=experiment/nnet2/nnet2_simple
num_threads=4
minibatch_size=128
unknown_phone=SPOKEN_NOISE # having these explicit is just something I did when
silence_phone=SIL # I was debugging, they are now required by decode_simple.sh
. ./path.sh
. ./utils/parse_options.sh
进入一步跟踪,调用脚本 train_pnorm_fast.sh(steps/nnet2/train_pnorm_fast.sh),内容类似如下:
tip: 我的项目中 align 目录位于 run_nnet2.sh
if [ $stage -le 1 ]; then
echo ""
echo "######################"
echo "### BEGIN TRAINING ###"
echo "######################"
mkdir -p $experiment_dir
steps/nnet2/train_simple.sh \
--stage -10 \
--num-threads "$num_threads" \
--feat-type raw \
--splice-width 4 \
--lda_dim 65 \
--num-hidden-layers 2 \
--hidden-layer-dim 50 \
--add-layers-period 5 \
--num-epochs 10 \
--iters-per-epoch 2 \
--initial-learning-rate 0.02 \
--final-learning-rate 0.004 \
--minibatch-size "$minibatch_size" \
data/train \
data/lang \
experiment/triphones_aligned \
$experiment_dir \
|| exit 1;
echo ""
echo "####################"
echo "### END TRAINING ###"
echo "####################"
if [ $stage -le 2 ]; then
echo ""
echo "#####################"
echo "### BEGIN TESTING ###"
echo "#####################"
steps/nnet2/decode_simple.sh \
--num-threads "$num_threads" \
--beam 8 \
--max-active 500 \
--lattice-beam 3 \
experiment/triphones/graph \
data/test \
$experiment_dir/final.mdl
$unknown_phone \
$silence_phone \
$experiment_dir/decode \
|| exit 1;
for x in ${experiment_dir}/decode*; do
[ -d $x ] && grep WER $x/wer_* | \
utils/best_wer.sh > nnet2_simple_wer.txt;
done
echo ""
echo "###################"
echo "### END TESTING ###"
echo "###################"
fi
3) 主要的脚本
首先,steps/nnet2/train_pnorm_fast.sh 中的一些默认参数设置:
#!/bin/bash
# Copyright 2012-2014 Johns Hopkins University (Author: Daniel Povey).
# 2013 Xiaohui Zhang
# 2013 Guoguo Chen
# 2014 Vimal Manohar
# Apache 2.0.
#
# Begin configuration section.
cmd=run.pl
stage=-4
num_epochs=15 # Number of epochs of training
initial_learning_rate=0.04
final_learning_rate=0.004
bias_stddev=0.5
hidden_layer_dim=0
add_layers_period=2 # by default, add new layers every 2 iterations.
num_hidden_layers=3
minibatch_size=128 # by default use a smallish minibatch size for neural net
# training; this controls instability which would otherwise
# be a problem with multi-threaded update.
num_threads=4 # Number of jobs to run in parallel.
splice_width=4 # meaning +- 4 frames on each side for second LDA
lda_dim=40
feat_type=raw # raw, untransformed features (probably MFCC or PLP)
iters_per_epoch=5
. ./path.sh || exit 1; # make sure we have a path.sh script
. ./utils/parse_options.sh || exit 1;
当完成上面 命令行的解析后,接下来就确认做为 DNN 训练时由 GMM-HMM 训练产的文件。
data_dir=$1
lang_dir=$2
ali_dir=$3
exp_dir=$4
# Check some files from our GMM-HMM system
for f in \
$data_dir/feats.scp \
$lang_dir/topo \
$ali_dir/ali.1.gz \
$ali_dir/final.mdl \
$ali_dir/tree \
$ali_dir/num_jobs;
do [ ! -f $f ] && echo "$0: no such file $f" && exit 1;
done
一旦确认完上述文件后,接下来就是从这些文件中提取 “参数信息”
# Set number of leaves
num_leaves=`tree-info $ali_dir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1;
# set up some dirs and parameter definition files
nj=`cat $ali_dir/num_jobs` || exit 1;
echo $nj > $exp_dir/num_jobs
cp $ali_dir/tree $exp_dir/tree
mkdir -p $exp_dir/log
上面的脚本定义了一连串的变量,创建了两个文件 tree(从GMM-HMM中拷贝而来),num_jobs, 并创建了一个空 log 目录,目录结构如下:
experiment/nnet2/
└── nnet2_simple
├── log
├── num_jobs
└── tree
接下来,进入训练前数据准备部分,通过脚本 local/train_mllt.sh 估计 LDA 特征变换,这些特征 transformation matrix 将用于 DNN 输入前的 spliced features (拼接特征)。
if [ $stage -le -5 ]; then
echo ""
echo "###############################"
echo "### BEGIN GET LDA TRANSFORM ###"
echo "###############################"
steps/nnet2/get_lda_simple.sh \
--cmd "$cmd" \
--lda-dim $lda_dim \
--feat-type $feat_type \
--splice-width $splice_width \
$data_dir \
$lang_dir \
$ali_dir \
$exp_dir \
|| exit 1;
# these files should have been written by get_lda.sh
feat_dim=$(cat $exp_dir/feat_dim) || exit 1;
lda_dim=$(cat $exp_dir/lda_dim) || exit 1;
lda_mat=$exp_dir/lda.mat || exit;
echo ""
echo "#############################"
echo "### END GET LDA TRANSFORM ###"
echo "#############################"
fi
上面的脚本将输出 LDA transform matrix,当初始化神经网络时,位于 input layer 的 拼接之后,该矩阵将用于 DNN 的 “
FixedAffineComponent”,也就是说:一旦我位得到 LDA transform,它将被应用到所有的 input,由于它是
FixedComponent,所以 LDA transform matrix 将不会被 back-propagation (反向传播)更新。产生的输出如下:
experiment/nnet2/
└── nnet2_simple
├── feat_dim
├── lda.1.acc
├── lda.2.acc
├── lda.3.acc
├── lda.4.acc
├── lda.acc
├── lda_dim
├── lda.mat
├── log
│ ├── lda_acc.1.log
│ ├── lda_acc.2.log
│ ├── lda_acc.3.log
│ ├── lda_acc.4.log
│ ├── lda_est.log
│ └── lda_sum.log
├── num_jobs
└── tree
为了简单,以下脚本将 validation 和 diagnostics 放在了一起,以只有训练数据和格式化部分,没有将其分成各个子集进行 diagnostic (诊断)
if [ $stage -le -4 ]; then
echo ""
echo "###################################"
echo "### BEGIN GET TRAINING EXAMPLES ###"
echo "###################################"
steps/nnet2/get_egs_simple.sh \
--cmd "$cmd" \
--feat-type $feat_type \
--splice-width $splice_width \
--num-jobs-nnet $num_threads \
--iters-per-epoch $iters_per_epoch \
$data_dir \
$ali_dir \
$exp_dir \
|| exit 1;
# this is the path to the new egs dir that was just created
egs_dir=$exp_dir/egs
echo ""
echo "#################################"
echo "### END GET TRAINING EXAMPLES ###"
echo "#################################"
fi
运行上述脚本,将输出新目录,结构如下:
experiment/nnet2/
└── nnet2_simple
├── egs
│ ├── egs.1.0.ark
│ ├── egs.1.1.ark
│ ├── egs.2.0.ark
│ ├── egs.2.1.ark
│ ├── egs.3.0.ark
│ ├── egs.3.1.ark
│ ├── egs.4.0.ark
│ ├── egs.4.1.ark
│ ├── iters_per_epoch
│ └── num_jobs_nnet
├── feat_dim
├── lda.1.acc
├── lda.2.acc
├── lda.3.acc
├── lda.4.acc
├── lda.acc
├── lda_dim
├── lda.mat
├── log
│ ├── get_egs.1.log
│ ├── get_egs.2.log
│ ├── get_egs.3.log
│ ├── get_egs.4.log
│ ├── lda_acc.1.log
│ ├── lda_acc.2.log
│ ├── lda_acc.3.log
│ ├── lda_acc.4.log
│ ├── lda_est.log
│ ├── lda_sum.log
│ ├── shuffle.0.1.log
│ ├── shuffle.0.2.log
│ ├── shuffle.0.3.log
│ ├── shuffle.0.4.log
│ ├── shuffle.1.1.log
│ ├── shuffle.1.2.log
│ ├── shuffle.1.3.log
│ ├── shuffle.1.4.log
│ ├── split_egs.1.log
│ ├── split_egs.2.log
│ ├── split_egs.3.log
│ └── split_egs.4.log
├── num_jobs
└── tree
类似将 topo 配置文件应用于 GMM-HMM 训练中,在初始化神经网络之前,我们需要神经网络的尺寸和结构,相关信息位于配置文件“exp/tri4-si/nnet.config”中,详细信息如下:
SpliceComponent input-dim=$feat_dim left-context=$splice_width right-context=$splice_width
FixedAffineComponent matrix=$lda_mat
AffineComponent input-dim=$lda_dim output-dim=$hidden_layer_dim learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev
TanhComponent dim=$hidden_layer_dim
AffineComponent input-dim=$hidden_layer_dim output-dim=$num_leaves learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev
SoftmaxComponent dim=$num_leaves
各层的含义如下:
SpliceComponent defines the size of the window of feature-frame-splicing to perform.
FixedAffineComponent is our LDA-like transform created by get_lda_simple.sh.
AffineComponent is the standard Wx+b affine transform found in neural nets. This first AffineComponent represents the weights and biases between the input layer and the first hidden layer.
TanhComponent is the standard tanh nonlinearity.
AffineComponent is the standard Wx+b affine transform found in neural nets. This second AffineComponent represents the weights and biases between the hidden layer and the output layer.
SoftmaxComponent is the final nonlinearity that produces properly normalized probabilities at the output.
SpliceComponent: 定义了完成 feature-frame-splicing 的窗口尺寸(以中间 frame 为轴,左右各四个 frame,共9帧为单位组合后做为输入(通常由 MFCC+splice+LDA+MLLT+fMLLR 组成的 40 维特征,splicing width = 4 是最优的)
FixedAffineComponent:类 LDA-like 的非相关转换,由标准的 weight matrix plus bias 组成,通过标准的 stochastic gradient descent 训练而来,使用 global learning rate
AffineComponentPreconditionedOnline:为 FixedAffineComponent 的一种提炼,训练过程中不仅使用global learning rate,还使用 matrix-valued learning rate 来预处理梯度下降。参见 dnn2_preconditioning。
PnormComponent:为非线性,传统的神经网络模型中使用 TanhComponent
NormalizeComponent:用于稳定训练 p-norm 网络,它是固定的,非可训练,非线性的。它不是在个别 individual activations 上起作用,而是对单帧的整个 vetor 起作用,重新使它们单位标准化。
SoftmaxComponent:为最终的非线性特征,便于输出标准概率
上述 初始化 DNN 配置文件一个隐藏层
AffineComponent input-dim=$hidden_layer_dim output-dim=$hidden_layer_dim learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev
TanhComponent dim=$hidden_layer_dim
再一次,我们发现 affine transform 之后紧跟着一个 non-linearity。
$cmd $exp_dir/log/nnet_init.log \
nnet-am-init \
$ali_dir/tree \
$lang_dir/topo \
"nnet-init $exp_dir/nnet.config -|" \
$exp_dir/0.mdl \
|| exit 1;
接下来 “check-in” 来看看都产生了哪些文件,如下:
nnet2/
└── nnet2_simple
├── 0.mdl
├── hidden.config
├── log
│ └── nnet_init.log
└── nnet.config
num-components 6
num-updatable-components 2
left-context 4
right-context 4
input-dim 13
output-dim 1759
parameter-dim 181759
component 0 : SpliceComponent, input-dim=13, output-dim=117, context=-4 -3 -2 -1 0 1 2 3 4
component 1 : FixedAffineComponent, input-dim=117, output-dim=40, linear-params-stddev=0.0146923, bias-params-stddev=2.91086
component 2 : AffineComponent, input-dim=40, output-dim=100, linear-params-stddev=0.100784, bias-params-stddev=0.49376, learning-rate=0.02
component 3 : TanhComponent, input-dim=100, output-dim=100
component 4 : AffineComponent, input-dim=100, output-dim=1759, linear-params-stddev=0, bias-params-stddev=0, learning-rate=0.02
component 5 : SoftmaxComponent, input-dim=1759, output-dim=1759
prior dimension: 0
$cmd $exp_dir/log/train_trans.log \
nnet-train-transitions \
$exp_dir/0.mdl \
"ark:gunzip -c $ali_dir/ali.*.gz|" \
$exp_dir/0.mdl \
|| exit 1;
nnet-am-info 0.mdl
num-components 6
num-updatable-components 2
left-context 4
right-context 4
input-dim 13
output-dim 1759
parameter-dim 181759
component 0 : SpliceComponent, input-dim=13, output-dim=117, context=-4 -3 -2 -1 0 1 2 3 4
component 1 : FixedAffineComponent, input-dim=117, output-dim=40, linear-params-stddev=0.0146923, bias-params-stddev=2.91086
component 2 : AffineComponent, input-dim=40, output-dim=100, linear-params-stddev=0.100784, bias-params-stddev=0.49376, learning-rate=0.02
component 3 : TanhComponent, input-dim=100, output-dim=100
component 4 : AffineComponent, input-dim=100, output-dim=1759, linear-params-stddev=0, bias-params-stddev=0, learning-rate=0.02
component 5 : SoftmaxComponent, input-dim=1759, output-dim=1759
prior dimension: 1759, prior sum: 1, prior min: 1.68406e-05
接下来进入主训练循环阶段,该阶段利用 backpropagation (反向传播)进行“参数更新”
if [ $stage -le -2 ]; then
echo ""
echo "#################################"
echo "### BEGIN TRAINING NEURAL NET ###"
echo "#################################"
# get some info on iterations and number of models we're training
iters_per_epoch=`cat $egs_dir/iters_per_epoch` || exit 1;
num_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;
num_tot_iters=$[$num_epochs * $iters_per_epoch]
echo "Will train for $num_epochs epochs = $num_tot_iters iterations"
# Main training loop
x=0
while [ $x -lt $num_tot_iters ]; do
echo "Training neural net (pass $x)"
# IF *not* first iteration \
# AND we still have layers to add \
# AND its the right time to add a layer
if [ $x -gt 0 ] \
&& [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] \
&& [ $[($x-1) % $add_layers_period] -eq 0 ];
then
echo "Adding new hidden layer"
mdl="nnet-init --srand=$x $exp_dir/hidden.config - |"
mdl="$mdl nnet-insert $exp_dir/$x.mdl - - |"
else
# otherwise just use the past model
mdl=$exp_dir/$x.mdl
fi
# Shuffle examples and train nets with SGD
$cmd JOB=1:$num_jobs_nnet $exp_dir/log/train.$x.JOB.log \
nnet-shuffle-egs \
--srand=$x \
ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark \
ark:- \| \
nnet-train-parallel \
--num-threads=$num_threads \
--minibatch-size=$minibatch_size \
--srand=$x \
"$mdl" \
ark:- \
$exp_dir/$[$x+1].JOB.mdl \
|| exit 1;
# Get a list of all the nnets which were run on different jobs
nnets_list=
for n in `seq 1 $num_jobs_nnet`; do
nnets_list="$nnets_list $exp_dir/$[$x+1].$n.mdl"
done
learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_tot_iters $initial_learning_rate $final_learning_rate`;
# Average all SGD-trained models for this iteration
$cmd $exp_dir/log/average.$x.log \
nnet-am-average \
$nnets_list - \| \
nnet-am-copy \
--learning-rate=$learning_rate \
- \
$exp_dir/$[$x+1].mdl \
|| exit 1;
# on to the next model
x=$[$x+1]
done;
# copy and rename final model as final.mdl
cp $exp_dir/$x.mdl $exp_dir/final.mdl
echo ""
echo "################################"
echo "### DONE TRAINING NEURAL NET ###"
echo "################################"
fi
上述过程中,主要的训练在这个 loop: “nnet-train-parallel”