首先安装 kaldi,参考官方文档
然后,下载 http://kaldi-asr.org/models/m2 并解压到egs/cvte,保证文件 kaldi/egs/cvte/s5 文件存在。
以下介绍如何添加新的语音文件并进行识别测试。
存储语音文件
data/wav/chat001
├── 001.wav
└── 002.wav
语音文件的录制,参考 语音处理常用工具集,命令
语音文件格式
$ sox --info data/wav/chat001/001.wav
Input File : 'data/wav/chat001/001.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:06.25 = 100000 samples ~ 468.75 CDDA sectors
File Size : 200k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
cd egs/cvte/s5
data/chat001/test
├── conf
│ └── fbank.conf
├── frame_shift
├── spk2utt
├── text
├── utt2spk
└── wav.scp
其中,conf
,frame_shift
的文件拷贝自 data/fbank/test
wav.scp
, 语音文件的列表
CHAT001_20200801_001 data/wav/chat001/001.wav
CHAT001_20200801_002 data/wav/chat001/002.wav
第一列和第二列之间的空格是 tab,不能使用4个空格替换,下同
text
, 语音文件的对应文本
CHAT001_20200801_001 上海 浦东机场 入境 防 输入 全 闭环 管理
CHAT001_20200801_002 北京 地铁 宣武门 站 综合 改造 新增 换乘 通道
文本中,第二列是由空格分割的单词,词汇表在 exp/chain/tdnn/graph/words.txt
对应的语素文件 exp/chain/tdnn/graph/phones.txt
词汇表和语素文件的关系 exp/chain/tdnn/graph/phones/align_lexicon.int
以上,比如 149, 133 即语素,在exp/chain/tdnn/graph/phones.txt中定义。
spk2utt
, utt2spk
说话人和语音文件的映射关系。
$ cat data/chat001/test/utt2spk
CHAT001_20200801_001 CHAT001_20200801_001
CHAT001_20200801_002 CHAT001_20200801_002
$ cat data/chat001/test/spk2utt
CHAT001_20200801_001 CHAT001_20200801_001
CHAT001_20200801_002 CHAT001_20200801_002
以上,使用了 文件的索引ID作为说话人,在kaldi中,说话人是一个宽泛的概念,理想情况是为每个独立的“发音人”设定一个ID。
utils/validate_data_dir.sh data/chat001/test
utils/fix_data_dir.sh data/chat001/test
自动解决错误会考虑完成 sort等。
kaldi/egs/cvte/s5/run_chat001.sh
#!/bin/bash
. ./cmd.sh
. ./path.sh
# step 1: generate fbank features
obj_dir=data/chat001
for x in test; do
rm -rf fbank/$x
mkdir -p fbank/$x
# compute fbank without pitch
steps/make_fbank.sh --nj 1 --cmd "run.pl" $obj_dir/$x exp/make_fbank/$x fbank/$x || exit 1;
# compute cmvn
steps/compute_cmvn_stats.sh $obj_dir/$x exp/fbank_cmvn/$x fbank/$x || exit 1;
done
# #step 2: offline-decoding
test_data=data/chat001/test
dir=exp/chain/tdnn
steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
--nj 1 --num-threads 1 \
--cmd "$decode_cmd" --iter final \
--frames-per-chunk 50 \
$dir/graph $test_data $dir/decode_chat001_test
# # note: the model is trained using "apply-cmvn-online",
# # so you can modify the corresponding code in steps/nnet3/decode.sh to obtain the best performance,
# # but if you directly steps/nnet3/decode.sh,
# # the performance is also good, but a little poor than the "apply-cmvn-online" method.
该脚本执行中分为一下几步:
$ tree data/chat001/test
data/chat001/test
├── cmvn.scp
├── conf
│ └── fbank.conf
├── feats.scp
├── frame_shift
├── spk2utt
├── split1
│ └── 1
│ ├── cmvn.scp
│ ├── feats.scp
│ ├── spk2utt
│ ├── text
│ ├── utt2dur
│ ├── utt2num_frames
│ ├── utt2spk
│ └── wav.scp
├── text
├── utt2dur
├── utt2num_frames
├── utt2spk
└── wav.scp
feats.scp
, utt2dur
, utt2num_frames
都是 make_fbank.sh
生成,也会在 fbank/test
下生成其他文件。
cmvn.scp
, 是归一化文件,steps/compute_cmvn_stats.sh
生成。
splitN
文件夹是在大量数据时,程序并发执行,然后合并,形成的一个个自文件夹。
fbank/test
目录
fbank/test
├── cmvn_test.ark
├── cmvn_test.scp
├── raw_fbank_test.1.ark
└── raw_fbank_test.1.scp
exp/make_fbank
目录
exp/make_fbank
└── test
├── make_fbank_test.1.log
└── wav.1.scp
steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
--nj 1 --num-threads 1 \
--cmd "$decode_cmd" --iter final \
--frames-per-chunk 50 \
$dir/graph $test_data $dir/decode_chat001_test
解码同样会计算WER,可以设置输出 nBest.
cat exp/chain/tdnn/decode_chat001_test/log/decode.1.log
$ cat exp/chain/tdnn/decode_chat001_test/scoring_kaldi/best_cer
%WER 2.94 [ 1 / 34, 0 ins, 0 del, 1 sub ] exp/chain/tdnn/decode_chat001_test/cer_7_0.0
这次测试一共34个单词,和识别结果的编辑距离0插入,0删除,1个替换。
但是该替换单词为"闭环"在发音词典里不存在,识别结果为“闭”“环”,两个字,其实也可以认为识别准确。
WER nBest输出 exp/chain/tdnn/decode_chat001_test/scoring_kaldi
在解码阶段,执行的脚本如下:
# nnet3-latgen-faster --frame-subsampling-factor=3 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=-1 --extra-right-context-final=-1 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=1.0 --allow-partial=true --word-symbol-table=exp/chain/tdnn/graph/words.txt exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLG.fst "ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |" "ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz"
# Started at Sat Aug 1 16:21:14 CST 2020
#
nnet3-latgen-faster --frame-subsampling-factor=3 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=-1 --extra-right-context-final=-1 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=1.0 --allow-partial=true --word-symbol-table=exp/chain/tdnn/graph/words.txt exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLG.fst 'ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |' 'ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz'
LOG (nnet3-latgen-faster[5.5.765-f88d5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
lattice-scale --acoustic-scale=10.0 ark:- ark:-
apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:-
LOG (nnet3-latgen-faster[5.5.765-f88d5]:CheckAndFixConfigs():nnet-am-decodable-simple.cc:294) Increasing --frames-per-chunk from 50 to 51 to make it a multiple of --frame-subsampling-factor=3
CHAT001_20200801_001 上海 浦东机场 入境 房 输入 全 闭 环 管理
LOG (nnet3-latgen-faster[5.5.765-f88d5]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:375) Log-like per frame for utterance CHAT001_20200801_001 is 2.19918 over 208 frames.
LOG (apply-cmvn[5.5.765-f88d5]:main():apply-cmvn.cc:162) Applied cepstral mean normalization to 2 utterances, errors on 0
CHAT001_20200801_002 北京 地铁 宣武门 站 综合 改造 新增 换乘 通道
LOG (nnet3-latgen-faster[5.5.765-f88d5]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:375) Log-like per frame for utterance CHAT001_20200801_002 is 2.19511 over 333 frames.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:256) Time taken 10.9386s: real-time factor assuming 100 frames/sec is 0.673972
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:259) Done 2 utterances, failed for 0
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:261) Overall log-likelihood per frame is 2.19668 over 541 frames.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.00447 seconds taken in nnet3 compilation total (breakdown: 0.00219 compilation, 0.00168 optimization, 0 shortcut expansion, 0.000385 checking, 1.1e-05 computing indexes, 0.000209 misc.) + 0 I/O.
LOG (lattice-scale[5.5.765-f88d5]:main():lattice-scale.cc:107) Done 2 lattices.
# Accounting: time=53 threads=1
# Ended (code 0) at Sat Aug 1 16:22:07 CST 2020, elapsed time 53 seconds
我们详细看一下参数列表
nnet3-latgen-faster \
--frame-subsampling-factor=3 \
--frames-per-chunk=50 \
--extra-left-context=0 \
--extra-right-context=0 \
--extra-left-context-initial=-1 \
--extra-right-context-final=-1 \
--minimize=false \
--max-active=7000 \
--min-active=200 \
--beam=15.0 \
--lattice-beam=8.0 \
--acoustic-scale=1.0 \
--allow-partial=true \
--word-symbol-table=exp/chain/tdnn/graph/words.txt \
exp/chain/tdnn/final.mdl \
exp/chain/tdnn/graph/HCLG.fst \
"ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |" \
"ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz"
nnet3-latgen-faster
命令:
基于解码器LatticeFasterDecoder, 声学分来源,nnet3 模型
此外还有类似的 nnet3-latgen-faster-parallel, nnet3-latgen-faster-batch命令。
打印以下nnet3-latgen-faster
的帮助:
Generate lattices using nnet3 neural net model.
Usage: nnet3-latgen-faster [options] [ [] ]
See also: nnet3-latgen-faster-parallel, nnet3-latgen-faster-batch
Options:
--acoustic-scale : Scaling factor for acoustic log-likelihoods (caution: is a no-op if set in the program nnet3-compute (float, default = 0.1)
--allow-partial : If true, produce output even if end state was not reached. (bool, default = false)
--beam : Decoding beam. Larger->slower, more accurate. (float, default = 16)
--beam-delta : Increment used in decoding-- this parameter is obscure and relates to a speedup in the way the max-active constraint is applied. Larger is more accurate. (float, default = 0.5)
--computation.debug : If true, turn on debug for the neural net computation (very verbose!) Will be turned on regardless if --verbose >= 5 (bool, default = false)
--debug-computation : If true, turn on debug for the actual computation (very verbose!) (bool, default = false)
--delta : Tolerance used in determinization (float, default = 0.000976562)
--determinize-lattice : If true, determinize the lattice (lattice-determinization, keeping only best pdf-sequence for each word-sequence). (bool, default = true)
--extra-left-context : Number of frames of additional left-context to add on top of the neural net's inherent left context (may be useful in recurrent setups (int, default = 0)
--extra-left-context-initial : If >= 0, overrides the --extra-left-context value at the start of an utterance. (int, default = -1)
--extra-right-context : Number of frames of additional right-context to add on top of the neural net's inherent right context (may be useful in recurrent setups (int, default = 0)
--extra-right-context-final : If >= 0, overrides the --extra-right-context value at the end of an utterance. (int, default = -1)
--frame-subsampling-factor : Required if the frame-rate of the output (e.g. in 'chain' models) is less than the frame-rate of the original alignment. (int, default = 1)
--frames-per-chunk : Number of frames in each chunk that is separately evaluated by the neural net. Measured before any subsampling, if the --frame-subsampling-factor options is used (i.e. counts input frames (int, default = 50)
--hash-ratio : Setting used in decoder to control hash behavior (float, default = 2)
--ivectors : Rspecifier for iVectors as vectors (i.e. not estimated online); per utterance by default, or per speaker if you provide the --utt2spk option. (string, default = "")
--lattice-beam : Lattice generation beam. Larger->slower, and deeper lattices (float, default = 10)
--max-active : Decoder max active states. Larger->slower; more accurate (int, default = 2147483647)
--max-mem : Maximum approximate memory usage in determinization (real usage might be many times this). (int, default = 50000000)
--min-active : Decoder minimum #active states. (int, default = 200)
--minimize : If true, push and minimize after determinization. (bool, default = false)
--online-ivector-period : Number of frames between iVectors in matrices supplied to the --online-ivectors option (int, default = 0)
--online-ivectors : Rspecifier for iVectors estimated online, as matrices. If you supply this, you must set the --online-ivector-period option. (string, default = "")
--optimization.allocate-from-other : Instead of deleting a matrix of a given size and then allocating a matrix of the same size, allow re-use of that memory (bool, default = true)
--optimization.allow-left-merge : Set to false to disable left-merging of variables in remove-assignments (obscure option) (bool, default = true)
--optimization.allow-right-merge : Set to false to disable right-merging of variables in remove-assignments (obscure option) (bool, default = true)
--optimization.backprop-in-place : Set to false to disable optimization that allows in-place backprop (bool, default = true)
--optimization.consolidate-model-update : Set to false to disable optimization that consolidates the model-update phase of backprop (e.g. for recurrent architectures (bool, default = true)
--optimization.convert-addition : Set to false to disable the optimization that converts Add commands into Copy commands wherever possible. (bool, default = true)
--optimization.extend-matrices : This optimization can reduce memory requirements for TDNNs when applied together with --convert-addition=true (bool, default = true)
--optimization.initialize-undefined : Set to false to disable optimization that avoids redundant zeroing (bool, default = true)
--optimization.max-deriv-time : You can set this to the maximum t value that you want derivatives to be computed at when updating the model. This is an optimization that saves time in the backprop phase for recurrent frameworks (int, default = 2147483647)
--optimization.max-deriv-time-relative : An alternative mechanism for setting the --max-deriv-time, suitable for situations where the length of the egs is variable. If set, it is equivalent to setting the --max-deriv-time to this value plus the largest 't' value in any 'output' node of the computation request. (int, default = 2147483647)
--optimization.memory-compression-level : This is only relevant to training, not decoding. Set this to 0,1,2; higher levels are more aggressive at reducing memory by compressing quantities needed for backprop, potentially at the expense of speed and the accuracy of derivatives. 0 means no compression at all; 1 means compression that shouldn't affect results at all. (int, default = 1)
--optimization.min-deriv-time : You can set this to the minimum t value that you want derivatives to be computed at when updating the model. This is an optimization that saves time in the backprop phase for recurrent frameworks (int, default = -2147483648)
--optimization.move-sizing-commands : Set to false to disable optimization that moves matrix allocation and deallocation commands to conserve memory. (bool, default = true)
--optimization.optimize : Set this to false to turn off all optimizations (bool, default = true)
--optimization.optimize-row-ops : Set to false to disable certain optimizations that act on operations of type *Row*. (bool, default = true)
--optimization.propagate-in-place : Set to false to disable optimization that allows in-place propagation (bool, default = true)
--optimization.remove-assignments : Set to false to disable optimization that removes redundant assignments (bool, default = true)
--optimization.snip-row-ops : Set this to false to disable an optimization that reduces the size of certain per-row operations (bool, default = true)
--optimization.split-row-ops : Set to false to disable an optimization that may replace some operations of type kCopyRowsMulti or kAddRowsMulti with up to two simpler operations. (bool, default = true)
--phone-determinize : If true, do an initial pass of determinization on both phones and words (see also --word-determinize) (bool, default = true)
--prune-interval : Interval (in frames) at which to prune tokens (int, default = 25)
--utt2spk : Rspecifier for utt2spk option used to get ivectors per speaker (string, default = "")
--word-determinize : If true, do a second pass of determinization on words only (see also --phone-determinize) (bool, default = true)
--word-symbol-table : Symbol table for words [for debug output] (string, default = "")
Standard options:
--config : Configuration file to read (this option may be repeated) (string, default = "")
--help : Print out usage message (bool, default = false)
--print-args : Print the command line arguments (to stderr) (bool, default = true)
--verbose : Verbose level (higher->more logging) (int, default = 0)
https://blog.csdn.net/qq_25750561/article/details/81070092
https://www.cnblogs.com/yszd/p/12192769.html
https://github.com/naxingyu/kaldi_cvte_model_test