基于WFST的语音识别热词定制化

最近学习了kaldi的"Support for grammars and graphs with on-the-fly parts"，感觉看英文看着挺费劲，而且中间有许多概念需要有对应的图才能方便理解，所以撰写此文给有需要的同仁。

背景介绍

语音识别实际应用中通常会遇到识别结果定制化问题，我们有时会称为“热词”问题。例如：想让语音识别引擎准确识别通讯录中的人名。这时采用语音识别定制化方案是一种比较好的选择。

基于WFST的语音识别解码器的实现方式一般分为动态解码和静态解码两种。这里，kaldi的热词定制化采用的是动态解码和类语言模型（class based language）这两种技术的结合。由于本人主要关心L和G，所以只详细阅读了kaldi文中的L、G、以及LG三部分的内容。

kaldi文中的大致思路：首先构建了top level的HCLG.fst（简记为top-HCLG.fst），这个HCLG跟传统的HCLG基本一样，只是中间的某些输入标签采用的是自定义的标签符号（例如：某些弧上的输入标签是#nonterm:contact_list，可以代表联系人这个类别）；其次构建了sub level的HCLG.fst（简记为sub-HCLG.fst），这个HCLG就是上面top-HCLG.fst在自定义标签上的扩展（例如：扩展上面的#nonterm:contact_list为具体的人名，例如：小明、小李等）。这个图比较小，当有新的人名后，直接扩展后重新构造这个sub-HCLG.fst；最后解码器在解码时当在top-HCLG.fst中遇到了特殊的输入标签（例如这里的#nonterm:contact_list），这时解码器就知道需要动态加载对应的sub-HCLG.fst，进入到sub-HCLG.fst中解码对应内容。

名词解释

原文：Let us define the set of "left-context phones" as the set of phones that can end a word, plus the optional silence, plus the special symbol #nonterm_bos.

翻译：kaldi文中定义了名词："left-context phones"（这里翻译成"左上下文音素集"），它由三种集合构成：1)所有词的结尾phone构成的集合（例如：HELLO在发音词典中的发音为HH AH0 L OW1，则最后音素是OW1，OW1属于这个集合）；2)静音符；3)特殊符号#nonterm_bos。（吐槽下：刚读这段英语时连断句都搞不太明白）。

主体内容

一、构造G.fst

1. 对于top level的G.fst：跟传统的G.fst一样，只是在所有定制化词的位置使用对应类符号进行替换。例如：这里的真实联系人位置使用#nonterm:contact_list进行了替换。这样解码器解码到该位置时，知道需要进入对应的non top-level的sub-G.fst。

2. 对于not top-level的G.fst：对于所有不是top-level的G.fst，需要开始于#nonterm_begin，结束于#nonterm_end，中间是枚举真正定制化的词。

二、构造L.fst

L.fst由脚本utils/lang/make_lexicon_fst.py或者utils/lang/make_lexicon_fst_silprob.py进行实现，在prepare_lang.sh中被调用。当提供了参数–left-context-phones和–nonterminals时就会走定制化热词L.fst的构图流程，否则走普通L.fst的构图流程。

1. 传统的L.fst如下图所示：

2. 定制化热词的L.fst不仅需要包含传统L.fst，还需要包含一些特殊的弧。kaldi中的原文如下（看得是不是不知所云）：

The lexicon needs to include, in addition to the normal things:

A sequence starting at the start state and ending at the loop-state, with olabel #nonterm_begin and ilabels consisting of, #nonterm_begin followed by all possible left-context phones (and #nonterm_bos) in parallel.

An arc from the loop-state to a final state, with ilabel and olabel equal to #nonterm_end.

For each user-defined nonterminal (e.g. #nonterm:foo) and for #nonterm_begin, a loop beginning and ending at the loop-state that starts with the user-defined nontermal, e.g. #nonterm:foo, on the ilabel and olabel, and then has all left-context-phones on the ilabel only.

定制化热词的L.fst的如下图所示：

还没有看懂没有关系，我详细介绍下。对比传统的L.fst，定制化热词的L.fst不仅包含所有传统的L.fst需要包含的内容，还需要包含kaldi中的原文说明的其它三种弧：

具体解释如下：

1)以红色不规则形状框起来的部分对应普通L.fst，参看传统的L.fst的图；（注意：这个图的状态0到状态1对应1)的图中的状态0到状态1）；

2)由红色横线标出的#nonterm_begin:#nonterm_begin对应规则1（包括后面由绿色正方形框出的部分）。即从开始节点到loop-state终止节点，先从开始节点0到中间节点3，输入标签是#nonterm_begin，输出标签是#nonterm_begin；然后从中间节点3到loop-state，输入标签是所有的left-context-phones（注意：真实的left-context-phones很多，所以这里会有很密集的弧，这里为了看起来方便left-context-phones只有三个）。（这段英文的原话翻译如下：输入标签序列由#nonterm_begin和所有的left-context-phones组成，输出标签序列只有#nonterm_begin。）

注：left-context-phones由三部分组成：1)the set of phones that can end a word(即使word结尾的phone，在kaldi中是以_E（表示END）和_S（表示Single）结尾的phone);2)optional silence;3)特殊符#nonterm_bos。这里就是data/lang_nosp_grammar1/phones/left_context_phones.txt.2里面的内容。

3)由蓝色横线标出的#nonterm_end:#nonterm_end对应规则2。即从循环状态1到终止状态4，输入标签为#nonterm_end，输出标签为#nonterm_end。

4)由黄色横线标出的#nonterm:contact_list:#nonterm:contact_list对应规则3。即对于用户自定义的nonterminal（这里的图中是#nonterm:contact_list），需要有从loop-state开始，到loop-state结束的路径。从loop-state出发的路径的输入标签是#nonterm:contact_list，输出标签是#nonterm:contact_list，然后回到loop-state的路径的输入标签是所有的left-context-phones。

注意：这个图L.fst画得不够好，中间的那个AH0_S:A/0.69315是一条正规的路径，不是一个自环。由于这个词A只有个一发音AH0_S，导致看着像自环。

三、构造LG.fst

1.top-level的LG.fst

2.non top-level的LG.fst

四、剩下HCLG.fst构图中的H和C的部分我不太关注，就没有仔细看，不再此详述。

五、后记

如果采用CTC，并且使用拼音建模单元的话，只需要构图LG.fst，采用这种方式可以基于LG.fst，使用CTC进行定制化，具体的代码实现难度就比较大了。

参考文章:

1.https://kaldi-asr.org/doc/grammar.html

2.https://www.jianshu.com/p/389cb3c6231b