语音识别_CMUSphinx入门(二)训练一个声学模型

本章是https://blog.csdn.net/xj853663557/article/details/84671223的跳转分支。

本章原文出自https://cmusphinx.github.io/wiki/tutorialam/

目录

介绍

什么时候你需要去训练

什么时候你不需要去训练

准备数据

编译所需包

建立训练脚本

建立语料数据集的格式

配置文件路径

配置模型的类型和参数

配置声音特征的参数

配置并行工作以加速训练

配置解码参数

训练

训练的核心

测试

使用模型

故障排除


  • Introduction
    • When you need to train
    • When you don’t need to train
  • Data preparation
  • Compilation of the required packages
  • Setting up the training scripts
    • Setting up the format of database audio
    • Configuring file paths
    • Configuring model type and model parameters
    • Configuring sound feature parameters
    • Configuring parallel jobs to speedup the training
    • Configuring decoding parameters
  • Training
  • Training internals
    • Transformation matrix training (advanced)
    • MMIE training (advanced)
  • Testing
  • Using the model
  • Troubleshooting
  • 介绍
    • 什么时候你需要去训练
    • 什么时候你不需要去训练
  • 准备数据
  • 编译所需包
  • 建立训练脚本
    • 建立语料数据集的格式
    • 配置文件路径
    • 配置模型的类型和参数
    • 配置声音特征的参数
    • 配置并行工作以加速训练
    • 配置解码参数
  • 训练
  • 训练的核心
    • 变换矩阵训练(高级)
    • MMIE训练(高级)
  • 测试
  • 使用模型
  • 故障排除

介绍

The CMUSphinx project comes with several high-quality acoustic models. There are US English acoustic models for microphone and broadcast speech as well as a model for speech over a telephone. You can also use French or Chinese models trained on a huge amount of acoustic data. Those models were carefully optimized to achieve the best recognition performance and work well for almost all applications. We put years of experience into making them perfect. Most command-and-control applications and even some large vocabulary applications could just use default models directly.

 CMUSphinx项目提供了几个高质量声学模型。有用于麦克风和广播演讲的英文模型,也有用于电话讲话的。你也可以使用法语或者汉语模型,这些模型由大量的声音数据的训练而来。这些模型经过了仔细的优化,以获得最佳的识别性能,并且几乎适用于所有应用程序。我们花了多年的经验使他们完美。大多数命令和控制应用程序,甚至一些大型词汇表应用程序都可以直接使用默认模型。

Besides models, CMUSphinx provides some approaches for adaptation which should suffice for most cases when more accuracy is required. Adaptation is known to work well when you are using different recording environments (close-distance or far microphone or telephone channel), or when a slightly different accent (UK English or Indian English) or even another language is present. Adaptation, for example, works well if you need to quickly add support for some new language just by mapping a phoneset of an acoustic model to a target phoneset with the dictionary.

 除了模型之外,CMUSphinx还提供了一些用于适应的方法,当需要更高的准确性时,这些方法应该能够满足大多数情况。当你使用不同的录音环境(近距离或远距离麦克风或电话频道),或当出现略微不同的口音(英国英语或印度英语),甚至是另一种语言时,适应的做法是非常有效的。例如,如果您需要快速添加对某些新语言的支持,只需通过使用字典将声学模型的一个语音模型映射到目标语音模型,适应性就会产生效果。

There are, however, applications where the current models won’t work. Such examples are handwriting recognition or dictation support for another language. In these cases, you will need to train your own model and this tutorial demonstrates how to do the training for the CMUSphinx speech recognition engine. Before starting with the training make sure you are familiar with the concepts, prepared the language model. Be sure that you indeed need to train the model and that you have the resources to do that.

 然而,有一些应用,比如手写识别或者支持其他语言,不能使用当前的模型。这种情况下,你需要训练自己的模型,这部分教程示范了怎么训练CMUSphinx识别引擎。在开始训练之前,确保您熟悉概念,准备好语言模型。确保您确实需要对模型进行训练,并且您拥有资源去这样做。

什么时候你需要去训练

When you need to train

You need to train an acoustic model if:

  • You want to create an acoustic model for a new language or dialect
  • OR you need a specialized model for a small vocabulary application
  • AND you have plenty of data to train on:
    • 1 hour of recording for command and control for a single speaker
    • 5 hours of recordings of 200 speakers for command and control for many speakers
    • 10 hours of recordings for single speaker dictation
    • 50 hours of recordings of 200 speakers for many speakers dictation
  • AND you have knowledge on the phonetic structure of the language
  • AND you have time to train the model and optimize the parameters (~1 month)

 当遇到以下情况时,你需要去训练一个声学模型:

  • 您希望为一种新语言或方言创建一个声学模型
  • 或者您需要一个小型词汇表应用程序的专门模型
  • 你有足够的数据来训练:
    • 1小时的单人命令和控制录音
    • 5小时的多人命令和控制录音
    • 10小时的单人方言录音
    • 50小时的200人方言录音
  • 你对这种语言的语音结构也有所了解
  • 您有时间对模型进行培训和参数优化(~1个月)

什么时候你不需要去训练

When you don’t need to train

You don’t need to train an acoustic model if:

  • You need to improve accuracy – do acoustic model adaptation instead
  • You do not have enough data – do acoustic model adaptation instead
  • You do not have enough time
  • You do not have enough experience

Please note that the amounts of data listed here are required to train a model. If you have significantly less data than listed you can not expect to train a good model. For example, you cannot train a model with 1 minute of speech data.

 当遇到以下情况时,你不需要去训练一个声学模型:

  • 你需要提高准确性 – 通过声学模型适应来完成
  • 你没有足够的语音数据 – 通过声学模型适应来完成
  • 你没有足够的时间
  • 你没有足够的经验

需要注意的是,这里列出了训练一个模型需要的大量数据。如果您拥有的数据明显少于所列出的数据,那么您就不能期望训练出一个好的模型。例如,您不能使用1分钟的语音数据来训练模型。

准备数据

The trainer learns the parameters for the models of the sound units using a set of sample speech signals. This is called a training database. A selection of already trained databases will also be provided to you.

The database contains information that is required to extract statistics from the speech in form of the acoustic model.

The trainer needs to be told which sound units you want it to learn the parameters of, and at least the sequence in which they occur in every speech signal in your training database. This information is provided to the trainer through a file called the transcript file. It contains the sequence of words and non-speech sounds in the exact same order as they occurred in a speech signal, followed by a tag which can be used to associate this sequence with the corresponding speech signal.

训练器通过一组语音信号来学习声音单元模型的参数。这些语音信号被叫做训练集。我们提供了精选的已经被用于训练的数据集给你。该数据库包含以声学模型的形式从语音中提取统计信息所需的信息。

训练器需要知道你想让它学习什么声音模型的参数,以及这些声音模型在你的训练集声音信号中的顺序。这些信息通过一个叫做transcript(转录文件)的文件提供给训练器。它包含单词和非语音的序列,其顺序与语音信号中出现的顺序完全相同,然后是一个标签,可以用来将这个序列与相应的语音信号联系起来。

Thus, in addition to the speech signals and transcription file, the trainer needs to access two dictionaries: one in which legitimate words in the language are mapped to sequences of sound units (or sub-word units), and a another one in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units. We will refer to the former as the language phonetic dictionary. The latter is called the filler dictionary. The trainer then looks into the dictionaries, to derive the sequence of sound units that are associated with each signal and transcription.

因此,除了语音信号和转录文件外,训练器还需要访问两个字典:一种是将语言中的合法单词映射到声音单元序列(或子词单元),另一种是将非语音映射到相应的非语音或类语音单元。我们将前者称为语言语音词典,后者称为填充字典。然后训练器查阅字典,得出与每个信号和转录相关的声音单位序列。

After training, it’s mandatory to run the decoder to check the training results. The Decoder takes a model, tests part of the database and reference transcriptions and estimates the quality (WER) of the model. During the testing stage we use the language model with the description of the possible order of words in the language.

 训练结束后,必须运行解码器检查训练结果。解码器使用一个模型、测试集以及相关的转录文件来估算模型的质量WER。在测试阶段,我们使用语言模型来描述语言中单词的可能顺序。

To setup a training, you first need to design a training database or download an existing one. For example, you can purchase a database from LDC. You’ll have to convert it into a proper format.

A database should be a good representation of what speech you are going to recognize. For example if you are going to recognize telephone speech it is preferred to use telephone recordings. If you want to work with mobile speech, you should better find mobile recordings. Speech is significantly different across various recording channels. Broadcast news is different from a phone call. Speech decoded from mp3 is significantly different from a microphone recording. However, if you do not have enough speech recorded in the required conditions you should definitely use any other speech you have. For example you can use broadcast recordings. Sometimes it make sense to stream through the telephone codec to equalize the audio. It’s often possible to add noise to training data too.

 为了计划一个训练,你首先需要设计一个训练集或者下载一个现有的,比如,你可以从 LDC购买。你还需要把它转换成合适的格式。

一个数据集应该能够很好的表现你想要识别的语句。比如,你如果想要识别电话讲话,那么最好使用电话录音的数据集。如果你想用于手机讲话,那么你最好需要手机的数据集。语音在不同的录音频道中有显著的不同。广播新闻不同于打电话。从mp3中解码出来的语音与麦克风录音有很大的不同。然而,如果你没有足够的语言记录在所需要的条件下,你当然可以使用任何条件下的语语音。例如,您可以使用广播录音。有时通过电话编解码器来平衡音频是有意义的。在训练数据中添加噪声也是可能的。

The database should have recording of enough speakers, a variety of recording conditions, enough acoustic variations and all possible linguistic sentences. As mentioned before, The size of the database depends on the complexity of the task you want to handle.

A database should have the two parts mentioned above: a training part and a test part. Usually the test part is about 1/10th of the total data size, but we don’t recommend you to have more than 4 hours of recordings as test data.

Good approaches to obtain a database for a new language are:

  • Manually segmenting audio recordings with existing transcription (podcasts, news, etc.)
  • Recording your friends and family and colleagues
  • Setting up an automated collection on Voxforge

数据集应该有足够的讲话人,多样的录音环境,足够的声音变化以及所有的语言句子。如前所述,数据库的大小取决于您想要处理的任务的复杂性。

一个数据集应该包含两部分:训练集和测试集。通常测试集占整个数据集的10%,但是我们不建议测试集超过4个小时的录音。

一些好的途径去获得新语言数据集:

  • 使用现有的转录(播客、新闻等)手工分割音频记录。
  • 记录你的朋友、家人和同事
  • 在Voxforge上建立自动收集

You have to design database prompts and post-process the results to ensure that the audio actually corresponds to the prompts. The file structure for the database is the following: 

├─ etc
│  ├─ your_db.dic                 (Phonetic dictionary)
│  ├─ your_db.phone               (Phoneset file)
│  ├─ your_db.lm.DMP              (Language model)
│  ├─ your_db.filler              (List of fillers)
│  ├─ your_db_train.fileids       (List of files for training)
│  ├─ your_db_train.transcription (Transcription for training)
│  ├─ your_db_test.fileids        (List of files for testing)
│  └─ your_db_test.transcription  (Transcription for testing)
└─ wav
   ├─ speaker_1
   │   └─ file_1.wav              (Recording of speech utterance)
   └─ speaker_2
      └─ file_2.wav

Let’s go through the files and describe their format and the way to prepare them: 

*.fileids: The your_db_train.fileids and your_db_test.fileids files are text files which list the names of the recordings (utterance ids) one by one, for example:

speaker_1/file_1
speaker_2/file_2

*.fileids file contains the path in a file-system relative to the wav directory. Note that a *.fileidsfile should not include audio file extensions in its content, but rather just the names.

*.transcription: The your_db_train.transcription and your_db_test.transcription files are text files listing the transcription for each audio file:

 hello world  (file_1)
 foo bar  (file_2)

It’s important that each line starts with  and ends with  followed by an id in parentheses. Also note that the parentheses contains only the file, without the speaker_n directory. It’s critical to have an exact match between the *.fileids file and the *.transcription file. The number of lines in both should be identical. The last part of the file id (speaker1/file_1) and the utterance id file_1must be the same on each line.

Below is an example of an incorrect *.fileids file for the above transcription file. If you follow it, you will get an error as discussed here:

speaker_2/file_2
speaker_1/file_1
// Bad! Do not create *.fileids files like this!

这一段简单说一下,这两种文件一定要严格对应,它们的行数一定是相同的,而且最后一行的file id (speaker1/file_1)和utterance id file_1一定是一致的。

Speech recordings (*.wav files): Your audio recordings should contain training audio which should match the audio you want to recognize in the end. In case of a mismatch you could experience a sometimes even significant drop of the accuracy. This means if you want to recognize continuous speech, your training database should record continuous speech. For continuous speech the optimal length for audio recordings is between 5 seconds and 30 seconds. Very long recordings make training much harder. If you are going to recognize short isolated commands, your training database should also contain files with short isolated commands. It is better to design the database to recognize continuous speech from the beginning though and not spend your time on commands. In the end you speak continuously anyway. The Amount of silence in the beginning and in the end of the utterance should not exceed 200 ms.

Recording files must be in MS WAV format with a specific sample rate – 16 kHz, 16 bit, mono for desktop application, 8kHz, 16bit, mono for telephone applications. It’s critical that the audio files have a specific format. Sphinxtrain does support some variety of sample rates but by default it is configured to train from 16khz 16bit mono files in MS WAV format.

So, please make sure that your recordings have a samplig rate of 16 kHz (or 8 kHz if you train a telephone model) in mono with a single channel!

If you train from an 8 kHz model you need to make sure you configured the feature extraction properly. Please note that you cannot upsample your audio, that means you can not train 16 kHz model with 8 kHz data.

A mismatch of the audio format is the most common training problem – make sure you eliminated this source of problems.

你的录音包含的训练音频应该与你最终想要识别的音频相匹配。在不匹配的情况下,你会遇到有时甚至会出现精度的显著下降。这意味着如果你想识别连续的讲话,你的训练集中必须包含这样的连续讲话。对于连续语音,录音的最佳长度在5秒到30秒之间。太长的录音会导致训练极为困难。如果你准备识别短的孤立的命令,你的训练集应该包含简短孤立命令的录音。最好从一开始就设计数据库来识别连续语音,而不是把时间花在命令上。因为在使用过程中你还是会不停地讲话。开始和结束时的静音时间不能超过200毫秒。

如果是桌面应用,那么录音文件最好是采样率16Khz,16bit的WAV格式单声道数据;如果是电话应用最好是8khz,16bit格式。音频有特定的格式是十分重要的。sphinxtrain支持不同的采样率,默认支持WAV格式、16khz、16bit单声道数据。

所以,请确保你的录音文件的采样率是16Khz(或者8Khz如果你要训练的是电话类模型),声道是单声道。

如果您是从8 kHz模型开始训练的,那么您需要确保正确配置了特征提取。请注意你不能升采样你的音频样本,这意味着你不能用8千赫的数据来训练16千赫的模型。

不匹配的音频格式是常见的训练问题,请确保您消除了这个问题的根源。

Phonetic Dictionary (your_db.dict): should have one line per word with the word following the phonetic transcription:

HELLO HH AH L OW
WORLD W AO R L D

If you need to find a phonetic dictionary, have a look on Wikipedia or read a book on phonetics. If you are using an existing phonetic dictionary do not use case-sensitive variants like “e” and “E”. Instead, all your phones must be different even in the case-insensitive variation. Sphinxtrain doesn’t support some special characters like “*” or “/” and supports most of others like “+”, “-“ or “:”. However, to be safe we recommend you to use alphanumeric-only phone-set. 

Replace special characters in the phone-set, like colons, dashes or tildes, with something alphanumeric. For example, replace “a~” with “aa” to make it alphanumeric only. Nowadays, even cell phones have gigabytes of memory on board. There is no sense in trying to save space with cryptic special characters.

 一个单词一行,一个单词后跟着单词的音标。

如果你需要找一个语音字典,请查看关于语音的wiki和书籍。如果您使用的是现有的语音词典,请不要使用区分大小写的变体,如“e”和“e”。相反,你所有的音素必须是不同的,即使是不区分大小写的变化。Sphinxtrain不支持某些特殊字符,如“*”或“/”,它支持大多数其他字符,如“+”、“-”或“:”。不过,为了安全起见,我们建议您只使用字母数字音素集。

使用一些字母数字替换掉音素里所有的特殊字符,比如冒号,破折号或波浪号。比如,“aa” 替换“a~”保证只有字母数字。如今,即使是手机也有千兆字节的内存。试图用神秘的特殊字符来节省空间是没有意义的。

There is one very important thing here. For a large vocabulary database, the phonetic representation is more or less known; it’s simple phones described in any book. If you don’t have a phonetic book, you can just use the word’s spelling and it will also give you very good results:

ONE O N E
TWO T W O

For a small vocabulary CMUSphinx is different from other toolkits. It’s often recommended to train word-based models for a small vocabulary databases like digits. Yet, this only makes sense if your HMMs could have variable length.

CMUSphinx does not support word models. Instead, you need to use a word-dependent phone dictionary:

ONE W_ONE AH_ONE N_ONE
TWO T_TWO UH_TWO
NINE N_NINE AY_NINE N_END_NINE

 This is actually equivalent to word-based models and some times even gives better accuracy. Do not use word-based models with CMUSphinx!

有一个非常重要的事情。对于一个庞大的词汇库,语音表示或多或少是已知的;任何书中都有简单的描述。如果你没有语音书,你可以只使用单词的拼写,它也会给你很好的结果:

对于小词汇量,CMUSphinx不同于其他工具包。我们通常建议为数字之类的小型词汇表数据库培训基于单词的模型。然而,这只有在HMMs可以具有可变长度时才有意义。

CMUSphinx不支持word模型。相反,您需要使用依赖于单词的音素字典。

这实际上等价于基于单词的模型,有时甚至提供更好的准确性。不要在CMUSphinx中使用基于单词的模型!

Phoneset file (your_db.phone): should have one phone per line. The number of phones should match the phones used in the dictionary plus the special SIL phone for silence:

AH
AX
DH
IX

Language model file (your_db.lm.DMP): should be in ARPA format or in DMP format. Find our more about language models in the Building a language model chapter.

Filler dictionary (your_db.filler): contains filler phones (not-covered by language model non-linguistic sounds like breath, “hmm” or laugh). It can contain just silences:

 SIL
 SIL
 SIL

It can also contain filler phones if they are present in the database transcriptions:

+um+ ++um++
+noise+ ++noise++

The sample database for training is available at an4 database NIST’s Sphere audio (.sph) format, You can use this database in the following sections. If you want to play with a large example, download the TED-LIUM English acoustic database. It contains about 200 hours of audio recordings at present.

Phoneset file (your_db.phone):一个音素一行。音素的数目应该和词典中使用的音素数目匹配,用大写的SIL表示静音。

Language model file (your_db.lm.DMP):语言模型,一般是ARPA或DMP格式,Building a language model。

Filler dictionary (your_db.filler):包含填充的音素,非语音比如呼吸声、笑声。

这里提供一个有效的数据集用于训练 an4 database NIST’s Sphere audio (.sph) format,你可以在接下来的环节里使用这个数据集。如果你想使用一个大的例子,下载 TED-LIUM 。它包括200小时的现场录音。

编译所需包

The following packages are required for training:

  • sphinxbase-5prealpha
  • sphinxtrain-5prealpha
  • pocketsphinx-5prealpha

The following external packages are also required:

 

  • perl, for example ActivePerl on Windows
  • python, for example ActivePython on Windows

In addition, if you download the packages with a .gz suffix, you will need gunzip or an equivalent tool to unpack them.

Install the perl and python packages somewhere in your executable path, if they are not already there.

We recommend that you train on Linux: this way you’ll be able to use all the features of sphinxtrain. You can also use a Windows system for training, in that case we recommend to use ActivePerl.

For further download instructions, see the download page.

 

Basically you need to put everything into a single root folder, unzip and untar them, and run configure and make and make install in each package folder. Put the database folder into this root folder as well. By the time you finish this, you will have a tutorial directory with the following contents:

└─ tutorial
   ├─ an4
   ├─ an4_sphere.tar.gz
   ├─ sphinxtrain
   ├─ sphinxtrain-5prealpha.tar.gz
   ├─ pocketsphinx
   ├─ pocketsphinx-5prealpha.tar.gz
   ├─ sphinxbase
   └─ sphinxbase-5prealpha.tar.gz

You will need to install the software as an administrator root. After you installed the software you may need to update the system configuration so that the system will be able to find the dynamic libraries, e.g.:

export PATH=/usr/local/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

 

If you don’t want to install into your system path, you may install the packages in your home folder. In that case you can append the following option to the autogen.sh script or to the configure script:

--prefix=/home/user/local

Obviously, the folder can be an arbitrary folder, just remember to update the environment configuration after modifying its name. If your binaries fail to load dynamic libraries with an error message like failed to open libsphinx.so.0 no such file or directory, it means that you didn’t configure the environment properly.

这节没什么好说的。

建议使用Linux,安装依赖的软件,注意路径设置。

建立训练脚本

To start the training, change to the database folder and run the following commands:

On Linux:

sphinxtrain -t an4 setup

On Windows:

python ../sphinxtrain/scripts/sphinxtrain -t an4 setup

Do not forget to replace an4 with your task name.

This will copy all the required configuration files into the etc/ subfolder of your database folder and will prepare the database for training. The directory structure after the setup will look like this:

├─ etc
└─ wav

In the process of the training other data folders will be created, so that your database directory should look like this:

├─ etc
├─ feat
├─ logdir
├─ model_parameters
├─ model_architecture
├─ result
└─ wav

After this basic setup, we need to edit the configuration files in the etc/ folder. There are many variables but to get started we need to change only a few. First of all, find the file etc/sphinx_train.cfg.

执行这个步骤之后,会根据数据集信息建立一个sphinx_train.cfg配置文件。在进行下一个步骤之前,我们根据需要对配置文件进行一些小的改动。

建立语料数据集的格式

In etc/sphinx_train.cfg you should see the following configurations:

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
$CFG_WAVFILE_EXTENSION = 'sph';
$CFG_WAVFILE_TYPE = 'nist'; # one of nist, mswav, raw

If you recorded audio in WAV format, change sph to wav here and nist to mswav:

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
$CFG_WAVFILE_EXTENSION = 'wav';
$CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw

根据自己的语料数据集格式修改脚本配置,脚本支持三种格式:nist,mswav,raw。如果你的格式是WAV格式,那就修改成mswav。

配置文件路径

Search for the following lines in your etc/sphinx_train.cfg file:

# Variables used in main training of models
$CFG_DICTIONARY     = "$CFG_LIST_DIR/$CFG_DB_NAME.dic";
$CFG_RAWPHONEFILE   = "$CFG_LIST_DIR/$CFG_DB_NAME.phone";
$CFG_FILLERDICT     = "$CFG_LIST_DIR/$CFG_DB_NAME.filler";
$CFG_LISTOFFILES    = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids";
$CFG_TRANSCRIPTFILE = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.transcription"

These values would be already set if you set up the file structure like described earlier, but make sure that your files are really named this way. 

The $CFG_LIST_DIR variable is the /etc directory in your project. The $CFG_DB_NAME variable is the name of your project itself.

如果你的数据集结构是之前描述的那样,那么这些配置已经设置好了,但是你最好还是确认一下是否正确。

 $CFG_LIST_DIR变量是你项目下的/etc目录。$CFG_DB_NAME是你的项目名称。

配置模型的类型和参数

To select the acoustic model type see the Acoustic Model Types article.

$CFG_HMM_TYPE = '.cont.'; # Sphinx4, Pocketsphinx
#$CFG_HMM_TYPE  = '.semi.'; # PocketSphinx only
#$CFG_HMM_TYPE  = '.ptm.'; # Sphinx4, Pocketsphinx, faster model

Just uncomment what you need. For resource-efficient applications use semi-continuous models, for best accuracy use continuous models. By default we use PTM models which provide a nice balance between accuracy and speed.

$CFG_FINAL_NUM_DENSITIES = 8;

 If you are training continuous models for a large vocabulary and have more than 100 hours of data, put 32 here. It can be any power of 2: 4, 8, 16, 32, 64.

If you are training semi-continuous or PTM model, use 256 gaussians.

根据你的需要来选择。对于资源效率高的应用程序,使用半连续模型,为了获得最佳的准确性,使用连续模型。默认情况下,我们使用PTM模型,它提供了准确性和速度之间的良好平衡。

如果你要为一个大词汇量并且超过100小时的数据训练一个连续模型,在这里写32.这个数可选范围是2为底的任意次幂:4,8,16,32,64.

如果你要训练一个半连续或者PTM模型,使用256高斯函数。

# Number of tied states (senones) to create in decision-tree clustering
$CFG_N_TIED_STATES = 1000;

This value is the number of senones to train in a model. The more senones a model has, the more precisely it discriminates the sounds. On the other hand, if you have too many senones, the model will not be generic enough to recognize yet unseen speech. That means that the WER will be higher on unseen data. That’s why it is important to not overtrain the models. In case there are too many unseen senones, the warnings will be generated in the norm log on stage 50 below:

ERROR: "gauden.c", line 1700: Variance (mgau= 948, feat= 0, density=3,
component=38) is less then 0. Most probably the number of senones is too
high for such a small training database. Use smaller $CFG_N_TIED_STATES.

 这个值是要在模型中训练的senones的数量。模型的senones越多,它对声音的识别就越准确。另一方面,如果你有太多的senones,这个模型将不足以通用到识别尚未见过的语音(译者:也就是模型向已识别的部分倾斜太多,很容易把一些不识别的东西误判)。这意味着对于不认识的数据WER(虚警率)会更高。这就是为什么不要过度训练模型非常重要。如果有太多没有认识的senones,就会出现警告,提示我们调小这个参数。

The approximate number of senones and the number of densities for a continuous model is provided in the table below:

Vocabulary Audio in database / hours Senones Densities Example
20 5 200 8 Tidigits Digits Recognition
100 20 2000 8 RM1 Command and Control
5000 30 4000 16 WSJ1 5k Small Dictation
20000 80 4000 32 WSJ1 20k Big Dictation
60000 200 6000 16 HUB4 Broadcast News
60000 2000 12000 64 Fisher Rich Telephone Transcription

For semi-continuous and PTM models use a fixed number of 256 densities.

Of course you also need to understand that only senones that are present in transcription can be trained. It means that if your transcription isn’t generic enough, e.g. if it’s the same single word spoken by 10.000 speakers 10.000 times you still have just a few senones no matter how many hours of speech you recorded. In that case you just need a few senones in the model, not thousands of them.

Though it might seem that diversity could improve the model that’s not the case. Diverse speech requires some artificial speech prompts and that decreases the speech naturalness. Artificial models don’t help in real life decoding. In order to build the best database you need to try to reproduce the real environment as much as possible. It’s even better to collect more speech to try to optimize the database size.

It’s important to remember, that optimal numbers depend on your database. To train a model properly, you need to experiment with different values and try to select the ones which result in the best WER for a development set. You can experiment with the number of senones and the number of Gaussian mixtures at least. Sometimes it’s also worth to experiment with the phoneset or the number of estimation iterations.

 当然你也需要明白只有在转录中出现的senones才能被训练。这意味着如果你的转录不够通用,比如10000个人说同一个单词10000次,那么你仍然只有少量的senones,不关你有多少小时的录音数据。因此,你的模型里只需要极少的senones,而不是成千上万的。

尽管看起来好像我们增加多样性就能够增强模型,这是不正确的。不同的语音需要一些人为的语音提示,这就降低了语音的自然度。人工模型对实际解码没有帮助。为了构建最好的数据库,您需要尽可能地再现真实环境。收集更多的语音信息来优化数据库大小会更好。

最重要的是,优化取决于你的数据集。为了正确的训练模型,你需要尝试不同的值,在其中选择一组能够使得开发数据集得到最佳WER。你至少可以尝试修改senones和Gaussian mixture的数目。有时也值得尝试修改音素集或估计的迭代次数。

配置声音特征的参数

The default for sound files used in Sphinx is a rate of 16 thousand samples per second (16 KHz). If this is the case, the etc/feat.params file will be automatically generated with the recommended values.

If you are using sound files with a sampling rate of 8 kHz (telephone audio), you need to change some values in etc/sphinx_train.cfg. The lower sampling rate also means a change in the sound frequency ranges and the number of filters that are used to recognize speech. Recommended values are:

# Feature extraction parameters
$CFG_WAVFILE_SRATE = 8000.0;
$CFG_NUM_FILT = 31; # For wideband speech it's 40, for telephone 8khz reasonable value is 31
$CFG_LO_FILT = 200; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 3500; # For telephone 8kHz speech value is 3500

默认的音频格式是16khz采样率。如果你的数据是这样的,那么etc/feat.params文件会被使用推荐值自动生成。 

如果你正在使用的音频格式是8khz(电话音频),你需要修改一些配置。较低的采样率还意味着声音频率范围和用于识别语音的滤波器数量的变化。脚本的注释里有推荐值。

配置并行工作以加速训练

If you are on a multicore machine or in a PBS cluster you can run the training in parallel. The following options should do the trick:

# Queue::POSIX for multiple CPUs on a local machine
# Queue::PBS to use a PBS/TORQUE queue
$CFG_QUEUE_TYPE = "Queue";

Change the type to “Queue::POSIX” to run on multicore. Then change the number of parallel processes to run:

# How many parts to run Forward-Backward estimation in
$CFG_NPART = 1;
$DEC_CFG_NPART = 1; #  Define how many pieces to split decode in

If you are running on an 8-core machine start around 10 parts to fully load the CPU during training.

如果您在多核机器上或在PBS集群中,则可以并行运行培训。

配置解码参数

Open etc/sphinx_train.cfg and make sure the following configurations are set:

$DEC_CFG_DICTIONARY     = "$CFG_BASE_DIR/etc/$CFG_DB_NAME.dic";
$DEC_CFG_FILLERDICT     = "$CFG_BASE_DIR/etc/$CFG_DB_NAME.filler";
$DEC_CFG_LISTOFFILES    = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_test.fileids";
$DEC_CFG_TRANSCRIPTFILE = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_test.transcription";
$DEC_CFG_RESULT_DIR     = "$CFG_BASE_DIR/result";

# These variables are used by the decoder and have to be defined by the user.
# They may affect the decoder output.

$DEC_CFG_LANGUAGEMODEL  = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}.lm.DMP";

If you are training with an4 please make sure that you changed ${CFG_DB_NAME}.lm.DMP to an4.ug.lm.DMP since the name of the language model is different in an4 database:

$DEC_CFG_LANGUAGEMODEL  = "$CFG_BASE_DIR/etc/an4.ug.lm.DMP";

If everything is OK, you can proceed to training.

这里我用的就是an4,所以我需要修改一下这个$DEC_CFG_LANGUAGEMODEL。

训练

First of all, go to the database directory:

cd an4

To train, just run the following commands:

On Linux:

sphinxtrain run

On Windows:

python ../sphinxtrain/scripts/sphinxtrain run

and it will go through all the required stages. It will take a few minutes to train. On large databases, training could take up to a month. 

The most important stage is the first one which checks that everything is configured correctly and your input data is consistent.

Do not ignore the errors reported on the first 00.verify_all step!

The typical output during decoding will look like:

Baum Welch starting for 2 Gaussian(s), iteration: 3 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 3
Current Overall Likelihood Per Frame = 30.6558644286942
Convergence Ratio = 0.633864444461992
Baum Welch starting for 2 Gaussian(s), iteration: 4 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 4

These scripts process all required steps to train the model. After they finished, the training is complete.

训练一开始,它将经过所有必要的阶段。小的数据库(an4)需要几分钟的时间来训练。在大型数据库上,训练可能需要长达一个月的时间。

最重要的阶段是第一个检查所有配置是否正确,输入数据是否一致的阶段。

不要忽略第一个00点报告的错误,这里验证所有步骤!

这些脚本处理训练模型所需的所有步骤。他们完成后,训练就结束了。

训练的核心

测试

It’s critical to test the quality of the trained database in order to select the best parameters, understand how your application performs and optimize the performance. To do that, a test decoding step is needed. The decoding is now a last stage of the training process.

You can restart decoding with the following command:

sphinxtrain -s decode run

This command will start a decoding process using the acoustic model you trained and the language model you configured in the etc/sphinx_train.cfg file. 

MODULE: DECODE Decoding using models previously trained
Decoding 130 segments starting at 0 (part 1 of 1)
0%

为了选择最佳参数和了解应用的性能并优化性能,测试被训练的数据集质量是至关重要的。为此,需要一个测试解码步骤。解码是训练流程里最后一步。

你可以使用一个命令重新启动解码步骤。这个命令会重启解码流程,这个流程会使用训练好的声学模型和语言模型。

When the recognition job is complete, the script computes the recognition Word Error Rate (WER) and the Sentence Error Rate (SER). The lower those rates the better is your recognition. For a typical 10-hours task the WER should be around 10%. For a large task, it could be like 30%.

On an4 data you should get something like:

SENTENCE ERROR: 70.8% (92/130)   WORD ERROR RATE: 30.3% (233/773)

当识别工作完成,脚本会计算识别器的单词错误率和句子错误率。数值越低表示识别性能越好。对于一个典型的10小时任务,WER应该在10%左右。对于大型任务,可能是30%。

译者:基于an4我的输出结果:

SENTENCE ERROR: 46.2% (60/130)   WORD ERROR RATE: 15.5% (119/773)

You can find exact details of the decoding, like the alignment with a reference transcription, speed and the result for each file, in the result folder which will be created after decoding. Let’s have a look into the file an4.align:

p   I   T      t   s   b   u   r   g   H      (MMXG-CEN5-MMXG-B)
p   R   EIGHTY t   s   b   u   r   g   EIGHT  (MMXG-CEN5-MMXG-B)
Words: 10 Correct: 7 Errors: 3 Percent correct = 70.00% Error = 30.00% Accuracy = 70.00%
Insertions: 0 Deletions: 0 Substitutions: 3
october twenty four nineteen seventy  (MMXG-CEN8-MMXG-B)
october twenty four nineteen seventy  (MMXG-CEN8-MMXG-B)
Words: 5 Correct: 5 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%
Insertions: 0 Deletions: 0 Substitutions: 0
TOTAL Words: 773 Correct: 587 Errors: 234
TOTAL Percent correct = 75.94% Error = 30.27% Accuracy = 69.73%
TOTAL Insertions: 48 Deletions: 15 Substitutions: 171

For a description of the WER see our Basic concepts of speech chapter.

有关解码的精确细节,比如相关转录文件的对其,速度以及结果,在解码完成之后这些都可以在result文件夹下找到。让我们看看an4.align文件。

对WER的详细介绍可以查看 Basic concepts of speech。

使用模型

After training, the acoustic model is located in

model_parameters/``.cd_cont_``

or in

model_parameters/``.cd_semi_``

You need only that folder. The model should have the following files:

mdef
feat.params
mixture_weights
means
noisedict
transition_matrices
variances

depending on the type of the model you trained. To use the model in PocketSphinx, simply point to it with the -hmm option:

pocketsphinx_continuous -hmm `` -lm `` -dict ``.

To use the trained model in Sphinx4, you need to specify the path in the Configuration object:

configuration.setAcousticModelPath("file:model_parameters/db.cd_cont_200");

If the model is in the resources you can reference it with resource:URL:

configuration.setAcousticModelPath("resource:/com/example/db.cd_cont_200");

See the Sphinx4 tutorial for details.

训练之后声学模型保存在,分别是CI(上下文独立)和CD(上下文依赖),你只需要这个文件夹。

pocketsphinx_continuous -hmm `` -lm `` -dict ``.

故障排除

Troubleshooting is not rocket science. For all issues you may blame yourself. You are most likely the reason of failure. Carefully read the messages in the logdir folder that contains a detailed log for each performed action. In addition, messages are copied to the your_project_name.html file, which you can open and read in a browser.

There are many well-working, proven methods to solve issues. For example, try to reduce the training set to see in which half the problem appears.

故障排除不是高深的事(造火箭^_^)。对于所有问题你都可以责备自己。你自己是最大可能失败的原因。仔细阅读日志目录下的每个操作的详细日志。此外消息将被复制到文件your_project_name.html,你可以用浏览器打开阅读它。

这里有许多有效的方法去解决一些问题。比如,尝试减少训练集来发现问题出现在哪里。

Here are some common problems:

  • WARNING: this phone (something) appears in the dictionary (dictionary file name), but not in the phone list (phone file name).

    Your dictionary either contains a mistake, or you have left out a phone symbol in the phone file. You may have to delete any comment lines from your dictionary file.

  • WARNING: This word (word) has duplicate entries in (dictionary file name). Check for duplicates.

    You may have to sort your dictionary file lines to find them. Perhaps a word is defined in both upper and lower case forms.

  • WARNING: This word: word was in the transcript file, but is not in the dictionary (transcript line) Do cases match?

    Make sure that all the words in the transcript are in the dictionary, and that they have matching cases when they appear. Also, words in the transcript may be misspelled, run together or be a number or symbol that is not in the dictionary. If the dictionary file is not perfectly sorted, some entries might be skipped while looking for words. If you hand-edited the dictionary file, be sure that each entry is in the proper format.

    You may have specified phones in the phone list that are not represented in the words in the transcript. The trainer expects to find examples of each phone at least once.

  • WARNING: CTL file, audio file name.mfc, does not exist, or is empty.

    The .mfc files are the feature files converted from the input audio files in stage 000.comp_feats. Did you skip this step? Did you add new audio files without converting them? The training process expects a feature file to be there, but it isn’t.

  • Very low recognition accuracy

    This might happen if there is a mismatch in the audio files and the parameters of training, or between the training and the testing.

  • ERROR: “backward.c”, line 430: Failed to align audio to transcript: final state of the search is not reached.

    Sometimes audio in your database doesn’t match the transcription properly. For example the transcription file has the line “Hello world” but in audio actually “Hello hello world” is pronounced. The training process usually detects that and emits this message in the logs. If there are too many of such errors it most likely means you misconfigured something, e.g. you had a mismatch between audio and the text caused by transcription reordering. If there are few errors, you can ignore them. You might want to edit the transcription file to put the exact word which was pronounced. In the case above you need to edit the transcription file and put “Hello hello world” on the corresponding line. You might want to filter such prompts because they affect the quality of the acoustic model. In that case you need to enable the forced alignment stage during training. To do that edit the following line in sphinx_train.cfg:

    $CFG_FORCEDALIGN = 'yes';
    

    and run the training again. It will execute stages 10 and 11 and will filter your database.

  • Can’t open */*-1-1.match word_align.pl failed with error code 65280

    This error occurs because the decoder did not run properly after the training. First check if the correct executable is present in your PATH. The executable shouldbe pocketsphinx_batch if the decoding script being used is psdecode.pl as set by the $DEC_CFG_SCRIPT variable insphinx_train.cfg. On Linux run:

    which pocketsphinx_batch
    

    and see if it is located as expected. If it is not, you need to set the PATH variable properly. Similarly on Windows, run:

    where pocketsphinx_batch
    

    If the path to the decoding executable is set properly, read the log files in logdir/decode/ to find out other reasons for the error.

  • To ask for help

    If you want to ask for help about training, try to provide the training folder or at least the logdir. Pack the files into an archive and upload it to a public file sharing resource. Then post the link to the resource. Remember: the more information you provide the faster you will solve the problem.

这儿有些常见的问题:

 

  • WARNING: this phone (something) appears in the dictionary (dictionary file name), but not in the phone list (phone file name).

    要么是你的dict字典文件出现了错误,要么是phone音素文件中遗漏了音素符号。你可能必须要从字典文件中删除一些行。

  • WARNING: This word (word) has duplicate entries in (dictionary file name). Check for duplicates.

    您可能必须对字典文件行进行排序才能找到它们。 也许一个词是用大写和小写两种形式定义的。.

  • WARNING: This word: word was in the transcript file, but is not in the dictionary (transcript line) Do cases match?

    请确保转录文件中的单词存在字典文件中,而且它们是完全匹配的(应该是大小写、书写格式完全匹配的意思)。也许,转录文件中的单词拼写错误,单词连在了一起,不合法的数字和符号出现。如果字典文件没有完全排序,在查找单词时可能会跳过一些条目。如果手工编辑字典文件,请确保每个条目的格式都是正确的。

    你可能在phone音素文件中指定了一个phone,但是死转录文件中没有出现它。但是训练过程需要至少出现一次这个phone。

  • WARNING: CTL file, audio file name.mfc, does not exist, or is empty.

    在 000.comp_feats步骤中,*.mfc文件存储的是由输入的音频文件提取的特征值文件. 你是否跳过了这个步骤?你是否添加了新的文件但是没有提取特征值? 总之,训练过程需要的是mfc文件,但是它不存在。

  • Very low recognition accuracy

    如果音频文件和训练参数不匹配,或者训练和测试之间不匹配,就可能发生这种情况。

  • ERROR: “backward.c”, line 430: Failed to align audio to transcript: final state of the search is not reached.

    有时候语料库里的音频文件和转录文件不匹配。 比如,转录文件里写的是“hello world”,但是音频文件发出来的是“hello hello world”。 训练过程检查这种情况并在日志中发出消息。如果这种错误出现的太多有可能是你搞错了,比如由于重新排序,导致音频和文本之间不匹配。如果只有很少的情况,你可以忽略它们。你可能需要用实际发音内容来编辑转录文件。比如以上例子中,你需要把“hello hello world”写入转录文件中。你或许想要过滤掉这些错误声音,因为它们会影响声学模型的质量。在这种情况下,您需要在训练之前启用强制对齐阶段。要实现这一点你可以在文件 sphinx_train.cfg中找到下面的选项进行配置:

    $CFG_FORCEDALIGN = 'yes';
    

    然后重新训练。它将执行阶段10和11,并过滤数据库。

  • Can’t open */*-1-1.match word_align.pl failed with error code 65280

    这个错误出现在训练之后解码不正确时。首先检查正确的可执行文件是否存在PATH路径中。如果sphinx_train.cfg脚本文件里$DEC_CFG_SCRIPT设置的是psdecode.pl,对应的可执行文件是 pocketsphinx_batch。在Linux中运行:

    which pocketsphinx_batch
    

    看看它是否位于预期的位置。如果不是,则需要正确设置路径变量。类似地,在Windows上运行:

    where pocketsphinx_batch
    

    如果解码可执行文件的路径设置正确,请阅读logdir/decode/中的日志文件,找出错误的其他原因。 

  • To ask for help

    如果您想寻求训练方面的帮助,请尝试提供训练相关的文件夹或至少提供logdir。将文件打包到归档文件中,并将其上传到公共文件共享资源。然后发布资源的链接。记住:你提供的信息越多问题解决的越快。

你可能感兴趣的:(音频开发)