言午三吉

语音识别_CMUSphinx入门(二)训练一个声学模型

本章是https://blog.csdn.net/xj853663557/article/details/84671223的跳转分支。

本章原文出自https://cmusphinx.github.io/wiki/tutorialam/

介绍

什么时候你需要去训练

什么时候你不需要去训练

准备数据

编译所需包

建立训练脚本

建立语料数据集的格式

配置文件路径

配置模型的类型和参数

配置声音特征的参数

配置并行工作以加速训练

配置解码参数

训练

训练的核心

测试

使用模型

故障排除

Introduction

When you need to train

When you don’t need to train

Data preparation

Compilation of the required packages

Setting up the training scripts

Setting up the format of database audio

Configuring file paths

Configuring model type and model parameters

Configuring sound feature parameters

Configuring parallel jobs to speedup the training

Configuring decoding parameters

Training

Training internals

Transformation matrix training (advanced)

MMIE training (advanced)

Testing

Using the model

Troubleshooting

介绍
- 什么时候你需要去训练
- 什么时候你不需要去训练
准备数据
编译所需包
建立训练脚本
- 建立语料数据集的格式
- 配置文件路径
- 配置模型的类型和参数
- 配置声音特征的参数
- 配置并行工作以加速训练
- 配置解码参数
训练
训练的核心
- 变换矩阵训练(高级)
- MMIE训练(高级)
测试
使用模型
故障排除

介绍

The CMUSphinx project comes with several high-quality acoustic models. There are US English acoustic models for microphone and broadcast speech as well as a model for speech over a telephone. You can also use French or Chinese models trained on a huge amount of acoustic data. Those models were carefully optimized to achieve the best recognition performance and work well for almost all applications. We put years of experience into making them perfect. Most command-and-control applications and even some large vocabulary applications could just use default models directly.

CMUSphinx项目提供了几个高质量声学模型。有用于麦克风和广播演讲的英文模型，也有用于电话讲话的。你也可以使用法语或者汉语模型，这些模型由大量的声音数据的训练而来。这些模型经过了仔细的优化，以获得最佳的识别性能，并且几乎适用于所有应用程序。我们花了多年的经验使他们完美。大多数命令和控制应用程序，甚至一些大型词汇表应用程序都可以直接使用默认模型。

Besides models, CMUSphinx provides some approaches for adaptation which should suffice for most cases when more accuracy is required. Adaptation is known to work well when you are using different recording environments (close-distance or far microphone or telephone channel), or when a slightly different accent (UK English or Indian English) or even another language is present. Adaptation, for example, works well if you need to quickly add support for some new language just by mapping a phoneset of an acoustic model to a target phoneset with the dictionary.

除了模型之外，CMUSphinx还提供了一些用于适应的方法，当需要更高的准确性时，这些方法应该能够满足大多数情况。当你使用不同的录音环境(近距离或远距离麦克风或电话频道)，或当出现略微不同的口音(英国英语或印度英语)，甚至是另一种语言时，适应的做法是非常有效的。例如，如果您需要快速添加对某些新语言的支持，只需通过使用字典将声学模型的一个语音模型映射到目标语音模型，适应性就会产生效果。

There are, however, applications where the current models won’t work. Such examples are handwriting recognition or dictation support for another language. In these cases, you will need to train your own model and this tutorial demonstrates how to do the training for the CMUSphinx speech recognition engine. Before starting with the training make sure you are familiar with the concepts, prepared the language model. Be sure that you indeed need to train the model and that you have the resources to do that.

然而，有一些应用，比如手写识别或者支持其他语言，不能使用当前的模型。这种情况下，你需要训练自己的模型，这部分教程示范了怎么训练CMUSphinx识别引擎。在开始训练之前，确保您熟悉概念，准备好语言模型。确保您确实需要对模型进行训练，并且您拥有资源去这样做。

什么时候你需要去训练

When you need to train

You need to train an acoustic model if:

You want to create an acoustic model for a new language or dialect

OR you need a specialized model for a small vocabulary application

AND you have plenty of data to train on:

1 hour of recording for command and control for a single speaker

5 hours of recordings of 200 speakers for command and control for many speakers

10 hours of recordings for single speaker dictation

50 hours of recordings of 200 speakers for many speakers dictation

AND you have knowledge on the phonetic structure of the language

AND you have time to train the model and optimize the parameters (~1 month)

当遇到以下情况时，你需要去训练一个声学模型:

您希望为一种新语言或方言创建一个声学模型
或者您需要一个小型词汇表应用程序的专门模型
你有足够的数据来训练:
- 1小时的单人命令和控制录音
- 5小时的多人命令和控制录音
- 10小时的单人方言录音
- 50小时的200人方言录音
你对这种语言的语音结构也有所了解
您有时间对模型进行培训和参数优化(~1个月)

什么时候你不需要去训练

When you don’t need to train

You don’t need to train an acoustic model if:

You need to improve accuracy – do acoustic model adaptation instead

You do not have enough data – do acoustic model adaptation instead

You do not have enough time

You do not have enough experience

Please note that the amounts of data listed here are required to train a model. If you have significantly less data than listed you can not expect to train a good model. For example, you cannot train a model with 1 minute of speech data.

当遇到以下情况时，你不需要去训练一个声学模型:

你需要提高准确性 – 通过声学模型适应来完成
你没有足够的语音数据 – 通过声学模型适应来完成
你没有足够的时间
你没有足够的经验

需要注意的是，这里列出了训练一个模型需要的大量数据。如果您拥有的数据明显少于所列出的数据，那么您就不能期望训练出一个好的模型。例如，您不能使用1分钟的语音数据来训练模型。

准备数据

The trainer learns the parameters for the models of the sound units using a set of sample speech signals. This is called a training database. A selection of already trained databases will also be provided to you.

The database contains information that is required to extract statistics from the speech in form of the acoustic model.

The trainer needs to be told which sound units you want it to learn the parameters of, and at least the sequence in which they occur in every speech signal in your training database. This information is provided to the trainer through a file called the transcript file. It contains the sequence of words and non-speech sounds in the exact same order as they occurred in a speech signal, followed by a tag which can be used to associate this sequence with the corresponding speech signal.

训练器通过一组语音信号来学习声音单元模型的参数。这些语音信号被叫做训练集。我们提供了精选的已经被用于训练的数据集给你。该数据库包含以声学模型的形式从语音中提取统计信息所需的信息。

训练器需要知道你想让它学习什么声音模型的参数，以及这些声音模型在你的训练集声音信号中的顺序。这些信息通过一个叫做transcript(转录文件)的文件提供给训练器。它包含单词和非语音的序列，其顺序与语音信号中出现的顺序完全相同，然后是一个标签，可以用来将这个序列与相应的语音信号联系起来。

Thus, in addition to the speech signals and transcription file, the trainer needs to access two dictionaries: one in which legitimate words in the language are mapped to sequences of sound units (or sub-word units), and a another one in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units. We will refer to the former as the language phonetic dictionary. The latter is called the filler dictionary. The trainer then looks into the dictionaries, to derive the sequence of sound units that are associated with each signal and transcription.

因此，除了语音信号和转录文件外，训练器还需要访问两个字典:一种是将语言中的合法单词映射到声音单元序列(或子词单元)，另一种是将非语音映射到相应的非语音或类语音单元。我们将前者称为语言语音词典，后者称为填充字典。然后训练器查阅字典，得出与每个信号和转录相关的声音单位序列。

After training, it’s mandatory to run the decoder to check the training results. The Decoder takes a model, tests part of the database and reference transcriptions and estimates the quality (WER) of the model. During the testing stage we use the language model with the description of the possible order of words in the language.

训练结束后，必须运行解码器检查训练结果。解码器使用一个模型、测试集以及相关的转录文件来估算模型的质量WER。在测试阶段，我们使用语言模型来描述语言中单词的可能顺序。

To setup a training, you first need to design a training database or download an existing one. For example, you can purchase a database from LDC. You’ll have to convert it into a proper format.

A database should be a good representation of what speech you are going to recognize. For example if you are going to recognize telephone speech it is preferred to use telephone recordings. If you want to work with mobile speech, you should better find mobile recordings. Speech is significantly different across various recording channels. Broadcast news is different from a phone call. Speech decoded from mp3 is significantly different from a microphone recording. However, if you do not have enough speech recorded in the required conditions you should definitely use any other speech you have. For example you can use broadcast recordings. Sometimes it make sense to stream through the telephone codec to equalize the audio. It’s often possible to add noise to training data too.

为了计划一个训练，你首先需要设计一个训练集或者下载一个现有的，比如，你可以从 LDC购买。你还需要把它转换成合适的格式。

一个数据集应该能够很好的表现你想要识别的语句。比如，你如果想要识别电话讲话，那么最好使用电话录音的数据集。如果你想用于手机讲话，那么你最好需要手机的数据集。语音在不同的录音频道中有显著的不同。广播新闻不同于打电话。从mp3中解码出来的语音与麦克风录音有很大的不同。然而，如果你没有足够的语言记录在所需要的条件下，你当然可以使用任何条件下的语语音。例如，您可以使用广播录音。有时通过电话编解码器来平衡音频是有意义的。在训练数据中添加噪声也是可能的。

The database should have recording of enough speakers, a variety of recording conditions, enough acoustic variations and all possible linguistic sentences. As mentioned before, The size of the database depends on the complexity of the task you want to handle.

A database should have the two parts mentioned above: a training part and a test part. Usually the test part is about 1/10th of the total data size, but we don’t recommend you to have more than 4 hours of recordings as test data.

Good approaches to obtain a database for a new language are:

Manually segmenting audio recordings with existing transcription (podcasts, news, etc.)

Recording your friends and family and colleagues

Setting up an automated collection on Voxforge

数据集应该有足够的讲话人，多样的录音环境，足够的声音变化以及所有的语言句子。如前所述，数据库的大小取决于您想要处理的任务的复杂性。

一个数据集应该包含两部分:训练集和测试集。通常测试集占整个数据集的10%，但是我们不建议测试集超过4个小时的录音。

一些好的途径去获得新语言数据集:

使用现有的转录(播客、新闻等)手工分割音频记录。
记录你的朋友、家人和同事
在Voxforge上建立自动收集

You have to design database prompts and post-process the results to ensure that the audio actually corresponds to the prompts. The file structure for the database is the following:
├─ etc
│  ├─ your_db.dic                 (Phonetic dictionary)
│  ├─ your_db.phone               (Phoneset file)
│  ├─ your_db.lm.DMP              (Language model)
│  ├─ your_db.filler              (List of fillers)
│  ├─ your_db_train.fileids       (List of files for training)
│  ├─ your_db_train.transcription (Transcription for training)
│  ├─ your_db_test.fileids        (List of files for testing)
│  └─ your_db_test.transcription  (Transcription for testing)
└─ wav
   ├─ speaker_1
   │   └─ file_1.wav              (Recording of speech utterance)
   └─ speaker_2
      └─ file_2.wav
Let’s go through the files and describe their format and the way to prepare them:

*.fileids: The your_db_train.fileids and your_db_test.fileids files are text files which list the names of the recordings (utterance ids) one by one, for example:
speaker_1/file_1
speaker_2/file_2
A *.fileids file contains the path in a file-system relative to the wav directory. Note that a *.fileidsfile should not include audio file extensions in its content, but rather just the names.

*.transcription: The your_db_train.transcription and your_db_test.transcription files are text files listing the transcription for each audio file:
 hello world  (file_1)
 foo bar  (file_2)
It’s important that each line starts with ~~and ends with~~ followed by an id in parentheses. Also note that the parentheses contains only the file, without the speaker_n directory. It’s critical to have an exact match between the *.fileids file and the *.transcription file. The number of lines in both should be identical. The last part of the file id (speaker1/file_1) and the utterance id file_1must be the same on each line.

Below is an example of an incorrect *.fileids file for the above transcription file. If you follow it, you will get an error as discussed here:
speaker_2/file_2
speaker_1/file_1
// Bad! Do not create *.fileids files like this!

这一段简单说一下，这两种文件一定要严格对应，它们的行数一定是相同的，而且最后一行的file id (speaker1/file_1)和utterance id file_1一定是一致的。

Speech recordings (*.wav files): Your audio recordings should contain training audio which should match the audio you want to recognize in the end. In case of a mismatch you could experience a sometimes even significant drop of the accuracy. This means if you want to recognize continuous speech, your training database should record continuous speech. For continuous speech the optimal length for audio recordings is between 5 seconds and 30 seconds. Very long recordings make training much harder. If you are going to recognize short isolated commands, your training database should also contain files with short isolated commands. It is better to design the database to recognize continuous speech from the beginning though and not spend your time on commands. In the end you speak continuously anyway. The Amount of silence in the beginning and in the end of the utterance should not exceed 200 ms.

Recording files must be in MS WAV format with a specific sample rate – 16 kHz, 16 bit, mono for desktop application, 8kHz, 16bit, mono for telephone applications. It’s critical that the audio files have a specific format. Sphinxtrain does support some variety of sample rates but by default it is configured to train from 16khz 16bit mono files in MS WAV format.

So, please make sure that your recordings have a samplig rate of 16 kHz (or 8 kHz if you train a telephone model) in mono with a single channel!

If you train from an 8 kHz model you need to make sure you configured the feature extraction properly. Please note that you cannot upsample your audio, that means you can not train 16 kHz model with 8 kHz data.

A mismatch of the audio format is the most common training problem – make sure you eliminated this source of problems.

你的录音包含的训练音频应该与你最终想要识别的音频相匹配。在不匹配的情况下，你会遇到有时甚至会出现精度的显著下降。这意味着如果你想识别连续的讲话，你的训练集中必须包含这样的连续讲话。对于连续语音，录音的最佳长度在5秒到30秒之间。太长的录音会导致训练极为困难。如果你准备识别短的孤立的命令，你的训练集应该包含简短孤立命令的录音。最好从一开始就设计数据库来识别连续语音，而不是把时间花在命令上。因为在使用过程中你还是会不停地讲话。开始和结束时的静音时间不能超过200毫秒。

如果是桌面应用，那么录音文件最好是采样率16Khz，16bit的WAV格式单声道数据；如果是电话应用最好是8khz，16bit格式。音频有特定的格式是十分重要的。sphinxtrain支持不同的采样率，默认支持WAV格式、16khz、16bit单声道数据。

所以，请确保你的录音文件的采样率是16Khz(或者8Khz如果你要训练的是电话类模型)，声道是单声道。

如果您是从8 kHz模型开始训练的，那么您需要确保正确配置了特征提取。请注意你不能升采样你的音频样本，这意味着你不能用8千赫的数据来训练16千赫的模型。

不匹配的音频格式是常见的训练问题，请确保您消除了这个问题的根源。

Phonetic Dictionary (your_db.dict): should have one line per word with the word following the phonetic transcription:
HELLO HH AH L OW
WORLD W AO R L D
If you need to find a phonetic dictionary, have a look on Wikipedia or read a book on phonetics. If you are using an existing phonetic dictionary do not use case-sensitive variants like “e” and “E”. Instead, all your phones must be different even in the case-insensitive variation. Sphinxtrain doesn’t support some special characters like “*” or “/” and supports most of others like “+”, “-“ or “:”. However, to be safe we recommend you to use alphanumeric-only phone-set.

Replace special characters in the phone-set, like colons, dashes or tildes, with something alphanumeric. For example, replace “a~” with “aa” to make it alphanumeric only. Nowadays, even cell phones have gigabytes of memory on board. There is no sense in trying to save space with cryptic special characters.

一个单词一行，一个单词后跟着单词的音标。

如果你需要找一个语音字典，请查看关于语音的wiki和书籍。如果您使用的是现有的语音词典，请不要使用区分大小写的变体，如“e”和“e”。相反，你所有的音素必须是不同的，即使是不区分大小写的变化。Sphinxtrain不支持某些特殊字符，如“*”或“/”，它支持大多数其他字符，如“+”、“-”或“:”。不过，为了安全起见，我们建议您只使用字母数字音素集。

使用一些字母数字替换掉音素里所有的特殊字符，比如冒号，破折号或波浪号。比如，“aa” 替换“a~”保证只有字母数字。如今，即使是手机也有千兆字节的内存。试图用神秘的特殊字符来节省空间是没有意义的。

There is one very important thing here. For a large vocabulary database, the phonetic representation is more or less known; it’s simple phones described in any book. If you don’t have a phonetic book, you can just use the word’s spelling and it will also give you very good results:
ONE O N E
TWO T W O
For a small vocabulary CMUSphinx is different from other toolkits. It’s often recommended to train word-based models for a small vocabulary databases like digits. Yet, this only makes sense if your HMMs could have variable length.

CMUSphinx does not support word models. Instead, you need to use a word-dependent phone dictionary:
ONE W_ONE AH_ONE N_ONE
TWO T_TWO UH_TWO
NINE N_NINE AY_NINE N_END_NINE
This is actually equivalent to word-based models and some times even gives better accuracy. Do not use word-based models with CMUSphinx!

有一个非常重要的事情。对于一个庞大的词汇库，语音表示或多或少是已知的;任何书中都有简单的描述。如果你没有语音书，你可以只使用单词的拼写，它也会给你很好的结果:

对于小词汇量，CMUSphinx不同于其他工具包。我们通常建议为数字之类的小型词汇表数据库培训基于单词的模型。然而，这只有在HMMs可以具有可变长度时才有意义。

CMUSphinx不支持word模型。相反，您需要使用依赖于单词的音素字典。

这实际上等价于基于单词的模型，有时甚至提供更好的准确性。不要在CMUSphinx中使用基于单词的模型!

Phoneset file (your_db.phone): should have one phone per line. The number of phones should match the phones used in the dictionary plus the special SIL phone for silence:
AH
AX
DH
IX
Language model file (your_db.lm.DMP): should be in ARPA format or in DMP format. Find our more about language models in the Building a language model chapter.

Filler dictionary (your_db.filler): contains filler phones (not-covered by language model non-linguistic sounds like breath, “hmm” or laugh). It can contain just silences:
 SIL
 SIL
 SIL
It can also contain filler phones if they are present in the database transcriptions:
+um+ ++um++
+noise+ ++noise++
The sample database for training is available at an4 database NIST’s Sphere audio (.sph) format, You can use this database in the following sections. If you want to play with a large example, download the TED-LIUM English acoustic database. It contains about 200 hours of audio recordings at present.

Phoneset file (your_db.phone):一个音素一行。音素的数目应该和词典中使用的音素数目匹配，用大写的SIL表示静音。

Language model file (your_db.lm.DMP):语言模型，一般是ARPA或DMP格式，Building a language model。

Filler dictionary (your_db.filler):包含填充的音素，非语音比如呼吸声、笑声。

这里提供一个有效的数据集用于训练 an4 database NIST’s Sphere audio (.sph) format，你可以在接下来的环节里使用这个数据集。如果你想使用一个大的例子，下载 TED-LIUM 。它包括200小时的现场录音。

编译所需包

The following packages are required for training:

sphinxbase-5prealpha

sphinxtrain-5prealpha

pocketsphinx-5prealpha

The following external packages are also required:

perl, for example ActivePerl on Windows

python, for example ActivePython on Windows

In addition, if you download the packages with a .gz suffix, you will need gunzip or an equivalent tool to unpack them.

Install the perl and python packages somewhere in your executable path, if they are not already there.

We recommend that you train on Linux: this way you’ll be able to use all the features of sphinxtrain. You can also use a Windows system for training, in that case we recommend to use ActivePerl.

For further download instructions, see the download page.

Basically you need to put everything into a single root folder, unzip and untar them, and run configure and make and make install in each package folder. Put the database folder into this root folder as well. By the time you finish this, you will have a tutorial directory with the following contents:
└─ tutorial
   ├─ an4
   ├─ an4_sphere.tar.gz
   ├─ sphinxtrain
   ├─ sphinxtrain-5prealpha.tar.gz
   ├─ pocketsphinx
   ├─ pocketsphinx-5prealpha.tar.gz
   ├─ sphinxbase
   └─ sphinxbase-5prealpha.tar.gz
You will need to install the software as an administrator root. After you installed the software you may need to update the system configuration so that the system will be able to find the dynamic libraries, e.g.:
export PATH=/usr/local/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
If you don’t want to install into your system path, you may install the packages in your home folder. In that case you can append the following option to the autogen.sh script or to the configure script:
--prefix=/home/user/local
Obviously, the folder can be an arbitrary folder, just remember to update the environment configuration after modifying its name. If your binaries fail to load dynamic libraries with an error message like failed to open libsphinx.so.0 no such file or directory, it means that you didn’t configure the environment properly.

这节没什么好说的。

建议使用Linux，安装依赖的软件，注意路径设置。

建立训练脚本

To start the training, change to the database folder and run the following commands:

On Linux:
sphinxtrain -t an4 setup
On Windows:
python ../sphinxtrain/scripts/sphinxtrain -t an4 setup
Do not forget to replace an4 with your task name.

This will copy all the required configuration files into the etc/ subfolder of your database folder and will prepare the database for training. The directory structure after the setup will look like this:
├─ etc
└─ wav
In the process of the training other data folders will be created, so that your database directory should look like this:
├─ etc
├─ feat
├─ logdir
├─ model_parameters
├─ model_architecture
├─ result
└─ wav
After this basic setup, we need to edit the configuration files in the etc/ folder. There are many variables but to get started we need to change only a few. First of all, find the file etc/sphinx_train.cfg.

执行这个步骤之后，会根据数据集信息建立一个sphinx_train.cfg配置文件。在进行下一个步骤之前，我们根据需要对配置文件进行一些小的改动。

建立语料数据集的格式

In etc/sphinx_train.cfg you should see the following configurations:
$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
$CFG_WAVFILE_EXTENSION = 'sph';
$CFG_WAVFILE_TYPE = 'nist'; # one of nist, mswav, raw
If you recorded audio in WAV format, change sph to wav here and nist to mswav:
$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
$CFG_WAVFILE_EXTENSION = 'wav';
$CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw

根据自己的语料数据集格式修改脚本配置，脚本支持三种格式:nist,mswav,raw。如果你的格式是WAV格式，那就修改成mswav。

配置文件路径

Search for the following lines in your etc/sphinx_train.cfg file:
# Variables used in main training of models
$CFG_DICTIONARY     = "$CFG_LIST_DIR/$CFG_DB_NAME.dic";
$CFG_RAWPHONEFILE   = "$CFG_LIST_DIR/$CFG_DB_NAME.phone";
$CFG_FILLERDICT     = "$CFG_LIST_DIR/$CFG_DB_NAME.filler";
$CFG_LISTOFFILES    = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids";
$CFG_TRANSCRIPTFILE = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.transcription"
These values would be already set if you set up the file structure like described earlier, but make sure that your files are really named this way.

The $CFG_LIST_DIR variable is the /etc directory in your project. The $CFG_DB_NAME variable is the name of your project itself.

如果你的数据集结构是之前描述的那样，那么这些配置已经设置好了，但是你最好还是确认一下是否正确。

$CFG_LIST_DIR变量是你项目下的/etc目录。$CFG_DB_NAME是你的项目名称。

配置模型的类型和参数

To select the acoustic model type see the Acoustic Model Types article.
$CFG_HMM_TYPE = '.cont.'; # Sphinx4, Pocketsphinx
#$CFG_HMM_TYPE  = '.semi.'; # PocketSphinx only
#$CFG_HMM_TYPE  = '.ptm.'; # Sphinx4, Pocketsphinx, faster model
Just uncomment what you need. For resource-efficient applications use semi-continuous models, for best accuracy use continuous models. By default we use PTM models which provide a nice balance between accuracy and speed.
$CFG_FINAL_NUM_DENSITIES = 8;
If you are training continuous models for a large vocabulary and have more than 100 hours of data, put 32 here. It can be any power of 2: 4, 8, 16, 32, 64.

If you are training semi-continuous or PTM model, use 256 gaussians.

根据你的需要来选择。对于资源效率高的应用程序，使用半连续模型，为了获得最佳的准确性，使用连续模型。默认情况下，我们使用PTM模型，它提供了准确性和速度之间的良好平衡。

如果你要为一个大词汇量并且超过100小时的数据训练一个连续模型，在这里写32.这个数可选范围是2为底的任意次幂:4,8,16,32,64.

如果你要训练一个半连续或者PTM模型，使用256高斯函数。

# Number of tied states (senones) to create in decision-tree clustering
$CFG_N_TIED_STATES = 1000;
This value is the number of senones to train in a model. The more senones a model has, the more precisely it discriminates the sounds. On the other hand, if you have too many senones, the model will not be generic enough to recognize yet unseen speech. That means that the WER will be higher on unseen data. That’s why it is important to not overtrain the models. In case there are too many unseen senones, the warnings will be generated in the norm log on stage 50 below:
ERROR: "gauden.c", line 1700: Variance (mgau= 948, feat= 0, density=3,
component=38) is less then 0. Most probably the number of senones is too
high for such a small training database. Use smaller $CFG_N_TIED_STATES.

这个值是要在模型中训练的senones的数量。模型的senones越多，它对声音的识别就越准确。另一方面，如果你有太多的senones，这个模型将不足以通用到识别尚未见过的语音(译者:也就是模型向已识别的部分倾斜太多，很容易把一些不识别的东西误判)。这意味着对于不认识的数据WER(虚警率)会更高。这就是为什么不要过度训练模型非常重要。如果有太多没有认识的senones，就会出现警告，提示我们调小这个参数。

The approximate number of senones and the number of densities for a continuous model is provided in the table below:

Vocabulary Audio in database / hours Senones Densities Example

20 5 200 8 Tidigits Digits Recognition

100 20 2000 8 RM1 Command and Control

5000 30 4000 16 WSJ1 5k Small Dictation

20000 80 4000 32 WSJ1 20k Big Dictation

60000 200 6000 16 HUB4 Broadcast News

60000 2000 12000 64 Fisher Rich Telephone Transcription

For semi-continuous and PTM models use a fixed number of 256 densities.

Of course you also need to understand that only senones that are present in transcription can be trained. It means that if your transcription isn’t generic enough, e.g. if it’s the same single word spoken by 10.000 speakers 10.000 times you still have just a few senones no matter how many hours of speech you recorded. In that case you just need a few senones in the model, not thousands of them.

Though it might seem that diversity could improve the model that’s not the case. Diverse speech requires some artificial speech prompts and that decreases the speech naturalness. Artificial models don’t help in real life decoding. In order to build the best database you need to try to reproduce the real environment as much as possible. It’s even better to collect more speech to try to optimize the database size.

It’s important to remember, that optimal numbers depend on your database. To train a model properly, you need to experiment with different values and try to select the ones which result in the best WER for a development set. You can experiment with the number of senones and the number of Gaussian mixtures at least. Sometimes it’s also worth to experiment with the phoneset or the number of estimation iterations.

Vocabulary	Audio in database / hours	Senones	Densities	Example
20	5	200	8	Tidigits Digits Recognition
100	20	2000	8	RM1 Command and Control
5000	30	4000	16	WSJ1 5k Small Dictation
20000	80	4000	32	WSJ1 20k Big Dictation
60000	200	6000	16	HUB4 Broadcast News
60000	2000	12000	64	Fisher Rich Telephone Transcription

当然你也需要明白只有在转录中出现的senones才能被训练。这意味着如果你的转录不够通用，比如10000个人说同一个单词10000次，那么你仍然只有少量的senones，不关你有多少小时的录音数据。因此，你的模型里只需要极少的senones，而不是成千上万的。

尽管看起来好像我们增加多样性就能够增强模型，这是不正确的。不同的语音需要一些人为的语音提示，这就降低了语音的自然度。人工模型对实际解码没有帮助。为了构建最好的数据库，您需要尽可能地再现真实环境。收集更多的语音信息来优化数据库大小会更好。

最重要的是，优化取决于你的数据集。为了正确的训练模型，你需要尝试不同的值，在其中选择一组能够使得开发数据集得到最佳WER。你至少可以尝试修改senones和Gaussian mixture的数目。有时也值得尝试修改音素集或估计的迭代次数。

配置声音特征的参数

The default for sound files used in Sphinx is a rate of 16 thousand samples per second (16 KHz). If this is the case, the etc/feat.params file will be automatically generated with the recommended values.

If you are using sound files with a sampling rate of 8 kHz (telephone audio), you need to change some values in etc/sphinx_train.cfg. The lower sampling rate also means a change in the sound frequency ranges and the number of filters that are used to recognize speech. Recommended values are:
# Feature extraction parameters
$CFG_WAVFILE_SRATE = 8000.0;
$CFG_NUM_FILT = 31; # For wideband speech it's 40, for telephone 8khz reasonable value is 31
$CFG_LO_FILT = 200; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 3500; # For telephone 8kHz speech value is 3500

默认的音频格式是16khz采样率。如果你的数据是这样的，那么etc/feat.params文件会被使用推荐值自动生成。

如果你正在使用的音频格式是8khz(电话音频)，你需要修改一些配置。较低的采样率还意味着声音频率范围和用于识别语音的滤波器数量的变化。脚本的注释里有推荐值。

配置并行工作以加速训练

If you are on a multicore machine or in a PBS cluster you can run the training in parallel. The following options should do the trick:
# Queue::POSIX for multiple CPUs on a local machine
# Queue::PBS to use a PBS/TORQUE queue
$CFG_QUEUE_TYPE = "Queue";
Change the type to “Queue::POSIX” to run on multicore. Then change the number of parallel processes to run:
# How many parts to run Forward-Backward estimation in
$CFG_NPART = 1;
$DEC_CFG_NPART = 1; #  Define how many pieces to split decode in
If you are running on an 8-core machine start around 10 parts to fully load the CPU during training.

如果您在多核机器上或在PBS集群中，则可以并行运行培训。

配置解码参数

Open etc/sphinx_train.cfg and make sure the following configurations are set:
$DEC_CFG_DICTIONARY     = "$CFG_BASE_DIR/etc/$CFG_DB_NAME.dic";
$DEC_CFG_FILLERDICT     = "$CFG_BASE_DIR/etc/$CFG_DB_NAME.filler";
$DEC_CFG_LISTOFFILES    = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_test.fileids";
$DEC_CFG_TRANSCRIPTFILE = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_test.transcription";
$DEC_CFG_RESULT_DIR     = "$CFG_BASE_DIR/result";

# These variables are used by the decoder and have to be defined by the user.
# They may affect the decoder output.

$DEC_CFG_LANGUAGEMODEL  = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}.lm.DMP";
If you are training with an4 please make sure that you changed ${CFG_DB_NAME}.lm.DMP to an4.ug.lm.DMP since the name of the language model is different in an4 database:
$DEC_CFG_LANGUAGEMODEL  = "$CFG_BASE_DIR/etc/an4.ug.lm.DMP";
If everything is OK, you can proceed to training.

这里我用的就是an4，所以我需要修改一下这个$DEC_CFG_LANGUAGEMODEL。

训练

First of all, go to the database directory:
cd an4
To train, just run the following commands:

On Linux:
sphinxtrain run
On Windows:
python ../sphinxtrain/scripts/sphinxtrain run
and it will go through all the required stages. It will take a few minutes to train. On large databases, training could take up to a month.

The most important stage is the first one which checks that everything is configured correctly and your input data is consistent.

Do not ignore the errors reported on the first 00.verify_all step!

The typical output during decoding will look like:
Baum Welch starting for 2 Gaussian(s), iteration: 3 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 3
Current Overall Likelihood Per Frame = 30.6558644286942
Convergence Ratio = 0.633864444461992
Baum Welch starting for 2 Gaussian(s), iteration: 4 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 4
These scripts process all required steps to train the model. After they finished, the training is complete.

训练一开始，它将经过所有必要的阶段。小的数据库(an4)需要几分钟的时间来训练。在大型数据库上，训练可能需要长达一个月的时间。

最重要的阶段是第一个检查所有配置是否正确，输入数据是否一致的阶段。

不要忽略第一个00点报告的错误，这里验证所有步骤!

这些脚本处理训练模型所需的所有步骤。他们完成后，训练就结束了。

训练的核心

测试

It’s critical to test the quality of the trained database in order to select the best parameters, understand how your application performs and optimize the performance. To do that, a test decoding step is needed. The decoding is now a last stage of the training process.

You can restart decoding with the following command:
sphinxtrain -s decode run
This command will start a decoding process using the acoustic model you trained and the language model you configured in the etc/sphinx_train.cfg file.
MODULE: DECODE Decoding using models previously trained
Decoding 130 segments starting at 0 (part 1 of 1)
0%

为了选择最佳参数和了解应用的性能并优化性能，测试被训练的数据集质量是至关重要的。为此，需要一个测试解码步骤。解码是训练流程里最后一步。

你可以使用一个命令重新启动解码步骤。这个命令会重启解码流程，这个流程会使用训练好的声学模型和语言模型。

When the recognition job is complete, the script computes the recognition Word Error Rate (WER) and the Sentence Error Rate (SER). The lower those rates the better is your recognition. For a typical 10-hours task the WER should be around 10%. For a large task, it could be like 30%.

On an4 data you should get something like:
SENTENCE ERROR: 70.8% (92/130)   WORD ERROR RATE: 30.3% (233/773)

当识别工作完成，脚本会计算识别器的单词错误率和句子错误率。数值越低表示识别性能越好。对于一个典型的10小时任务，WER应该在10%左右。对于大型任务，可能是30%。

译者:基于an4我的输出结果:
SENTENCE ERROR: 46.2% (60/130)   WORD ERROR RATE: 15.5% (119/773)

You can find exact details of the decoding, like the alignment with a reference transcription, speed and the result for each file, in the result folder which will be created after decoding. Let’s have a look into the file an4.align:
p   I   T      t   s   b   u   r   g   H      (MMXG-CEN5-MMXG-B)
p   R   EIGHTY t   s   b   u   r   g   EIGHT  (MMXG-CEN5-MMXG-B)
Words: 10 Correct: 7 Errors: 3 Percent correct = 70.00% Error = 30.00% Accuracy = 70.00%
Insertions: 0 Deletions: 0 Substitutions: 3
october twenty four nineteen seventy  (MMXG-CEN8-MMXG-B)
october twenty four nineteen seventy  (MMXG-CEN8-MMXG-B)
Words: 5 Correct: 5 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%
Insertions: 0 Deletions: 0 Substitutions: 0
TOTAL Words: 773 Correct: 587 Errors: 234
TOTAL Percent correct = 75.94% Error = 30.27% Accuracy = 69.73%
TOTAL Insertions: 48 Deletions: 15 Substitutions: 171
For a description of the WER see our Basic concepts of speech chapter.

有关解码的精确细节，比如相关转录文件的对其，速度以及结果，在解码完成之后这些都可以在result文件夹下找到。让我们看看an4.align文件。

对WER的详细介绍可以查看 Basic concepts of speech。

使用模型

After training, the acoustic model is located in
model_parameters/``.cd_cont_``
or in
model_parameters/``.cd_semi_``
You need only that folder. The model should have the following files:
mdef
feat.params
mixture_weights
means
noisedict
transition_matrices
variances
depending on the type of the model you trained. To use the model in PocketSphinx, simply point to it with the -hmm option:
pocketsphinx_continuous -hmm `` -lm `` -dict ``.
To use the trained model in Sphinx4, you need to specify the path in the Configuration object:
configuration.setAcousticModelPath("file:model_parameters/db.cd_cont_200");
If the model is in the resources you can reference it with resource:URL:
configuration.setAcousticModelPath("resource:/com/example/db.cd_cont_200");
See the Sphinx4 tutorial for details.

训练之后声学模型保存在，分别是CI(上下文独立)和CD(上下文依赖)，你只需要这个文件夹。

pocketsphinx_continuous -hmm `` -lm `` -dict ``.

故障排除

Troubleshooting is not rocket science. For all issues you may blame yourself. You are most likely the reason of failure. Carefully read the messages in the logdir folder that contains a detailed log for each performed action. In addition, messages are copied to the your_project_name.html file, which you can open and read in a browser.

There are many well-working, proven methods to solve issues. For example, try to reduce the training set to see in which half the problem appears.

故障排除不是高深的事(造火箭^_^)。对于所有问题你都可以责备自己。你自己是最大可能失败的原因。仔细阅读日志目录下的每个操作的详细日志。此外消息将被复制到文件your_project_name.html，你可以用浏览器打开阅读它。

这里有许多有效的方法去解决一些问题。比如，尝试减少训练集来发现问题出现在哪里。

Here are some common problems:
WARNING: this phone (something) appears in the dictionary (dictionary file name), but not in the phone list (phone file name).

Your dictionary either contains a mistake, or you have left out a phone symbol in the phone file. You may have to delete any comment lines from your dictionary file.

WARNING: This word (word) has duplicate entries in (dictionary file name). Check for duplicates.

You may have to sort your dictionary file lines to find them. Perhaps a word is defined in both upper and lower case forms.

WARNING: This word: word was in the transcript file, but is not in the dictionary (transcript line) Do cases match?

Make sure that all the words in the transcript are in the dictionary, and that they have matching cases when they appear. Also, words in the transcript may be misspelled, run together or be a number or symbol that is not in the dictionary. If the dictionary file is not perfectly sorted, some entries might be skipped while looking for words. If you hand-edited the dictionary file, be sure that each entry is in the proper format.

You may have specified phones in the phone list that are not represented in the words in the transcript. The trainer expects to find examples of each phone at least once.

WARNING: CTL file, audio file name.mfc, does not exist, or is empty.

The .mfc files are the feature files converted from the input audio files in stage 000.comp_feats. Did you skip this step? Did you add new audio files without converting them? The training process expects a feature file to be there, but it isn’t.

Very low recognition accuracy

This might happen if there is a mismatch in the audio files and the parameters of training, or between the training and the testing.
ERROR: “backward.c”, line 430: Failed to align audio to transcript: final state of the search is not reached.

Sometimes audio in your database doesn’t match the transcription properly. For example the transcription file has the line “Hello world” but in audio actually “Hello hello world” is pronounced. The training process usually detects that and emits this message in the logs. If there are too many of such errors it most likely means you misconfigured something, e.g. you had a mismatch between audio and the text caused by transcription reordering. If there are few errors, you can ignore them. You might want to edit the transcription file to put the exact word which was pronounced. In the case above you need to edit the transcription file and put “Hello hello world” on the corresponding line. You might want to filter such prompts because they affect the quality of the acoustic model. In that case you need to enable the forced alignment stage during training. To do that edit the following line in sphinx_train.cfg:
$CFG_FORCEDALIGN = 'yes';
and run the training again. It will execute stages 10 and 11 and will filter your database.
Can’t open */*-1-1.match word_align.pl failed with error code 65280

This error occurs because the decoder did not run properly after the training. First check if the correct executable is present in your PATH. The executable shouldbe pocketsphinx_batch if the decoding script being used is psdecode.pl as set by the $DEC_CFG_SCRIPT variable insphinx_train.cfg. On Linux run:
which pocketsphinx_batch
and see if it is located as expected. If it is not, you need to set the PATH variable properly. Similarly on Windows, run:
where pocketsphinx_batch
If the path to the decoding executable is set properly, read the log files in logdir/decode/ to find out other reasons for the error.
To ask for help

If you want to ask for help about training, try to provide the training folder or at least the logdir. Pack the files into an archive and upload it to a public file sharing resource. Then post the link to the resource. Remember: the more information you provide the faster you will solve the problem.

这儿有些常见的问题:

WARNING: this phone (something) appears in the dictionary (dictionary file name), but not in the phone list (phone file name).

要么是你的dict字典文件出现了错误，要么是phone音素文件中遗漏了音素符号。你可能必须要从字典文件中删除一些行。
WARNING: This word (word) has duplicate entries in (dictionary file name). Check for duplicates.

您可能必须对字典文件行进行排序才能找到它们。也许一个词是用大写和小写两种形式定义的。.
WARNING: This word: word was in the transcript file, but is not in the dictionary (transcript line) Do cases match?

请确保转录文件中的单词存在字典文件中，而且它们是完全匹配的(应该是大小写、书写格式完全匹配的意思)。也许，转录文件中的单词拼写错误，单词连在了一起，不合法的数字和符号出现。如果字典文件没有完全排序，在查找单词时可能会跳过一些条目。如果手工编辑字典文件，请确保每个条目的格式都是正确的。

你可能在phone音素文件中指定了一个phone，但是死转录文件中没有出现它。但是训练过程需要至少出现一次这个phone。
WARNING: CTL file, audio file name.mfc, does not exist, or is empty.

在 000.comp_feats步骤中，*.mfc文件存储的是由输入的音频文件提取的特征值文件. 你是否跳过了这个步骤?你是否添加了新的文件但是没有提取特征值? 总之，训练过程需要的是mfc文件，但是它不存在。
Very low recognition accuracy

如果音频文件和训练参数不匹配，或者训练和测试之间不匹配，就可能发生这种情况。
ERROR: “backward.c”, line 430: Failed to align audio to transcript: final state of the search is not reached.

有时候语料库里的音频文件和转录文件不匹配。比如，转录文件里写的是“hello world”，但是音频文件发出来的是“hello hello world”。训练过程检查这种情况并在日志中发出消息。如果这种错误出现的太多有可能是你搞错了，比如由于重新排序，导致音频和文本之间不匹配。如果只有很少的情况，你可以忽略它们。你可能需要用实际发音内容来编辑转录文件。比如以上例子中，你需要把“hello hello world”写入转录文件中。你或许想要过滤掉这些错误声音，因为它们会影响声学模型的质量。在这种情况下，您需要在训练之前启用强制对齐阶段。要实现这一点你可以在文件 sphinx_train.cfg中找到下面的选项进行配置:
```
$CFG_FORCEDALIGN = 'yes';
```
然后重新训练。它将执行阶段10和11，并过滤数据库。
Can’t open */*-1-1.match word_align.pl failed with error code 65280

这个错误出现在训练之后解码不正确时。首先检查正确的可执行文件是否存在PATH路径中。如果sphinx_train.cfg脚本文件里$DEC_CFG_SCRIPT设置的是psdecode.pl，对应的可执行文件是 pocketsphinx_batch。在Linux中运行:
```
which pocketsphinx_batch
```
看看它是否位于预期的位置。如果不是，则需要正确设置路径变量。类似地，在Windows上运行:
```
where pocketsphinx_batch
```
如果解码可执行文件的路径设置正确，请阅读logdir/decode/中的日志文件，找出错误的其他原因。
To ask for help

如果您想寻求训练方面的帮助，请尝试提供训练相关的文件夹或至少提供logdir。将文件打包到归档文件中，并将其上传到公共文件共享资源。然后发布资源的链接。记住:你提供的信息越多问题解决的越快。

你可能感兴趣的:(音频开发)

ESP32 S3音频开发
1.音频硬件框架Codec：音频编解码芯片，一种低功耗单声道音频编解码器，包含单通道ADC、单通道DAC、低噪声前置放大器、耳机驱动器、数字音效、模拟混音和增益功能。它通过I2S和I2C总线与ESP32-S3-WROOM-1模组连接，以提供独立于音频应用程序的。PA：音频功率放大器，用于放大来自音频编解码芯片的音频信号，以驱动扬声器。2.音频软件框架ESP32提供了几个简单的高级API，可以参考例
ESP32-S3 I2S音频开发实战指南薛慕昭音视频
目录前言I2S简介TDM通信模式(标准)PDM通信模式.对比总结为什么要学习I2SPCM原始数据I2S录制声音I2S播放声音WAV音频WAV文件头结构（44字节）解析wav格式数据struct.unpack的基本用法格式化字符串(fmt)示例1：解析单个值示例2：解析多个值示例3：解析混合类型示例4：解析字符串示例5：解析WAV文件头注意事项总结实操演练保存wav格式数据结语前言在智能硬件和物联网
音频质量客观评价标准（信噪比、总谐波失真等）天夏已微凉音频音视频
在音频开发和评测过程中，音频质量的客观评价标准对于确保产品性能和用户体验非常重要。以下是一些常见且关键的音频质量客观评价标准：1.信噪比（Signal-to-NoiseRatio,SNR）信噪比是指信号的强度与噪声强度的比值，通常以分贝（dB）为单位表示。信噪比越高，音频信号的质量越好，因为噪声对信号的干扰越小。公式：有用信号功率（PowerofSignal）与杂讯功率（PowerofNoise）
android pcm频谱_Android音频开发（7）：音乐可视化-FFT频谱图 weixin_39520149 android pcm频谱
Android音频开发目录一、演示image二、实现实现流程：使用MediaPlayer播放传入的音乐，并拿到mediaPlayerId使用Visualizer类拿到拿到MediaPlayer播放中的音频数据(wave/fft)将数据用自定义控件展现出来三、准备工作使用Visualizer需要录音的动态权限，如果播放sd卡音频需要STORAGE权限privatestaticfinalString[
ESP-ADF 开发环境搭建全职编程-JieGeGe ESP32入门教程 esp-adf web-rtc esp32
首先在github上下载esp-adf的工程代码或者gitee上下载代码esp-adf:ESP-ADF是由乐鑫官方推出的针对ESP32和ESP32-S2系列芯片的音频开发框架。ESP-ADF国内镜像仓库，Issues和PRs请仍旧提交到github。(gitee.com)下载后，查看esp-adf\components文件夹下的esp-adf-libs文件夹和esp-sr文件夹，这时候这两个文件夹
从 0 到 1 掌握鸿蒙 AudioRenderer 音频渲染：我的自学笔记与踩坑实录（API 14）李游Leo harmonyos-next harmonyos 鸿蒙音视频笔记
最近我在研究HarmonyOS音频开发。在音视频领域，鸿蒙的AudioKit框架提供了AVPlayer和AudioRenderer两种方案。AVPlayer适合快速实现播放功能，而AudioRenderer允许更底层的音频处理，适合定制化需求。本文将以一个开发者的自学视角，详细记录使用AudioRenderer开发音频播放功能的完整过程，包含代码实现、状态管理、最佳实践及踩坑总结。一、环境准备与核
Qt 设置窗体透明 Qt开发老杰 qt 数据库开发语言 c++c语言
一、前言在音频开发中，窗体多半为半透明、圆角窗体，如下为Qt5.5VS2013实现半透明方法总结。二、半透明方法设置1、窗体及子控件都设置为半透明1）setWindowOpacity(0.8);//参数范围为0-1.0，通过QSlider控件做成透明度控制条本文福利，莬费领取Qt开发学习资料包、技术视频，内容包括（C++语言基础，Qt编程入门，QT信号与槽机制，QT界面开发-图像绘制，QT网络，Q
QT6开发高性能企业视频会议-5 Linux Audio开发 sqmeeting linux 运维服务器
Linux系统音频技术简介视频会议或者其他音视频通信应用都会涉及Audio/Voice的采集和播放，本文简单介绍Linux系统常用Audio开发框架和技术，并且配有示例代码。更完整的代码和应用请访问下面地址免费获取:国内:https://gitee.com/sqmeeting神旗视讯--开源高性能音视频系统目前，常用的Linux系统音频开发框架和SDK主要有如下几种：QtMultimedia简介：
车载音频开发（三）：对wav音频做定浮点转换（采样深度转换） Mr Chris_LI wav音频开发心得音视频
对于wav的采样格式讨论较多的是定浮点采样基于上一节我们对采样点的理解车载音频开发（二）：对音频数据作音量调节_音频数据的音量控制代码-CSDN博客定点常见的有16bit，24bit，和32bit浮点一般用float(32bit)IEEE754浮点数不同位深度的取值范围：16bit定点数:-32,768~32,76724bit定点数:-8,388,608~8,388,60732bit定点数:-2,
Android视频开发进阶-关于视频的那些术语，android软件开发计算器 wa32saa 程序员架构移动开发 android
原文出处：jianshu正文说到安卓的视频开发，大多数朋友们都是用着开源的播放器，或者安卓自带的nativemediaplayer，拿来主义居多，我曾经也是。。。最近这半年因为开始着手重构公司的播放器,也开始学习了很多视频音频开发的相关知识，抱着独乐乐不如众乐乐的想法，开始写一些值得分享的东西。这次的连载和之前的RxJava分享一样，会分开不容的章节。第一次我打算分享一下视频开发中常见的一些知识点
音频质量评价方法 musiclvme 数字信号处理音视频
作为一个音频开发人员，针对音频质量你还没有一个可用的工具箱来评测？那么本文总结的音频质量评测方法就是为你量身打造，偷偷收藏吧。评价类型方法名称方法说明参考资料备注主观评价MUSHRAITU-RBS.1534通过在测试中加入无损和全损音源作为参考，并组织专家和普通听众，盲听无损，全损，以及算法处理的音频，对其主观打分，音频一般为10-20秒，测试时间不超过20分钟。GitHub-audiolabs/
android音视频开发总结 Magic11
https://github.com/Jhuster/AudioDemoAndroid音频开发（1）：基础知识Android音频开发（2）：如何采集一帧音频Android音频开发（3）：如何播放一帧音频Android音频开发（4）：如何存储和解析wav文件Android音频开发（5）：音频数据的编解码Android音频开发（6）：使用OpenSLESAPI（上）Android音频开发（7）：使用O
音频开发之ALSA框架稚肩音视频开发音视频
ALSA（AdvancedLinuxSoundArchitecture）是Linux操作系统上用于提供音频和MIDI功能的软件架构。它为Linux系统提供了强大的音频支持，包括音频录制、播放和处理，它设计用于提供高性能、低延迟、高质量的音频处理，并为开发者提供了一组API和工具。主要框架ALSA的涉及本身比较复杂，如果不是特别底层的驱动开发，一般我们只需关注alsa在应用层给我们提供的接口即可。驱
Android 多媒体之音频 qfliweimin
在开发上，习惯的将音频、视频功能的使用称之为多媒体，实际上如果讲的宽泛一些的话，相机的使用，比如拍照，录制视频等，也可以划分到多媒体的范畴里面。从本节课开始，我们就来看看Android中多媒体的API使用和具体的功能。本篇文章我们先从音频开发聊起。零、音频开发场景、内容和基本概念在说音频开发之前，我们可以先想一想自己琢磨一下，哪些应用场景会用到音频开发。主要的应用场景大致包括：音频播放器录音机语音
FFmpeg从入门到入魔(4)：OpenSL ES播放PCM音频【零声教育】音视频开发进阶音视频开发程序员 ffmpeg 音视频 elasticsearch c++android
1.OpenSLES原理 OpenSLES(OpenSoundLibraryforEmbeddedSystems)，即嵌入式音频加速标准，是一个无授权费、跨平台、针对嵌入式系统精心优化的硬件音频加速API库。它为嵌入移动多媒体设备上的本地应用程序开发者提供了标准化、高性能、低相应时间的音频开发方案，并实现软/硬件音频性能的直接跨平台部署，被广泛应用于3D音效、音频播放、音频录制以及音乐体验增强(低
android音频编辑之音频转换PCM与WAV 锐湃 media
前言本篇开始讲解在Android平台上进行的音频编辑开发，首先需要对音频相关概念有基础的认识。所以本篇要讲解以下内容：常用音频格式简介WAV和PCM的区别和联系WAV文件头信息采样率简介声道数和采样位数下的PCM编码音频文件解码PCM文件转WAV文件现在先给出音频编辑的效果图，看看能不能提高大家的积极性~，哈哈常用音频格式简介在Android平台上进行音频开发，首先需要对常用的音频格式有个大致的了
将音频bin文件转换为数组形式的嵌入式端音频开发技巧飘逸轻舞音视频嵌入式
将音频bin文件转换为数组形式的嵌入式端音频开发技巧在嵌入式系统的音频开发中，经常需要将音频文件以数组的形式嵌入到代码中，以便在嵌入式设备上进行播放或处理。本文将介绍如何将音频bin文件转换为数组形式，以便在嵌入式系统中使用。首先，我们需要一个音频bin文件作为输入。音频bin文件是一种二进制文件，其中包含音频数据的原始字节表示。你可以使用各种工具和库来生成音频bin文件，例如音频编辑软件或者专门
[ Linux Audio 篇 ] 音频开发入门基础知识程序手艺人音视频
在短视频兴起的背景下，音视频开发越来越受到重视。接下来将为大家介绍音频开发者入门知识，帮助读者快速了解这个领域。轻柔的音乐、程序员有节奏感的键盘声、嗡嗡的发动机、刺耳的手提钻……这些声音是如何产生的呢？又是如何传到我们耳中的呢？声音是振动产生的声波，通过介质（气体、固体、液体）传播并能被人或动物听觉器官所感知的波动现象。声音的频率一般会以赫兹表示，记为Hz，指每秒钟周期性震动的次数。而分贝是用来表
音频采集程序猿想吃肉
想更好地了解音频采集，首先要去了解一些音频入门基础知识。关于一些音频开发的一些基础知识，这里就不一一讲解了，可以去了解Android音频技术开发的一些基础知识•Android音频采集（捕获）android平台上的音频采集一般就两种方式：使用MediaRecorder进行音频采集。MediaRecorder是基于AudioRecorder的API(最终还是会创建AudioRecord用来与Audio
Android音频开发（七）音频编解码之MediaCodec编解码AAC下 Hirezy framework 图形图像与音视频 Android
在上一篇初识MediaCodec中，我们认识了MediaCodec，知道了MediaCodec的基本工作流程和开发注意事项，这一篇我将讲述如何利用MediaCodec编解码AAC。1:MediaCodec实时采集音频并编码我们将使用AudioRecord和MediaCodec实现这个功能,关于AudioRecord的使用后期我会单独讲述。为了保证兼容性，推荐的配置是44.1kHz、单通道、16位精
基于ADI-ADAU1452-DSP音频开发周南音频科技教育学院(AI湖湘学派) 车载DSP音频系统研究开发音频
1基于ADI-ADAU1452-DSP音频开发(1)基于SigmaStudio图形界面音频算法设计(2)基于ADAU1452A2B通讯系统设计(3)基于赛普拉斯MCU和ADAU1452通信系统设计(4)基于ADAU1452的I2S通讯系统设计(5)数字音频功放设计(6)
基于HAL库：STM32F407_Discovery串口2配置只用tx功能渣渣小码官方探索版 stm32 嵌入式硬件单片机
目录1.下载HAL库2.配置工程3.测试STM32F407_Discovery这个板子可快速搭建音频开发相关，无论是官方的参数示例，或者是HAL库代码中给出的示例，都不带串口调试这部分的，音频开发涉及IIC/IIS以及PDM麦克风等模块，GPIO口有限，很多具有串口复用功能的GPIO都给了其他模块，如果强制配成串口进行调试，音频链路运行的过程中输出就会不正常，开发过程中使用其他调试手段也可以，但是
音频开发学习线路图 FisherTige_f2ef
1.在Android平台绘制一张图片，使用至少3种不同的API，ImageView，SurfaceView，自定义View2.在Android平台使用AudioRecord和AudioTrackAPI完成音频PCM数据的采集和播放，并实现读写音频wav文件3.在Android平台使用CameraAPI进行视频的采集，分别使用SurfaceView、TextureView来预览Camera数据，取到
Qt 设置窗体透明 Qt开发老杰 qt ui 开发语言
一、前言在音频开发中，窗体多半为半透明、圆角窗体，如下为Qt5.5VS2013实现半透明方法总结。二、半透明方法设置1、窗体及子控件都设置为半透明1）setWindowOpacity(0.8);//参数范围为0-1.0，通过QSlider控件做成透明度控制条2）无边框设置setWindowFlags(Qt::FramelessWindowHint);3）窗体圆角设置setAttribute(Qt:
Linux 音频路由适配 (amixer) 炭烤毛蛋音视频 linux
amixer详解Alsa是AdvancedLinuxSoundArchitecture的缩写，即高级Linux声音架构，在Linux操作系统上提供了对音频和MIDI的支持。Alsa提供的调试命令有基于文本下的图形界面的alsamixer和文本模式的amixer，amixer也可以称作命令行模式。UbuntuDebian等常见Linux终端文件系统音频开发过程，没有显示器但又需要查看音频设备信息，a
android 字节转wav,Android音频开发（4）：PCM转WAV格式音频 Edward.Fu android 字节转wav
前面几篇已经介绍了PCM音频文件的录制，这一篇主要介绍下pcm转wav。一、wav和pcm一般通过麦克风采集的录音数据都是PCM格式的，即不包含头部信息，播放器无法知道音频采样率、位宽等参数，导致无法播放，显然是非常不方便的。pcm转换成wav，我们只需要在pcm的文件起始位置加上至少44个字节的WAV头信息即可。RIFFWAVE文件是以RIFF(ResourceInterchangeFileFo
HarmonyOS 音频开发指导：使用 OpenSL ES 开发音频播放功能 HarmonyOS开发者 1024程序员节 HarmonyOS
OpenSLES全称为OpenSoundLibraryforEmbeddedSystems，是一个嵌入式、跨平台、免费的音频处理库。为嵌入式移动多媒体设备上的应用开发者提供标准化、高性能、低延迟的API。HarmonyOS的NativeAPI基于KhronosGroup开发的OpenSLES1.0.1API规范实现，开发者可以通过和在HarmonyOS上使用相关API。HarmonyOS上的Ope
HarmonyOS 音频开发指导：使用 AudioRenderer 开发音频播放功能 HarmonyOS开发者音视频 HarmonyOS
AudioRenderer是音频渲染器，用于播放PCM（PulseCodeModulation）音频数据，相比AVPlayer而言，可以在输入前添加数据预处理，更适合有音频开发经验的开发者，以实现更灵活的播放功能。开发指导使用AudioRenderer播放音频涉及到AudioRenderer实例的创建、音频渲染参数的配置、渲染的开始与停止、资源的释放等。本开发指导将以一次渲染音频数据的过程为例，向
基于QCC30XX/51XX ANC多功能音频开发板方案之Sink工程数字光纤SPDIF输入输出配置 Alex_886 QCC 单片机 stm32 语音识别 dsp开发性能优化
基于QCC30XX/51XXANC多功能音频开发板集成的数字光纤SPDIF电路和接口，所以本博文在此基础上介绍SPDIF的输入输出在软件MDE和Config配置工具中如何定义和设置，如果开发者购买本开发板，只要照着配置就可以实现该功能。1.QCC30XX/51XXEVB板集成SPDIF功能,原理图如下：2.实物EVB连接,测试SPDIF通路：定义PIO16输出，PIO17输入：3.使用默认QCC5
HarmonyOS音频开发指导：使用AVPlayer开发音频播放功能 HarmonyOS开发者音视频 HarmonyOS
如何选择音频播放开发方式在HarmonyOS系统中，多种API都提供了音频播放开发的支持，不同的API适用于不同音频数据格式、音频资源来源、音频使用场景，甚至是不同开发语言。因此，选择合适的音频播放API，有助于降低开发工作量，实现更佳的音频播放效果。●AVPlayer：功能较完善的音频、视频播放ArkTS/JSAPI，集成了流媒体和本地资源解析、媒体资源解封装、音频解码和音频输出功能。可以用于直
JAVA基础灵静志远位运算加载 Date 字符串池覆盖
一、类的初始化顺序 1 （静态变量，静态代码块）-->（变量，初始化块）--> 构造器同一括号里的，根据它们在程序中的顺序来决定。上面所述是同一类中。如果是继承的情况，那就在父类到子类交替初始化。二、String 1 String a = "abc"; JAVA虚拟机首先在字符串池中查找是否已经存在了值为"abc"的对象，根
keepalived实现redis主从高可用 bylijinnan redis
方案说明两台机器（称为A和B），以统一的VIP对外提供服务 1.正常情况下，A和B都启动，B会把A的数据同步过来（B is slave of A） 2.当A挂了后，VIP漂移到B；B的keepalived 通知redis 执行：slaveof no one，由B提供服务 3.当A起来后，VIP不切换，仍在B上面；而A的keepalived 通知redis 执行slaveof B，开始
java文件操作大全 0624chenhong java
最近在博客园看到一篇比较全面的文件操作文章，转过来留着。 http://www.cnblogs.com/zhuocheng/archive/2011/12/12/2285290.html 转自http://blog.sina.com.cn/s/blog_4a9f789a0100ik3p.html 一.获得控制台用户输入的信息 &nbs
android学习任务不懂事的小屁孩工作
任务完成情况搞清楚带箭头的pupupwindows和不带的使用已完成熟练使用pupupwindows和alertdialog，并搞清楚两者的区别已完成熟练使用android的线程handler,并敲示例代码进行中了解游戏2048的流程，并完成其代码工作进行中-差几个actionbar 研究一下android的动画效果，写一个实例已完成复习fragem
zoom.js 换个号韩国红果果 oom
它的基于bootstrap 的 https://raw.github.com/twbs/bootstrap/master/js/transition.js transition.js模块引用顺序 <link rel="stylesheet" href="style/zoom.css"> <script src=&q
详解Oracle云操作系统Solaris 11.2 蓝儿唯美 Solaris
当Oracle发布Solaris 11时，它将自己的操作系统称为第一个面向云的操作系统。Oracle在发布Solaris 11.2时继续它以云为中心的基调。但是，这些说法没有告诉我们为什么Solaris是配得上云的。幸好，我们不需要等太久。Solaris11.2有4个重要的技术可以在一个有效的云实现中发挥重要作用：OpenStack、内核域、统一存档（UA）和弹性虚拟交换（EVS）。
spring学习——springmvc（一） a-john springMVC
Spring MVC基于模型-视图-控制器（Model-View-Controller，MVC）实现，能够帮助我们构建像Spring框架那样灵活和松耦合的Web应用程序。 1，跟踪Spring MVC的请求请求的第一站是Spring的DispatcherServlet。与大多数基于Java的Web框架一样，Spring MVC所有的请求都会通过一个前端控制器Servlet。前
hdu4342 History repeat itself-------多校联合五 aijuans 数论
水题就不多说什么了。 #include<iostream>#include<cstdlib>#include<stdio.h>#define ll __int64using namespace std;int main(){ int t; ll n; scanf("%d",&t); while(t--)
EJB和javabean的区别 asia007 bean ejb
EJB不是一般的JavaBean,EJB是企业级JavaBean,EJB一共分为3种,实体Bean,消息Bean,会话Bean,书写EJB是需要遵循一定的规范的,具体规范你可以参考相关的资料.另外,要运行EJB,你需要相应的EJB容器,比如Weblogic,Jboss等,而JavaBean不需要,只需要安装Tomcat就可以了 1.EJB用于服务端应用开发, 而JavaBeans
Struts的action和Result总结百合不是茶 struts Action配置 Result配置
一:Action的配置详解: 下面是一个Struts中一个空的Struts.xml的配置文件 <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE struts PUBLIC &quo
如何带好自已的团队 bijian1013 项目管理团队管理团队
在网上看到博客" 怎么才能让团队成员好好干活"的评论，觉得写的比较好。原文如下：我做团队管理有几年了吧，我和你分享一下我认为带好团队的几点： 1.诚信对团队内成员，无论是技术研究、交流、问题探讨，要尽可能的保持一种诚信的态度，用心去做好，你的团队会感觉得到。 2.努力提
Java代码混淆工具 sunjing ProGuard
Open Source Obfuscators ProGuard http://java-source.net/open-source/obfuscators/proguardProGuard is a free Java class file shrinker and obfuscator. It can detect and remove unused classes, fields, m
【Redis三】基于Redis sentinel的自动failover主从复制 bit1129 redis
在第二篇中使用2.8.17搭建了主从复制，但是它存在Master单点问题，为了解决这个问题，Redis从2.6开始引入sentinel，用于监控和管理Redis的主从复制环境，进行自动failover，即Master挂了后，sentinel自动从从服务器选出一个Master使主从复制集群仍然可以工作，如果Master醒来再次加入集群，只能以从服务器的形式工作。什么是Sentine
使用代理实现Hibernate Dao层自动事务白糖_ DAO spring AOP 框架 Hibernate
都说spring利用AOP实现自动事务处理机制非常好，但在只有hibernate这个框架情况下，我们开启session、管理事务就往往很麻烦。 public void save(Object obj){ Session session = this.getSession(); Transaction tran = session.beginTransaction(); try
maven3实战读书笔记 braveCS maven3
Maven简介是什么？ Is a software project management and comprehension tool.项目管理工具是基于POM概念(工程对象模型) [设计重复、编码重复、文档重复、构建重复，maven最大化消除了构建的重复] [与XP：简单、交流与反馈；测试驱动开发、十分钟构建、持续集成、富有信息的工作区] 功能：
编程之美-子数组的最大乘积 bylijinnan 编程之美
public class MaxProduct { /** * 编程之美子数组的最大乘积 * 题目: 给定一个长度为N的整数数组，只允许使用乘法，不能用除法，计算任意N-1个数的组合中乘积中最大的一组，并写出算法的时间复杂度。 * 以下程序对应书上两种方法，求得“乘积中最大的一组”的乘积——都是有溢出的可能的。 * 但按题目的意思，是要求得这个子数组，而不
读书笔记-2 chengxuyuancsdn 读书笔记
1、反射 2、oracle年-月-日时-分-秒 3、oracle创建有参、无参函数 4、oracle行转列 5、Struts2拦截器 6、Filter过滤器(web.xml) 1、反射 (1)检查类的结构在java.lang.reflect包里有3个类Field,Method,Constructor分别用于描述类的域、方法和构造器。 2、oracle年月日时分秒 s
[求学与房地产]慎重选择IT培训学校 comsci it
关于培训学校的教学和教师的问题,我们就不讨论了,我主要关心的是这个问题培训学校的教学楼和宿舍的环境和稳定性问题我们大家都知道，房子是一个比较昂贵的东西，特别是那种能够当教室的房子... &nb
RMAN配置中通道(CHANNEL)相关参数 PARALLELISM 、FILESPERSET的关系 daizj oracle rman filesperset PARALLELISM
RMAN配置中通道(CHANNEL)相关参数 PARALLELISM 、FILESPERSET的关系转 PARALLELISM --- 我们还可以通过parallelism参数来指定同时"自动"创建多少个通道： RMAN > configure device type disk parallelism 3 ; 表示启动三个通道，可以加快备份恢复的速度。
简单排序:冒泡排序 dieslrae 冒泡排序
public void bubbleSort(int[] array){ for(int i=1;i<array.length;i++){ for(int k=0;k<array.length-i;k++){ if(array[k] > array[k+1]){
初二上学期难记单词三 dcj3sjt126com sciet
concert 音乐会 tonight 今晚 famous 有名的；著名的 song 歌曲 thousand 千 accident 事故；灾难 careless 粗心的，大意的 break 折断；断裂；破碎 heart 心（脏） happen 偶尔发生，碰巧 tourist 旅游者；观光者 science （自然）科学 marry 结婚 subject 题目；
I.安装Memcahce 1. 安装依赖包libevent Memcache需要安装libevent,所以安装前可能需要执行 Shell代码收藏代码 dcj3sjt126com redis
wget http://download.redis.io/redis-stable.tar.gz tar xvzf redis-stable.tar.gz cd redis-stable make 前面3步应该没有问题，主要的问题是执行make的时候，出现了异常。异常一： make[2]: cc: Command not found 异常原因：没有安装g
并发容器 shuizhaosi888 并发容器
通过并发容器来改善同步容器的性能，同步容器将所有对容器状态的访问都串行化，来实现线程安全，这种方式严重降低并发性，当多个线程访问时，吞吐量严重降低。并发容器ConcurrentHashMap 替代同步基于散列的Map，通过Lock控制。 &nb
Spring Security（12）——Remember-Me功能 234390216 Spring Security Remember Me 记住我
Remember-Me功能目录 1.1 概述 1.2 基于简单加密token的方法 1.3 基于持久化token的方法 1.4 Remember-Me相关接口和实现
位运算焦志广位运算
一、位运算符Ｃ语言提供了六种位运算符： & 按位与 | 按位或 ^ 按位异或 ~ 取反 << 左移 >> 右移 1. 按位与运算按位与运算符"&"是双目运算符。其功能是参与运算的两数各对应的二进位相与。只有对应的两个二进位均为1时，结果位才为1 ，否则为0。参与运算的数以补码方式出现。例如：9&am
nodejs 数据库连接 mongodb mysql liguangsong mongodb mysql node 数据库连接
1.mysql 连接 package.json中dependencies加入 "mysql":"~2.7.0" 执行 npm install 在config 下创建文件 database.js
java动态编译 olive6615 java HotSpot jvm 动态编译
在HotSpot虚拟机中，有两个技术是至关重要的，即动态编译(Dynamic compilation)和Profiling。 HotSpot是如何动态编译Javad的bytecode呢？Java bytecode是以解释方式被load到虚拟机的。HotSpot里有一个运行监视器，即Profile Monitor,专门监视
Storm0.9.5的集群部署配置优化 roadrunners 优化 storm.yaml
nimbus结点配置（storm.yaml）信息： # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional inf
101个MySQL 的调节和优化的提示 tomcat_oracle mysql
　1. 拥有足够的物理内存来把整个InnoDB文件加载到内存中——在内存中访问文件时的速度要比在硬盘中访问时快的多。　　2. 不惜一切代价避免使用Swap交换分区 – 交换时是从硬盘读取的，它的速度很慢。　　3. 使用电池供电的RAM（注：RAM即随机存储器）。　　4. 使用高级的RAID（注：Redundant Arrays of Inexpensive Disks，即磁盘阵列
zoj 3829 Known Notation(贪心) 阿尔萨斯 ZOJ
题目链接：zoj 3829 Known Notation 题目大意：给定一个不完整的后缀表达式，要求有2种不同操作，用尽量少的操作使得表达式完整。解题思路：贪心，数字的个数要要保证比∗的个数多1，不够的话优先补在开头是最优的。然后遍历一遍字符串，碰到数字+1，碰到∗-1,保证数字的个数大于等1，如果不够减的话，可以和最后面的一个数字交换位置（用栈维护十分方便），因为添加和交换代价都是1