


1.kaldi基础介绍(一)在说话人识别中的数据准备 - monsieurliaxiamen的博客 - CSDN博客
2.kaldi中改写sre10/v1用timit dataset做说话人识别总结 - zjm750617105的专栏 - CSDN博客
3.kaldi下清华语音数据集的说话人测试脚本编写 - 破晓的专栏 - CSDN博客
4.Voiceprint recognition in kaldi - Programmer Sought
5.kaldi中的声纹识别 - yutouwd的博客 - CSDN博客
6.【数据预处理】TIMIT语料库WAV文件转换 - JJJanepp - 博客园
7.利用kaldi提取mfcc特征 - 长虹剑的专栏 - CSDN博客
8.对TIMIT数据进行格式转换(SPHERE2WAV(RIFF)) - MengWang - 博客园




timit的wav文件不是真正的wav文件,需要用kaldi的工具 sph2pipe 进行转换。

比如使用timit中的test/dr1/mdab0_si1039.wav 文件,首先用命令

fname=test_dr1_mdab0_si1039.wav  #改了文件的名字
sph2pipe -f wav $fname >file.wav  #要先转换格式


如何批量处理呢?对TIMIT数据进行格式转换(SPHERE2WAV(RIFF)) - MengWang - 博客园特别好用!


cd '/home/dream/Research/kaldi-master/tools/sph2pipe_v2.5'


./sph2pipe -f wav ./wav_test/SA1.WAV ./wav_test/SA1_tr.WAV






  • 如果数据只有100条,100条或者1万条,通常将样本集设置为70%验证集,30%测试集。也可按照60%训练集,20%验证和20%测试集来分类较为合理。
  • 如果数据规模较大,是百万级别,验证集和测试集要小于数据总量的20%和10%。
  • 做科研的话一般会使用标准数据集,也就不用考虑数据划分的问题了
  • 如果是第一次尝试,不用如何严苛地划分数据



训练集的准备:spk2utt, utt2spk以及wav.scp


  • spk2utt 是说话人id(记作spkid)和说话人语音名称(uttid)的对应关系,通常来讲,一个说话人会有很多条语音,文件中的格式为 ...,每一行有且只有一个说话人id。每一行的uttid顺序需要按照sort命令的排序模式来排,以及spkid也需要按照排序命令sort的模式来排。否则kaldi脚本在进行validate_data_dir.sh的时候报错。
  • utt2spk 是单个语音名称uttid和说话人的对应,很明显每行都是一一对应关系。utt2spk也可以由kaldi自带脚本和spk2utt生成,也可以由自己写脚本完成。

用kaldi自带的命令utils/utt2spk_to_spk2utt.pl utt2spk >spk2utt 转过来得到,好像需要把文件放在utils目录下(?)

  • wav.scp 是语音名称uttid和其完整路径的对应,也是每行一个音频。但是根据数据集中音频文件格式的不同,需要添加一些转换格式的命令。原始音频文件格式为wav,则只需要写uttid path:





注册集和训练集一样,由spk2utt,utt2spk,wav.scp 组成。文本文件内容的模式也和训练集保持一致,这里不再赘述。



trials文件格式: ,如:

FADG0_SI649.WAV FADG0 target
FADG0_SI649.WAV FAKS0 nontarget
FADG0_SI649.WAV FASW0 nontarget


只需准备三个数据文件就行了,train、enroll、test。SRE数据集用于训练PLDA模型,因为大部分训练数据超出了域。 在您的情况下,您将直接在训练数据上训练PLDA模型,因为它在域中。train应包括一组与enroll和test数据不重叠的spk,其余spk的utt应分为enroll和test。 虽然enroll和test共享speaker,但它们不应包含来自相同录音的utt。



I'm not familiar with using TIMIT for speaker recognition, so I'm not sure how the evaluation is set up. It sounds like you might have only evaluation data and nothing to train your models with. Hopefully someone who has used TIMIT for this purpose can comment more. If you don't have any training data, you could try using the Librispeech corpus (look at the recipe in egs/ for more info).

You need at least the following datasets:

  • Training data. This is used to train the UBM, i-vector extractor and PLDA model. It should be non-overlapping with the other datasets. In the sre10 recipe, it corresponds to the "train" and "sre" data. The "sre" data is just a subset of "train" used to train the PLDA model, but it doesn't have to be that way in general.

  • Enrollment data. This is a subset of the evaluation data in which you know the identity of the speaker in the recording. Using the models created in the previous step, i-vectors are generated from this data. If you have multiple enrollment recordings per speaker, you might average their i-vectors to get speaker-level representations. In the sre10 recipe, this dataset is called "sre10_train."

  • Test data. This is also part of the evaluation data, and consists of recordings for which you don't know the identity of the speaker. These are compared (using the PLDA model or cosine distance) with the i-vectors created from the enrollment data. This dataset is called "sre10_test" in the recipe. **The set of comparisons is defined by the "trials" file. **


The run.sh in egs/aishell/v1 includes the entire voiceprint recognition process. It is best to copy the commands in run.sh to another script. In one sentence, one sentence at a time, so that errors can be found in time and then modified.

1)data preparation

2)start extracting mfcc features, perform endpoint detection (VAD), and check that the file does not meet the requirements to sort the files

3)train UBM and ivector extractor.It should be noted that the script that trains the ivector extractor will execute the program at the same time by default, which will take up a lot of memory and cause memory overflow. We need to modify it in train_ivector_extractor.sh. It defaults to executing njnum_threadnum_processes at the same time. In 16G memory, I changed these three parameters to 2 to run. There are also two hyperparameters that can be modified, namely the UBM dimension and the ivector dimension. The UBM dimension is modified directly in run.sh. The parameter behind the data/train in train_diag_ubm.sh is UBM. Dimension, the default is 1024. To modify the dimension of ivector, you also need to modify ivector_dim in train_ivector_extractor.sh. The default is 400.

4)extracting the ivector of the training set, and training the plda model for scoring with the ivector of the training set.

5)After that, the test set is divided into a registration set and a verification set. This step is mainly done by the script loacl/split_data_enroll_eval.py. This script first stores each spk and its corresponding utt in dictutt, then randomly smashes the utt order of the spk and redistributes it into enroll (registration set) and eval (evaluation set). You can see that in the penultimate line of the program, if(i<3): utt is written to enroll, otherwise it is written to eval. So we can change the value of the registration set and the evaluation set by changing this value.

6)After re-creating utt2spk, it is necessary to generate trials. Trials are generated by loacl/product_trials.py. Trials are a list of registered speakers and different voices that need to be scored. The format is (for example):


We first look at the data partitioning in the AISHELL and TIMIT databases. There are a total of 400 people in AISHELL. The default is divided into train, dev and test sets. There are 340 people in the train; 40 in the dev; 20 in the test. In the routine, train is used as the training set, test is used as the test set, and dev is not used. Everyone in AISHELL has about 300 voices. Each voice is a sentence. Each voice is about 26s. There are 630 people in the TIMIT database, divided into train and test. There are 462 people in the training set and 168 people in the test set. Each person has 10 voices, and each voice is about 24s. Here, TIMIT's original distribution method is used directly, with 462 people as the training set and 168 people as the test set.

After understanding the differences between the two databases and the entire process of voiceprint recognition, we can begin to rewrite our program. In fact, there are not many places that need to be changed in the whole process. The main reason is that the process of preparing the data phase and generating trials needs to be modified. The first is the data preparation phase, we can rewrite our own tit_stat_prepare.sh according to the aishell_data_prepare.sh script. In the data preparation phase, three files, utt2spk spk2utt and wav.scp, are generated. The format of these three files is as follows:


Next, check if the found wav files add up to 141924, and then start wav.scp, utt2spk, and spk2utt and transcripts.txt for speech recognition. Here we will find the script related to transcripts.txt. Then delete it.

After completing the stage of preparing the data, we can start to perform voiceprint recognition according to the above process. One thing to note is that trials, if a person has only two or three segments of speech, you need to modify the proportion of the assigned enroll and eval sets. However, since everyone in the TIMIT database has 10 segments of speech, it is ok to not modify it. Here we use 3 segments of voice to register, and then the remaining 7 segments are used for verification.

The final error rate was about 4.5%. Although it is an acceptable result, it is still a lot worse than AISHELL's 0.18% error rate. Analyze the reasons: First, there are fewer voices for training. Although there are 462 people, each person has only 10 voices, and 340 people in AISHELL are used for training. Each person has a lot worse than 300 voices. Similarly, there are a total of 168 people in the TIMIT test set, which is much more than 40 people in the AISHELL test set. Moreover, AISHELL's default training UBM order and ivector dimensions are very high, so these two points may lead to higher error rates. If you want to further reduce the error rate, you can try to reduce the dimensions of the trained UBM and ivector. After I reduced the dimensions of both UBM and ivector, the error rate could eventually reach 1.53%.


  • 用sudo命令的话,会给新生成的文件上锁,解锁办法是sudo chmod 777 file
  • 单步调试:sh -x script.sh,修改脚本后,不想重新debug,就用BLOCK注释多行
  • 线程设置在train_ivector_extractor.sh,就算按前文所说设置为2,我也跑不动,我将ubm的维数降为了600,ivector的维数降为了400,程序才能跑快点,也得1h。我怀疑可以更小的,因为utt很短
  • 不用text的话,可以在一个子程序的开头设置,而不用找具体的位置去注释,我给忘了哪个程序了,大家可以运行出错排查
  • 所有需要预先准备的就是数据集了,TIMIT数据格式和他的文件夹都是比较头疼的,最好写个脚本实现
  • aishell_data_prep.sh里注释掉所有的dev和transcripts.txt
