目录
介绍
前言
数据集:cnceleb
必备技能
数据集cnceleb
data
eval
enroll
test
flac转wav
wav.scp
utt2spk
spk2utt
有没有刚开始接触kaldi的小伙伴们,也像我一样,感觉理论很多,但是上手就崩。
俗话说:“万事开头难”,数据准备就是kaldi实验的开始。
因为我上手直接学习asv-subtools,我绕过kaldi,直接学习使用asv-subtools。但是asv-subtools里面没有关于数据准备的代码。网上关于asv-subtools的内容也少之又少。然而kaidl和subtools的数据准备一样。所以我就把目光投向了kaldi的数据处理。我想通过kaldi学习有关数据的处理。
kaldi项目里面也有关于kaldi数据准备的代码,比如在/kaldi/egs/cnceleb/v2/run.sh中stage0和stage1就是kaldi的数据准备。但是我运行一直不成功,因为我下载的cnceleb1和cnceleb2数据集不适应这个脚本,少了几个表单之类的文件。加上我也是刚接触,这就给了我很大压力。感觉成功近在咫尺,但是却又那么遥远。
所以我认为不应该局限与脚本生成,因该自己彻底明白要做什么怎么做。自己写脚本做,这才是正确的。这样以后面对不同的数据集,也能做到得心应手。
所以本篇文章,我将详细说明我所遇到的问题以及解决办法,算是一种记录,也希望能帮到你。
我处理的是cnceleb数据集,因为cnceleb2全是train。而cnceleb1分为train,enroll和test。所以这里用cnceleb1作为演示。
我想强调一下“工欲善其事必先利其器”,所以想做好数据处理,就必须先学会使用工具。shell命令我们必须有一个详细的了解,不仅要会认,关键是还要会用。我就吃了这样的亏,因为我也是shell小白,在自学asvsubtools的时候脚本就看起来很吃力。所以还是要重视基础工具的学习
这里我推荐一个课程B站讲的最好的Linux Shell脚本教程,通俗易懂层层深入
老师讲的很好,都是实战训练,上手比较快,也比较容易理解。
我再推荐一个shell命令集合,基础的命令可以在此查询
linux命令大全
我在官网下载的cnceleb1数据集,下载下来是个安装包,如下图
进行解压之后,形成了文件夹,如下图
进入文件夹,我们可以看到有个文件,其中需要我们注意的是文件夹data和文件夹eval
jyt522@xju-aslp3:/student/temp/jyt522/data/CN-Celeb_flac$ ls
1911.01799.pdf data dev eval README.TXT
进入data
jyt522@xju-aslp3:/student/temp/jyt522/data/CN-Celeb_flac$ cd data;ls id00000 id00084 id00168 id00252 id00337 id00421 id00506 id00590 id00674 id00759 id00843 id00927 id00001 id00085 id00169 id00253 id00338 id00422 id00507 id00591 id00675 id00760 id00844 id00928 id00002 id00086 id00170 id00254 id00339 id00423 id00508 id00592 id00676 id00761 id00845 id00929 id00003 id00087 id00171 id00255 id00340 id00424 id00509 id00593 id00677 id00762 id00846 id00930 id00004 id00088 id00172 id00256 id00341 id00425 id00510 id00594 id00678 id00763 id00847 id00931 id00005 id00089 id00173 id00257 id00342 id00426 id00511 id00595 id00679 id00764 id00848 id00932 id00006 id00090 id00174 id00258 id00343 id00427 id00512 id00596 id00680 id00765 id00849 id00933 id00007 id00091 id00175 id00259 id00344 id00428 id00513 id00597 id00681 id00766 id00850 id00934 id00008 id00092 id00176 id00260 id00345 id00429 id00514 id00598 id00682 id00767 id00851 id00935 id00009 id00093 id00177 id00261 id00346 id00430 id00515 id00599 id00683 id00768 id00852 id00936 id00010 id00094 id00178 id00262 id00347 id00431 id00516 id00600 id00684 id00769 id00853 id00937 id00011 id00095 id00179 id00263 id00348 id00432 id00517 id00601 id00685 id00770 id00854 id00938 id00012 id00096 id00180 id00264 id00349 id00433 id00518 id00602 id00686 id00771 id00855 id00939 id00013 id00097 id00181 id00265 id00350 id00434 id00519 id00603 id00687 id00772 id00856 id00940 id00014 id00098 id00182 id00266 id00351 id00435 id00520 id00604 id00688 id00773 id00857 id00941 id00015 id00099 id00183 id00267 id00352 id00436 id00521 id00605 id00689 id00774 id00858 id00942 id00016 id00100 id00184 id00268 id00353 id00437 id00522 id00606 id00690 id00775 id00859 id00943 id00017 id00101 id00185 id00269 id00354 id00438 id00523 id00607 id00691 id00776 id00860 id00944 id00018 id00102 id00186 id00270 id00355 id00439 id00524 id00608 id00692 id00777 id00861 id00945 id00019 id00103 id00187 id00271 id00356 id00440 id00525 id00609 id00693 id00778 id00862 id00946 id00020 id00104 id00188 id00272 id00357 id00441 id00526 id00610 id00694 id00779 id00863 id00947 id00021 id00105 id00189 id00273 id00358 id00442 id00527 id00611 id00695 id00780 id00864 id00948 id00022 id00106 id00190 id00274 id00359 id00443 id00528 id00612 id00696 id00781 id00865 id00949 id00023 id00107 id00191 id00275 id00360 id00444 id00529 id00613 id00697 id00782 id00866 id00950 id00024 id00108 id00192 id00276 id00361 id00445 id00530 id00614 id00698 id00783 id00867 id00951 id00025 id00109 id00193 id00277 id00362 id00446 id00531 id00615 id00699 id00784 id00868 id00952 id00026 id00110 id00194 id00278 id00363 id00447 id00532 id00616 id00700 id00785 id00869 id00953 id00027 id00111 id00195 id00279 id00364 id00448 id00533 id00617 id00701 id00786 id00870 id00954 id00028 id00112 id00196 id00280 id00365 id00449 id00534 id00618 id00702 id00787 id00871 id00955 id00029 id00113 id00197 id00281 id00366 id00450 id00535 id00619 id00703 id00788 id00872 id00956 id00030 id00114 id00198 id00282 id00367 id00451 id00536 id00620 id00704 id00789 id00873 id00957 id00031 id00115 id00199 id00283 id00368 id00452 id00537 id00621 id00705 id00790 id00874 id00958 id00032 id00116 id00200 id00284 id00369 id00453 id00538 id00622 id00706 id00791 id00875 id00959 id00033 id00117 id00201 id00285 id00370 id00454 id00539 id00623 id00707 id00792 id00876 id00960 id00034 id00118 id00202 id00286 id00371 id00455 id00540 id00624 id00708 id00793 id00877 id00961 id00035 id00119 id00203 id00287 id00372 id00456 id00541 id00625 id00709 id00794 id00878 id00962 id00036 id00120 id00204 id00288 id00373 id00458 id00542 id00626 id00710 id00795 id00879 id00963 id00037 id00121 id00205 id00289 id00374 id00459 id00543 id00627 id00711 id00796 id00880 id00964 id00038 id00122 id00206 id00290 id00375 id00460 id00544 id00628 id00712 id00797 id00881 id00965 id00039 id00123 id00207 id00291 id00376 id00461 id00545 id00629 id00713 id00798 id00882 id00966 id00040 id00124 id00208 id00292 id00377 id00462 id00546 id00630 id00714 id00799 id00883 id00967 id00041 id00125 id00209 id00293 id00378 id00463 id00547 id00631 id00715 id00800 id00884 id00968 id00042 id00126 id00210 id00294 id00379 id00464 id00548 id00632 id00716 id00801 id00885 id00969 id00043 id00127 id00211 id00295 id00380 id00465 id00549 id00633 id00717 id00802 id00886 id00970 id00044 id00128 id00212 id00296 id00381 id00466 id00550 id00634 id00718 id00803 id00887 id00971 id00045 id00129 id00213 id00297 id00382 id00467 id00551 id00635 id00719 id00804 id00888 id00972 id00046 id00130 id00214 id00298 id00383 id00468 id00552 id00636 id00720 id00805 id00889 id00973 id00047 id00131 id00215 id00299 id00384 id00469 id00553 id00637 id00721 id00806 id00890 id00974 id00048 id00132 id00216 id00300 id00385 id00470 id00554 id00638 id00722 id00807 id00891 id00975 id00049 id00133 id00217 id00301 id00386 id00471 id00555 id00639 id00723 id00808 id00892 id00976 id00050 id00134 id00218 id00302 id00387 id00472 id00556 id00640 id00724 id00809 id00893 id00977 id00051 id00135 id00219 id00303 id00388 id00473 id00557 id00641 id00725 id00810 id00894 id00978 id00052 id00136 id00220 id00304 id00389 id00474 id00558 id00642 id00726 id00811 id00895 id00979 id00053 id00137 id00221 id00305 id00390 id00475 id00559 id00643 id00727 id00812 id00896 id00980 id00054 id00138 id00222 id00306 id00391 id00476 id00560 id00644 id00728 id00813 id00897 id00981 id00055 id00139 id00223 id00307 id00392 id00477 id00561 id00645 id00729 id00814 id00898 id00982 id00056 id00140 id00224 id00308 id00393 id00478 id00562 id00646 id00730 id00815 id00899 id00983 id00057 id00141 id00225 id00309 id00394 id00479 id00563 id00647 id00731 id00816 id00900 id00984 id00058 id00142 id00226 id00310 id00395 id00480 id00564 id00648 id00732 id00817 id00901 id00985 id00059 id00143 id00227 id00311 id00396 id00481 id00565 id00649 id00733 id00818 id00902 id00986 id00060 id00144 id00228 id00312 id00397 id00482 id00566 id00650 id00734 id00819 id00903 id00987 id00061 id00145 id00229 id00313 id00398 id00483 id00567 id00651 id00735 id00820 id00904 id00988 id00062 id00146 id00230 id00314 id00399 id00484 id00568 id00652 id00737 id00821 id00905 id00989 id00063 id00147 id00231 id00315 id00400 id00485 id00569 id00653 id00738 id00822 id00906 id00990 id00064 id00148 id00232 id00316 id00401 id00486 id00570 id00654 id00739 id00823 id00907 id00991 id00065 id00149 id00233 id00318 id00402 id00487 id00571 id00655 id00740 id00824 id00908 id00992 id00066 id00150 id00234 id00319 id00403 id00488 id00572 id00656 id00741 id00825 id00909 id00993 id00067 id00151 id00235 id00320 id00404 id00489 id00573 id00657 id00742 id00826 id00910 id00994 id00068 id00152 id00236 id00321 id00405 id00490 id00574 id00658 id00743 id00827 id00911 id00995 id00069 id00153 id00237 id00322 id00406 id00491 id00575 id00659 id00744 id00828 id00912 id00996 id00070 id00154 id00238 id00323 id00407 id00492 id00576 id00660 id00745 id00829 id00913 id00997 id00071 id00155 id00239 id00324 id00408 id00493 id00577 id00661 id00746 id00830 id00914 id00998 id00072 id00156 id00240 id00325 id00409 id00494 id00578 id00662 id00747 id00831 id00915 id00999 id00073 id00157 id00241 id00326 id00410 id00495 id00579 id00663 id00748 id00832 id00916 id00074 id00158 id00242 id00327 id00411 id00496 id00580 id00664 id00749 id00833 id00917 id00075 id00159 id00243 id00328 id00412 id00497 id00581 id00665 id00750 id00834 id00918 id00076 id00160 id00244 id00329 id00413 id00498 id00582 id00666 id00751 id00835 id00919 id00077 id00161 id00245 id00330 id00414 id00499 id00583 id00667 id00752 id00836 id00920 id00078 id00162 id00246 id00331 id00415 id00500 id00584 id00668 id00753 id00837 id00921 id00079 id00163 id00247 id00332 id00416 id00501 id00585 id00669 id00754 id00838 id00922 id00080 id00164 id00248 id00333 id00417 id00502 id00586 id00670 id00755 id00839 id00923 id00081 id00165 id00249 id00334 id00418 id00503 id00587 id00671 id00756 id00840 id00924 id00082 id00166 id00250 id00335 id00419 id00504 id00588 id00672 id00757 id00841 id00925 id00083 id00167 id00251 id00336 id00420 id00505 id00589 id00673 id00758 id00842 id00926
可以看到,每一个id00xxx对应一个说话人,每一个id都是一个文件夹。随意进入一个id就是对应的所有说话片段。我们在这里进入id00000
jyt522@xju-aslp3:/student/temp/jyt522/data/CN-Celeb_flac/data/id00000$ ls singing-01-001.flac singing-01-002.flac singing-01-003.flac singing-01-004.flac singing-01-005.flac singing-01-006.flac
可以看出,id00000有5个片段。
进入eval文件夹
jyt522@xju-aslp3:/student/temp/jyt522/data/CN-Celeb_flac/eval$ ls enroll lists README.TXT test
我们要注意的是enroll和test这两个文件夹
进入enroll文件夹
jyt522@xju-aslp3:/student/temp/jyt522/data/CN-Celeb_flac/eval/enroll$ ls id00800-enroll.flac id00833-enroll.flac id00866-enroll.flac id00899-enroll.flac id00935-enroll.flac id00969-enroll.flac id00801-enroll.flac id00834-enroll.flac id00867-enroll.flac id00900-enroll.flac id00936-enroll.flac id00970-enroll.flac id00802-enroll.flac id00835-enroll.flac id00868-enroll.flac id00901-enroll.flac id00937-enroll.flac id00971-enroll.flac id00803-enroll.flac id00836-enroll.flac id00869-enroll.flac id00903-enroll.flac id00938-enroll.flac id00972-enroll.flac id00804-enroll.flac id00837-enroll.flac id00870-enroll.flac id00904-enroll.flac id00939-enroll.flac id00973-enroll.flac id00805-enroll.flac id00838-enroll.flac id00871-enroll.flac id00905-enroll.flac id00940-enroll.flac id00974-enroll.flac id00806-enroll.flac id00839-enroll.flac id00872-enroll.flac id00906-enroll.flac id00941-enroll.flac id00975-enroll.flac id00807-enroll.flac id00840-enroll.flac id00873-enroll.flac id00907-enroll.flac id00942-enroll.flac id00976-enroll.flac id00808-enroll.flac id00841-enroll.flac id00874-enroll.flac id00908-enroll.flac id00943-enroll.flac id00977-enroll.flac id00809-enroll.flac id00842-enroll.flac id00875-enroll.flac id00909-enroll.flac id00944-enroll.flac id00978-enroll.flac id00810-enroll.flac id00843-enroll.flac id00876-enroll.flac id00910-enroll.flac id00945-enroll.flac id00979-enroll.flac id00811-enroll.flac id00844-enroll.flac id00877-enroll.flac id00911-enroll.flac id00946-enroll.flac id00980-enroll.flac id00812-enroll.flac id00845-enroll.flac id00878-enroll.flac id00912-enroll.flac id00947-enroll.flac id00981-enroll.flac id00813-enroll.flac id00846-enroll.flac id00879-enroll.flac id00913-enroll.flac id00948-enroll.flac id00982-enroll.flac id00814-enroll.flac id00847-enroll.flac id00880-enroll.flac id00914-enroll.flac id00949-enroll.flac id00983-enroll.flac id00815-enroll.flac id00848-enroll.flac id00881-enroll.flac id00915-enroll.flac id00950-enroll.flac id00984-enroll.flac id00816-enroll.flac id00849-enroll.flac id00882-enroll.flac id00916-enroll.flac id00951-enroll.flac id00985-enroll.flac id00817-enroll.flac id00850-enroll.flac id00883-enroll.flac id00917-enroll.flac id00952-enroll.flac id00986-enroll.flac id00818-enroll.flac id00851-enroll.flac id00884-enroll.flac id00919-enroll.flac id00953-enroll.flac id00987-enroll.flac id00819-enroll.flac id00852-enroll.flac id00885-enroll.flac id00920-enroll.flac id00954-enroll.flac id00988-enroll.flac id00820-enroll.flac id00853-enroll.flac id00886-enroll.flac id00921-enroll.flac id00956-enroll.flac id00989-enroll.flac id00821-enroll.flac id00854-enroll.flac id00887-enroll.flac id00922-enroll.flac id00957-enroll.flac id00990-enroll.flac id00822-enroll.flac id00855-enroll.flac id00888-enroll.flac id00923-enroll.flac id00958-enroll.flac id00991-enroll.flac id00823-enroll.flac id00856-enroll.flac id00889-enroll.flac id00924-enroll.flac id00959-enroll.flac id00992-enroll.flac id00824-enroll.flac id00857-enroll.flac id00890-enroll.flac id00925-enroll.flac id00960-enroll.flac id00993-enroll.flac id00825-enroll.flac id00858-enroll.flac id00891-enroll.flac id00927-enroll.flac id00961-enroll.flac id00994-enroll.flac id00826-enroll.flac id00859-enroll.flac id00892-enroll.flac id00928-enroll.flac id00962-enroll.flac id00995-enroll.flac id00827-enroll.flac id00860-enroll.flac id00893-enroll.flac id00929-enroll.flac id00963-enroll.flac id00996-enroll.flac id00828-enroll.flac id00861-enroll.flac id00894-enroll.flac id00930-enroll.flac id00964-enroll.flac id00997-enroll.flac id00829-enroll.flac id00862-enroll.flac id00895-enroll.flac id00931-enroll.flac id00965-enroll.flac id00998-enroll.flac id00830-enroll.flac id00863-enroll.flac id00896-enroll.flac id00932-enroll.flac id00966-enroll.flac id00999-enroll.flac id00831-enroll.flac id00864-enroll.flac id00897-enroll.flac id00933-enroll.flac id00967-enroll.flac id00832-enroll.flac id00865-enroll.flac id00898-enroll.flac id00934-enroll.flac id00968-enroll.flac
可以看到enroll文件夹之下是若干语音片段
进入test文件夹
截取部分数据,可以看到test也是若干片段
认识了原始数据的文件结构,我们又发现了一个新的问题:每一条数据都是*.flac
这是我们不想要的情况,我们需要的音频文件是wav格式的。
所以我们接下来要做的就是把flac格式转变成wav格式
所以我写了一个专门处理flac转wav的脚本flac_to_wav1.sh
#!/bin/bash
in_dir=/student/temp/jyt522/data/CN-Celeb_flac
out_dir=/student/temp/jyt522/data/CN-Celeb1
for i in $in_dir/data/*;
do
speaker=`basename $i`
mkdir -p $out_dir/data/$speaker
for j in $i/*;
do
utt_temp=`basename $j`
#echo $utt_temp
utt=${utt_temp/%flac/wav}
flac -d $j -o $out_dir/data/$speaker/$utt
done
done
for i in $in_dir/eval/enroll/*;
do
utt_temp=`basename $i`
utt=${utt_temp/%flac/wav}
mkdir -p $out_dir/eval/enroll
flac -d $i -o $out_dir/eval/enroll/$utt
done
for i in $in_dir/eval/test/*;
do
utt_temp=`basename $i`
utt=${utt_temp/%flac/wav}
mkdir -p $out_dir/eval/test
flac -d $i -o $out_dir/eval/test/$utt
done
脚本思路其实很简单,调用的方法就是flac命令 flac -d $flac文件 $wav文件
这样就得到了一个名叫CN-Celeb1的文件,里面的数据就全是我们想要wav格式,而且最重要的是格式没有改变,文件结构和之前一样。这是极好的。
做完了前面的准备工作,接下来接进入正题,开始数据准备的工作
首先我们先来看看什么是数据准备,字面来看,就是把数据进行整理归纳,做好准备。
那我们要准备成什么样子才算成功呢?不同的项目有不同的要求。接下来就以asv-subtools的项目subtools/recipe/cnsrc/sv/run-cnsrc_sv.sh为入口文件,可以看出我们需要准备3个文件
这三个文件我们将详细讨论一下具体的含义以及如何去生成这些数据
文件格式
[utt_id] [wav_path]
刚开始见到这个文件,我也和你一样,懵逼人此时更加懵逼。
所以我们要从文件的格式入手,看看它究竟是什么。
可以看到wav.scp文件是由两部分组成, 分别是[utt_id]和[wav_path]
[utt_id]是utterance_identity的缩写
[wav_path]是wav的绝对路径
这就很好理解了,简单的来说wav.scp就是一个包含着类似于[名称] [位置]这样格式的信息。
举个例子:
接下来我们就只应该注意如何去生成这个文件。
当然是靠shell脚本完成,需要学会的shell命令有
- awk
- sed
这些命令在我上文推荐的视频课程中,有很好很详细的解释。
以下是我用于生成wav.scp的脚本
#!/bin/bash
in_dir=/student/temp/jyt522/data/CN-Celeb1
out_dir=/student/temp/jyt522/data
#eval_enroll
mkdir -p $out_dir/eval_enroll
cd $out_dir/eval_enroll
find $in_dir/eval/enroll -iname "*.wav" > wav.scp.temp
cat wav.scp.temp | awk -F '/' '{printf("%s\n",$NF)}' | sed 's|.wav||' > wav_id
paste -d' ' wav_id wav.scp.temp > wav.scp
rm -rf wav_id wav.scp.temp
#eval_test
mkdir -p $out_dir/eval_test
cd $out_dir/eval_test
find $in_dir/eval/test -iname "*.wav" > wav.scp.temp
cat wav.scp.temp | awk -F '/' '{printf("%s\n",$NF)}' | sed 's|.wav||' > wav_id
paste -d' ' wav_id wav.scp.temp > wav.scp
rm -rf wav_id wav.scp.temp
#cnceleb1_train
mkdir -p $out_dir/cnceleb1_train
cd $out_dir/cnceleb1_train
find $in_dir/data -iname "*.wav" > wav.scp.temp
cat wav.scp.temp | awk -F '/' '{printf("%s-%s\n",$(NF-1),$NF)}' | sed 's|.wav||' > wav_id
paste -d' ' wav_id wav.scp.temp > wav.scp
rm -rf wav_id wav.scp.temp
思路很简单,我解释一下:
1.先find找到所有后缀名称为.wav的文件地址,把他们放在wav.scp.temp中。
2.打开文件,用管道传输给awk命令,用awk命令对每一行内容进行筛选,紧接着管道传输给sed进行修改。
3.把wav_id 和 wav.scp.temp连接起来,并放进wav.scp中
于是就在cnceleb_train eval_enroll eval_test这三个文件夹中分别生成了wav.scp文件
文件格式
[utt_id] [spk_id]
utt2spk文件是由两个部分组成,分别是[utt_id]和[spk_id]
[utt_id]是utterance_identity的缩写
[spk_id]是speaker_identity的缩写
举个例子:
你需要在意的只有这个文件的结构,接下来就是如何生成
下面是我用于生成utt2spk的脚本:
#!/bin/bash
in_dir=/student/temp/jyt522/data/CN-Celeb1
out_dir=/student/temp/jyt522/data
#eval_enroll
cd $out_dir/eval_enroll
find $in_dir/eval/enroll -iname "*.wav" > wav.scp.temp
cat wav.scp.temp | awk -F '/' '{printf("%s\n",$NF)}' | sed 's|.wav||' > wav_id
cat wav_id | awk -F '-' '{printf("%s\n",$(NF-1))}' > spk_id
paste -d' ' wav_id spk_id > utt2spk
rm -rf wav_id wav.scp.temp spk_id
#eval_test
cd $out_dir/eval_test
find $in_dir/eval/test -iname "*.wav" > wav.scp.temp
cat wav.scp.temp | awk -F '/' '{printf("%s\n",$NF)}' | sed 's|.wav||' > wav_id
cat wav_id | awk -F '-' '{printf("%s\n",$1)}' > spk_id
paste -d' ' wav_id spk_id > utt2spk
rm -rf wav_id wav.scp.temp spk_id
#cnceleb1_train
cd $out_dir/cnceleb1_train
find $in_dir/data -iname "*.wav" > wav.scp.temp
cat wav.scp.temp | awk -F '/' '{printf("%s-%s\n",$(NF-1),$NF)}' | sed 's|.wav||' > wav_id
cat wav.scp.temp | awk -F '/' '{printf("%s\n",$(NF-1))}' > spk_id
paste -d' ' wav_id spk_id > utt2spk
rm -rf wav_id wav.scp.temp spk_id
对于每一个文件
1.找到后缀为.wav的文件地址,并把他们存放在wav.scp.temp中
2.打开wav.scp.temp,用awk对每一行进行筛选,并用sed进行修改,最后生成wav_id
3.打开wav.scp.temp,用awk筛选,生成spk_id
4.链接wav_id和spk_id
于是就在cnceleb_train eval_enroll eval_test这三个文件夹中分别生成了utt2spk文件
文件格式
[spk_id] [utt_id]...
看起来是不是很熟悉,没错。跟上面介绍过的utt2spk文件格式很像,区别在于顺序不同。
在这里就不再赘述具体含义。
要注意的是,这里[utt_id]的可能有多个,因为每个说话人都对应许多的片段。这是和utt2spk不一样的地方。
举个例子:
具体的实现,我用了kaldi自带的脚本utt2spk_to_spk2utt.pl
这个工具十分方便,我们只要生成了utt2spk,就能自动的生成对应的spk2utt文件。
只需要配置一下环境变量即可,下面介绍方法:
- 软链接两个工具
ln -s /student/temp/jyt522/kaldi/egs/wsj/s5/steps steps ln -s /student/temp/jyt522/kaldi/egs/wsj/s5/utils utils
- 配置环境变量
export kaldi_wsj=/student/temp/jyt522/kaldi/egs/wsj/s5/
- 复制path.sh到当前路径
cp /student/temp/jyt522/kaldi/egs/wsj/s5/path.sh ./
- 修改path.sh文件的第一行
export KALDI_ROOT=/student/temp/jyt522/kaldi
- 运行path.sh
. path.sh
注意要用.空格path.sh
- 生成spk2utt文件
./utils/utt2spk_to_spk2utt.pl utt2spk > spk2utt
于是就在cnceleb_train eval_enroll eval_test这三个文件夹中分别生成了spk2utt文件
到此为止,我们就成功地再每一个数据集下生成了三个文件。
但是数据处理的步骤还没有完成。因为现在只处理了cnceleb1的数据,还需要把cnceleb2的数据也进行处理,接着进行合并。而且还有个最重要的问题,就是我们的数据有可能不是按顺序的,这也是需要处理的问题。
针对上面提出的问题,kaidl自带了工具combine_data.sh和fix_data_dir.sh,可以解决我们相应的问题。
在这里推荐一个视频kaldi数据准备,我从中也学到了很多。