Corpus Phonetics Tutorial
Eleanor Chodroff
Intro Penn Forced Aligner AutoVOT Kaldi Other Resources
Prerequisites Familiarization Training Acoustic Models Conceptually Training Acoustic Models Forced Alignment
Kaldi
Forced Alignment
Once acoustic models have been created, Kaldi can also perform forced alignment on audio accompanied by a word-level transcript. Note that the Montreal Forced Aligner is a forced alignment system based on Kaldi-trained acoustic models for American English (other languages in development). If you only need an alignment of American English speech, it may be in your interest to check out their system (see also the Penn Forced Aligner and FAVE).
Otherwise, if the audio to be aligned is the same as the audio used in the acoustic models, then the alignments can be extracted directly from the alignment files. If you have new audio and transcripts, then the transcript files will need to be updated before alignment.
The full procedure will convert output from the model alignment into Praat TextGrids containing the phone-level transcript.
If the data to be aligned is the same as the training data, skip to step 4. Otherwise, you’ll need to update the transcript files and audio file specifications.
Create a directory in mycorpus/data to house the new versions of text, segments, wav.scp, utt2spk, and spk2utt. See the page on data/train for details on how to create these.
cd mycorpus/data
mkdir alignme
Revisit the page on MFCC feature extraction for reference. You’ll need to replace data/train with the the new directory, data/alignme.
cd mycorpus
mfccdir=mfcc
for x in data/alignme; do
steps/make_mfcc.sh --cmd “$train_cmd” --nj 16 x e x p / m a k e m f c c / x exp/make_mfcc/ xexp/makemfcc/x $mfccdir
utils/fix_data_dir.sh data/alignme
steps/compute_cmvn_stats.sh x e x p / m a k e m f c c / x exp/make_mfcc/ xexp/makemfcc/x $mfccdir
utils/fix_data_dir.sh data/alignme
done
Revisit the page on triphone training and alignment for reference. Select the acoustic model and corresponding alignment process you’d like to use. You’ll need to replace data/train with the the new directory, data/alignme. As an example:
cd mycorpus
steps/align_si.sh --cmd “$train_cmd” data/alignme data/lang exp/tri4a exp/tri4a_alignme || exit 1;
CTM stands for time-marked conversation file and contains a time-aligned phoneme transcription of the utterances. Its format is:
utt_id channel_num start_time end_time phone_id
To obtain these, you will need to decide which acoustic models to use. The following code will extract the CTM output from the alignment files in the directory tri4a_alignme, using the acoustic models in tri4a:
cd mycorpus
for i in exp/tri4a_alignme/ali.*.gz;
do src/bin/ali-to-phones --ctm-output exp/tri4a/final.mdl ark:“gunzip -c $i|” -> ${i%.gz}.ctm;
done;
Concatenate CTM files
cd mycorpus/exp/tri4a_alignme
cat *.ctm > merged_alignment.txt
Convert time marks and phone IDs
The CTM output reports start and end times relative to the utterance, as opposed to the file. You will need the segments file located in either data/train or data/alignme to convert the utterance times into file times.
The output also reports the phone ID, as opposed to the phone itself. You will need the phones.txt file located in data/lang to convert the phone IDs into phone symbols.
id2phone.R
After obtaining the segments and phones.txt files, run id2phone.R to convert phone IDs to phones characters and map utterance times to file times. You will need to modify the file locations and possibly the regular expression to obtain the filename from the utterance name. Recall that the CTM output lists the utterance ID whereas the segments file lists the file ID. (If you named things logically, the file ID should be a subset of the utterance ID.)
id2phone.R returns a modified version of merged_alignment.txt called final_ali.txt
splitAlignments.py
final_ali.txt contains the phone transcript for all files together. This can be split into unique files by running splitAlignments.py. You will need to modify the location of final_ali.txt in this script.
python splitAlignments.py
First we’ll need to use the [B I E S] suffixes on the phones in order to group phones together into word-level units.
Run phons2pron.py to complete this step. Note that I have utf-8 character encoding on this script. If necessary, this can be updated to reflect the character encoding that best matches your files.
Second, we’ll need to match the phone pronunciation to the corresponding lexical entry using lexicon.txt.
Run pron2words.py to complete this step.
Praat requires that a text file have a header. Once we append the header, then we can convert these text files into TextGrids. The following code requires a text file containing the header:
file_utt file id ali startinutt dur phone start_utt end_utt start end
It also requires a tmp directory for processing. I put this on my Desktop.
cd ~/Desktop
mkdir tmp
header="/Users/Eleanor/Desktop/header.txt"
cd mycorpus/forcedalignment
for i in *.txt;
do
cat “ h e a d e r " " header" " header""i” > /Users/Eleanor/Desktop/tmp/xx. m v / U s e r s / E l e a n o r / D e s k t o p / t m p / x x . mv /Users/Eleanor/Desktop/tmp/xx. mv/Users/Eleanor/Desktop/tmp/xx. “$i”
done
createtextgrid.praat will read in the new phone transcripts and corresponding audio files to create a TextGrid for that file. You will need to modify the locations of the phone transcripts and audio files.
createWordTextGrids.praat
stackTextGrids.praat
Prerequisites Familiarization Training Acoustic Models Conceptually Training Acoustic Models Forced Alignment
Kaldi