Forced Alignment¶
This tutorial covers the usage of abkhazia to do phone-level forced alignment on your own corpus of annotated audio files.
Prerequisites¶
Here’s what you need to have before being able to follow this tutorial:
A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment
On these audio files, a set of segments corresponding to utterances. For each utterance, you’ll need to have a phonemic transcription (an easy way to get these is by using Phonemizer )
It’s also recommended (yet optional) to have some kind of reference file where you can identify the speaker of each of your phonemized utterance.
Corpus format¶
- The corpus format is the same as the one specified in abkhazia_format, two
corpus files having a bit more specific format, namely
text.txt
andlexicon.txt
. Here,text.txt
is composed of your phonemic transcription of each utterance:<utterance-id> <pho1> <pho2> ... <phoN>
and lexicon.txt
is just a “phony” file containg phonemes mapped to themselves:
<pho1> <pho1>
<pho2> <pho2>
<pho3> <pho3>
...
Doing the Forced Alignment¶
Once you’ve gathered all the required files (cited above) in a corpus/
folder (the name is
obviously arbitrary), you’ll want to validate the corpus to check that it is conform to Kaldi’s
input format. Abkhazia luckily does that for us:
abhkazia validate corpus/
Then, we’ll compute the language model (actually here a phonetic model) for your dataset.
Note that even though we set the model-level (option -l
) to “word”, here it’s
still working find since all words are phonemes:
abkhazia language corpus/ -l word -n 3 -v
We’ll now extract the MFCC features from the wav files:
abkhazia features mfcc corpus/ --cmvn
Then, using the langage model and the extracted MFCC’s, compute a triphone HMM-GMM acoustic model:
abkhazia acoustic monophone -v corpus/ --force --recipe
abkhazia acoustic triphone -v corpus/
If you specified the speaker for each utterance, you can adapt your model per speaker:
abkhazia acoustic triphone-sa -v corpus/
And the, at last, we can compute the forced phonetic aligments:
abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones
abkhazia align corpus -a corpus/triphone # if you didn't
If everything went right, you should be able to find your alignment in
corpus/align/alignments.txt
. The file will have the following row structure:
<utt_id> <pho_start> <pho_end> <pho_name> <pho_symbol>
...
Note that the phoneme’s start and end time markers (in seconds) are relative to the utterance in which they were contained, not to the entire audio file.