This tutorial covers the usage of abkhazia to do phone-level forced alignment on your own corpus of annotated audio files.
Here’s what you need to have before being able to follow this tutorial:
A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment
On these audio files, a set of segments corresponding to utterances. For each utterance, you’ll need to have a phonemic transcription (an easy way to get these is by using Phonemizer )
It’s also recommended (yet optional) to have some kind of reference file where you can identify the speaker of each of your phonemized utterance.
- The corpus format is the same as the one specified in abkhazia_format, two
corpus files having a bit more specific format, namely
text.txtis composed of your phonemic transcription of each utterance:
<utterance-id> <pho1> <pho2> ... <phoN>
lexicon.txt is just a “phony” file containg phonemes mapped to themselves:
<pho1> <pho1> <pho2> <pho2> <pho3> <pho3> ...
Doing the Forced Alignment¶
Once you’ve gathered all the required files (cited above) in a
corpus/ folder (the name is
obviously arbitrary), you’ll want to validate the corpus to check that it is conform to Kaldi’s
input format. Abkhazia luckily does that for us:
abhkazia validate corpus/
Then, we’ll compute the language model (actually here a phonetic model) for your dataset.
Note that even though we set the model-level (option
-l) to “word”, here it’s
still working find since all words are phonemes:
abkhazia language corpus/ -l word -n 3 -v
We’ll now extract the MFCC features from the wav files:
abkhazia features mfcc corpus/ --cmvn
Then, using the langage model and the extracted MFCC’s, compute a triphone HMM-GMM acoustic model:
abkhazia acoustic monophone -v corpus/ --force --recipe abkhazia acoustic triphone -v corpus/
If you specified the speaker for each utterance, you can adapt your model per speaker:
abkhazia acoustic triphone-sa -v corpus/
And the, at last, we can compute the forced phonetic aligments:
abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones abkhazia align corpus -a corpus/triphone # if you didn't
If everything went right, you should be able to find your alignment in
corpus/align/alignments.txt. The file will have the following row structure:
<utt_id> <pho_start> <pho_end> <pho_name> <pho_symbol> ...
Note that the phoneme’s start and end time markers (in seconds) are relative to the utterance in which they were contained, not to the entire audio file.