================ Forced Alignment ================ This tutorial covers the usage of abkhazia to do phone-level forced alignment on your own corpus of annotated audio files. Prerequisites ============= Here's what you need to have before being able to follow this tutorial: - A set of audio files encoded in 16000kz WAV 16bit PCM on which to run the alignment - On these audio files, a set of segments corresponding to utterances. For each utterance, you'll need to have a phonemic transcription (an easy way to get these is by using `Phonemizer `_ ) It's also recommended (yet optional) to have some kind of reference file where you can identify the speaker of each of your phonemized utterance. Corpus format ============= The corpus format is the same as the one specified in :ref:`abkhazia_format`, two corpus files having a bit more specific format, namely ``text.txt`` and ``lexicon.txt``. Here, ``text.txt`` is composed of your phonemic transcription of each utterance:: ... and ``lexicon.txt`` is just a "phony" file containg phonemes mapped to themselves:: ... Doing the Forced Alignment ========================== Once you've gathered all the required files (cited above) in a ``corpus/`` folder (the name is obviously arbitrary), you'll want to validate the corpus to check that it is conform to Kaldi's input format. Abkhazia luckily does that for us:: abhkazia validate corpus/ Then, we'll compute the language model (actually here a phonetic model) for your dataset. Note that even though we set the model-level (option ``-l``) to "word", here it's still working find since all words are phonemes:: abkhazia language corpus/ -l word -n 3 -v We'll now extract the MFCC features from the wav files:: abkhazia features mfcc corpus/ --cmvn Then, using the langage model and the extracted MFCC's, compute a triphone HMM-GMM acoustic model:: abkhazia acoustic monophone -v corpus/ --force --recipe abkhazia acoustic triphone -v corpus/ If you specified the speaker for each utterance, you can adapt your model per speaker:: abkhazia acoustic triphone-sa -v corpus/ And the, at last, we can compute the forced phonetic aligments:: abkhazia align corpus -a corpus/triphone-sa # if you computed the speaker-adapted triphones abkhazia align corpus -a corpus/triphone # if you didn't If everything went right, you should be able to find your alignment in ``corpus/align/alignments.txt``. The file will have the following row structure:: ... **Note that the phoneme's start and end time markers (in seconds) are relative to the utterance in which they were contained, not to the entire audio file.**