# Evaluation Metrics¶

All of our metric assume a time aligned transcription, where $$T_{i,j}$$ is the (phoneme) transcription corresponding to the speech fragment designed by the pair of indices $$\langle i,j \rangle$$ (i.e., the speech fragment between frame i and j). If the left or right edge of the fragment contains part of a phoneme, that phoneme is included in the transcription if is corresponds to more than more than 30ms or more than 50% of it’s duration.