Slicing features¶

To compute phoneme or triphone based ABX, we need phone-level alignments. Those are described in item files, like the following

#file onset offset #phone prev-phone next-phone speaker
6295-244435-0009 0.2925 0.4725 IH L NG 6295
6295-244435-0009 0.3725 0.5325 NG IH K 6295
6295-244435-0009 0.4325 0.5725 K NG AH 6295
...

We compute the representations using the full audio file, and we then slice to only get the frames that correspond to the unit of interest. Since the frames are downsampled, there is a decision to make on exactly which frame to keep and which to remove.

Let \(t_\text{on}, t_\text{off}\) the times of start and end of the triphone or phoneme considered, with \(t_\text{on} < t_\text{off}\). This corresponds to the columns “onset” and “offset” of the item file.

Let \(\Delta t\) the constant time step between consecutive features, 20 ms for example. The discrete times associated to the features are \(t_i = \frac{\Delta t}{2} + \Delta t \times i\).

The set of indices to slice \(I\) is

\[I = \left\{ i \mid \onset \leq t_i \leq \offset \right\},\]

We have, for any \(i \in \mathbb{N}\),

\[\begin{split}i \in I \Leftrightarrow \begin{cases} i \geq \frac{\onset}{\Delta t} - \frac{1}{2} \\ i \leq \frac{\offset}{\Delta t} - \frac{1}{2} \end{cases}.\end{split}\]

There the beginning and end indices (both included) are:

\[\begin{split}\begin{align} i_\text{start} & = \min(I) = \left\lceil \frac{\onset}{\Delta t} - \frac{1}{2} \right\rceil, \\ i_\text{end} & = \max(I) = \left\lfloor \frac{\offset}{\Delta t} - \frac{1}{2} \right\rfloor. \end{align}\end{split}\]

In libri-light, because the features were sliced like this features[start:end], the last index was \(i_\text{end} - 1 = \left\lfloor \frac{\offset}{\Delta t} - \frac{1}{2} \right\rfloor - 1\) (see here).