Introduction to speech features

Implemented models

Note

All the models implemented from Kaldi are using pykaldi, see [Can2018].

  • The following features extraction models are implemented in shennong, the detailed documentation is available here:

    Features

    Implementation

    Spectrogram

    from Kaldi

    Filterbank

    from Kaldi

    MFCC

    from Kaldi

    PLP

    from Kaldi

    RASTA-PLP

    from rastapy, after labrosa

    Bottleneck

    from BUTspeech

    One Hot Vectors

    shennong

    Pitch

    from Kaldi

    Energy

    from Kaldi

  • The following post-processing pipelines are implemented in shennong, the detailed documentation is available here:

    Post-processing

    Implementation

    Delta / Delta-delta

    from Kaldi

    Mean Variance Normalization (CMVN)

    from Kaldi

    Voice Activity Detection

    from Kaldi

  • Here is an illustration of the features (without post-processing) computed on the example wav file provided as shennong/test/data/test.wav, this picture can be reproduced using the example script shennong/examples/plot_features.py:

    _images/features.png

Features comparison

This section details a phone discrimination task based on the features available in shennong. It reproduces the track 1 of the Zero Speech Challenge 2015 using the same datasets and setup. The recipe to replicate this experiment is available at shennong/examples/features_abx.

  • Setup:

    • Two languages are tested:

    • The considered features extraction algorithms are:

      • bottleneck

      • filterbanks

      • MFCC

      • PLP

      • RASTA PLP

      • spectrogram

    • Each is tested with 3 distinct parameters sets:

      • only: just the raw features,

      • nocmvn: raw features with delta, delta-delta and pitch,

      • full: raw features with CMVN normalization by speaker, with delta, delta-delta and pitch.

    • The considered ABX tasks are the same as in the ZRC2015 track1, namely a phone discrimination task within and across speakers.

    • This gives us 2 corpora * 2 tasks * 6 features * 3 parameters sets = 72 scores.

Note

The results below are ABX error rates on phone discrimination (given in %).

  • Results on English:

    features

    across

    within

    only

    nocmvn

    full

    only

    nocmvn

    full

    bottleneck

    12.5

    12.5

    12.5

    8.5

    8.5

    8.6

    filterbank

    24.9

    22.1

    26.5

    12.8

    11.6

    18.2

    mfcc

    27.2

    26.4

    24.0

    13.0

    12.5

    12.4

    plp

    28.0

    26.6

    23.8

    12.5

    12.4

    12.0

    rastaplp

    26.8

    30.0

    22.7

    18.1

    23.0

    13.1

    spectrogram

    30.3

    27.9

    29.7

    16.7

    15.2

    20.2

  • Results on Xitsonga:

    features

    across

    within

    only

    nocmvn

    full

    only

    nocmvn

    full

    bottleneck

    9.5

    9.6

    9.6

    6.9

    7.0

    7.3

    filterbank

    28.1

    25.1

    21.5

    13.8

    11.7

    15.2

    mfcc

    33.6

    32.8

    26.0

    17.1

    16.2

    14.6

    plp

    33.5

    31.2

    26.2

    16.2

    14.6

    14.0

    rastaplp

    27.1

    25.6

    21.3

    19.5

    20.1

    12.6

    spectrogram

    34.6

    32.0

    26.5

    19.2

    16.8

    19.2

  • Comparison with the ZRC2015 baseline (on MFCC only), see [Versteegh2015]:

    English

    Xitsonga

    across

    within

    across

    within

    ZRC2015

    28.1

    15.6

    33.8

    19.1

    shennong-only

    27.2

    13.0

    33.6

    17.1

    shennong-full

    24.0

    12.4

    26.0

    14.6


Versteegh2015

The zero resource speech challenge 2015, Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux. In INTERSPEECH-2015. 2015.

Can2018

PyKaldi: A Python Wrapper for Kaldi, Dogan Can and Victor R. Martinez and Pavlos Papadopoulos and Shrikanth S. Narayanan, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2018.