Command line interface


Shennong provides the speech-features program available from the command line.

  • It is a wrapper on the features extraction pipeline module.

  • The program is largely self-documented, the above documentation is simply the content of speech-features --help

Speech features extraction pipeline from raw wav files

The general extraction pipeline is as follow:

   <input-config>     |--> features --> CMVN --> delta -->|
       and         -->|     (VTLN)                        |--> <output-file>
<input-utterances>    |---------------> pitch ----------->|

Simple exemple

Features extraction basically involves three steps:

  1. Configure an extraction pipeline. For exemple this defines a full pipeline for MFCCs extraction (with CMVN, but without delta, pitch nor VTLN) and writes it to the file config.yaml:

    speech-features config mfcc --no-pitch --no-delta --no-vtln -o config.yaml

    You can then edit the file config.yaml to modify the parameters.

  2. Define a list of utterances on which to extract features (along with optional speakers or timestamps specification), for exemple you can write a utterances.txt file with the following content (see below for details on the format):

    utterance1 /path/to/wav1.wav speaker1
    utterance2 /path/to/wav2.wav speaker1
    utterance3 /path/to/wav3.wav speaker2
  3. Apply the configured pipeline on the defined utterances. For exemple this computes the features using 4 parallel subprocesses and save them to a file in the numpy .npz format:

    speech-features extract --njobs 4 config.yaml utterances.txt features.npz

Definition of <input-config>

The <input-config> is a configuration file in YAML format defining all the parameters of the extraction pipeline, including main features extraction (spectrogram, mfcc, plp, rastaplp, filterbank or bottleneck features) and post-processing (CMVN, delta and pitch extraction).

You can generate a configuration template using speech-features config. It will write a YAML file with default parameters that you can edit. See speech-features config --help for description of the available options.

Definition of <input-utterances>

The <input-utterances> is a text file indexing the utterances on which to apply the extraction pipeline. Each line of the file defines a single utterance (or sentence, or speech fragment), it can have one of the following formats:

  1. <wav-file>

    The simplest format, with a wav file per line. Each wav is considered as a single utterance. Each wav file must be unique.

  2. <utterance-id> <wav-file>

    Give a name to each utterance, identifiers must be unique.

  3. <utterance-id> <wav-file> <speaker-id>

    Specify a speaker for each utterance. This is required if you are using CMVN normalization per speaker.

  4. <utterance-id> <wav-file> <tstart> <tstop>

    Each wav contains several utterances, the utterance boundaries are defined by the start and stop timestamps within the wav file (given in seconds).

  5. <utterance-id> <wav-file> <speaker-id> <tstart> <tstop>

    Combination of 3 and 4. Several utterances per wav, with speakers identification.

Definition of <output-file>

The <output-file> will store the extracted features. The underlying format is a dictionnary of utterances. Each utterance’s features are stored as a matrix [nframes * ndims], along with timestamps and metadata.

Several file formats are supported, the format is guessed by the file extension specified in command line:

File format


Use case



First choice, fast and efficient



Second choice, standard numpy format



Very fast, standard Python format



Compatibility with Matlab



Compatibility with Kaldi



Very slow, for manual introspection only

More info on file formats are available on the online documentation, at