Command line interface¶
Shennong provides the
speech-features program available from
the command line.
It is a wrapper on the features extraction pipeline module.
The program is largely self-documented, the above documentation is simply the content of
Speech features extraction pipeline from raw wav files
The general extraction pipeline is as follow:
<input-config> |--> features --> CMVN --> delta -->| and -->| (VTLN) |--> <output-file> <input-utterances> |---------------> pitch ----------->|
Features extraction basically involves three steps:
Configure an extraction pipeline. For exemple this defines a full pipeline for MFCCs extraction (with CMVN, but without delta, pitch nor VTLN) and writes it to the file
speech-features config mfcc --no-pitch --no-delta --no-vtln -o config.yaml
You can then edit the file
config.yamlto modify the parameters.
Define a list of utterances on which to extract features (along with optional speakers or timestamps specification), for exemple you can write a
utterances.txtfile with the following content (see below for details on the format):
utterance1 /path/to/wav1.wav speaker1 utterance2 /path/to/wav2.wav speaker1 utterance3 /path/to/wav3.wav speaker2
Apply the configured pipeline on the defined utterances. For exemple this computes the features using 4 parallel subprocesses and save them to a file in the numpy
speech-features extract --njobs 4 config.yaml utterances.txt features.npz
Definition of <input-config>¶
<input-config> is a configuration file in YAML format defining
all the parameters of the extraction pipeline, including main features
extraction (spectrogram, mfcc, plp, rastaplp, filterbank or bottleneck
features) and post-processing (CMVN, delta and pitch extraction).
You can generate a configuration template using
config. It will write a YAML file with default parameters that you
can edit. See
speech-features config --help for description of the
Definition of <input-utterances>¶
<input-utterances> is a text file indexing the utterances on
which to apply the extraction pipeline. Each line of the file defines
a single utterance (or sentence, or speech fragment), it can have one
of the following formats:
The simplest format, with a wav file per line. Each wav is considered as a single utterance. Each wav file must be unique.
Give a name to each utterance, identifiers must be unique.
<utterance-id> <wav-file> <speaker-id>
Specify a speaker for each utterance. This is required if you are using CMVN normalization per speaker.
<utterance-id> <wav-file> <tstart> <tstop>
Each wav contains several utterances, the utterance boundaries are defined by the start and stop timestamps within the wav file (given in seconds).
<utterance-id> <wav-file> <speaker-id> <tstart> <tstop>
Combination of 3 and 4. Several utterances per wav, with speakers identification.
Definition of <output-file>¶
<output-file> will store the extracted features. The underlying
format is a dictionnary of utterances. Each utterance’s features are
stored as a matrix [nframes * ndims], along with timestamps and
Several file formats are supported, the format is guessed by the file extension specified in command line:
First choice, fast and efficient
Second choice, standard numpy format
Very fast, standard Python format
Compatibility with Matlab
Compatibility with Kaldi
Very slow, for manual introspection only
More info on file formats are available on the online documentation, at https://coml.lscp.ens.fr/shennong/python/features.html#save-load-features.