Command line interface¶
Note
Shennong provides the speech-features
program available from
the command line.
It is a wrapper on the features extraction pipeline module.
The program is largely self-documented, the above documentation is simply the content of
speech-features --help
Speech features extraction pipeline from raw wav files
The general extraction pipeline is as follow:
<input-config> |--> features --> CMVN --> delta -->|
and -->| (VTLN) |--> <output-file>
<input-utterances> |---------------> pitch ----------->|
Simple exemple¶
Features extraction basically involves three steps:
Configure an extraction pipeline. For exemple this defines a full pipeline for MFCCs extraction (with CMVN, but without delta, pitch nor VTLN) and writes it to the file
config.yaml
:speech-features config mfcc --no-pitch --no-delta --no-vtln -o config.yaml
You can then edit the file
config.yaml
to modify the parameters.Define a list of utterances on which to extract features (along with optional speakers or timestamps specification), for exemple you can write a
utterances.txt
file with the following content (see below for details on the format):utterance1 /path/to/wav1.wav speaker1 utterance2 /path/to/wav2.wav speaker1 utterance3 /path/to/wav3.wav speaker2
Apply the configured pipeline on the defined utterances. For exemple this computes the features using 4 parallel subprocesses and save them to a file in the numpy
.npz
format:speech-features extract --njobs 4 config.yaml utterances.txt features.npz
Definition of <input-config>¶
The <input-config>
is a configuration file in YAML format defining
all the parameters of the extraction pipeline, including main features
extraction (spectrogram, mfcc, plp, rastaplp, filterbank or bottleneck
features) and post-processing (CMVN, delta and pitch extraction).
You can generate a configuration template using speech-features
config
. It will write a YAML file with default parameters that you
can edit. See speech-features config --help
for description of the
available options.
Definition of <input-utterances>¶
The <input-utterances>
is a text file indexing the utterances on
which to apply the extraction pipeline. Each line of the file defines
a single utterance (or sentence, or speech fragment), it can have one
of the following formats:
<wav-file>
The simplest format, with a wav file per line. Each wav is considered as a single utterance. Each wav file must be unique.
<utterance-id> <wav-file>
Give a name to each utterance, identifiers must be unique.
<utterance-id> <wav-file> <speaker-id>
Specify a speaker for each utterance. This is required if you are using CMVN normalization per speaker.
<utterance-id> <wav-file> <tstart> <tstop>
Each wav contains several utterances, the utterance boundaries are defined by the start and stop timestamps within the wav file (given in seconds).
<utterance-id> <wav-file> <speaker-id> <tstart> <tstop>
Combination of 3 and 4. Several utterances per wav, with speakers identification.
Definition of <output-file>¶
The <output-file>
will store the extracted features. The underlying
format is a dictionnary of utterances. Each utterance’s features are
stored as a matrix [nframes * ndims], along with timestamps and
metadata.
Several file formats are supported, the format is guessed by the file extension specified in command line:
File format
Extension
Use case
h5features
.h5f
First choice, fast and efficient
numpy
.npz
Second choice, standard numpy format
pickle
.pkl
Very fast, standard Python format
matlab
.mat
Compatibility with Matlab
kaldi
.ark
Compatibility with Kaldi
JSON
.json
Very slow, for manual introspection only
More info on file formats are available on the online documentation, at https://coml.lscp.ens.fr/shennong/python/features.html#save-load-features.