Command line interface¶
Note
Shennong provides the speech-features
program available from
the command line.
It is a wrapper on the features extraction pipeline module.
The program is largely self-documented, the above documentation is simply the content of
speech-features --help
Speech features extraction pipeline from raw audio files
The general extraction pipeline is as follow:
<input-config> |--> features --> CMVN --> delta -->|
and -->| (VTLN) |--> <output-file>
<input-utterances> |---------------> pitch ----------->|
Simple exemple¶
Features extraction basically involves three steps:
Configure an extraction pipeline. For exemple this defines a full pipeline for MFCCs extraction (with CMVN, but without delta, pitch nor VTLN) and writes it to the file
config.yaml
:speech-features config mfcc --no-pitch --no-delta --no-vtln -o config.yaml
You can then edit the file
config.yaml
to modify the parameters.Define a list of utterances on which to extract features (along with optional speakers or timestamps specification), for exemple you can write a
utterances.txt
file with the following content (see below for details on the format):utterance1 /path/to/audio1.wav speaker1 utterance2 /path/to/audio2.wav speaker1 utterance3 /path/to/audio3.wav speaker2
Apply the configured pipeline on the defined utterances. For exemple this computes the features using 4 parallel subprocesses and save them to a file in the numpy
.npz
format:speech-features extract --njobs 4 config.yaml utterances.txt features.npz
Definition of <input-config>¶
The <input-config>
is a configuration file in YAML format defining
all the parameters of the extraction pipeline, including main features
extraction (spectrogram, mfcc, plp, rastaplp, filterbank or bottleneck
features) and post-processing (CMVN, delta and pitch extraction).
You can generate a configuration template using speech-features
config
. It will write a YAML file with default parameters that you
can edit. See speech-features config --help
for description of the
available options.
Definition of <input-utterances>¶
The <input-utterances>
is a text file indexing the utterances on
which to apply the extraction pipeline. Each line of the file defines
a single utterance (or sentence, or speech fragment), it can have one
of the following formats:
<utterance-id> <audio-file>
The simplest format. Give a name to each utterance, identifiers must be unique. Each entire audio file is considered a single utterance.
<utterance-id> <audio-file> <speaker-id>
Specify a speaker for each utterance. This is required if you are using CMVN or VTLN normalization per speaker.
<utterance-id> <audio-file> <tstart> <tstop>
Each audio file contains several utterances, the utterance boundaries are defined by the start and stop timestamps within the audio file (given in seconds).
<utterance-id> <audio-file> <speaker-id> <tstart> <tstop>
Combination of 2 and 3. Several utterances per audio file, with speakers identification.
Definition of <output-file>¶
The <output-file>
will store the extracted features. The underlying
format is a dictionnary of utterances. Each utterance’s features are
stored as a matrix [nframes * ndims], along with timestamps and
metadata.
Several file formats are supported, the format is guessed by the file extension specified in command line:
File format
Extension
Use case
pickle
.pkl
Very fast, standard Python format
h5features
.h5f
Fast and efficient for very big datasets
numpy
.npz
Standard numpy format
matlab
.mat
Compatibility with Matlab
kaldi
.ark
Compatibility with Kaldi
CSV
<folder>
Very slow, raw text, one utterance per file
More info on file formats are available on the online documentation.