Provides the Uttterance and Utterances classes

An utterance correspond to a sentence, or a speech segment, that is processed individually by an extraction pipeline. An utterance is defined by one of the following format:

  • 2-uple: <utterance-id> <audio-file>

  • 3-uple: <utterance-id> <audio-file> <speaker-id>

  • 4-uple: <utterance-id> <audio-file> <tstart> <tstop>

  • 5-uple: <utterance-id> <audio-file> <speaker-id> <tstart> <tstop>


Most of shennong components (processors and post processors) work directly on individual audio files. Utterances are used when training a VtlnProcessor or extracting features from a shennong.pipeline.

shennong.utterances.VALID_FORMATS = {1: '<utterance-id> <audio-file>', 2: '<utterance-id> <audio-file> <speaker-id>', 3: '<utterance-id> <audio-file> <tstart> <tstop>', 4: '<utterance-id> <audio-file> <speaker-id> <tstart> <tstop>'}

The valid formats for an utterance, as detailed above

class shennong.utterances.Utterance(*args)[source]

Bases: object

Manage a single utterance

The class Utterance manages individual utterances and basically give access to their components: name, speaker, corresponding audio segment. The utterance must be defined by one of the formats defined above.


*args – The arguments must be 2, 3, 4 or 5. The number of arguments defines the utterance format and the signification of each positional argument (see VALID_FORMATS)


ValueError – If the arguments are not 2, 3, 4 or 5, or if the utterance cannot be created from them (for instance the audio file is not readable)

property format

The utterance format code

property name

The utterance name, or <utterance-id>

property audio_file

The audio file attached to the utterance

property speaker

The utterance speaker, or None if no speaker information

property tstart

The utterance onset time in the audio file, or None

property tstop

The utterance offset time in the audio file, or None

property duration

The utterance duration in seconds


Returns the utterance’s Audio data

class shennong.utterances.Utterances(utterances)[source]

Bases: object

Manages a collection of Utterance.

The Utterances manages a collection of utterances and allows to iterate over the utterances by name or by speaker, as well as generating sub-utterances fit to a particular duration.

The following conditions apply:

  • All utterances in the collection must have the same format

  • All utterances must have a unique name


utterances (list of Utterance or list of tuples) – The utterances to be stored


ValueError – If the utterances cannot be created because of the above conditions, or because one of the utterances if not valid

classmethod load(filename)[source]

Returns utterances loaded from a file

All the lines in the must conform to the same utterance format.


filename (str) – The file to load


ValueError – If the filename is not found, if all the utterances do not have the same format, if all the <utterance-id> are not unique or if some defined utterances are not valid (audio file not found for instance).


Writes the utterances to file


filename (str) – The filename to write

format(type=<class 'int'>)[source]

Returns the utterances format


type (optional, int or str) – When int return the format code, when str returns it’s string representation


ValueError – If type is not int or str


Returns True if there is speaker information, False otherwise


Returns a dictionary of utterances indexed by speaker

The returned dictionary has speakers as keys and list of Utterance as values.


ValueError – If there is no speaker information


Returns a dictonary of utterances indexed by name

The returned dictionary has utterance names as keys and Utterance instances as values.


Returns the total duration of the utterances in seconds

fit_to_duration(duration, truncate=False, shuffle=False)[source]

Returns a subset of utterances, keeping duration sec per speaker

  • duration (float) – The duration to keep per speaker, in seconds

  • truncate (bool, optional) – When True, truncate the the total duration to the one available if there is not enough data. When False, raise an error if the duration cannot be returned for a speaker. Default to False.

  • shuffle (bool, optional) – When True, shuffle the utterances before extracting segments. When False, take them in order. Default to False.


utterances (Utterances) – The utterances segments fitting the given duration for each speaker


ValueError – If the utterances are not defined by speakers. When duration is not strictly positive or, when truncate is True, if a speaker has not enough data to build segments.