Utterances

Provides the Uttterance and Utterances classes

An utterance correspond to a sentence, or a speech segment, that is processed individually by an extraction pipeline. An utterance is defined by one of the following format:

  • 2-uple: <utterance-id> <audio-file>

  • 3-uple: <utterance-id> <audio-file> <speaker-id>

  • 4-uple: <utterance-id> <audio-file> <tstart> <tstop>

  • 5-uple: <utterance-id> <audio-file> <speaker-id> <tstart> <tstop>

Note

Most of shennong components (processors and post processors) work directly on individual audio files. Utterances are used when training a VtlnProcessor or extracting features from a shennong.pipeline.

shennong.utterances.VALID_FORMATS = {1: '<utterance-id> <audio-file>', 2: '<utterance-id> <audio-file> <speaker-id>', 3: '<utterance-id> <audio-file> <tstart> <tstop>', 4: '<utterance-id> <audio-file> <speaker-id> <tstart> <tstop>'}

The valid formats for an utterance, as detailed above

class shennong.utterances.Utterance(*args)[source]

Bases: object

Manage a single utterance

The class Utterance manages individual utterances and basically give access to their components: name, speaker, corresponding audio segment. The utterance must be defined by one of the formats defined above.

Parameters

*args – The arguments must be 2, 3, 4 or 5. The number of arguments defines the utterance format and the signification of each positional argument (see VALID_FORMATS)

Raises

ValueError – If the arguments are not 2, 3, 4 or 5, or if the utterance cannot be created from them (for instance the audio file is not readable)

property format

The utterance format code

property name

The utterance name, or <utterance-id>

property audio_file

The audio file attached to the utterance

property speaker

The utterance speaker, or None if no speaker information

property tstart

The utterance onset time in the audio file, or None

property tstop

The utterance offset time in the audio file, or None

property duration

The utterance duration in seconds

load_audio()[source]

Returns the utterance’s Audio data

class shennong.utterances.Utterances(utterances)[source]

Bases: object

Manages a collection of Utterance.

The Utterances manages a collection of utterances and allows to iterate over the utterances by name or by speaker, as well as generating sub-utterances fit to a particular duration.

The following conditions apply:

  • All utterances in the collection must have the same format

  • All utterances must have a unique name

Parameters

utterances (list of Utterance or list of tuples) – The utterances to be stored

Raises

ValueError – If the utterances cannot be created because of the above conditions, or because one of the utterances if not valid

classmethod load(filename)[source]

Returns utterances loaded from a file

All the lines in the must conform to the same utterance format.

Parameters

filename (str) – The file to load

Raises

ValueError – If the filename is not found, if all the utterances do not have the same format, if all the <utterance-id> are not unique or if some defined utterances are not valid (audio file not found for instance).

save(filename)[source]

Writes the utterances to file

Parameters

filename (str) – The filename to write

format(type=<class 'int'>)[source]

Returns the utterances format

Parameters

type (optional, int or str) – When int return the format code, when str returns it’s string representation

Raises

ValueError – If type is not int or str

has_speakers()[source]

Returns True if there is speaker information, False otherwise

by_speaker()[source]

Returns a dictionary of utterances indexed by speaker

The returned dictionary has speakers as keys and list of Utterance as values.

Raises

ValueError – If there is no speaker information

by_name()[source]

Returns a dictonary of utterances indexed by name

The returned dictionary has utterance names as keys and Utterance instances as values.

duration()[source]

Returns the total duration of the utterances in seconds

fit_to_duration(duration, truncate=False, shuffle=False)[source]

Returns a subset of utterances, keeping duration sec per speaker

Parameters
  • duration (float) – The duration to keep per speaker, in seconds

  • truncate (bool, optional) – When True, truncate the the total duration to the one available if there is not enough data. When False, raise an error if the duration cannot be returned for a speaker. Default to False.

  • shuffle (bool, optional) – When True, shuffle the utterances before extracting segments. When False, take them in order. Default to False.

Returns

utterances (Utterances) – The utterances segments fitting the given duration for each speaker

Raises

ValueError – If the utterances are not defined by speakers. When duration is not strictly positive or, when truncate is True, if a speaker has not enough data to build segments.