Bottleneck¶

Extraction of bottleneck features from a speech signal

Audio —> BottleneckProcessor —> Features

This module provides the class BottleneckProcessor which computes stacked bottleneck features from audio signals (see [Silnova2018] and [Fer2017]). This is an adpatation of the original code released on [bottleneck-site]. Features are extracted from one of the three provided pre-trained neural networks:

FisherMono: Trained on Fisher English (parts 1 and 2 datasets, about 2000 hours of clean telephone speech) with 120 phoneme states as output classes (40 phonemes, 3 state for each phoneme).
FisherTri: Trained on the same datasets as FisherMono, with 2423 triphones as output classes.
BabelMulti: Trained on 17 languages from the IARPA [BABEL-project], with 3096 output classes (3 phoneme states per each language stacked together).

Examples

Compute bottleneck features on some speech using the multilingual network (BabelMulti):

>>> from shennong.audio import Audio
>>> from shennong.processor.bottleneck import BottleneckProcessor
>>> audio = Audio.load('./test/data/test.wav')
>>> processor = BottleneckProcessor(weights='BabelMulti')
>>> features = processor.process(audio)
>>> features.shape
(140, 80)

References

bottleneck-site: https://speech.fit.vutbr.cz/software/but-phonexia-bottleneck-feature-extractor
BABEL-project: https://www.iarpa.gov/index.php/research-programs/babel
Silnova2018: Anna Silnova, Pavel Matejka, Ondrej Glembek, Oldrich Plchot, Ondrej Novotny, Frantisek Grezl, Petr Schwarz, Lukas Burget, Jan “Honza” Cernocky, “BUT/Phonexia Bottleneck Feature Extractor”, Submitted to Odyssey: The Speaker and Language Recognition Workshop 2018
Fer2017: Fér Radek, Matějka Pavel, Grézl František, Plchot Oldřich, Veselý Karel and Černocký Jan. Multilingually Trained Bottleneck Features in Spoken Language Recognition. Computer Speech and Language. Amsterdam: Elsevier Science, 2017, vol. 2017, no. 46, pp. 252-267.

class shennong.processor.bottleneck.BottleneckProcessor(weights='BabelMulti', dither=0.1)[source]¶

Bases: shennong.processor.base.FeaturesProcessor

Bottleneck features from a pre-trained neural network

Parameters

weights ('BabelMulti', 'FisherMono' or 'FisherMulti') – The pretrained weights to use for features extraction

Raises

ValueError – If the weights are invalid
RuntimeError – If the weights file cannot be found (meaning shennong is not correctly installed on your system)

property name¶: Name of the processor

property dither¶

Amount of dithering

0.0 means no dither

property weights¶

The name of the pretrained weights used to extract the features

Must be ‘BabelMulti’, ‘FisherMono’ or ‘FisherTri’.

property ndims¶

The dimension of extracted frames

Cannot be tuned because the underlying neural networks are trained with this parameter.

property sample_rate¶

Processing sample frequency in Hertz

Cannot be tuned because the underlying neural networks are trained with this parameter.

property frame_length¶

The length of extracted frames (in seconds)

Cannot be tuned because the underlying neural networks are trained with this parameter.

property frame_shift¶

The time shift between two consecutive frames (in seconds)

Cannot be tuned because the underlying neural networks are trained with this parameter.

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

get_properties(**kwargs)¶: Return the processors properties as a dictionary

property log¶: Processor logger

process_all(utterances, njobs=None, **kwargs)¶

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters

utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

classmethod available_weights()[source]¶

Return the pretrained weights files as a dict (name -> file)

Returns: weight_files (dict) – A mapping ‘weights name’ -> ‘weights files’, where the files are absolutes paths to compressed numpy array (.npz format). The ‘weights name’ is either BabelMulti, FisherMono or FisherTri.
Raises: RuntimeError – If the directory shennong/share/bottleneck is not found, or if all the weights files are missing in it.

process(signal)[source]¶

Computes bottleneck features on an audio signal

Use a pre-trained neural network to extract bottleneck features. Features have a frame shift of 10 ms and frame length of 25 ms.

Parameters: signal (Audio, shape = [nsamples, 1]) – The input audio signal to compute the features on, must be mono. The signal is up/down-sampled at 8 kHz during processing.
Returns: features (Features, shape = [nframes, 80]) – The computes bottleneck features will have as many rows as there are frames (depends on the signal duration, expect about 100 frames per second), each frame with 80 dimensions.
Raises: RuntimeError – If no speech is detected on the signal during the voice activity detection preprocessing step.