Bottleneck

Extraction of bottleneck features from a speech signal

Audio —> BottleneckProcessor —> Features

This module provides the class BottleneckProcessor which computes stacked bottleneck features from audio signals (see [Silnova2018] and [Fer2017]). This is an adpatation of the original code released on [bottleneck-site]. Features are extracted from one of the three provided pre-trained neural networks:

  • FisherMono: Trained on Fisher English (parts 1 and 2 datasets, about 2000 hours of clean telephone speech) with 120 phoneme states as output classes (40 phonemes, 3 state for each phoneme).

  • FisherTri: Trained on the same datasets as FisherMono, with 2423 triphones as output classes.

  • BabelMulti: Trained on 17 languages from the IARPA [BABEL-project], with 3096 output classes (3 phoneme states per each language stacked together).

Examples

Compute bottleneck features on some speech using the multilingual network (BabelMulti):

>>> from shennong.audio import Audio
>>> from shennong.processor.bottleneck import BottleneckProcessor
>>> audio = Audio.load('./test/data/test.wav')
>>> processor = BottleneckProcessor(weights='BabelMulti')
>>> features = processor.process(audio)
>>> features.shape
(140, 80)

References

bottleneck-site

https://speech.fit.vutbr.cz/software/but-phonexia-bottleneck-feature-extractor

BABEL-project

https://www.iarpa.gov/index.php/research-programs/babel

Silnova2018

Anna Silnova, Pavel Matejka, Ondrej Glembek, Oldrich Plchot, Ondrej Novotny, Frantisek Grezl, Petr Schwarz, Lukas Burget, Jan “Honza” Cernocky, “BUT/Phonexia Bottleneck Feature Extractor”, Submitted to Odyssey: The Speaker and Language Recognition Workshop 2018

Fer2017

Fér Radek, Matějka Pavel, Grézl František, Plchot Oldřich, Veselý Karel and Černocký Jan. Multilingually Trained Bottleneck Features in Spoken Language Recognition. Computer Speech and Language. Amsterdam: Elsevier Science, 2017, vol. 2017, no. 46, pp. 252-267.

class shennong.processor.bottleneck.BottleneckProcessor(weights='BabelMulti', dither=0.1)[source]

Bases: shennong.processor.base.FeaturesProcessor

Bottleneck features from a pre-trained neural network

Parameters

weights ('BabelMulti', 'FisherMono' or 'FisherMulti') – The pretrained weights to use for features extraction

Raises
  • ValueError – If the weights are invalid

  • RuntimeError – If the weights file cannot be found (meaning shennong is not correctly installed on your system)

property name

Name of the processor

property dither

Amount of dithering

0.0 means no dither

property weights

The name of the pretrained weights used to extract the features

Must be ‘BabelMulti’, ‘FisherMono’ or ‘FisherTri’.

property ndims

The dimension of extracted frames

Cannot be tuned because the underlying neural networks are trained with this parameter.

property sample_rate

Processing sample frequency in Hertz

Cannot be tuned because the underlying neural networks are trained with this parameter.

property frame_length

The length of extracted frames (in seconds)

Cannot be tuned because the underlying neural networks are trained with this parameter.

property frame_shift

The time shift between two consecutive frames (in seconds)

Cannot be tuned because the underlying neural networks are trained with this parameter.

get_params(deep=True)

Get parameters for this processor.

Parameters

deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.

Returns

params (mapping of string to any) – Parameter names mapped to their values.

get_properties(**kwargs)

Return the processors properties as a dictionary

property log

Processor logger

process_all(utterances, njobs=None, **kwargs)

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters
  • utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

  • **kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

Parameters
  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)

Set the parameters of this processor.

Returns

self

Raises

ValueError – If any given parameter in params is invalid for the processor.

classmethod available_weights()[source]

Return the pretrained weights files as a dict (name -> file)

Returns

weight_files (dict) – A mapping ‘weights name’ -> ‘weights files’, where the files are absolutes paths to compressed numpy array (.npz format). The ‘weights name’ is either BabelMulti, FisherMono or FisherTri.

Raises

RuntimeError – If the directory shennong/share/bottleneck is not found, or if all the weights files are missing in it.

process(signal)[source]

Computes bottleneck features on an audio signal

Use a pre-trained neural network to extract bottleneck features. Features have a frame shift of 10 ms and frame length of 25 ms.

Parameters

signal (Audio, shape = [nsamples, 1]) – The input audio signal to compute the features on, must be mono. The signal is up/down-sampled at 8 kHz during processing.

Returns

features (Features, shape = [nframes, 80]) – The computes bottleneck features will have as many rows as there are frames (depends on the signal duration, expect about 100 frames per second), each frame with 80 dimensions.

Raises

RuntimeError – If no speech is detected on the signal during the voice activity detection preprocessing step.