Bottleneck¶
Extraction of bottleneck features from a speech signal
This module provides the class
BottleneckProcessor
which computes stacked bottleneck features from audio signals (see
[Silnova2018] and [Fer2017]). This is an adpatation of the original
code released on [bottleneck-site]. Features are extracted from one
of the three provided pre-trained neural networks:
FisherMono: Trained on Fisher English (parts 1 and 2 datasets, about 2000 hours of clean telephone speech) with 120 phoneme states as output classes (40 phonemes, 3 state for each phoneme).
FisherTri: Trained on the same datasets as FisherMono, with 2423 triphones as output classes.
BabelMulti: Trained on 17 languages from the IARPA [BABEL-project], with 3096 output classes (3 phoneme states per each language stacked together).
Examples
Compute bottleneck features on some speech using the multilingual network (BabelMulti):
>>> from shennong.audio import Audio
>>> from shennong.processor.bottleneck import BottleneckProcessor
>>> audio = Audio.load('./test/data/test.wav')
>>> processor = BottleneckProcessor(weights='BabelMulti')
>>> features = processor.process(audio)
>>> features.shape
(140, 80)
References
- bottleneck-site
https://speech.fit.vutbr.cz/software/but-phonexia-bottleneck-feature-extractor
- BABEL-project
- Silnova2018
Anna Silnova, Pavel Matejka, Ondrej Glembek, Oldrich Plchot, Ondrej Novotny, Frantisek Grezl, Petr Schwarz, Lukas Burget, Jan “Honza” Cernocky, “BUT/Phonexia Bottleneck Feature Extractor”, Submitted to Odyssey: The Speaker and Language Recognition Workshop 2018
- Fer2017
Fér Radek, Matějka Pavel, Grézl František, Plchot Oldřich, Veselý Karel and Černocký Jan. Multilingually Trained Bottleneck Features in Spoken Language Recognition. Computer Speech and Language. Amsterdam: Elsevier Science, 2017, vol. 2017, no. 46, pp. 252-267.
-
class
shennong.processor.bottleneck.
BottleneckProcessor
(weights='BabelMulti', dither=0.1)[source]¶ Bases:
shennong.processor.base.FeaturesProcessor
Bottleneck features from a pre-trained neural network
- Parameters
weights ('BabelMulti', 'FisherMono' or 'FisherMulti') – The pretrained weights to use for features extraction
- Raises
ValueError – If the weights are invalid
RuntimeError – If the weights file cannot be found (meaning shennong is not correctly installed on your system)
-
property
name
¶ Name of the processor
-
property
dither
¶ Amount of dithering
0.0 means no dither
-
property
weights
¶ The name of the pretrained weights used to extract the features
Must be ‘BabelMulti’, ‘FisherMono’ or ‘FisherTri’.
-
property
ndims
¶ The dimension of extracted frames
Cannot be tuned because the underlying neural networks are trained with this parameter.
-
property
sample_rate
¶ Processing sample frequency in Hertz
Cannot be tuned because the underlying neural networks are trained with this parameter.
-
property
frame_length
¶ The length of extracted frames (in seconds)
Cannot be tuned because the underlying neural networks are trained with this parameter.
-
property
frame_shift
¶ The time shift between two consecutive frames (in seconds)
Cannot be tuned because the underlying neural networks are trained with this parameter.
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
get_properties
(**kwargs)¶ Return the processors properties as a dictionary
-
property
log
¶ Processor logger
-
process_all
(utterances, njobs=None, **kwargs)¶ Returns features processed from several input utterances
This function processes the features in parallel jobs.
- Parameters
utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.
- Returns
features (
FeaturesCollection
) – The computed features on each input signal. The keys of output features are the keys of the input utterances.- Raises
ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
classmethod
available_weights
()[source]¶ Return the pretrained weights files as a dict (name -> file)
- Returns
weight_files (dict) – A mapping ‘weights name’ -> ‘weights files’, where the files are absolutes paths to compressed numpy array (.npz format). The ‘weights name’ is either BabelMulti, FisherMono or FisherTri.
- Raises
RuntimeError – If the directory shennong/share/bottleneck is not found, or if all the weights files are missing in it.
-
process
(signal)[source]¶ Computes bottleneck features on an audio signal
Use a pre-trained neural network to extract bottleneck features. Features have a frame shift of 10 ms and frame length of 25 ms.
- Parameters
signal (Audio, shape = [nsamples, 1]) – The input audio signal to compute the features on, must be mono. The signal is up/down-sampled at 8 kHz during processing.
- Returns
features (Features, shape = [nframes, 80]) – The computes bottleneck features will have as many rows as there are frames (depends on the signal duration, expect about 100 frames per second), each frame with 80 dimensions.
- Raises
RuntimeError – If no speech is detected on the signal during the voice activity detection preprocessing step.