Voice Activity Detection¶

Compute Voice Activity Detection (VAD) on features log-energy

Features –> VadPostProcessor –> Features

Compute voice-activity detection for speech features using the Kaldi implementation see [kaldi-vad]: The output is, for each input frame, 1 if we judge the frame as voiced, 0 otherwise. There are no continuity constraints.

This method is a very simple energy-based method which only looks at the first coefficient of the input features, which is assumed to be a log-energy or something similar. If working from the raw signal, extract the energy using EnergyProcessor.

A cutoff is set, we use a formula of the general type:

$\textrm{cutoff} = 5.0 + 0.5 * (\textrm{average log} - \textrm{energy}),$

and for each frame the decision is based on the proportion of frames in a context window around the current frame, which are above this cutoff.

Note

This code is geared toward speaker-id applications and is not suitable for automatic speech recognition (ASR) because it makes independent decisions for each frame without imposing any notion of continuity.

Examples

>>> import numpy as np
>>> from shennong.audio import Audio
>>> from shennong.processor.mfcc import MfccProcessor
>>> from shennong.postprocessor.vad import VadPostProcessor
>>> audio = Audio.load('./test/data/test.wav')
>>> mfcc = MfccProcessor().process(audio)

Computes the voice activity detection on the extracted MFCCs:

>>> processor = VadPostProcessor()
>>> vad = processor.process(mfcc)

For each frames of the MFCCs, vad is 1 if detected as a voiced frame, 0 otherwise:

>>> nframes = mfcc.shape[0]
>>> vad.shape == (nframes, 1)
True
>>> nvoiced = sum(vad.data[vad.data == 1])
>>> print('{} voiced frames out of {}'.format(nvoiced, nframes))
119 voiced frames out of 140

References

kaldi-vad: https://kaldi-asr.org/doc/voice-activity-detection_8h.html

class shennong.postprocessor.vad.VadPostProcessor(energy_threshold=5.0, energy_mean_scale=0.5, frames_context=0, proportion_threshold=0.6)[source]¶

Bases: shennong.postprocessor.base.FeaturesPostProcessor

Computes VAD on speech features

property name¶: Name of the processor

property energy_threshold¶

Constant term in energy threshold for MFCC0 for VAD

See also energy_mean_scale()

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

get_properties(features)¶: Return the processors properties as a dictionary

property log¶: Processor logger

process_all(utterances, njobs=None, **kwargs)¶

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters

utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

property energy_mean_scale¶

Scale factor of the mean log-energy

If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + energy_threshold(). Must be greater or equal to 0.

property frames_context¶

Number of frames of context on each side of central frame

The size of the window for which energy is monitored is 2 * frames_context + 1. Must be greater or equal to 0.

property proportion_threshold¶

Proportion of frames beyond the energy threshold

Parameter controlling the proportion of frames within the window that need to have more energy than the threshold. Must be in ]0, 1[.

property ndims¶: Dimension of the output features frames

process(features)[source]¶

Computes voice activity detection (VAD) on the input features

Parameters: features (Features, shape = [n,m]) – The speech features on which to look for voiced frames. The first coefficient must be a log-energy (or equivalent). Works well with MfccProcessor and PlpProcessor.
Returns: vad (Features, shape = [n,1]) – The output vad features are of dtype uint8 and contain 1 for voiced frames or 0 for unvoiced frames.