Voice Activity Detection

Compute Voice Activity Detection (VAD) on features log-energy

Features –> VadPostProcessor –> Features

Compute voice-activity detection for speech features using the Kaldi implementation see [kaldi-vad]: The output is, for each input frame, 1 if we judge the frame as voiced, 0 otherwise. There are no continuity constraints.

This method is a very simple energy-based method which only looks at the first coefficient of the input features, which is assumed to be a log-energy or something similar. If working from the raw signal, extract the energy using EnergyProcessor.

A cutoff is set, we use a formula of the general type:

\textrm{cutoff} = 5.0 + 0.5 * (\textrm{average log} - \textrm{energy}),

and for each frame the decision is based on the proportion of frames in a context window around the current frame, which are above this cutoff.


This code is geared toward speaker-id applications and is not suitable for automatic speech recognition (ASR) because it makes independent decisions for each frame without imposing any notion of continuity.


>>> import numpy as np
>>> from shennong.audio import Audio
>>> from shennong.features.processor.mfcc import MfccProcessor
>>> from shennong.features.postprocessor.vad import VadPostProcessor
>>> audio = Audio.load('./test/data/test.wav')
>>> mfcc = MfccProcessor().process(audio)

Computes the voice activity detection on the extracted MFCCs:

>>> processor = VadPostProcessor()
>>> vad = processor.process(mfcc)

For each frames of the MFCCs, vad is 1 if detected as a voiced frame, 0 otherwise:

>>> nframes = mfcc.shape[0]
>>> vad.shape == (nframes, 1)
>>> nvoiced = sum(vad.data[vad.data == 1])
>>> print('{} voiced frames out of {}'.format(nvoiced, nframes))
119 voiced frames out of 140




class shennong.features.postprocessor.vad.VadPostProcessor(energy_threshold=5.0, energy_mean_scale=0.5, frames_context=0, proportion_threshold=0.6)[source]

Bases: shennong.features.postprocessor.base.FeaturesPostProcessor

Computes VAD on speech features

property name

Name of the processor

property energy_threshold

Constant term in energy threshold for MFCC0 for VAD

See also energy_mean_scale()


Get parameters for this processor.


deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.


params (mapping of string to any) – Parameter names mapped to their values.


Return the processors properties as a dictionary

process_all(signals, njobs=None)

Returns features processed from several input signals

This function processes the features in parallel jobs.

  • signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.


features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.


ValueError – If the njobs parameter is <= 0


Set the parameters of this processor.




ValueError – If any given parameter in params is invalid for the processor.

property energy_mean_scale

Scale factor of the mean log-energy

If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + energy_threshold(). Must be greater or equal to 0.

property frames_context

Number of frames of context on each side of central frame

The size of the window for which energy is monitored is 2 * frames_context + 1. Must be greater or equal to 0.

property proportion_threshold

Proportion of frames beyond the energy threshold

Parameter controlling the proportion of frames within the window that need to have more energy than the threshold. Must be in ]0, 1[.

property ndims

Dimension of the output features frames


Computes voice activity detection (VAD) on the input features


features (Features, shape = [n,m]) – The speech features on which to look for voiced frames. The first coefficient must be a log-energy (or equivalent). Works well with MfccProcessor and PlpProcessor.


vad (Features, shape = [n,1]) – The output vad features are of dtype uint8 and contain 1 for voiced frames or 0 for unvoiced frames.