Voice Activity Detection¶
Compute Voice Activity Detection (VAD) on features log-energy
Compute voice-activity detection for speech features using the Kaldi implementation see [kaldi-vad]: The output is, for each input frame, 1 if we judge the frame as voiced, 0 otherwise. There are no continuity constraints.
This method is a very simple energy-based method which only looks at
the first coefficient of the input features, which is assumed to be
a log-energy or something similar. If working from the raw signal,
extract the energy using
EnergyProcessor
.
A cutoff is set, we use a formula of the general type:
and for each frame the decision is based on the proportion of frames in a context window around the current frame, which are above this cutoff.
Note
This code is geared toward speaker-id applications and is not suitable for automatic speech recognition (ASR) because it makes independent decisions for each frame without imposing any notion of continuity.
Examples
>>> import numpy as np
>>> from shennong.audio import Audio
>>> from shennong.processor.mfcc import MfccProcessor
>>> from shennong.postprocessor.vad import VadPostProcessor
>>> audio = Audio.load('./test/data/test.wav')
>>> mfcc = MfccProcessor().process(audio)
Computes the voice activity detection on the extracted MFCCs:
>>> processor = VadPostProcessor()
>>> vad = processor.process(mfcc)
For each frames of the MFCCs, vad is 1 if detected as a voiced frame, 0 otherwise:
>>> nframes = mfcc.shape[0]
>>> vad.shape == (nframes, 1)
True
>>> nvoiced = sum(vad.data[vad.data == 1])
>>> print('{} voiced frames out of {}'.format(nvoiced, nframes))
119 voiced frames out of 140
References
-
class
shennong.postprocessor.vad.
VadPostProcessor
(energy_threshold=5.0, energy_mean_scale=0.5, frames_context=0, proportion_threshold=0.6)[source]¶ Bases:
shennong.postprocessor.base.FeaturesPostProcessor
Computes VAD on speech features
-
property
name
¶ Name of the processor
-
property
energy_threshold
¶ Constant term in energy threshold for MFCC0 for VAD
See also
energy_mean_scale()
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
get_properties
(features)¶ Return the processors properties as a dictionary
-
property
log
¶ Processor logger
-
process_all
(utterances, njobs=None, **kwargs)¶ Returns features processed from several input utterances
This function processes the features in parallel jobs.
- Parameters
utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.
- Returns
features (
FeaturesCollection
) – The computed features on each input signal. The keys of output features are the keys of the input utterances.- Raises
ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
property
energy_mean_scale
¶ Scale factor of the mean log-energy
If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m +
energy_threshold()
. Must be greater or equal to 0.
-
property
frames_context
¶ Number of frames of context on each side of central frame
The size of the window for which energy is monitored is 2 * frames_context + 1. Must be greater or equal to 0.
-
property
proportion_threshold
¶ Proportion of frames beyond the energy threshold
Parameter controlling the proportion of frames within the window that need to have more energy than the threshold. Must be in ]0, 1[.
-
property
ndims
¶ Dimension of the output features frames
-
process
(features)[source]¶ Computes voice activity detection (VAD) on the input features
- Parameters
features (
Features
, shape = [n,m]) – The speech features on which to look for voiced frames. The first coefficient must be a log-energy (or equivalent). Works well withMfccProcessor
andPlpProcessor
.- Returns
vad (
Features
, shape = [n,1]) – The output vad features are of dtype uint8 and contain 1 for voiced frames or 0 for unvoiced frames.
-
property