Extraction of spectrogram from audio signals

Extract spectrogram (log of the power spectrum) from an audio signal. Uses the Kaldi implementation (see [kaldi-spec]):

Audio —> SpectrogramProcessor —> Features


>>> from shennong.audio import Audio
>>> from shennong.features.processor.spectrogram import SpectrogramProcessor
>>> audio = Audio.load('./test/data/test.wav')

Initialize the spectrogram processor with some options and compute the features:

>>> processor = SpectrogramProcessor(sample_rate=audio.sample_rate)
>>> processor.window_type = 'hanning'
>>> spect = processor.process(audio)
>>> spect.shape
(140, 257)




class shennong.features.processor.spectrogram.SpectrogramProcessor(sample_rate=16000, frame_shift=0.01, frame_length=0.025, dither=1.0, preemph_coeff=0.97, remove_dc_offset=True, window_type='povey', round_to_power_of_two=True, blackman_coeff=0.42, snip_edges=True, energy_floor=0.0, raw_energy=True)[source]

Bases: shennong.features.processor.base.FramesProcessor


property name

Name of the processor

property ndims

Dimension of the output features frames

property blackman_coeff

Constant coefficient for generalized Blackman window

Used only if window_type is ‘blackman’

property dither

Amount of dithering

0.0 means no dither

property energy_floor
property frame_length

Frame length in seconds

property frame_shift

Frame shift in seconds


Get parameters for this processor.


deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.


params (mapping of string to any) – Parameter names mapped to their values.


Return the processors properties as a dictionary

property preemph_coeff

Coefficient for use in signal preemphasis

process_all(signals, njobs=None)

Returns features processed from several input signals

This function processes the features in parallel jobs.

  • signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.


features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.


ValueError – If the njobs parameter is <= 0

property remove_dc_offset

If True, subtract mean from waveform on each frame

property round_to_power_of_two

If true, round window size to power of two

This is done by zero-padding input to FFT

property sample_rate

Waveform sample frequency in Hertz

Must match the sample rate of the signal specified in process


Set the parameters of this processor.




ValueError – If any given parameter in params is invalid for the processor.

property snip_edges

If true, output only frames that completely fit in the file

When True the number of frames depends on the frame_length. If False, the number of frames depends only on the frame_shift, and we reflect the data at the ends.


Returns the times label for the rows given by process()

property window_type

Type of window

Must be ‘hamming’, ‘hanning’, ‘povey’, ‘rectangular’ or ‘blackman’

property raw_energy
process(signal, vtln_warp=1.0)[source]

Compute spectrogram with the specified options

Do an optional feature-level vocal tract length normalization (VTLN) when vtln_warp != 1.0.

  • signal (Audio, shape = [nsamples, 1]) – The input audio signal to compute the features on, must be mono

  • vtln_warp (float, optional) – The VTLN warping factor to be applied when computing features. Be 1.0 by default, meaning no warping is to be done.


features (Features, shape = [nframes, ndims]) – The computed features, output will have as many rows as there are frames (depends on the specified options frame_shift and frame_length).


ValueError – If the input signal has more than one channel (i.e. is not mono). If sample_rate != signal.sample_rate.