Pitch estimation using Kaldi¶

Provides classes to extract pitch from an audio (speech) signal

This modules provides the classes PitchProcessor and PitchPostProcessor which respectively computes the pitch from raw speech and turns it into suitable features: it produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems

Uses the Kaldi implementation of pitch extraction and postprocessing (see [Ghahremani2014] and [kaldi-pitch]).

Audio —> PitchProcessor —> PitchPostProcessor —> Features

Examples

>>> from shennong.audio import Audio
>>> from shennong.processor.pitch import (PitchProcessor, PitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')

Initialize a pitch processor with some options. Options can be specified at construction, or after:

>>> processor = PitchProcessor(frame_shift=0.01, frame_length=0.025)
>>> processor.sample_rate = audio.sample_rate
>>> processor.min_f0 = 20
>>> processor.max_f0 = 500

Options can also being passed as a dictionnary:

>>> options = {
...     'sample_rate': audio.sample_rate,
...     'frame_shift': 0.01, 'frame_length': 0.025,
...     'min_f0': 20, 'max_f0': 500}
>>> processor = PitchProcessor(**options)

Compute the pitch with the specified options, the output is an instance of Features:

>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)

The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:

>>> postprocessor = PitchPostProcessor()  # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)

References

Ghahremani2014: A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition, Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal and Sanjeev Khudanpur, ICASSP 2014
kaldi-pitch: http://kaldi-asr.org/doc/pitch-functions_8h.html

class shennong.processor.pitch.PitchProcessor(sample_rate=16000, frame_shift=0.01, frame_length=0.025, min_f0=50, max_f0=400, soft_min_f0=10, penalty_factor=0.1, lowpass_cutoff=1000, resample_freq=4000, delta_pitch=0.005, nccf_ballast=7000, lowpass_filter_width=1, upsample_filter_width=5)[source]¶

Bases: shennong.processor.base.FeaturesProcessor

Extracts the (NCCF, pitch) per frame from a speech signal

The output will have as many rows as there are frames, and two columns corresponding to (NCCF, pitch). NCCF is the Normalized Cross Correlation Function.

property name¶: Name of the processor

property sample_rate¶

Waveform sample frequency in Hertz

Must match the sample rate of the signal specified in process

property frame_shift¶: Frame shift in seconds

property frame_length¶: Frame length in seconds

property min_f0¶: Minimum F0 to search for in Hertz

property max_f0¶: Maximum F0 to search for in Hertz

property soft_min_f0¶

Minimum F0 to search, applied in soft way, in Hertz

Must not exceed min_f0

property penalty_factor¶: Cost factor for F0 change

property lowpass_cutoff¶: Cutoff frequency for low-pass filter, in Hertz

property resample_freq¶

Frequency that we down-sample the signal to, in Hertz

Must be more than twice lowpass_cutoff

property delta_pitch¶: Smallest relative change in pitch that the algorithm measures

property nccf_ballast¶

Increasing this factor reduces NCCF for quiet frames

This helps ensuring pitch continuity in unvoiced regions

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

get_properties()¶: Return the processors properties as a dictionary

property log¶: Processor logger

property lowpass_filter_width¶

Integer that determines filter width of lowpass filter

More gives sharper filter

process_all(signals, njobs=None)¶

Returns features processed from several input signals

This function processes the features in parallel jobs.

Parameters

signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.

Raises

ValueError – If the njobs parameter is <= 0

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

property upsample_filter_width¶: Integer that determines filter width when upsampling NCCF

property ndims¶: Dimension of the output features frames

times(nframes)[source]¶: Returns the time label for the rows given by the process method

process(signal)[source]¶

Extracts the (NCCF, pitch) from a given speech signal

Parameters: signal (Audio) – The speech signal on which to estimate the pitch. The signal’s sample rate must match the sample rate specified in the PitchProcessor options.
Returns: raw_pitch_features (Features, shape = [nframes, 2]) – The output array has as many rows as there are frames (depends on the specified options frame_shift and frame_length), and two columns corresponding to (NCCF, pitch).
Raises: ValueError – If the input signal has more than one channel (i.e. is not mono). If sample_rate != signal.sample_rate.

class shennong.processor.pitch.PitchPostProcessor(pitch_scale=2.0, pov_scale=2.0, pov_offset=0.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]¶

Bases: shennong.postprocessor.base.FeaturesPostProcessor

Processes the raw (NCCF, pitch) computed by the PitchProcessor

Turns the raw pitch quantites into usable features. By default it will output three-dimensional features, (POV-feature, mean-subtracted-log-pitch, delta-of-raw-pitch), but this is configurable in the options. The number of rows of “output” will be the number of frames (rows) in “input”, i.e. the number of frames. The number of columns will be the number of different types of features requested (by default, 3; 4 is the max). The four parameters add_pov_feature, add_normalized_log_pitch, add_delta_pitch, add_raw_log_pitch determine which features we create; by default we create the first three.

POV stands for Probability of Voicing.

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

property log¶: Processor logger

process_all(signals, njobs=None)¶

Returns features processed from several input signals

This function processes the features in parallel jobs.

Parameters

signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.

Raises

ValueError – If the njobs parameter is <= 0

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

property name¶: Name of the processor

property pitch_scale¶: Scaling factor for the final normalized log-pitch value

property pov_scale¶: Scaling factor for final probability of voicing feature

property pov_offset¶

This can be used to add an offset to the POV feature

Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)

property delta_pitch_scale¶: Term to scale the final delta log-pitch feature

property delta_pitch_noise_stddev¶

Standard deviation for noise we add to the delta log-pitch

The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.

property normalization_left_context¶: Left-context (in frames) for moving window normalization

property normalization_right_context¶: Right-context (in frames) for moving window normalization

property delta_window¶: Number of frames on each side of central frame

property delay¶: Number of frames by which the pitch information is delayed

property add_pov_feature¶: If true, the warped NCCF is added to output features

property add_normalized_log_pitch¶

If true, the normalized log-pitch is added to output features

Normalization is done with POV-weighted mean subtraction over 1.5 second window.

property add_delta_pitch¶: If true, time derivative of log-pitch is added to output features

property add_raw_log_pitch¶: If true, time derivative of log-pitch is added to output features

property ndims¶: Dimension of the output features frames

get_properties(features)[source]¶: Return the processors properties as a dictionary

process(raw_pitch)[source]¶

Post process a raw pitch data as specified by the options

Parameters: raw_pitch (Features, shape = [n, 2]) – The pitch as extracted by the PitchProcessor.process method
Returns: pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.
Raises: ValueError – If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).