Pitch estimation using Kaldi

Provides classes to extract pitch from an audio (speech) signal

This modules provides the classes PitchProcessor and PitchPostProcessor which respectively computes the pitch from raw speech and turns it into suitable features: it produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems

Uses the Kaldi implementation of pitch extraction and postprocessing (see [Ghahremani2014] and [kaldi-pitch]).

Audio —> PitchProcessor —> PitchPostProcessor —> Features

Examples

>>> from shennong.audio import Audio
>>> from shennong.processor.pitch import (PitchProcessor, PitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')

Initialize a pitch processor with some options. Options can be specified at construction, or after:

>>> processor = PitchProcessor(frame_shift=0.01, frame_length=0.025)
>>> processor.sample_rate = audio.sample_rate
>>> processor.min_f0 = 20
>>> processor.max_f0 = 500

Options can also being passed as a dictionnary:

>>> options = {
...     'sample_rate': audio.sample_rate,
...     'frame_shift': 0.01, 'frame_length': 0.025,
...     'min_f0': 20, 'max_f0': 500}
>>> processor = PitchProcessor(**options)

Compute the pitch with the specified options, the output is an instance of Features:

>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)

The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:

>>> postprocessor = PitchPostProcessor()  # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)

References

Ghahremani2014

A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition, Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal and Sanjeev Khudanpur, ICASSP 2014

kaldi-pitch

http://kaldi-asr.org/doc/pitch-functions_8h.html

class shennong.processor.pitch.PitchProcessor(sample_rate=16000, frame_shift=0.01, frame_length=0.025, min_f0=50, max_f0=400, soft_min_f0=10, penalty_factor=0.1, lowpass_cutoff=1000, resample_freq=4000, delta_pitch=0.005, nccf_ballast=7000, lowpass_filter_width=1, upsample_filter_width=5)[source]

Bases: shennong.processor.base.FeaturesProcessor

Extracts the (NCCF, pitch) per frame from a speech signal

The output will have as many rows as there are frames, and two columns corresponding to (NCCF, pitch). NCCF is the Normalized Cross Correlation Function.

property name

Name of the processor

property sample_rate

Waveform sample frequency in Hertz

Must match the sample rate of the signal specified in process

property frame_shift

Frame shift in seconds

property frame_length

Frame length in seconds

property min_f0

Minimum F0 to search for in Hertz

property max_f0

Maximum F0 to search for in Hertz

property soft_min_f0

Minimum F0 to search, applied in soft way, in Hertz

Must not exceed min_f0

property penalty_factor

Cost factor for F0 change

property lowpass_cutoff

Cutoff frequency for low-pass filter, in Hertz

property resample_freq

Frequency that we down-sample the signal to, in Hertz

Must be more than twice lowpass_cutoff

property delta_pitch

Smallest relative change in pitch that the algorithm measures

property nccf_ballast

Increasing this factor reduces NCCF for quiet frames

This helps ensuring pitch continuity in unvoiced regions

get_params(deep=True)

Get parameters for this processor.

Parameters

deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.

Returns

params (mapping of string to any) – Parameter names mapped to their values.

get_properties()

Return the processors properties as a dictionary

property log

Processor logger

property lowpass_filter_width

Integer that determines filter width of lowpass filter

More gives sharper filter

process_all(signals, njobs=None)

Returns features processed from several input signals

This function processes the features in parallel jobs.

Parameters
  • signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.

Raises

ValueError – If the njobs parameter is <= 0

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

Parameters
  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)

Set the parameters of this processor.

Returns

self

Raises

ValueError – If any given parameter in params is invalid for the processor.

property upsample_filter_width

Integer that determines filter width when upsampling NCCF

property ndims

Dimension of the output features frames

times(nframes)[source]

Returns the time label for the rows given by the process method

process(signal)[source]

Extracts the (NCCF, pitch) from a given speech signal

Parameters

signal (Audio) – The speech signal on which to estimate the pitch. The signal’s sample rate must match the sample rate specified in the PitchProcessor options.

Returns

raw_pitch_features (Features, shape = [nframes, 2]) – The output array has as many rows as there are frames (depends on the specified options frame_shift and frame_length), and two columns corresponding to (NCCF, pitch).

Raises

ValueError – If the input signal has more than one channel (i.e. is not mono). If sample_rate != signal.sample_rate.

class shennong.processor.pitch.PitchPostProcessor(pitch_scale=2.0, pov_scale=2.0, pov_offset=0.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]

Bases: shennong.postprocessor.base.FeaturesPostProcessor

Processes the raw (NCCF, pitch) computed by the PitchProcessor

Turns the raw pitch quantites into usable features. By default it will output three-dimensional features, (POV-feature, mean-subtracted-log-pitch, delta-of-raw-pitch), but this is configurable in the options. The number of rows of “output” will be the number of frames (rows) in “input”, i.e. the number of frames. The number of columns will be the number of different types of features requested (by default, 3; 4 is the max). The four parameters add_pov_feature, add_normalized_log_pitch, add_delta_pitch, add_raw_log_pitch determine which features we create; by default we create the first three.

POV stands for Probability of Voicing.

get_params(deep=True)

Get parameters for this processor.

Parameters

deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.

Returns

params (mapping of string to any) – Parameter names mapped to their values.

property log

Processor logger

process_all(signals, njobs=None)

Returns features processed from several input signals

This function processes the features in parallel jobs.

Parameters
  • signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.

Raises

ValueError – If the njobs parameter is <= 0

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

Parameters
  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)

Set the parameters of this processor.

Returns

self

Raises

ValueError – If any given parameter in params is invalid for the processor.

property name

Name of the processor

property pitch_scale

Scaling factor for the final normalized log-pitch value

property pov_scale

Scaling factor for final probability of voicing feature

property pov_offset

This can be used to add an offset to the POV feature

Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)

property delta_pitch_scale

Term to scale the final delta log-pitch feature

property delta_pitch_noise_stddev

Standard deviation for noise we add to the delta log-pitch

The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.

property normalization_left_context

Left-context (in frames) for moving window normalization

property normalization_right_context

Right-context (in frames) for moving window normalization

property delta_window

Number of frames on each side of central frame

property delay

Number of frames by which the pitch information is delayed

property add_pov_feature

If true, the warped NCCF is added to output features

property add_normalized_log_pitch

If true, the normalized log-pitch is added to output features

Normalization is done with POV-weighted mean subtraction over 1.5 second window.

property add_delta_pitch

If true, time derivative of log-pitch is added to output features

property add_raw_log_pitch

If true, time derivative of log-pitch is added to output features

property ndims

Dimension of the output features frames

get_properties(features)[source]

Return the processors properties as a dictionary

process(raw_pitch)[source]

Post process a raw pitch data as specified by the options

Parameters

raw_pitch (Features, shape = [n, 2]) – The pitch as extracted by the PitchProcessor.process method

Returns

pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.

Raises

ValueError – If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).