Pitch estimation using Kaldi¶
Provides classes to extract pitch from an audio (speech) signal
This modules provides the classes KaldiPitchProcessor
and
KaldiPitchPostProcessor
which respectively computes the pitch from raw
speech and turns it into suitable features: it produces pitch and
probability-of-voicing estimates for use as features in automatic speech
recognition systems.
Uses the Kaldi implementation of pitch extraction and postprocessing (see [Ghahremani2014] and [kaldi-pitch]).
Examples
>>> from shennong.audio import Audio
>>> from shennong.processor import (
... KaldiPitchProcessor, KaldiPitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')
Initialize a pitch processor with some options. Options can be specified at construction, or after:
>>> processor = KaldiPitchProcessor(frame_shift=0.01, frame_length=0.025)
>>> processor.sample_rate = audio.sample_rate
>>> processor.min_f0 = 20
>>> processor.max_f0 = 500
Options can also being passed as a dictionnary:
>>> options = {
... 'sample_rate': audio.sample_rate,
... 'frame_shift': 0.01, 'frame_length': 0.025,
... 'min_f0': 20, 'max_f0': 500}
>>> processor = KaldiPitchProcessor(**options)
Compute the pitch with the specified options, the output is an
instance of Features
:
>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)
The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:
>>> postprocessor = KaldiPitchPostProcessor() # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)
References
- Ghahremani2014
A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition, Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal and Sanjeev Khudanpur, ICASSP 2014
- kaldi-pitch
-
class
shennong.processor.pitch_kaldi.
KaldiPitchProcessor
(sample_rate=16000, frame_shift=0.01, frame_length=0.025, min_f0=50, max_f0=400, soft_min_f0=10, penalty_factor=0.1, lowpass_cutoff=1000, resample_freq=4000, delta_pitch=0.005, nccf_ballast=7000, lowpass_filter_width=1, upsample_filter_width=5)[source]¶ Bases:
shennong.processor.base.FeaturesProcessor
Extracts the (NCCF, pitch) per frame from a speech signal
The output will have as many rows as there are frames, and two columns corresponding to (NCCF, pitch). NCCF is the Normalized Cross Correlation Function.
-
property
name
¶ Name of the processor
-
property
sample_rate
¶ Waveform sample frequency in Hertz
Must match the sample rate of the signal specified in process
-
property
frame_shift
¶ Frame shift in seconds
-
property
frame_length
¶ Frame length in seconds
-
property
min_f0
¶ Minimum F0 to search for in Hertz
-
property
max_f0
¶ Maximum F0 to search for in Hertz
-
property
soft_min_f0
¶ Minimum F0 to search, applied in soft way, in Hertz
Must not exceed min_f0
-
property
penalty_factor
¶ Cost factor for F0 change
-
property
lowpass_cutoff
¶ Cutoff frequency for low-pass filter, in Hertz
-
property
resample_freq
¶ Frequency that we down-sample the signal to, in Hertz
Must be more than twice lowpass_cutoff
-
property
delta_pitch
¶ Smallest relative change in pitch that the algorithm measures
-
property
nccf_ballast
¶ Increasing this factor reduces NCCF for quiet frames
This helps ensuring pitch continuity in unvoiced regions
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
get_properties
(**kwargs)¶ Return the processors properties as a dictionary
-
property
log
¶ Processor logger
-
property
lowpass_filter_width
¶ Integer that determines filter width of lowpass filter
More gives sharper filter
-
process_all
(utterances, njobs=None, **kwargs)¶ Returns features processed from several input utterances
This function processes the features in parallel jobs.
- Parameters
utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.
- Returns
features (
FeaturesCollection
) – The computed features on each input signal. The keys of output features are the keys of the input utterances.- Raises
ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
property
upsample_filter_width
¶ Integer that determines filter width when upsampling NCCF
-
property
ndims
¶ Dimension of the output features frames
-
process
(signal)[source]¶ Extracts the (NCCF, pitch) from a given speech signal
- Parameters
signal (Audio) – The speech signal on which to estimate the pitch. The signal’s sample rate must match the sample rate specified in the PitchProcessor options.
- Returns
raw_pitch_features (Features, shape = [nframes, 2]) – The output array has as many rows as there are frames (depends on the specified options frame_shift and frame_length), and two columns corresponding to (NCCF, pitch).
- Raises
ValueError – If the input signal has more than one channel (i.e. is not mono). If sample_rate != signal.sample_rate.
-
property
-
class
shennong.processor.pitch_kaldi.
KaldiPitchPostProcessor
(pitch_scale=2.0, pov_scale=2.0, pov_offset=0.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]¶ Bases:
shennong.postprocessor.base.FeaturesPostProcessor
Processes the raw (NCCF, pitch) computed by the PitchProcessor
Turns the raw pitch quantites into usable features. By default it will output three-dimensional features, (POV-feature, mean-subtracted-log-pitch, delta-of-raw-pitch), but this is configurable in the options. The number of rows of “output” will be the number of frames (rows) in “input”, i.e. the number of frames. The number of columns will be the number of different types of features requested (by default, 3; 4 is the max). The four parameters add_pov_feature, add_normalized_log_pitch, add_delta_pitch, add_raw_log_pitch determine which features we create; by default we create the first three.
POV stands for Probability of Voicing.
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
property
log
¶ Processor logger
-
process_all
(utterances, njobs=None, **kwargs)¶ Returns features processed from several input utterances
This function processes the features in parallel jobs.
- Parameters
utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.
- Returns
features (
FeaturesCollection
) – The computed features on each input signal. The keys of output features are the keys of the input utterances.- Raises
ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
property
name
¶ Name of the processor
-
property
pitch_scale
¶ Scaling factor for the final normalized log-pitch value
-
property
pov_scale
¶ Scaling factor for final probability of voicing feature
-
property
pov_offset
¶ This can be used to add an offset to the POV feature
Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)
-
property
delta_pitch_scale
¶ Term to scale the final delta log-pitch feature
-
property
delta_pitch_noise_stddev
¶ Standard deviation for noise we add to the delta log-pitch
The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.
-
property
normalization_left_context
¶ Left-context (in frames) for moving window normalization
-
property
normalization_right_context
¶ Right-context (in frames) for moving window normalization
-
property
delta_window
¶ Number of frames on each side of central frame
-
property
delay
¶ Number of frames by which the pitch information is delayed
-
property
add_pov_feature
¶ If true, the warped NCCF is added to output features
-
property
add_normalized_log_pitch
¶ If true, the normalized log-pitch is added to output features
Normalization is done with POV-weighted mean subtraction over 1.5 second window.
-
property
add_delta_pitch
¶ If true, time derivative of log-pitch is added to output features
-
property
add_raw_log_pitch
¶ If true, time derivative of log-pitch is added to output features
-
property
ndims
¶ Dimension of the output features frames
-
process
(raw_pitch)[source]¶ Post process a raw pitch data as specified by the options
- Parameters
raw_pitch (Features, shape = [n, 2]) – The pitch as extracted by the KaldiPitchProcessor.process method
- Returns
pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.
- Raises
ValueError – If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).
-