Pitch estimation using CREPE¶
Provides classes to extract pitch from an audio (speech) signal
using the CREPE model (see [Kim2018]). Integrates the CREPE package
(see [crepe-repo]) into shennong API and provides postprocessing
to turn the raw pitch into usable features, using
PitchPostProcessor.
The maximum value of the output of the neural network is used as a heuristic estimate of the voicing probability (POV).
Examples
>>> from shennong.audio import Audio
>>> from shennong.processor.crepepitch import (
... CrepePitchProcessor, CrepePitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')
Initialize a pitch processor with some options. Options can be specified at construction, or after:
>>> processor = CrepePitchProcessor(
... model_capacity='tiny', frame_shift=0.01)
Compute the pitch with the specified options, the output is an
instance of Features:
>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)
The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:
>>> postprocessor = CrepePitchPostProcessor() # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)
References
- Kim2018
CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. https://arxiv.org/abs/1802.06182
- crepe-repo
-
shennong.processor.crepepitch.predict_voicing(confidence)[source]¶ Find the Viterbi path for voiced versus unvoiced frames.
Adapted from https://github.com/sannawag/crepe.
- Parameters
confidence (np.ndarray [shape=(N,)]) – voicing confidence array, i.e. the confidence in the presence of a pitch
- Returns
voicing_states (np.ndarray [shape=(N,)]) – HMM predictions for each frames state, 0 if unvoiced, 1 if voiced
-
class
shennong.processor.crepepitch.CrepePitchProcessor(model_capacity='full', viterbi=True, center=True, frame_shift=0.01, frame_length=0.025)[source]¶ Bases:
shennong.processor.base.FeaturesProcessorExtracts the (POV, pitch) per frame from a speech signal
This processor uses the pre-trained CREPE model.
The output will have as many rows as there are frames, and two columns corresponding to (POV, pitch). POV is the Probability of Voicing.
-
property
name¶ Name of the processor
-
property
model_capacity¶ String specifying the model capacity to use
Must be ‘tiny’, ‘small’, ‘medium’, ‘large’ or ‘full’
-
property
viterbi¶ Whether to apply viterbi smoothing to the estimated pitch curve
-
property
center¶ Whether to center the window on the current frame
-
property
frame_shift¶ “Frame shift in seconds for running pitch estimation
-
property
frame_length¶ Frame length in seconds
-
get_params(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
get_properties()¶ Return the processors properties as a dictionary
-
property
log¶ Processor logger
-
process_all(signals, njobs=None)¶ Returns features processed from several input signals
This function processes the features in parallel jobs.
- Parameters
signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
- Returns
features (
FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.- Raises
ValueError – If the njobs parameter is <= 0
-
set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
paramsis invalid for the processor.
-
property
sample_rate¶ CREPE operates at 16kHz
-
property
ndims¶ Dimension of the output features frames
-
process(audio)[source]¶ Extracts the (POV, pitch) from a given speech
audiousing CREPE.- Parameters
audio (Audio) – The speech signal on which to estimate the pitch. Will be transparently resampled at 16kHz if needed.
- Returns
raw_pitch_features (Features, shape = [nframes, 2]) – The output array has two columns corresponding to (POV, pitch). The output from the crepe module is reshaped to match the specified options frame_shift and frame_length.
- Raises
ValueError – If the input signal has more than one channel (i.e. is not mono).
-
property
-
class
shennong.processor.crepepitch.CrepePitchPostProcessor(pitch_scale=2.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]¶ Bases:
shennong.processor.pitch.PitchPostProcessorProcesses the raw (POV, pitch) computed by the CrepePitchProcessor
Turns the raw pitch quantities into usable features. Converts the POV into NCCF usable by
PitchPostProcessor, then removes the pitch at frames with the worst POV (according to the pov_threshold or the proportion_voiced option) and replace them with interpolated values, and finally sends this (NCCF, pitch) pair toshennong.processor.pitch.PitchPostProcessor.process().-
property
add_delta_pitch¶ If true, time derivative of log-pitch is added to output features
-
property
add_normalized_log_pitch¶ If true, the normalized log-pitch is added to output features
Normalization is done with POV-weighted mean subtraction over 1.5 second window.
-
property
add_pov_feature¶ If true, the warped NCCF is added to output features
-
property
add_raw_log_pitch¶ If true, time derivative of log-pitch is added to output features
-
property
delay¶ Number of frames by which the pitch information is delayed
-
property
delta_pitch_noise_stddev¶ Standard deviation for noise we add to the delta log-pitch
The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.
-
property
delta_pitch_scale¶ Term to scale the final delta log-pitch feature
-
property
delta_window¶ Number of frames on each side of central frame
-
get_params(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
property
log¶ Processor logger
-
property
ndims¶ Dimension of the output features frames
-
property
normalization_left_context¶ Left-context (in frames) for moving window normalization
-
property
normalization_right_context¶ Right-context (in frames) for moving window normalization
-
property
pitch_scale¶ Scaling factor for the final normalized log-pitch value
-
property
pov_offset¶ This can be used to add an offset to the POV feature
Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)
-
property
pov_scale¶ Scaling factor for final probability of voicing feature
-
process_all(signals, njobs=None)¶ Returns features processed from several input signals
This function processes the features in parallel jobs.
- Parameters
signals (dict of :class`~shennong.audio.Audio`) – A dictionnary of input audio signals to process features on, where the keys are item names and values are audio signals.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
- Returns
features (
FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input signals.- Raises
ValueError – If the njobs parameter is <= 0
-
set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
paramsis invalid for the processor.
-
property
name¶ Name of the processor
-
process(crepe_pitch)[source]¶ Post process a raw pitch data as specified by the options
- Parameters
crepe_pitch (Features, shape = [n, 2]) – The pitch as extracted by the CrepePitchProcessor.process method
- Returns
pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.
- Raises
ValueError – If after interpolation some pitch values are not positive. If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).
-
property