Pitch estimation using CREPE¶

Provides classes to extract pitch from an audio (speech) signal using the CREPE model (see [Kim2018]). Integrates the CREPE package (see [crepe-repo]) into shennong API and provides postprocessing to turn the raw pitch into usable features, using PitchPostProcessor.

The maximum value of the output of the neural network is used as a heuristic estimate of the voicing probability (POV).

Examples

>>> from shennong.audio import Audio
>>> from shennong.processor import (
...     CrepePitchProcessor, CrepePitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')

Initialize a pitch processor with some options. Options can be specified at construction, or after:

>>> processor = CrepePitchProcessor(
...   model_capacity='tiny', frame_shift=0.01)

Compute the pitch with the specified options, the output is an instance of Features:

>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)

The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:

>>> postprocessor = CrepePitchPostProcessor()  # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)

References

Kim2018(1,2): CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. https://arxiv.org/abs/1802.06182
crepe-repo: https://github.com/marl/crepe

shennong.processor.pitch_crepe.predict_voicing(confidence)[source]¶

Find the Viterbi path for voiced versus unvoiced frames.

Adapted from https://github.com/sannawag/crepe.

Parameters: confidence (np.ndarray [shape=(N,)]) – voicing confidence array, i.e. the confidence in the presence of a pitch
Returns: voicing_states (np.ndarray [shape=(N,)]) – HMM predictions for each frames state, 0 if unvoiced, 1 if voiced

class shennong.processor.pitch_crepe.CrepePitchProcessor(model_capacity='full', viterbi=True, center=True, frame_shift=0.01, frame_length=0.025)[source]¶

Bases: shennong.processor.base.FeaturesProcessor

Extracts the (POV, pitch) per frame from a speech signal

This processor uses the pre-trained CREPE model. The output will have as many rows as there are frames, and two columns corresponding to (POV, pitch). POV is the Probability of Voicing.

property name¶: Name of the processor

property model_capacity¶

String specifying the model capacity to use

Must be ‘tiny’, ‘small’, ‘medium’, ‘large’ or ‘full’. Determines the model’s capacity multiplier to 4 (tiny), 8 (small), 16 (medium), 24 (large), or 32 (full). ‘full’ uses the model size specified in [Kim2018], and the others use a reduced number of filters in each convolutional layer, resulting in a smaller model that is faster to evaluate at the cost of slightly reduced pitch estimation accuracy.

property viterbi¶: Whether to apply viterbi smoothing to the estimated pitch curve

property center¶

Whether to center the window on the current frame.

When True, the output frame $t$ is centered at audio[t * hop_length]. When False, the frame begins at audio[t * hop_length].

property frame_shift¶: “Frame shift in seconds for running pitch estimation

property frame_length¶: Frame length in seconds

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

get_properties(**kwargs)¶: Return the processors properties as a dictionary

property log¶: Processor logger

process_all(utterances, njobs=None, **kwargs)¶

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters

utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

property sample_rate¶: CREPE operates at 16kHz

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

property ndims¶: Dimension of the output features frames

times(nframes)[source]¶: Returns the time label for the rows given by process()

process(audio)[source]¶

Extracts the (POV, pitch) from a given speech audio using CREPE.

Parameters: audio (Audio) – The speech signal on which to estimate the pitch. Will be transparently resampled at 16kHz if needed.
Returns: raw_pitch_features (Features, shape = [nframes, 2]) – The output array has two columns corresponding to (POV, pitch). The output from the crepe module is reshaped to match the specified options frame_shift and frame_length.
Raises: ValueError – If the input signal has more than one channel (i.e. is not mono).

class shennong.processor.pitch_crepe.CrepePitchPostProcessor(pitch_scale=2.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]¶

Bases: shennong.processor.pitch_kaldi.KaldiPitchPostProcessor

Processes the raw (POV, pitch) computed by the CrepePitchProcessor

Turns the raw pitch quantities into usable features. Converts the POV into NCCF usable by PitchPostProcessor, then removes the pitch at frames with the worst POV (according to the pov_threshold or the proportion_voiced option) and replace them with interpolated values, and finally sends this (NCCF, pitch) pair to shennong.processor.pitch.PitchPostProcessor.process().

property add_delta_pitch¶: If true, time derivative of log-pitch is added to output features

property add_normalized_log_pitch¶

If true, the normalized log-pitch is added to output features

Normalization is done with POV-weighted mean subtraction over 1.5 second window.

property add_pov_feature¶: If true, the warped NCCF is added to output features

property add_raw_log_pitch¶: If true, time derivative of log-pitch is added to output features

property delay¶: Number of frames by which the pitch information is delayed

property delta_pitch_noise_stddev¶

Standard deviation for noise we add to the delta log-pitch

The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.

property delta_pitch_scale¶: Term to scale the final delta log-pitch feature

property delta_window¶: Number of frames on each side of central frame

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

property log¶: Processor logger

property ndims¶: Dimension of the output features frames

property normalization_left_context¶: Left-context (in frames) for moving window normalization

property normalization_right_context¶: Right-context (in frames) for moving window normalization

property pitch_scale¶: Scaling factor for the final normalized log-pitch value

property pov_offset¶

This can be used to add an offset to the POV feature

Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)

property pov_scale¶: Scaling factor for final probability of voicing feature

process_all(utterances, njobs=None, **kwargs)¶

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters

utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

property name¶: Name of the processor

get_properties(features)[source]¶: Return the processors properties as a dictionary

process(crepe_pitch)[source]¶

Post process a raw pitch data as specified by the options

Parameters: crepe_pitch (Features, shape = [n, 2]) – The pitch as extracted by the CrepePitchProcessor.process method
Returns: pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.
Raises: ValueError – If after interpolation some pitch values are not positive. If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).