Pitch estimation using CREPE

Provides classes to extract pitch from an audio (speech) signal using the CREPE model (see [Kim2018]). Integrates the CREPE package (see [crepe-repo]) into shennong API and provides postprocessing to turn the raw pitch into usable features, using PitchPostProcessor.

The maximum value of the output of the neural network is used as a heuristic estimate of the voicing probability (POV).

Examples

>>> from shennong.audio import Audio
>>> from shennong.processor import (
...     CrepePitchProcessor, CrepePitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')

Initialize a pitch processor with some options. Options can be specified at construction, or after:

>>> processor = CrepePitchProcessor(
...   model_capacity='tiny', frame_shift=0.01)

Compute the pitch with the specified options, the output is an instance of Features:

>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)

The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:

>>> postprocessor = CrepePitchPostProcessor()  # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)

References

Kim2018(1,2)

CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. https://arxiv.org/abs/1802.06182

crepe-repo

https://github.com/marl/crepe

shennong.processor.pitch_crepe.predict_voicing(confidence)[source]

Find the Viterbi path for voiced versus unvoiced frames.

Adapted from https://github.com/sannawag/crepe.

Parameters

confidence (np.ndarray [shape=(N,)]) – voicing confidence array, i.e. the confidence in the presence of a pitch

Returns

voicing_states (np.ndarray [shape=(N,)]) – HMM predictions for each frames state, 0 if unvoiced, 1 if voiced

class shennong.processor.pitch_crepe.CrepePitchProcessor(model_capacity='full', viterbi=True, center=True, frame_shift=0.01, frame_length=0.025)[source]

Bases: shennong.processor.base.FeaturesProcessor

Extracts the (POV, pitch) per frame from a speech signal

This processor uses the pre-trained CREPE model. The output will have as many rows as there are frames, and two columns corresponding to (POV, pitch). POV is the Probability of Voicing.

property name

Name of the processor

property model_capacity

String specifying the model capacity to use

Must be ‘tiny’, ‘small’, ‘medium’, ‘large’ or ‘full’. Determines the model’s capacity multiplier to 4 (tiny), 8 (small), 16 (medium), 24 (large), or 32 (full). ‘full’ uses the model size specified in [Kim2018], and the others use a reduced number of filters in each convolutional layer, resulting in a smaller model that is faster to evaluate at the cost of slightly reduced pitch estimation accuracy.

property viterbi

Whether to apply viterbi smoothing to the estimated pitch curve

property center

Whether to center the window on the current frame.

When True, the output frame t is centered at audio[t * hop_length]. When False, the frame begins at audio[t * hop_length].

property frame_shift

“Frame shift in seconds for running pitch estimation

property frame_length

Frame length in seconds

get_params(deep=True)

Get parameters for this processor.

Parameters

deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.

Returns

params (mapping of string to any) – Parameter names mapped to their values.

get_properties(**kwargs)

Return the processors properties as a dictionary

property log

Processor logger

process_all(utterances, njobs=None, **kwargs)

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters
  • utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

  • **kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

property sample_rate

CREPE operates at 16kHz

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

Parameters
  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)

Set the parameters of this processor.

Returns

self

Raises

ValueError – If any given parameter in params is invalid for the processor.

property ndims

Dimension of the output features frames

times(nframes)[source]

Returns the time label for the rows given by process()

process(audio)[source]

Extracts the (POV, pitch) from a given speech audio using CREPE.

Parameters

audio (Audio) – The speech signal on which to estimate the pitch. Will be transparently resampled at 16kHz if needed.

Returns

raw_pitch_features (Features, shape = [nframes, 2]) – The output array has two columns corresponding to (POV, pitch). The output from the crepe module is reshaped to match the specified options frame_shift and frame_length.

Raises

ValueError – If the input signal has more than one channel (i.e. is not mono).

class shennong.processor.pitch_crepe.CrepePitchPostProcessor(pitch_scale=2.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]

Bases: shennong.processor.pitch_kaldi.KaldiPitchPostProcessor

Processes the raw (POV, pitch) computed by the CrepePitchProcessor

Turns the raw pitch quantities into usable features. Converts the POV into NCCF usable by PitchPostProcessor, then removes the pitch at frames with the worst POV (according to the pov_threshold or the proportion_voiced option) and replace them with interpolated values, and finally sends this (NCCF, pitch) pair to shennong.processor.pitch.PitchPostProcessor.process().

property add_delta_pitch

If true, time derivative of log-pitch is added to output features

property add_normalized_log_pitch

If true, the normalized log-pitch is added to output features

Normalization is done with POV-weighted mean subtraction over 1.5 second window.

property add_pov_feature

If true, the warped NCCF is added to output features

property add_raw_log_pitch

If true, time derivative of log-pitch is added to output features

property delay

Number of frames by which the pitch information is delayed

property delta_pitch_noise_stddev

Standard deviation for noise we add to the delta log-pitch

The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.

property delta_pitch_scale

Term to scale the final delta log-pitch feature

property delta_window

Number of frames on each side of central frame

get_params(deep=True)

Get parameters for this processor.

Parameters

deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.

Returns

params (mapping of string to any) – Parameter names mapped to their values.

property log

Processor logger

property ndims

Dimension of the output features frames

property normalization_left_context

Left-context (in frames) for moving window normalization

property normalization_right_context

Right-context (in frames) for moving window normalization

property pitch_scale

Scaling factor for the final normalized log-pitch value

property pov_offset

This can be used to add an offset to the POV feature

Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)

property pov_scale

Scaling factor for final probability of voicing feature

process_all(utterances, njobs=None, **kwargs)

Returns features processed from several input utterances

This function processes the features in parallel jobs.

Parameters
  • utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.

  • njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.

  • **kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.

Returns

features (FeaturesCollection) – The computed features on each input signal. The keys of output features are the keys of the input utterances.

Raises

ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

Parameters
  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)

Set the parameters of this processor.

Returns

self

Raises

ValueError – If any given parameter in params is invalid for the processor.

property name

Name of the processor

get_properties(features)[source]

Return the processors properties as a dictionary

process(crepe_pitch)[source]

Post process a raw pitch data as specified by the options

Parameters

crepe_pitch (Features, shape = [n, 2]) – The pitch as extracted by the CrepePitchProcessor.process method

Returns

pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.

Raises

ValueError – If after interpolation some pitch values are not positive. If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).