Pitch estimation using CREPE¶
Provides classes to extract pitch from an audio (speech) signal
using the CREPE model (see [Kim2018]). Integrates the CREPE package
(see [crepe-repo]) into shennong API and provides postprocessing
to turn the raw pitch into usable features, using
PitchPostProcessor
.
The maximum value of the output of the neural network is used as a heuristic estimate of the voicing probability (POV).
Examples
>>> from shennong.audio import Audio
>>> from shennong.processor import (
... CrepePitchProcessor, CrepePitchPostProcessor)
>>> audio = Audio.load('./test/data/test.wav')
Initialize a pitch processor with some options. Options can be specified at construction, or after:
>>> processor = CrepePitchProcessor(
... model_capacity='tiny', frame_shift=0.01)
Compute the pitch with the specified options, the output is an
instance of Features
:
>>> pitch = processor.process(audio)
>>> type(pitch)
<class 'shennong.features.Features'>
>>> pitch.shape
(140, 2)
The pitch post-processor works in the same way, input is the pitch, output are features usable by speech processing tools:
>>> postprocessor = CrepePitchPostProcessor() # use default options
>>> postpitch = postprocessor.process(pitch)
>>> postpitch.shape
(140, 3)
References
- Kim2018(1,2)
CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. https://arxiv.org/abs/1802.06182
- crepe-repo
-
shennong.processor.pitch_crepe.
predict_voicing
(confidence)[source]¶ Find the Viterbi path for voiced versus unvoiced frames.
Adapted from https://github.com/sannawag/crepe.
- Parameters
confidence (np.ndarray [shape=(N,)]) – voicing confidence array, i.e. the confidence in the presence of a pitch
- Returns
voicing_states (np.ndarray [shape=(N,)]) – HMM predictions for each frames state, 0 if unvoiced, 1 if voiced
-
class
shennong.processor.pitch_crepe.
CrepePitchProcessor
(model_capacity='full', viterbi=True, center=True, frame_shift=0.01, frame_length=0.025)[source]¶ Bases:
shennong.processor.base.FeaturesProcessor
Extracts the (POV, pitch) per frame from a speech signal
This processor uses the pre-trained CREPE model. The output will have as many rows as there are frames, and two columns corresponding to (POV, pitch). POV is the Probability of Voicing.
-
property
name
¶ Name of the processor
-
property
model_capacity
¶ String specifying the model capacity to use
Must be ‘tiny’, ‘small’, ‘medium’, ‘large’ or ‘full’. Determines the model’s capacity multiplier to 4 (tiny), 8 (small), 16 (medium), 24 (large), or 32 (full). ‘full’ uses the model size specified in [Kim2018], and the others use a reduced number of filters in each convolutional layer, resulting in a smaller model that is faster to evaluate at the cost of slightly reduced pitch estimation accuracy.
-
property
viterbi
¶ Whether to apply viterbi smoothing to the estimated pitch curve
-
property
center
¶ Whether to center the window on the current frame.
When True, the output frame is centered at audio[t * hop_length]. When False, the frame begins at audio[t * hop_length].
-
property
frame_shift
¶ “Frame shift in seconds for running pitch estimation
-
property
frame_length
¶ Frame length in seconds
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
get_properties
(**kwargs)¶ Return the processors properties as a dictionary
-
property
log
¶ Processor logger
-
process_all
(utterances, njobs=None, **kwargs)¶ Returns features processed from several input utterances
This function processes the features in parallel jobs.
- Parameters
utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.
- Returns
features (
FeaturesCollection
) – The computed features on each input signal. The keys of output features are the keys of the input utterances.- Raises
ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.
-
property
sample_rate
¶ CREPE operates at 16kHz
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
property
ndims
¶ Dimension of the output features frames
-
process
(audio)[source]¶ Extracts the (POV, pitch) from a given speech
audio
using CREPE.- Parameters
audio (Audio) – The speech signal on which to estimate the pitch. Will be transparently resampled at 16kHz if needed.
- Returns
raw_pitch_features (Features, shape = [nframes, 2]) – The output array has two columns corresponding to (POV, pitch). The output from the crepe module is reshaped to match the specified options frame_shift and frame_length.
- Raises
ValueError – If the input signal has more than one channel (i.e. is not mono).
-
property
-
class
shennong.processor.pitch_crepe.
CrepePitchPostProcessor
(pitch_scale=2.0, delta_pitch_scale=10.0, delta_pitch_noise_stddev=0.005, normalization_left_context=75, normalization_right_context=75, delta_window=2, delay=0, add_pov_feature=True, add_normalized_log_pitch=True, add_delta_pitch=True, add_raw_log_pitch=False)[source]¶ Bases:
shennong.processor.pitch_kaldi.KaldiPitchPostProcessor
Processes the raw (POV, pitch) computed by the CrepePitchProcessor
Turns the raw pitch quantities into usable features. Converts the POV into NCCF usable by
PitchPostProcessor
, then removes the pitch at frames with the worst POV (according to the pov_threshold or the proportion_voiced option) and replace them with interpolated values, and finally sends this (NCCF, pitch) pair toshennong.processor.pitch.PitchPostProcessor.process()
.-
property
add_delta_pitch
¶ If true, time derivative of log-pitch is added to output features
-
property
add_normalized_log_pitch
¶ If true, the normalized log-pitch is added to output features
Normalization is done with POV-weighted mean subtraction over 1.5 second window.
-
property
add_pov_feature
¶ If true, the warped NCCF is added to output features
-
property
add_raw_log_pitch
¶ If true, time derivative of log-pitch is added to output features
-
property
delay
¶ Number of frames by which the pitch information is delayed
-
property
delta_pitch_noise_stddev
¶ Standard deviation for noise we add to the delta log-pitch
The stddev is added before scaling. Should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values.
-
property
delta_pitch_scale
¶ Term to scale the final delta log-pitch feature
-
property
delta_window
¶ Number of frames on each side of central frame
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
property
log
¶ Processor logger
-
property
ndims
¶ Dimension of the output features frames
-
property
normalization_left_context
¶ Left-context (in frames) for moving window normalization
-
property
normalization_right_context
¶ Right-context (in frames) for moving window normalization
-
property
pitch_scale
¶ Scaling factor for the final normalized log-pitch value
-
property
pov_offset
¶ This can be used to add an offset to the POV feature
Intended for use in Kaldi’s online decoding as a substitute for CMV (cepstral mean normalization)
-
property
pov_scale
¶ Scaling factor for final probability of voicing feature
-
process_all
(utterances, njobs=None, **kwargs)¶ Returns features processed from several input utterances
This function processes the features in parallel jobs.
- Parameters
utterances (:class`~shennong.uttterances.Utterances`) – The utterances on which to process features on.
njobs (int, optional) – The number of parallel jobs to run in background. Default to the number of CPU cores available on the machine.
**kwargs (dict, optional) – Extra arguments to be forwarded to the process method. Keys must be the same as for utterances.
- Returns
features (
FeaturesCollection
) – The computed features on each input signal. The keys of output features are the keys of the input utterances.- Raises
ValueError – If the njobs parameter is <= 0 or if an entry is missing in optioanl kwargs.
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
property
name
¶ Name of the processor
-
process
(crepe_pitch)[source]¶ Post process a raw pitch data as specified by the options
- Parameters
crepe_pitch (Features, shape = [n, 2]) – The pitch as extracted by the CrepePitchProcessor.process method
- Returns
pitch (Features, shape = [n, 1 2 3 or 4]) – The post-processed pitch usable as speech features. The output columns are ‘pov_feature’, ‘normalized_log_pitch’, delta_pitch’ and ‘raw_log_pitch’, in that order,if their respective options are set to True.
- Raises
ValueError – If after interpolation some pitch values are not positive. If raw_pitch has not exactly two columns. If all the following options are False: ‘add_pov_feature’, ‘add_normalized_log_pitch’, ‘add_delta_pitch’ and ‘add_raw_log_pitch’ (at least one of them must be True).
-
property