VTLN¶

Extraction of VTLN warp factors from utterances.

Uses the Kaldi implmentation of Linear Vocal Tract Length Normalization (see [kaldi-lvtln]).

Examples

>>> from shennong import Utterances
>>> from shennong.processor.vtln import VtlnProcessor
>>> wav = './test/data/test.wav'
>>> utterances = Utterances(
...     [('utt1', wav, 'spk1', 0, 1), ('utt2', wav, 'spk1', 1, 1.4)])

Initialize the VTLN model. Other options can be specified at construction, or after:

>>> vtln = VtlnProcessor(min_warp=0.95, max_warp=1.05, ubm={'num_gauss': 4})
>>> vtln.num_iters = 10

Returns the computed warps for each utterance. If the by_speaker property was set to True and the speaker information is provided with the utterances, the warps have been computed for each speaker, and each utterance from the same speaker is mapped to the same warp factor.

>>> warps = vtln.process(utterances)

Those warps can be passed individually in the process() method of MfccProcessor, FilterbankProcessor, PlpProcessor and SpectrogramProcessor to warp the corresponding feature.

The features can also be warped directly via the pipeline.

>>> from shennong.pipeline import get_default_config, extract_features
>>> config = get_default_config('mfcc', with_vtln='simple')
>>> config['vtln']['ubm']['num_gauss'] = 4
>>> warped_features = extract_features(config, utterances)

References

kaldi-lvtln: https://kaldi-asr.org/doc/transform.html#transform_lvtln

class shennong.processor.vtln.VtlnProcessor(num_iters=15, min_warp=0.85, max_warp=1.25, warp_step=0.01, logdet_scale=0.0, norm_type='offset', subsample=5, features=None, ubm=None, by_speaker=True)[source]¶

Bases: shennong.base.BaseProcessor

VTLN model

property name¶: Processor name

property num_iters¶: Number of iterations of training

property min_warp¶: Minimum warp considered

get_params(deep=True)¶

Get parameters for this processor.

Parameters: deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
Returns: params (mapping of string to any) – Parameter names mapped to their values.

property log¶: Processor logger

property max_warp¶: Maximum warp considered

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶

Change level and/or format of the processor’s logger

Parameters

level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)¶

Set the parameters of this processor.

Returns: self
Raises: ValueError – If any given parameter in params is invalid for the processor.

property warp_step¶: Warp step

property logdet_scale¶: Scale on log-determinant term in auxiliary function

property norm_type¶: Type of fMLLR applied (offset, none or diag)

property subsample¶: When computing base LVTLN transforms, use every n frames (a speedup)

property by_speaker¶: Compute the warps for each speaker, or each utterance

property features¶: Features extraction configuration

property ubm¶: Diagonal UBM-GMM configuration

classmethod load(path)[source]¶: Load the LVTLN from a binary file

classmethod load_warps(path)[source]¶: Load precomputed warps

save(path)[source]¶: Save the LVTLN to a binary file

save_warps(path)[source]¶: Save the computed warps

compute_mapping_transform(feats_untransformed, feats_transformed, class_idx, warp, weights=None)[source]¶

“Set one of the transforms in lvtln to the minimum-squared-error solution to mapping feats_untransformed to feats_transformed; posteriors may optionally be used to downweight/remove silence.

Adapted from [kaldi-train-lvtln-special]

Parameters

feats_untransformed (FeaturesCollection) – Collection of original features.
feats_transformed – Collection of warped features.
class_idx (int) – Rank of warp considered.
warp (float, optional) – Warp considered.
weights (dict[str, ndarrays], optional) – For each features in the collection, an array of weights to apply on the features frames. Unweighted by default.

Raises

ValueError – If the features have unconsistent dimensions. If the size of the posteriors does not correspond to the size of the features.

References

kaldi-train-lvtln-special: https://kaldi-asr.org/doc/gmm-train-lvtln-special_8cc.html

estimate(ubm, feats_collection, posteriors, utt2speak=None)[source]¶

Estimate linear-VTLN transforms, either per utterance or for the supplied set of speakers (utt2speak option). Reads posteriors indicating Gaussian indexes in the UBM.

Adapted from [kaldi-global-est-lvtln-trans]

Parameters

ubm (DiagUbmProcessor) – The Universal Background Model.
feats_collection (FeaturesCollection) – The untransformed features.
posteriors (dict[str, list[list[tuple[int, float]]]]) – The posteriors indicating Gaussian indexes in the UBM.
utt2speak (dict[str, str], optional) – If provided, map each utterance to a speaker.

References

kaldi-global-est-lvtln-trans: https://kaldi-asr.org/doc/gmm-global-est-lvtln-trans_8cc.html

process(utterances, ubm=None, group_by='utterance', njobs=1)[source]¶

Compute the VTLN warp factors for the given utterances.

If the by_speaker option is set to True before the call to process(), the warps are computed on per speaker basis (i.e. each utterance of the same speaker has an identical warp). If per_speaker is False, the warps are computed on a per-utterance basis.

Parameters

utterances (Utterances) – The list of utterances to train the VTLN on.
ubm (DiagUbmProcessor, optional) – If provided, uses this UBM instead of computing a new one.
group_by (str, optional) – Must be ‘utterance’ or ‘speaker’.
njobs (int, optional) – Number of threads to use for computation, default to 1.

Returns

warps (dict[str, float]) – Warps computed for each speaker or utterance, according to group_by. If by speaker: same warp for all utterances of this speaker.