Extraction of VTLN warp factors from utterances.

Uses the Kaldi implmentation of Linear Vocal Tract Length Normalization (see [kaldi-lvtln]).


>>> from shennong import Utterances
>>> from shennong.processor.vtln import VtlnProcessor
>>> wav = './test/data/test.wav'
>>> utterances = Utterances(
...     [('utt1', wav, 'spk1', 0, 1), ('utt2', wav, 'spk1', 1, 1.4)])

Initialize the VTLN model. Other options can be specified at construction, or after:

>>> vtln = VtlnProcessor(min_warp=0.95, max_warp=1.05, ubm={'num_gauss': 4})
>>> vtln.num_iters = 10

Returns the computed warps for each utterance. If the by_speaker property was set to True and the speaker information is provided with the utterances, the warps have been computed for each speaker, and each utterance from the same speaker is mapped to the same warp factor.

>>> warps = vtln.process(utterances)

Those warps can be passed individually in the process() method of MfccProcessor, FilterbankProcessor, PlpProcessor and SpectrogramProcessor to warp the corresponding feature.

The features can also be warped directly via the pipeline.

>>> from shennong.pipeline import get_default_config, extract_features
>>> config = get_default_config('mfcc', with_vtln='simple')
>>> config['vtln']['ubm']['num_gauss'] = 4
>>> warped_features = extract_features(config, utterances)




class shennong.processor.vtln.VtlnProcessor(num_iters=15, min_warp=0.85, max_warp=1.25, warp_step=0.01, logdet_scale=0.0, norm_type='offset', subsample=5, features=None, ubm=None, by_speaker=True)[source]

Bases: shennong.base.BaseProcessor

VTLN model

property name

Processor name

property num_iters

Number of iterations of training

property min_warp

Minimum warp considered


Get parameters for this processor.


deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.


params (mapping of string to any) – Parameter names mapped to their values.

property log

Processor logger

property max_warp

Maximum warp considered

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.


Set the parameters of this processor.




ValueError – If any given parameter in params is invalid for the processor.

property warp_step

Warp step

property logdet_scale

Scale on log-determinant term in auxiliary function

property norm_type

Type of fMLLR applied (offset, none or diag)

property subsample

When computing base LVTLN transforms, use every n frames (a speedup)

property by_speaker

Compute the warps for each speaker, or each utterance

property features

Features extraction configuration

property ubm

Diagonal UBM-GMM configuration

classmethod load(path)[source]

Load the LVTLN from a binary file

classmethod load_warps(path)[source]

Load precomputed warps


Save the LVTLN to a binary file


Save the computed warps

compute_mapping_transform(feats_untransformed, feats_transformed, class_idx, warp, weights=None)[source]

“Set one of the transforms in lvtln to the minimum-squared-error solution to mapping feats_untransformed to feats_transformed; posteriors may optionally be used to downweight/remove silence.

Adapted from [kaldi-train-lvtln-special]

  • feats_untransformed (FeaturesCollection) – Collection of original features.

  • feats_transformed – Collection of warped features.

  • class_idx (int) – Rank of warp considered.

  • warp (float, optional) – Warp considered.

  • weights (dict[str, ndarrays], optional) – For each features in the collection, an array of weights to apply on the features frames. Unweighted by default.


ValueError – If the features have unconsistent dimensions. If the size of the posteriors does not correspond to the size of the features.




estimate(ubm, feats_collection, posteriors, utt2speak=None)[source]

Estimate linear-VTLN transforms, either per utterance or for the supplied set of speakers (utt2speak option). Reads posteriors indicating Gaussian indexes in the UBM.

Adapted from [kaldi-global-est-lvtln-trans]

  • ubm (DiagUbmProcessor) – The Universal Background Model.

  • feats_collection (FeaturesCollection) – The untransformed features.

  • posteriors (dict[str, list[list[tuple[int, float]]]]) – The posteriors indicating Gaussian indexes in the UBM.

  • utt2speak (dict[str, str], optional) – If provided, map each utterance to a speaker.




process(utterances, ubm=None, group_by='utterance', njobs=1)[source]

Compute the VTLN warp factors for the given utterances.

If the by_speaker option is set to True before the call to process(), the warps are computed on per speaker basis (i.e. each utterance of the same speaker has an identical warp). If per_speaker is False, the warps are computed on a per-utterance basis.

  • utterances (Utterances) – The list of utterances to train the VTLN on.

  • ubm (DiagUbmProcessor, optional) – If provided, uses this UBM instead of computing a new one.

  • group_by (str, optional) – Must be ‘utterance’ or ‘speaker’.

  • njobs (int, optional) – Number of threads to use for computation, default to 1.


warps (dict[str, float]) – Warps computed for each speaker or utterance, according to group_by. If by speaker: same warp for all utterances of this speaker.