VTLN

Extraction of VTLN warp factors from utterances.

Uses the Kaldi implmentation of Linear Vocal Tract Length Normalization (see [kaldi-lvtln]).

Examples

>>> from shennong import Utterances
>>> from shennong.processor.vtln import VtlnProcessor
>>> wav = './test/data/test.wav'
>>> utterances = Utterances(
...     [('utt1', wav, 'spk1', 0, 1), ('utt2', wav, 'spk1', 1, 1.4)])

Initialize the VTLN model. Other options can be specified at construction, or after:

>>> vtln = VtlnProcessor(min_warp=0.95, max_warp=1.05, ubm={'num_gauss': 4})
>>> vtln.num_iters = 10

Returns the computed warps for each utterance. If the by_speaker property was set to True and the speaker information is provided with the utterances, the warps have been computed for each speaker, and each utterance from the same speaker is mapped to the same warp factor.

>>> warps = vtln.process(utterances)

Those warps can be passed individually in the process() method of MfccProcessor, FilterbankProcessor, PlpProcessor and SpectrogramProcessor to warp the corresponding feature.

The features can also be warped directly via the pipeline.

>>> from shennong.pipeline import get_default_config, extract_features
>>> config = get_default_config('mfcc', with_vtln='simple')
>>> config['vtln']['ubm']['num_gauss'] = 4
>>> warped_features = extract_features(config, utterances)

References

kaldi-lvtln

https://kaldi-asr.org/doc/transform.html#transform_lvtln

class shennong.processor.vtln.VtlnProcessor(num_iters=15, min_warp=0.85, max_warp=1.25, warp_step=0.01, logdet_scale=0.0, norm_type='offset', subsample=5, features=None, ubm=None, by_speaker=True)[source]

Bases: shennong.base.BaseProcessor

VTLN model

property name

Processor name

property num_iters

Number of iterations of training

property min_warp

Minimum warp considered

get_params(deep=True)

Get parameters for this processor.

Parameters

deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.

Returns

params (mapping of string to any) – Parameter names mapped to their values.

property log

Processor logger

property max_warp

Maximum warp considered

set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')

Change level and/or format of the processor’s logger

Parameters
  • level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.

  • formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.

set_params(**params)

Set the parameters of this processor.

Returns

self

Raises

ValueError – If any given parameter in params is invalid for the processor.

property warp_step

Warp step

property logdet_scale

Scale on log-determinant term in auxiliary function

property norm_type

Type of fMLLR applied (offset, none or diag)

property subsample

When computing base LVTLN transforms, use every n frames (a speedup)

property by_speaker

Compute the warps for each speaker, or each utterance

property features

Features extraction configuration

property ubm

Diagonal UBM-GMM configuration

classmethod load(path)[source]

Load the LVTLN from a binary file

classmethod load_warps(path)[source]

Load precomputed warps

save(path)[source]

Save the LVTLN to a binary file

save_warps(path)[source]

Save the computed warps

compute_mapping_transform(feats_untransformed, feats_transformed, class_idx, warp, weights=None)[source]

“Set one of the transforms in lvtln to the minimum-squared-error solution to mapping feats_untransformed to feats_transformed; posteriors may optionally be used to downweight/remove silence.

Adapted from [kaldi-train-lvtln-special]

Parameters
  • feats_untransformed (FeaturesCollection) – Collection of original features.

  • feats_transformed – Collection of warped features.

  • class_idx (int) – Rank of warp considered.

  • warp (float, optional) – Warp considered.

  • weights (dict[str, ndarrays], optional) – For each features in the collection, an array of weights to apply on the features frames. Unweighted by default.

Raises

ValueError – If the features have unconsistent dimensions. If the size of the posteriors does not correspond to the size of the features.

References

kaldi-train-lvtln-special

https://kaldi-asr.org/doc/gmm-train-lvtln-special_8cc.html

estimate(ubm, feats_collection, posteriors, utt2speak=None)[source]

Estimate linear-VTLN transforms, either per utterance or for the supplied set of speakers (utt2speak option). Reads posteriors indicating Gaussian indexes in the UBM.

Adapted from [kaldi-global-est-lvtln-trans]

Parameters
  • ubm (DiagUbmProcessor) – The Universal Background Model.

  • feats_collection (FeaturesCollection) – The untransformed features.

  • posteriors (dict[str, list[list[tuple[int, float]]]]) – The posteriors indicating Gaussian indexes in the UBM.

  • utt2speak (dict[str, str], optional) – If provided, map each utterance to a speaker.

References

kaldi-global-est-lvtln-trans

https://kaldi-asr.org/doc/gmm-global-est-lvtln-trans_8cc.html

process(utterances, ubm=None, group_by='utterance', njobs=1)[source]

Compute the VTLN warp factors for the given utterances.

If the by_speaker option is set to True before the call to process(), the warps are computed on per speaker basis (i.e. each utterance of the same speaker has an identical warp). If per_speaker is False, the warps are computed on a per-utterance basis.

Parameters
  • utterances (Utterances) – The list of utterances to train the VTLN on.

  • ubm (DiagUbmProcessor, optional) – If provided, uses this UBM instead of computing a new one.

  • group_by (str, optional) – Must be ‘utterance’ or ‘speaker’.

  • njobs (int, optional) – Number of threads to use for computation, default to 1.

Returns

warps (dict[str, float]) – Warps computed for each speaker or utterance, according to group_by. If by speaker: same warp for all utterances of this speaker.