VTLN¶
Extraction of VTLN warp factors from utterances.
Uses the Kaldi implmentation of Linear Vocal Tract Length Normalization (see [kaldi-lvtln]).
Examples
>>> from shennong import Utterances
>>> from shennong.processor.vtln import VtlnProcessor
>>> wav = './test/data/test.wav'
>>> utterances = Utterances(
...     [('utt1', wav, 'spk1', 0, 1), ('utt2', wav, 'spk1', 1, 1.4)])
Initialize the VTLN model. Other options can be specified at construction, or after:
>>> vtln = VtlnProcessor(min_warp=0.95, max_warp=1.05, ubm={'num_gauss': 4})
>>> vtln.num_iters = 10
Returns the computed warps for each utterance. If the by_speaker property
was set to True and the speaker information is provided with the
utterances, the warps have been computed for each speaker, and each utterance
from the same speaker is mapped to the same warp factor.
>>> warps = vtln.process(utterances)
Those warps can be passed individually in the process() method of
MfccProcessor,
FilterbankProcessor,
PlpProcessor and
SpectrogramProcessor
to warp the corresponding feature.
The features can also be warped directly via the pipeline.
>>> from shennong.pipeline import get_default_config, extract_features
>>> config = get_default_config('mfcc', with_vtln='simple')
>>> config['vtln']['ubm']['num_gauss'] = 4
>>> warped_features = extract_features(config, utterances)
References
- 
class 
shennong.processor.vtln.VtlnProcessor(num_iters=15, min_warp=0.85, max_warp=1.25, warp_step=0.01, logdet_scale=0.0, norm_type='offset', subsample=5, features=None, ubm=None, by_speaker=True)[source]¶ Bases:
shennong.base.BaseProcessorVTLN model
- 
property 
name¶ Processor name
- 
property 
num_iters¶ Number of iterations of training
- 
property 
min_warp¶ Minimum warp considered
- 
get_params(deep=True)¶ Get parameters for this processor.
- Parameters
 deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
 params (mapping of string to any) – Parameter names mapped to their values.
- 
property 
log¶ Processor logger
- 
property 
max_warp¶ Maximum warp considered
- 
set_logger(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
 level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
- 
set_params(**params)¶ Set the parameters of this processor.
- Returns
 self
- Raises
 ValueError – If any given parameter in
paramsis invalid for the processor.
- 
property 
warp_step¶ Warp step
- 
property 
logdet_scale¶ Scale on log-determinant term in auxiliary function
- 
property 
norm_type¶ Type of fMLLR applied (
offset,noneordiag)
- 
property 
subsample¶ When computing base LVTLN transforms, use every n frames (a speedup)
- 
property 
by_speaker¶ Compute the warps for each speaker, or each utterance
- 
property 
features¶ Features extraction configuration
- 
property 
ubm¶ Diagonal UBM-GMM configuration
- 
compute_mapping_transform(feats_untransformed, feats_transformed, class_idx, warp, weights=None)[source]¶ “Set one of the transforms in lvtln to the minimum-squared-error solution to mapping feats_untransformed to feats_transformed; posteriors may optionally be used to downweight/remove silence.
Adapted from [kaldi-train-lvtln-special]
- Parameters
 feats_untransformed (FeaturesCollection) – Collection of original features.
feats_transformed – Collection of warped features.
class_idx (int) – Rank of warp considered.
warp (float, optional) – Warp considered.
weights (dict[str, ndarrays], optional) – For each features in the collection, an array of weights to apply on the features frames. Unweighted by default.
- Raises
 ValueError – If the features have unconsistent dimensions. If the size of the posteriors does not correspond to the size of the features.
References
- 
estimate(ubm, feats_collection, posteriors, utt2speak=None)[source]¶ Estimate linear-VTLN transforms, either per utterance or for the supplied set of speakers (
utt2speakoption). Reads posteriors indicating Gaussian indexes in the UBM.Adapted from [kaldi-global-est-lvtln-trans]
- Parameters
 ubm (DiagUbmProcessor) – The Universal Background Model.
feats_collection (FeaturesCollection) – The untransformed features.
posteriors (dict[str, list[list[tuple[int, float]]]]) – The posteriors indicating Gaussian indexes in the UBM.
utt2speak (dict[str, str], optional) – If provided, map each utterance to a speaker.
References
- 
process(utterances, ubm=None, group_by='utterance', njobs=1)[source]¶ Compute the VTLN warp factors for the given utterances.
If the
by_speakeroption is set to True before the call toprocess(), the warps are computed on per speaker basis (i.e. each utterance of the same speaker has an identical warp). Ifper_speakeris False, the warps are computed on a per-utterance basis.- Parameters
 utterances (
Utterances) – The list of utterances to train the VTLN on.ubm (DiagUbmProcessor, optional) – If provided, uses this UBM instead of computing a new one.
group_by (str, optional) – Must be ‘utterance’ or ‘speaker’.
njobs (int, optional) – Number of threads to use for computation, default to 1.
- Returns
 warps (dict[str, float]) – Warps computed for each speaker or utterance, according to
group_by. If by speaker: same warp for all utterances of this speaker.
- 
property