VTLN¶
Extraction of VTLN warp factors from utterances.
Uses the Kaldi implmentation of Linear Vocal Tract Length Normalization (see [kaldi-lvtln]).
Examples
>>> from shennong import Utterances
>>> from shennong.processor.vtln import VtlnProcessor
>>> wav = './test/data/test.wav'
>>> utterances = Utterances(
... [('utt1', wav, 'spk1', 0, 1), ('utt2', wav, 'spk1', 1, 1.4)])
Initialize the VTLN model. Other options can be specified at construction, or after:
>>> vtln = VtlnProcessor(min_warp=0.95, max_warp=1.05, ubm={'num_gauss': 4})
>>> vtln.num_iters = 10
Returns the computed warps for each utterance. If the by_speaker
property
was set to True
and the speaker information is provided with the
utterances, the warps have been computed for each speaker, and each utterance
from the same speaker is mapped to the same warp factor.
>>> warps = vtln.process(utterances)
Those warps can be passed individually in the process()
method of
MfccProcessor
,
FilterbankProcessor
,
PlpProcessor
and
SpectrogramProcessor
to warp the corresponding feature.
The features can also be warped directly via the pipeline.
>>> from shennong.pipeline import get_default_config, extract_features
>>> config = get_default_config('mfcc', with_vtln='simple')
>>> config['vtln']['ubm']['num_gauss'] = 4
>>> warped_features = extract_features(config, utterances)
References
-
class
shennong.processor.vtln.
VtlnProcessor
(num_iters=15, min_warp=0.85, max_warp=1.25, warp_step=0.01, logdet_scale=0.0, norm_type='offset', subsample=5, features=None, ubm=None, by_speaker=True)[source]¶ Bases:
shennong.base.BaseProcessor
VTLN model
-
property
name
¶ Processor name
-
property
num_iters
¶ Number of iterations of training
-
property
min_warp
¶ Minimum warp considered
-
get_params
(deep=True)¶ Get parameters for this processor.
- Parameters
deep (boolean, optional) – If True, will return the parameters for this processor and contained subobjects that are processors. Default to True.
- Returns
params (mapping of string to any) – Parameter names mapped to their values.
-
property
log
¶ Processor logger
-
property
max_warp
¶ Maximum warp considered
-
set_logger
(level, formatter='%(levelname)s - %(name)s - %(message)s')¶ Change level and/or format of the processor’s logger
- Parameters
level (str) – The minimum log level handled by the logger (any message above this level will be ignored). Must be ‘debug’, ‘info’, ‘warning’ or ‘error’.
formatter (str, optional) – A string to format the log messages, see https://docs.python.org/3/library/logging.html#formatter-objects. By default display level and message. Use ‘%(asctime)s - %(levelname)s - %(name)s - %(message)s’ to display time, level, name and message.
-
set_params
(**params)¶ Set the parameters of this processor.
- Returns
self
- Raises
ValueError – If any given parameter in
params
is invalid for the processor.
-
property
warp_step
¶ Warp step
-
property
logdet_scale
¶ Scale on log-determinant term in auxiliary function
-
property
norm_type
¶ Type of fMLLR applied (
offset
,none
ordiag
)
-
property
subsample
¶ When computing base LVTLN transforms, use every n frames (a speedup)
-
property
by_speaker
¶ Compute the warps for each speaker, or each utterance
-
property
features
¶ Features extraction configuration
-
property
ubm
¶ Diagonal UBM-GMM configuration
-
compute_mapping_transform
(feats_untransformed, feats_transformed, class_idx, warp, weights=None)[source]¶ “Set one of the transforms in lvtln to the minimum-squared-error solution to mapping feats_untransformed to feats_transformed; posteriors may optionally be used to downweight/remove silence.
Adapted from [kaldi-train-lvtln-special]
- Parameters
feats_untransformed (FeaturesCollection) – Collection of original features.
feats_transformed – Collection of warped features.
class_idx (int) – Rank of warp considered.
warp (float, optional) – Warp considered.
weights (dict[str, ndarrays], optional) – For each features in the collection, an array of weights to apply on the features frames. Unweighted by default.
- Raises
ValueError – If the features have unconsistent dimensions. If the size of the posteriors does not correspond to the size of the features.
References
-
estimate
(ubm, feats_collection, posteriors, utt2speak=None)[source]¶ Estimate linear-VTLN transforms, either per utterance or for the supplied set of speakers (
utt2speak
option). Reads posteriors indicating Gaussian indexes in the UBM.Adapted from [kaldi-global-est-lvtln-trans]
- Parameters
ubm (DiagUbmProcessor) – The Universal Background Model.
feats_collection (FeaturesCollection) – The untransformed features.
posteriors (dict[str, list[list[tuple[int, float]]]]) – The posteriors indicating Gaussian indexes in the UBM.
utt2speak (dict[str, str], optional) – If provided, map each utterance to a speaker.
References
-
process
(utterances, ubm=None, group_by='utterance', njobs=1)[source]¶ Compute the VTLN warp factors for the given utterances.
If the
by_speaker
option is set to True before the call toprocess()
, the warps are computed on per speaker basis (i.e. each utterance of the same speaker has an identical warp). Ifper_speaker
is False, the warps are computed on a per-utterance basis.- Parameters
utterances (
Utterances
) – The list of utterances to train the VTLN on.ubm (DiagUbmProcessor, optional) – If provided, uses this UBM instead of computing a new one.
group_by (str, optional) – Must be ‘utterance’ or ‘speaker’.
njobs (int, optional) – Number of threads to use for computation, default to 1.
- Returns
warps (dict[str, float]) – Warps computed for each speaker or utterance, according to
group_by
. If by speaker: same warp for all utterances of this speaker.
-
property