Introduction to speech features¶

Implemented models¶

Note

All the models implemented from Kaldi are using pykaldi, see [Can2018].

The following features extraction models are implemented in shennong, the detailed documentation is available here:

Features	Implementation
Spectrogram	from Kaldi
Filterbank	from Kaldi
MFCC	from Kaldi
PLP (optional Rasta filtering)	Kaldi and rastapy
Bottleneck	from BUTspeech
One Hot Vectors	shennong
Pitch	from Kaldi and CREPE
Energy	from Kaldi

The following post-processing pipelines are implemented in shennong, the detailed documentation is available here:

Post-processing	Implementation
Delta / Delta-delta	from Kaldi
Mean Variance Normalization (CMVN)	from Kaldi
Voice Activity Detection	from Kaldi
Vocal Tract Length Normalization (VTLN)	from Kaldi

Here is an illustration of the features (without post-processing) computed on the example wav file provided as shennong/test/data/test.wav, this picture can be reproduced using the example script shennong/examples/plot_features.py:

Features comparison¶

This section details a phone discrimination task based on the features available in shennong. It reproduces the track 1 of the Zero Speech Challenge 2015 using the same datasets and setup. The recipe to replicate this experiment is available at shennong/examples/features_abx.

Setup:
- Two languages are tested:
  - English (Buckeye corpus, 12 speakers for a duration of 10:34:44)
  - Xitsonga (NCHLT corpus, 24 speakers for a duration of 4:24:37)
- The considered features extraction algorithms are:
  - Spectrogram
  - Mel Filterbanks
  - MFCC
  - PLP (with and with Rasta filtering)
  - Bottleneck
- Each is tested with 3 distinct parameters sets, with and without VTLN:
  - raw: the raw features only,
  - +∆/F0: raw features with delta, delta-delta and Kaldi pitch estimates,
  - +CMVN: raw features with CMVN normalization by speaker, with delta, delta-delta and Kaldi pitch estimates.
- The considered ABX tasks are the same as in the ZRC2015 track1, namely a phone discrimination task within and across speakers.

Note

The results are ABX error rates on phone discrimination (given in %, random score is 50%).

Results on English across speakers:

features	without VTLN			with VTLN
features	raw	+∆/F0	+CMVN	raw	+∆/F0	+CMVN
spectrogram	30.3	27.9	29.7
filterbank	24.9	22.1	26.5	23.2	20.7	25.4
mfcc	27.2	26.4	24.0	23.4	22.7	20.0
plp	28.0	26.6	23.8	24.7	23.5	19.7
rastaplp	28.5	26.8	25.3	24.6	23.6	21.3
bottleneck	12.5	12.5	12.5

Results on English within speakers:

features	without VTLN			with VTLN
features	raw	+∆/F0	+CMVN	raw	+∆/F0	+CMVN
spectrogram	16.7	15.2	20.2
filterbank	12.8	11.6	18.2	12.6	11.4	18.1
mfcc	13.0	12.5	12.4	12.8	12.3	12.0
plp	12.5	12.4	11.9	12.5	12.4	11.9
rastaplp	14.3	14.2	12.5	14.2	14.1	12.5
bottleneck	8.5	8.5	8.6

Results on Xitsonga across speakers:

features	without VTLN			with VTLN
features	raw	+∆/F0	+CMVN	raw	+∆/F0	+CMVN
spectrogram	34.6	32.0	26.5
filterbank	28.1	25.1	21.5	26.9	24.0	20.7
mfcc	33.6	32.8	26.0	31.4	30.6	22.7
plp	33.5	31.2	26.2	31.7	29.5	22.2
rastaplp	27.9	25.2	23.9	25.0	22.8	21.7
bottleneck	9.5	9.6	9.6

Results on Xitsonga within speakers:

features	without VTLN			with VTLN
features	raw	+∆/F0	+CMVN	raw	+∆/F0	+CMVN
spectrogram	19.2	16.8	19.2
filterbank	13.8	11.7	15.2	13.6	11.4	15.2
mfcc	17.1	16.2	14.6	17.5	16.5	14.6
plp	16.2	14.6	14.0	16.2	14.7	14.2
rastaplp	13.7	12.5	12.3	13.5	12.2	12.0
bottleneck	6.9	7.0	7.3

Comparison with the ZRC2015 baseline (on MFCC only), see [Versteegh2015]:

	English		Xitsonga
	across	within	across	within
ZRC2015	28.1	15.6	33.8	19.1
shennong raw	27.2	13.0	33.6	17.1
shennong +CMVN	24.0	12.4	26.0	14.6
ZRC2015 VTLN	24.0	14.6
shennong VTLN raw	23.4	12.8	31.4	17.5
shennong VTLN +CMVN	20.0	12.0	22.7	14.2

Versteegh2015: The zero resource speech challenge 2015, Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux. In INTERSPEECH-2015. 2015.
Can2018: PyKaldi: A Python Wrapper for Kaldi, Dogan Can and Victor R. Martinez and Pavlos Papadopoulos and Shrikanth S. Narayanan, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2018.