Dataset description

CogSci-2019: Unsupervised speech and human perception dataset

Cogsci dataset is made of 112 English and French cross-language triplets of stimuli, which are all triphones, all CVC. The idea was to test French and English sounds similarity. Six persons read the stimuli to build the dataset: two were bilingual (American English and French) and both of them were female, four were male (two North American English two French native speakers). 63 English and 55 French speaking participants were tested. The purpose of this dataset was to assess the discrimination of different vowel sound (second phone of the stimuli) in English and in French by French speakers and English speakers, so the stimuli in a triplet only differ from the second phone.

For more details, you can refer to this paper and its associated code on github.

If you use this dataset, please cite the associated paper:

@inproceedings{millet2019comparing,
    title       = {Comparing unsupervised speech learning directly to human performance in speech perception},
    author      = {Millet, Juliette and Jurov, Nika and Dunbar, Ewan},
    booktitle   = {CogSci 2019-41st Annual Meeting of Cognitive Science Society},
    year        = {2019}
}

Zero Resource Speech Challenge 2017 dataset

Zero Resource Speech dataset provides a cleaned subset triphone stimuli taken from the 2017 Zero Resource Speech challenge one second French and one second English test set. The cleaned subset consists of 5202 triplets(2214 from English), making 461 distinct centre phone contrasts (212 English, 249 French), in a total of 201 distinct con-texts (118 English, 83 French), with most phone comparisonsappearing in three contexts each (a total of 47 English contrasts appear in either one, two, or four contexts). The speakers used(15 English, 18 French) have, in our assessment, pronunciations close to standard American English/Metropolitan French. 93 French and 91 American English speaking participants performed an ABX test on these French and English triplets (approximately 185 triplets per person).

Their answers value range from -3 to 3 depending on their certainty.

  • If the person was totally sure of his response and the latter was the correct one, then he would have got the score 3 (would have got -3 otherwise).

  • If the person had few doubts on his response and the latter was the correct one, then he would have got the score 2 (would have been -2 otherwise).

  • Finally, if he was unsure on his response and the latter was the correct one, then he would have got the score 1 (would have been -1 otherwise).

We used the human results on their native language’s triplets in this article and here is the code linked to the ZRS challenge on github. We used the results of English participants in this article and the associated github.

If you use this dataset, please cite one of our paper:

@article{millet2020perceptimatic,
  title    = {The Perceptimatic English Benchmark for Speech Perception Models},
  author   = {Millet, Juliette and Dunbar, Ewan},
  journal  = {arXiv preprint arXiv:2005.03418},
  year     = {2020}
}

or

@article{millet2020perceptimatic,
  title   = {Perceptimatic: A human speech perception benchmark for unsupervised subword modelling},
  author  = {Millet, Juliette and Dunbar, Ewan},
  journal = {arXiv preprint arXiv:2010.05961},
  year    = {2020}
}

Pilot July 2018 dataset

Pilot July 2018 dataset is a set of ABX triphone stimuli, made from TIMIT dataset. All the triplets used are in American English, however the peculiarity of TIMIT is the use of 8 different dialects: New England (DR1), Northern (DR2), North Midland (DR3), South Midland (DR4), Southern (DR5), New York City(DR6), Western (DR7) and Army Brar (DR8). (DR = Dialect Region). Each ABX triplet contains two triphones from the same dialect (A and B) and a third one from another dialect (X).

In the experiments, about 185 triplets were tested on 12 native French and 26 native English. The purpose is to compare native French phone discrimination to native English. If focuses on the contrast between stimuli center phone (phones are either vowels or consonants). The code relative to these studies is available here.

Pilot August 2018 dataset

Pilot August 2018 dataset is composed of 144 english triplets uttered by to english native speakers, one man and one woman. For each triplet, one speaker uttered two stimuli (A and B) and the other uttered the last one. Experiments were carried out on 109 people: 51 French subjects and 58 English subjects. Vowels and consonants phone divergences are studied either on the stimulus first phone or on the second phone. The code relative to these studies is available here.

WorldVowels dataset

This dataset is composed of French, English, Brazilian Portuguese, Turkish, Estonian and German stimuli (some German stimuli are taken from the OLLO database). French and English speaking participants performed an ABX test on these multilingual triplets (approximately 185 triplets per person). Only vowel contrasts are tested.

Their answers value range from -3 to 3 depending on their certainty.

  • If the person was totally sure of his response and the latter was the correct one, then he would have got the score 3 (would have got -3 otherwise).

  • If the person had few doubts on his response and the latter was the correct one, then he would have got the score 2 (would have been -2 otherwise).

  • Finally, if he was unsure on his response and the latter was the correct one, then he would have got the score 1 (would have been -1 otherwise).