Puddle Segmenter¶
Note
wordseg.algo.puddle
in python, wordseg-puddle
in bash.
Puddle word segmentation algorithm
Implementation of the puddle philosophy developped by P. Monaghan.
See “Monaghan, P., & Christiansen, M. H. (2010). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. Journal of child language, 37(03), 545-564.”
The algorithm has two modes of operation:
Segmentation and online learning on the same text: Specify <input-text> only, <input-text> must be in phonologized form. The PUDDLE model is updated line per line and so segmentation performances are better at the end. Use –nfolds and –njobs options to run the segmentation in several folds in parallel.
Training ans segmentation on separate files: Specify <input-text> and –train-file <training-file>. Both texts must be in phonologized form. The PUDDLE model is trained offline on <training-file>, before the segmentation of <input-text>. In this mode –nfolds and –njobs options are not valid.
-
class
wordseg.algos.puddle.
Puddle
(window=2, by_frequency=False, log=<RootLogger root (WARNING)>)[source]¶ Bases:
object
Train and segmenttext with a PUDDLE modelling
Implementation of a PUDDLE model with train() and segment() methods.
- Parameters
window (int, optional) – Number of phonemes to be taken into account for boundary constraint. Default to 2.
by_frequency (bool, optional) – When True choose the word candidates by filterring them by frequency. Default to False.
log (logging.Logger, optional) – The logger instance where to send messages.
-
wordseg.algos.puddle.
segment
(text, train_text=None, window=2, by_frequency=False, nfolds=5, njobs=1, log=<RootLogger root (WARNING)>)[source]¶ Returns a word segmented version of text using the puddle algorithm
- Parameters
text (sequence of str) – A sequence of lines with syllable (or phoneme) boundaries marked by spaces and no word boundaries. Each line in the sequence corresponds to a single and complete utterance.
train_text (sequence of str) – The list of utterances to train the model on. If None (default) the model is trained online during segmentation. When train_text is specified, the options nfolds and njobs are ignored.
window (int, optional) – Number of phonemes to be taken into account for boundary constraint. Default to 2.
by_frequency (bool, optional) – When True choose the word candidates by filterring them by frequency. Default to False.
nfolds (int, optional) – The number of folds to segment the text on. This option is ignored if a train_text is provided.
njobs (int, optional) – The number of subprocesses to run in parallel. The folds are independant of each others and can be computed in parallel. Requesting a number of jobs greater then nfolds have no effect. This option is ignored if a train_text is provided.
log (logging.Logger, optional) – The logger instance where to send messages.
- Returns
The utterances from text with estimated words boundaries.
- Return type
generator
See also