Puddle Segmenter

Note

wordseg.algo.puddle in python, wordseg-puddle in bash.

Puddle word segmentation algorithm

Implementation of the puddle philosophy developped by P. Monaghan.

See “Monaghan, P., & Christiansen, M. H. (2010). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. Journal of child language, 37(03), 545-564.”

The algorithm has two modes of operation:

  • Segmentation and online learning on the same text: Specify <input-text> only, <input-text> must be in phonologized form. The PUDDLE model is updated line per line and so segmentation performances are better at the end. Use –nfolds and –njobs options to run the segmentation in several folds in parallel.

  • Training ans segmentation on separate files: Specify <input-text> and –train-file <training-file>. Both texts must be in phonologized form. The PUDDLE model is trained offline on <training-file>, before the segmentation of <input-text>. In this mode –nfolds and –njobs options are not valid.

class wordseg.algos.puddle.Puddle(window=2, by_frequency=False, log=<RootLogger root (WARNING)>)[source]

Bases: object

Train and segmenttext with a PUDDLE modelling

Implementation of a PUDDLE model with train() and segment() methods.

Parameters
  • window (int, optional) – Number of phonemes to be taken into account for boundary constraint. Default to 2.

  • by_frequency (bool, optional) – When True choose the word candidates by filterring them by frequency. Default to False.

  • log (logging.Logger, optional) – The logger instance where to send messages.

segment(text, update_model=True)[source]

Segments a text using the trained PUDDLE model

text must be a sequence of strings, each one considered as an utterance.

If update_model is True, the model is trained online during segmentation. Otherwise it stays constant.

Yields the segmented utterances.

train(text)[source]

Train a PUDDLE model from text

text must be a sequence of strings, each one considered as an utterance.

wordseg.algos.puddle.segment(text, train_text=None, window=2, by_frequency=False, nfolds=5, njobs=1, log=<RootLogger root (WARNING)>)[source]

Returns a word segmented version of text using the puddle algorithm

Parameters
  • text (sequence of str) – A sequence of lines with syllable (or phoneme) boundaries marked by spaces and no word boundaries. Each line in the sequence corresponds to a single and complete utterance.

  • train_text (sequence of str) – The list of utterances to train the model on. If None (default) the model is trained online during segmentation. When train_text is specified, the options nfolds and njobs are ignored.

  • window (int, optional) – Number of phonemes to be taken into account for boundary constraint. Default to 2.

  • by_frequency (bool, optional) – When True choose the word candidates by filterring them by frequency. Default to False.

  • nfolds (int, optional) – The number of folds to segment the text on. This option is ignored if a train_text is provided.

  • njobs (int, optional) – The number of subprocesses to run in parallel. The folds are independant of each others and can be computed in parallel. Requesting a number of jobs greater then nfolds have no effect. This option is ignored if a train_text is provided.

  • log (logging.Logger, optional) – The logger instance where to send messages.

Returns

The utterances from text with estimated words boundaries.

Return type

generator