Diphone Based Segmenter

Note

wordseg.algo.dibs in python, wordseg-dibs in bash.

Diphone based segmentation algorithm

A DiBS model assigns, for each phrase-medial diphone, a value between 0 and 1 inclusive (representing the probability the model assigns that there is a word-boundary there).

The particularity of DiBS, with repect to the other segmentation algorithms in wordseg, is that it requires a little training set with word boundaries (ie. in phonologized form, not in prepared form). User has two choices:

  • Train and segment on the same text: Specify <input-text> only, <input-text> must be in phonologized form. The algorithm will the provided word boundaries to train the model and will remove them to generate the text to segment.

  • Train and segment on different texts: Specify <input-text> AND –train-file <training-file>. Here <input-text> must be in prepared form (without word boundaries) whereas <training-file> must contain word boundaries.

For details, see Daland, R., Pierrehumbert, J.B., “Learning diphone-based segmentation”. Cognitive science 35(1), 119-155 (2011).

class wordseg.algos.dibs.AbstractSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]

Bases: object

An interface for DiBS segmentation

Subclasses must implement the _init_diphones() method.

Parameters
  • summary (CorpusSummary) – Some diphones stats computed on a train text

  • pwb (float, optional) – Probability of word boundary, if not specified it is estimated from the train text as (nwords - nlines)/(nphones - nlines). When defined must in [0, 1].

  • threshold (float, optional) – Threshold on word boundary probabilities. If a diphone has a word boundray probability greater than this threshold, a word boudary is added. Must be in [0, 1]. The optimal threshold is 0.5 (default).

  • log (logging.Logger, optional) – The log instance where to send messages.

Raises

ValueError: – If threshold and pwb are not floats in [0, 1].

abstract init_diphones()[source]

Initializes diphone probabilities from the summary

segment(utterance)[source]

Estimates word boundaries based on diphone probabilities

Parameters

utterance (str) – The utterance to segment must be a suite of phones or syllables separated by spaces.

Returns

  • The segmented utterance, with phone separation removed and

  • spaces at estimated word boundaries.

class wordseg.algos.dibs.CorpusSummary(text, separator=<wordseg.separator.Separator object>, level='phone', log=<RootLogger root (WARNING)>)[source]

Bases: object

Compute statistics on a phonemized corpus

This is the “training” step of DiBS. It computes some statistics on phones (and diphones) on a tokenized training text.

Parameters
  • text (sequence of str) – The input text must be tokenized at phone and word levels (syllables boundaries are ignored if any)

  • separator (Separator, optional) – Token separation in the input text

  • level ('phone' or 'syllable', optional) – The token level to train the model on. Default to ‘phone’.

  • log (logging.Logger, optional) – Where to send log messages

summary

Basic stats on the entire text: ‘nlines’, ‘nwords’ and ‘nphones’

Type

Counter

lexicon

Word count on the entire text

Type

Counter

phrase_initial

The phones at first position in an utterance

Type

Counter

phrase_final

The phones at last position in an utterance

Type

Counter

internal_diphones

The count of within word diphones

Type

Counter

spanning_diphones

The count of across words diphones

Type

Counter

diphones

The count of all diphones, sum of internal and spanning diphones.

Type

Counter

Raises

ValueError if a line in the text does not contain a word separator.

class wordseg.algos.dibs.Counter[source]

Bases: dict

A Counter is a (key -> count) dictionnary for counting elements

Update the counter with the increment(key, count) method. If an element is absent from the dictionary, its count defaults to 0.

Examples

>>> c = Counter()
>>> c['a']
0
>>> c['a'] = 10
>>> c['a']
10
>>> c.increment('a')
>>> c['a']
11
>>> c.increment('a', 9)
>>> c['a']
20
increment(key, value=1)[source]
class wordseg.algos.dibs.GoldSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]

Bases: wordseg.algos.dibs.AbstractSegmenter

init_diphones()[source]

Initializes diphone probabilities from the summary

class wordseg.algos.dibs.LexicalSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]

Bases: wordseg.algos.dibs.AbstractSegmenter

init_diphones()[source]

Initializes diphone probabilities from the summary

class wordseg.algos.dibs.PhrasalSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]

Bases: wordseg.algos.dibs.AbstractSegmenter

init_diphones()[source]

Initializes diphone probabilities from the summary

wordseg.algos.dibs.segment(test_text, trained_model, type='phrasal', threshold=0.5, pwb=None, log=<RootLogger root (WARNING)>)[source]

Segment a corpus from a trained DiBS model

This method is a simple wrapper on the Segmenter classes, namely GoldSegmenter, PhrasalSegmenter and LexicalSegmenter.

Parameters
  • test_text (sequence of str) – The input text to segment is a sequence (list or generator) of utterances. Each utterance is composed of space seprated tokens (can be phones or syllables).

  • trained_model (CorpusSummary) – The trained DiBS model used for segmentation of test_text.

  • type (str, optional) – The type of DiBS segmenter to use, must be ‘gold’, ‘phrasal’ or ‘lexical’. Default is ‘phrasal’.

  • threshold (float, optional) – Threshold on word boundary probabilities. If a diphone has a word boundray probability greater than this threshold, a word boudary is added. Must be in [0, 1]. The optimal threshold is 0.5 (default).

  • pwb (float, optional) – Probability of word boundary, if not specified it is estimated from the train text as (nwords - nlines)/(nphones - nlines). This option is not used in ‘gold’ segmentation type. When defined must in [0, 1].

  • log (logging.Logger, optional) – The log instance where to send messages.

Yields

utterance (str) – The current utterance segmented (with estimated word boundaries)

Raises

ValueError: – If type is not ‘gold’, ‘phrasal’ or ‘lexical’. If threshold and pwb are not floats in [0, 1].