Diphone Based Segmenter¶

Note

wordseg.algo.dibs in python, wordseg-dibs in bash.

Diphone based segmentation algorithm

A DiBS model assigns, for each phrase-medial diphone, a value between 0 and 1 inclusive (representing the probability the model assigns that there is a word-boundary there).

The particularity of DiBS, with repect to the other segmentation algorithms in wordseg, is that it requires a little training set with word boundaries (ie. in phonologized form, not in prepared form). User has two choices:

Train and segment on the same text: Specify <input-text> only, <input-text> must be in phonologized form. The algorithm will the provided word boundaries to train the model and will remove them to generate the text to segment.
Train and segment on different texts: Specify <input-text> AND –train-file <training-file>. Here <input-text> must be in prepared form (without word boundaries) whereas <training-file> must contain word boundaries.

For details, see Daland, R., Pierrehumbert, J.B., “Learning diphone-based segmentation”. Cognitive science 35(1), 119-155 (2011).

class wordseg.algos.dibs.AbstractSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶

Bases: object

An interface for DiBS segmentation

Subclasses must implement the _init_diphones() method.

Parameters

summary (CorpusSummary) – Some diphones stats computed on a train text
pwb (float, optional) – Probability of word boundary, if not specified it is estimated from the train text as (nwords - nlines)/(nphones - nlines). When defined must in [0, 1].
threshold (float, optional) – Threshold on word boundary probabilities. If a diphone has a word boundray probability greater than this threshold, a word boudary is added. Must be in [0, 1]. The optimal threshold is 0.5 (default).
log (logging.Logger, optional) – The log instance where to send messages.

Raises

ValueError: – If threshold and pwb are not floats in [0, 1].

abstract init_diphones()[source]¶: Initializes diphone probabilities from the summary

segment(utterance)[source]¶

Estimates word boundaries based on diphone probabilities

Parameters

utterance (str) – The utterance to segment must be a suite of phones or syllables separated by spaces.

Returns

The segmented utterance, with phone separation removed and
spaces at estimated word boundaries.

class wordseg.algos.dibs.CorpusSummary(text, separator=<wordseg.separator.Separator object>, level='phone', log=<RootLogger root (WARNING)>)[source]¶

Bases: object

Compute statistics on a phonemized corpus

This is the “training” step of DiBS. It computes some statistics on phones (and diphones) on a tokenized training text.

Parameters

text (sequence of str) – The input text must be tokenized at phone and word levels (syllables boundaries are ignored if any)
separator (Separator, optional) – Token separation in the input text
level ('phone' or 'syllable', optional) – The token level to train the model on. Default to ‘phone’.
log (logging.Logger, optional) – Where to send log messages

summary¶

Basic stats on the entire text: ‘nlines’, ‘nwords’ and ‘nphones’

Type: Counter

lexicon¶

Word count on the entire text

Type: Counter

phrase_initial¶

The phones at first position in an utterance

Type: Counter

phrase_final¶

The phones at last position in an utterance

Type: Counter

internal_diphones¶

The count of within word diphones

Type: Counter

spanning_diphones¶

The count of across words diphones

Type: Counter

diphones¶

The count of all diphones, sum of internal and spanning diphones.

Type: Counter

Raises: ValueError if a line in the text does not contain a word separator. –

class wordseg.algos.dibs.Counter[source]¶

Bases: dict

A Counter is a (key -> count) dictionnary for counting elements

Update the counter with the increment(key, count) method. If an element is absent from the dictionary, its count defaults to 0.

Examples

>>> c = Counter()
>>> c['a']
0
>>> c['a'] = 10
>>> c['a']
10
>>> c.increment('a')
>>> c['a']
11
>>> c.increment('a', 9)
>>> c['a']
20

increment(key, value=1)[source]¶

class wordseg.algos.dibs.GoldSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶

Bases: wordseg.algos.dibs.AbstractSegmenter

init_diphones()[source]¶: Initializes diphone probabilities from the summary

class wordseg.algos.dibs.LexicalSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶

Bases: wordseg.algos.dibs.AbstractSegmenter

init_diphones()[source]¶: Initializes diphone probabilities from the summary

class wordseg.algos.dibs.PhrasalSegmenter(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶

Bases: wordseg.algos.dibs.AbstractSegmenter

init_diphones()[source]¶: Initializes diphone probabilities from the summary

wordseg.algos.dibs.segment(test_text, trained_model, type='phrasal', threshold=0.5, pwb=None, log=<RootLogger root (WARNING)>)[source]¶

Segment a corpus from a trained DiBS model

This method is a simple wrapper on the Segmenter classes, namely GoldSegmenter, PhrasalSegmenter and LexicalSegmenter.

Parameters

test_text (sequence of str) – The input text to segment is a sequence (list or generator) of utterances. Each utterance is composed of space seprated tokens (can be phones or syllables).
trained_model (CorpusSummary) – The trained DiBS model used for segmentation of test_text.
type (str, optional) – The type of DiBS segmenter to use, must be ‘gold’, ‘phrasal’ or ‘lexical’. Default is ‘phrasal’.
threshold (float, optional) – Threshold on word boundary probabilities. If a diphone has a word boundray probability greater than this threshold, a word boudary is added. Must be in [0, 1]. The optimal threshold is 0.5 (default).
pwb (float, optional) – Probability of word boundary, if not specified it is estimated from the train text as (nwords - nlines)/(nphones - nlines). This option is not used in ‘gold’ segmentation type. When defined must in [0, 1].
log (logging.Logger, optional) – The log instance where to send messages.

Yields

utterance (str) – The current utterance segmented (with estimated word boundaries)

Raises

ValueError: – If type is not ‘gold’, ‘phrasal’ or ‘lexical’. If threshold and pwb are not floats in [0, 1].