Diphone Based Segmenter¶
Note
wordseg.algo.dibs
in python, wordseg-dibs
in bash.
Diphone based segmentation algorithm
A DiBS model assigns, for each phrase-medial diphone, a value between 0 and 1 inclusive (representing the probability the model assigns that there is a word-boundary there).
The particularity of DiBS, with repect to the other segmentation algorithms in wordseg, is that it requires a little training set with word boundaries (ie. in phonologized form, not in prepared form). User has two choices:
Train and segment on the same text: Specify <input-text> only, <input-text> must be in phonologized form. The algorithm will the provided word boundaries to train the model and will remove them to generate the text to segment.
Train and segment on different texts: Specify <input-text> AND –train-file <training-file>. Here <input-text> must be in prepared form (without word boundaries) whereas <training-file> must contain word boundaries.
For details, see Daland, R., Pierrehumbert, J.B., “Learning diphone-based segmentation”. Cognitive science 35(1), 119-155 (2011).
-
class
wordseg.algos.dibs.
AbstractSegmenter
(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶ Bases:
object
An interface for DiBS segmentation
Subclasses must implement the
_init_diphones()
method.- Parameters
summary (CorpusSummary) – Some diphones stats computed on a train text
pwb (float, optional) – Probability of word boundary, if not specified it is estimated from the train text as (nwords - nlines)/(nphones - nlines). When defined must in [0, 1].
threshold (float, optional) – Threshold on word boundary probabilities. If a diphone has a word boundray probability greater than this threshold, a word boudary is added. Must be in [0, 1]. The optimal threshold is 0.5 (default).
log (logging.Logger, optional) – The log instance where to send messages.
- Raises
ValueError: – If threshold and pwb are not floats in [0, 1].
-
segment
(utterance)[source]¶ Estimates word boundaries based on diphone probabilities
- Parameters
utterance (str) – The utterance to segment must be a suite of phones or syllables separated by spaces.
- Returns
The segmented utterance, with phone separation removed and
spaces at estimated word boundaries.
-
class
wordseg.algos.dibs.
CorpusSummary
(text, separator=<wordseg.separator.Separator object>, level='phone', log=<RootLogger root (WARNING)>)[source]¶ Bases:
object
Compute statistics on a phonemized corpus
This is the “training” step of DiBS. It computes some statistics on phones (and diphones) on a tokenized training text.
- Parameters
text (sequence of str) – The input text must be tokenized at phone and word levels (syllables boundaries are ignored if any)
separator (Separator, optional) – Token separation in the input text
level ('phone' or 'syllable', optional) – The token level to train the model on. Default to ‘phone’.
log (logging.Logger, optional) – Where to send log messages
- Raises
ValueError if a line in the text does not contain a word separator. –
-
class
wordseg.algos.dibs.
Counter
[source]¶ Bases:
dict
A Counter is a (key -> count) dictionnary for counting elements
Update the counter with the increment(key, count) method. If an element is absent from the dictionary, its count defaults to 0.
Examples
>>> c = Counter() >>> c['a'] 0 >>> c['a'] = 10 >>> c['a'] 10 >>> c.increment('a') >>> c['a'] 11 >>> c.increment('a', 9) >>> c['a'] 20
-
class
wordseg.algos.dibs.
GoldSegmenter
(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶
-
class
wordseg.algos.dibs.
LexicalSegmenter
(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶
-
class
wordseg.algos.dibs.
PhrasalSegmenter
(summary, pwb=None, threshold=0.5, log=<RootLogger root (WARNING)>)[source]¶
-
wordseg.algos.dibs.
segment
(test_text, trained_model, type='phrasal', threshold=0.5, pwb=None, log=<RootLogger root (WARNING)>)[source]¶ Segment a corpus from a trained DiBS model
This method is a simple wrapper on the Segmenter classes, namely GoldSegmenter, PhrasalSegmenter and LexicalSegmenter.
- Parameters
test_text (sequence of str) – The input text to segment is a sequence (list or generator) of utterances. Each utterance is composed of space seprated tokens (can be phones or syllables).
trained_model (CorpusSummary) – The trained DiBS model used for segmentation of test_text.
type (str, optional) – The type of DiBS segmenter to use, must be ‘gold’, ‘phrasal’ or ‘lexical’. Default is ‘phrasal’.
threshold (float, optional) – Threshold on word boundary probabilities. If a diphone has a word boundray probability greater than this threshold, a word boudary is added. Must be in [0, 1]. The optimal threshold is 0.5 (default).
pwb (float, optional) – Probability of word boundary, if not specified it is estimated from the train text as (nwords - nlines)/(nphones - nlines). This option is not used in ‘gold’ segmentation type. When defined must in [0, 1].
log (logging.Logger, optional) – The log instance where to send messages.
- Yields
utterance (str) – The current utterance segmented (with estimated word boundaries)
- Raises
ValueError: – If type is not ‘gold’, ‘phrasal’ or ‘lexical’. If threshold and pwb are not floats in [0, 1].