Random Baseline

Note

wordseg.algo.baseline in python, wordseg-baseline in bash.

Baseline algorithm for word segmentation

This algorithm randomly adds word boundaries after the input tokens with a given probability.

wordseg.algos.baseline.segment(text, probability=0.5, log=<RootLogger root (WARNING)>)[source]

Random word segmentation given a boundary probability

Given a probability p, the probability P(t_i) to add a word boundary after each token t_i is:

P(t_i) = P(X < p), X \sim \mathcal{U}(0, 1).

Parameters
  • text (sequence) – The input utterances to segment, tokens are assumed to be space separated.

  • probability (float, optional) – The probability to append a word boundary after each token.

  • log (logging.Logger) – Where to send log messages

Yields

segmented_text (generator) – The randomly segmented utterances.

Raises

ValueError – if the probability is not a float in [0, 1].

wordseg.algos.baseline.segment_oracle(text, oracle_text, oracle_separator=<wordseg.separator.Separator object>, oracle_level='phone', log=<RootLogger root (WARNING)>)[source]

Random oracle word segmentation

The probability of word boundary p is estimated from an oracle text as the ration nwords / (nphones or nsyllables), according to oracle_level. The segmentation is then delegated to the segment(text, p) method is called.

Parameters
  • text (sequence of str) – The input utterances to segment, tokens are assumed to be space separated.

  • oracle_text (sequence of str) – The text on which to estimate the probaility of word boundary. Must be tokenized at word and at least phone or syllable levels (according to oracle_level).

  • oracle_separator (Separator, optional) – Token separation in the oracle text.

  • oracle_level (str, optional) – The level to consider when estimating p, must be ‘phone’ or ‘syllable’, default to ‘phone’.

  • log (logging.Logger) – Where to send log messages

Yields

segmented_text (generator) – The randomly segmented utterances.