Random Baseline¶

Note

wordseg.algo.baseline in python, wordseg-baseline in bash.

Baseline algorithm for word segmentation

This algorithm randomly adds word boundaries after the input tokens with a given probability.

wordseg.algos.baseline.segment(text, probability=0.5, log=<RootLogger root (WARNING)>)[source]¶

Random word segmentation given a boundary probability

Given a probability $p$ , the probability $P(t_i)$ to add a word boundary after each token $t_i$ is:

$P(t_i) = P(X < p), X \sim \mathcal{U}(0, 1).$

Parameters

text (sequence) – The input utterances to segment, tokens are assumed to be space separated.
probability (float, optional) – The probability to append a word boundary after each token.
log (logging.Logger) – Where to send log messages

Yields

segmented_text (generator) – The randomly segmented utterances.

Raises

ValueError – if the probability is not a float in [0, 1].

wordseg.algos.baseline.segment_oracle(text, oracle_text, oracle_separator=<wordseg.separator.Separator object>, oracle_level='phone', log=<RootLogger root (WARNING)>)[source]¶

Random oracle word segmentation

The probability of word boundary $p$ is estimated from an oracle text as the ration nwords / (nphones or nsyllables), according to oracle_level. The segmentation is then delegated to the segment(text, $p$ ) method is called.

Parameters

text (sequence of str) – The input utterances to segment, tokens are assumed to be space separated.
oracle_text (sequence of str) – The text on which to estimate the probaility of word boundary. Must be tokenized at word and at least phone or syllable levels (according to oracle_level).
oracle_separator (Separator, optional) – Token separation in the oracle text.
oracle_level (str, optional) – The level to consider when estimating $p$ , must be ‘phone’ or ‘syllable’, default to ‘phone’.
log (logging.Logger) – Where to send log messages

Yields

segmented_text (generator) – The randomly segmented utterances.