Random Baseline¶
Note
wordseg.algo.baseline in python, wordseg-baseline in bash.
Baseline algorithm for word segmentation
This algorithm randomly adds word boundaries after the input tokens with a given probability.
-
wordseg.algos.baseline.segment(text, probability=0.5, log=<RootLogger root (WARNING)>)[source]¶ Random word segmentation given a boundary probability
Given a probability
, the probability
to
add a word boundary after each token
is:
- Parameters
text (sequence) – The input utterances to segment, tokens are assumed to be space separated.
probability (float, optional) – The probability to append a word boundary after each token.
log (logging.Logger) – Where to send log messages
- Yields
segmented_text (generator) – The randomly segmented utterances.
- Raises
ValueError – if the probability is not a float in [0, 1].
-
wordseg.algos.baseline.segment_oracle(text, oracle_text, oracle_separator=<wordseg.separator.Separator object>, oracle_level='phone', log=<RootLogger root (WARNING)>)[source]¶ Random oracle word segmentation
The probability of word boundary
is estimated from an
oracle text as the ration nwords / (nphones or nsyllables), according tooracle_level. The segmentation is then delegated to the segment(text,
) method is called.- Parameters
text (sequence of str) – The input utterances to segment, tokens are assumed to be space separated.
oracle_text (sequence of str) – The text on which to estimate the probaility of word boundary. Must be tokenized at word and at least phone or syllable levels (according to
oracle_level).oracle_separator (Separator, optional) – Token separation in the oracle text.
oracle_level (str, optional) – The level to consider when estimating
, must be
‘phone’ or ‘syllable’, default to ‘phone’.log (logging.Logger) – Where to send log messages
- Yields
segmented_text (generator) – The randomly segmented utterances.