Data Preparation

Note

wordseg.prepare in python, wordseg-prep in bash.

Prepare an input text for word segmentation

  • The input text must be in a phonologized form (a suite of phones, syllables or words tokens as specified by the token separator).

  • The input text is checked for errors in formatting (presence of punctuation, missing separators, etc…).

  • The output text contains space separated phones (or syllables according to the unit option).

  • The program fails on the first encountered error, or ignore them if the tolerant option is used.

wordseg.prepare.check_utterance(utterance, separator=<wordseg.separator.Separator object>, check_punctuation=True)[source]

Ensures an utterance is in a valid phonological form

Parameters
  • utterance (str) – The utterance to be checked

  • separator (Separator, optional) – The token separators used in the utterance

  • check_punctuation (bool, optional) – When True (default), forbid any punctuation character in the utterance and raise ValueError if any punctuation is found. When False, do not check punctuation.

Returns

True if no error detected, raises otherwise

Return type

bool

Raises

ValueError – If one of the following errors is detected: * utterance is empty or is not a string * utterance contains any punctuation character (once the separators are removed), only if check_punctuation is True * utterance begins with a separator * utterance does not end with a word separator * utterance contains syllable tokens but a word does not end with a syllable separator

wordseg.prepare.gold(text, separator=<wordseg.separator.Separator object>)[source]

Returns a gold text from a phonologized one

The returned gold text is the ground-truth segmentation. It has phone and syllable separators removed and word separators replaced by a single space ‘ ‘. It is used to evaluate the output of segmentation algorithms.

Parameters
  • text (sequence) – The input text to be prepared for segmentation. Each element of the sequence is assumed to be a single and complete utterance in valid phonological form.

  • separator (Separator, optional) – Token separation in the text

Returns

gold_text – Gold utterances with separators removed and words separated by spaces. The returned text is the gold version, against which the algorithms are evaluated.

Return type

generator

wordseg.prepare.prepare(text, separator=<wordseg.separator.Separator object>, unit='phone', check_punctuation=True, tolerant=False, log=<RootLogger root (WARNING)>)[source]

Prepares a text in phonological form for word segmentation

The returned text is ready to be segmented. It consists in a suite of phonological symbols (can be phones or syllable depending on unit) separated by spaces.

The function removes the word separators from all the lines in text and replaces boundaries at the unit level defined by unit by a space. If unit is ‘phone’ the syllable separators are removed, and vice-versa if unit is ‘syllable’ the phone separators are dicarded.

Parameters
  • text (sequence) – The input text to be prepared for segmentation. Each element of the sequence is assumed to be a single and complete utterance in valid phonological form.

  • separator (Separator, optional) – Token separation in the text

  • unit (str, optional) – The unit representation level to prepare the text at, must be ‘syllable’ or ‘phone’.

  • check_punctuation (bool, optional) – When True (default), forbid any punctuation character in the utterance and raise ValueError if any punctuation is found. When False, do not check punctiation.

  • tolerant (bool, optional) – If False, raise ValueError on the first format error detected in the text. If True, the badly formated utterances are filtered out from the output and a warning is issued.

  • log (logging.Logger, optional) – The logger instance where to send messages.

Returns

prepared_text – Utterances from the text with separators removed, prepared for segmentation at a syllable or phoneme representation level (separated by space).

Return type

generator

Raises

ValueError – On the first format error encountered in text (see the prepare.check_utterance function), only if tolerant is False.

wordseg.prepare.punctuation_re = re.compile('[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]')

A regular expression matching all the punctuation characters