Data Preparation¶
Note
wordseg.prepare
in python, wordseg-prep
in bash.
Prepare an input text for word segmentation
The input text must be in a phonologized form (a suite of phones, syllables or words tokens as specified by the token separator).
The input text is checked for errors in formatting (presence of punctuation, missing separators, etc…).
The output text contains space separated phones (or syllables according to the unit option).
The program fails on the first encountered error, or ignore them if the tolerant option is used.
-
wordseg.prepare.
check_utterance
(utterance, separator=<wordseg.separator.Separator object>, check_punctuation=True)[source]¶ Ensures an utterance is in a valid phonological form
- Parameters
utterance (str) – The utterance to be checked
separator (Separator, optional) – The token separators used in the utterance
check_punctuation (bool, optional) – When True (default), forbid any punctuation character in the utterance and raise ValueError if any punctuation is found. When False, do not check punctuation.
- Returns
True if no error detected, raises otherwise
- Return type
bool
- Raises
ValueError – If one of the following errors is detected: * utterance is empty or is not a string * utterance contains any punctuation character (once the separators are removed), only if check_punctuation is True * utterance begins with a separator * utterance does not end with a word separator * utterance contains syllable tokens but a word does not end with a syllable separator
-
wordseg.prepare.
gold
(text, separator=<wordseg.separator.Separator object>)[source]¶ Returns a gold text from a phonologized one
The returned gold text is the ground-truth segmentation. It has phone and syllable separators removed and word separators replaced by a single space ‘ ‘. It is used to evaluate the output of segmentation algorithms.
- Parameters
text (sequence) – The input text to be prepared for segmentation. Each element of the sequence is assumed to be a single and complete utterance in valid phonological form.
separator (Separator, optional) – Token separation in the text
- Returns
gold_text – Gold utterances with separators removed and words separated by spaces. The returned text is the gold version, against which the algorithms are evaluated.
- Return type
generator
-
wordseg.prepare.
prepare
(text, separator=<wordseg.separator.Separator object>, unit='phone', check_punctuation=True, tolerant=False, log=<RootLogger root (WARNING)>)[source]¶ Prepares a text in phonological form for word segmentation
The returned text is ready to be segmented. It consists in a suite of phonological symbols (can be phones or syllable depending on unit) separated by spaces.
The function removes the word separators from all the lines in text and replaces boundaries at the unit level defined by unit by a space. If unit is ‘phone’ the syllable separators are removed, and vice-versa if unit is ‘syllable’ the phone separators are dicarded.
- Parameters
text (sequence) – The input text to be prepared for segmentation. Each element of the sequence is assumed to be a single and complete utterance in valid phonological form.
separator (Separator, optional) – Token separation in the text
unit (str, optional) – The unit representation level to prepare the text at, must be ‘syllable’ or ‘phone’.
check_punctuation (bool, optional) – When True (default), forbid any punctuation character in the utterance and raise ValueError if any punctuation is found. When False, do not check punctiation.
tolerant (bool, optional) – If False, raise ValueError on the first format error detected in the text. If True, the badly formated utterances are filtered out from the output and a warning is issued.
log (logging.Logger, optional) – The logger instance where to send messages.
- Returns
prepared_text – Utterances from the text with separators removed, prepared for segmentation at a syllable or phoneme representation level (separated by space).
- Return type
generator
- Raises
ValueError – On the first format error encountered in text (see the prepare.check_utterance function), only if tolerant is False.
-
wordseg.prepare.
punctuation_re
= re.compile('[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]')¶ A regular expression matching all the punctuation characters