Token Separation¶
Manage token separation at phone, syllable and word levels
-
class
wordseg.separator.
Separator
(phone=' ', syllable=';esyll', word=';eword')[source]¶ Token separation at phone, syllable and word levels
A Separator is made of 3 entries phone, syllable and word defining the token separators for each of these levels within an utterance. A token separator can be a string or None. If not None, the entries ‘phone’, ‘syllable’ and ‘word’ must be all different.
The following characters are forbidden in separators: !#$%&’*+-.^`|~:"
-
forbidden_chars
= '!#$%&\'*+-.^`|~:\\"'¶ Characters forbidden in separators
They interfer with regular expression processing
-
iterate
(type='value')[source]¶ Yields on phone, syllable and word tokens, in that order
- Parameters
type (str, optional) – Type of separator representation to return, must be ‘value’ or ‘pair’.
- Yields
token (str or tuple) – In the form token_value if type is ‘value’. In the form (token_name, token_value) if type is ‘pair’.
- Raises
ValueError – If the type is not ‘value’ or ‘pair’.
-
remove
(utterance, level=None)[source]¶ Returns the utterance with separators removed
- Parameters
utterance (str) – The string to remove the separators from
level (str, optional) – If specified (must be ‘phone’, ‘syllable’ or ‘word’), remove only the separators of the given level. Else remove all the separators.
- Returns
The utterance with specified separators removed. Multiple
spaces are removed as well.
- Raises
ValueError – If the level is specified and is not ‘phone’, ‘syllable’ or ‘word’.
-
split
(utterance, level, keep_boundaries=False)[source]¶ Split the utterance at a given token level
This method is sensitive to either the utterance is striped or not. It may output empty tokens.
- Parameters
utterance (str) – The string to split in tokens.
level (str) – Token level to split the string with. Must be ‘phone’, ‘syllable’ or ‘word’.
keep_boundaries (bool, optional) – If False (default), remove all the separators for all levels from the returned sub-utterances.
- Returns
tokens – The tokens extracted from utt, may include empty tokens.
- Return type
generator
- Raises
ValueError – If the level is not ‘phone’, ‘syllable’ or ‘word’.
See also
tokenize()
an higher-level method to split an utterance
-
strip
(utterance, level=None)[source]¶ Removes leading and ending separators of an utterance
- Parameters
utterance (str) – The utterance to be striped.
level (str, optional) – Specify the level boundaries to strip. If not specified remove all the boundaries. If specified, must be ‘phone’, ‘syllable’ or ‘word’.
- Returns
- Return type
The striped utterance
-
tokenize
(utterance, level=None, keep_boundaries=True)[source]¶ Return the tokens in utterance at the given level
Iterates on phones, syllable or words within a given utterance, other levels being ignored.
- Parameters
utterance (str) – The utterance to be tokenized.
level (str, optional) – The level to tokenize the utterance at, must be ‘phone’, ‘syllable’ or ‘word’. If not specified, tokenize at all the defined levels and return a nested list.
keep_boundaries (bool, optional) – When True (default) preserve the sublevel token boundaries in the output. When False all token boundaries are removed.
- Returns
token – The successive phones, syllables or words tokenized from the utterance. From outer to inner levels in the returned nested list are words, syllables and phones. Empty tokens are ignored, tokens are striped.
- Return type
list of (list of (list of)) str
- Raises
ValueError – If the level is not ‘phone’, ‘syllable’ or ‘word’.
Examples
>>> from wordseg.separator import Separator >>> s = Separator(phone=' ', syllable=None, word=';eword') >>> t = 'j uː ;eword n oʊ ;eword dʒ ʌ s t ;eword' >>> list(s.tokenize(t, level='word')) ['j uː', 'n oʊ', 'dʒ ʌ s t'] >>> list(s.tokenize(t, level='word', keep_boundaries=False)) ['juː', 'noʊ', 'dʒʌst'] >>> list(s.tokenize(t, level='phone')) ['j', 'uː', 'n', 'oʊ', 'dʒ', 'ʌ', 's', 't'] >>> list(s.tokenize(t)) [['j', 'uː'], ['n', 'oʊ'], ['dʒ', 'ʌ', 's', 't']]
-
upper_levels
(level)[source]¶ Lists the defined levels upper than the given one
- Parameters
level (str) – Must be ‘phone’, ‘syllable’ or ‘word’.
- Raises
ValuError – when level is not defined in the separator.
Examples
>>> from wordseg.separator import Separator >>> s = Separator(phone='p', syllable='s', word='w') >>> s.upper_levels('phone') ['syllable', 'word'] >>> s.upper_levels('word') [] >>> s = Separator(phone='p', syllable=None, word='w') >>> s.upper_levels('phone') ['word']
-