Token Separation

Manage token separation at phone, syllable and word levels

class wordseg.separator.Separator(phone=' ', syllable=';esyll', word=';eword')[source]

Token separation at phone, syllable and word levels

A Separator is made of 3 entries phone, syllable and word defining the token separators for each of these levels within an utterance. A token separator can be a string or None. If not None, the entries ‘phone’, ‘syllable’ and ‘word’ must be all different.

The following characters are forbidden in separators: !#$%&’*+-.^`|~:"


Raises ValueError if level is not defined in the separator


Raise a ValueError if the sep contains a forbidden character

forbidden_chars = '!#$%&\'*+-.^`|~:\\"'

Characters forbidden in separators

They interfer with regular expression processing


Yields on phone, syllable and word tokens, in that order


type (str, optional) – Type of separator representation to return, must be ‘value’ or ‘pair’.


token (str or tuple) – In the form token_value if type is ‘value’. In the form (token_name, token_value) if type is ‘pair’.


ValueError – If the type is not ‘value’ or ‘pair’.


The list of defined token levels from inner to outer

remove(utterance, level=None)[source]

Returns the utterance with separators removed

  • utterance (str) – The string to remove the separators from

  • level (str, optional) – If specified (must be ‘phone’, ‘syllable’ or ‘word’), remove only the separators of the given level. Else remove all the separators.


  • The utterance with specified separators removed. Multiple

  • spaces are removed as well.


ValueError – If the level is specified and is not ‘phone’, ‘syllable’ or ‘word’.

split(utterance, level, keep_boundaries=False)[source]

Split the utterance at a given token level

This method is sensitive to either the utterance is striped or not. It may output empty tokens.

  • utterance (str) – The string to split in tokens.

  • level (str) – Token level to split the string with. Must be ‘phone’, ‘syllable’ or ‘word’.

  • keep_boundaries (bool, optional) – If False (default), remove all the separators for all levels from the returned sub-utterances.


tokens – The tokens extracted from utt, may include empty tokens.

Return type



ValueError – If the level is not ‘phone’, ‘syllable’ or ‘word’.

See also


an higher-level method to split an utterance

strip(utterance, level=None)[source]

Removes leading and ending separators of an utterance

  • utterance (str) – The utterance to be striped.

  • level (str, optional) – Specify the level boundaries to strip. If not specified remove all the boundaries. If specified, must be ‘phone’, ‘syllable’ or ‘word’.


Return type

The striped utterance

tokenize(utterance, level=None, keep_boundaries=True)[source]

Return the tokens in utterance at the given level

Iterates on phones, syllable or words within a given utterance, other levels being ignored.

  • utterance (str) – The utterance to be tokenized.

  • level (str, optional) – The level to tokenize the utterance at, must be ‘phone’, ‘syllable’ or ‘word’. If not specified, tokenize at all the defined levels and return a nested list.

  • keep_boundaries (bool, optional) – When True (default) preserve the sublevel token boundaries in the output. When False all token boundaries are removed.


token – The successive phones, syllables or words tokenized from the utterance. From outer to inner levels in the returned nested list are words, syllables and phones. Empty tokens are ignored, tokens are striped.

Return type

list of (list of (list of)) str


ValueError – If the level is not ‘phone’, ‘syllable’ or ‘word’.


>>> from wordseg.separator import Separator
>>> s = Separator(phone=' ', syllable=None, word=';eword')
>>> t = 'j uː ;eword n oʊ ;eword dʒ ʌ s t ;eword'
>>> list(s.tokenize(t, level='word'))
['j uː', 'n oʊ', 'dʒ ʌ s t']
>>> list(s.tokenize(t, level='word', keep_boundaries=False))
['juː', 'noʊ', 'dʒʌst']
>>> list(s.tokenize(t, level='phone'))
['j', 'uː', 'n', 'oʊ', 'dʒ', 'ʌ', 's', 't']
>>> list(s.tokenize(t))
[['j', 'uː'], ['n', 'oʊ'], ['dʒ', 'ʌ', 's', 't']]

Lists the defined levels upper than the given one


level (str) – Must be ‘phone’, ‘syllable’ or ‘word’.


ValuError – when level is not defined in the separator.


>>> from wordseg.separator import Separator
>>> s = Separator(phone='p', syllable='s', word='w')
>>> s.upper_levels('phone')
['syllable', 'word']
>>> s.upper_levels('word')
>>> s = Separator(phone='p', syllable=None, word='w')
>>> s.upper_levels('phone')