Token Separation¶

Manage token separation at phone, syllable and word levels

class wordseg.separator.Separator(phone=' ', syllable=';esyll', word=';eword')[source]¶

Token separation at phone, syllable and word levels

A Separator is made of 3 entries phone, syllable and word defining the token separators for each of these levels within an utterance. A token separator can be a string or None. If not None, the entries ‘phone’, ‘syllable’ and ‘word’ must be all different.

The following characters are forbidden in separators: !#$%&’*+-.^`|~:"

check_level(level)[source]¶: Raises ValueError if level is not defined in the separator

check_separator(sep)[source]¶: Raise a ValueError if the sep contains a forbidden character

forbidden_chars = '!#$%&\'*+-.^`|~:\\"'¶

Characters forbidden in separators

They interfer with regular expression processing

iterate(type='value')[source]¶

Yields on phone, syllable and word tokens, in that order

Parameters: type (str, optional) – Type of separator representation to return, must be ‘value’ or ‘pair’.
Yields: token (str or tuple) – In the form token_value if type is ‘value’. In the form (token_name, token_value) if type is ‘pair’.
Raises: ValueError – If the type is not ‘value’ or ‘pair’.

levels()[source]¶: The list of defined token levels from inner to outer

remove(utterance, level=None)[source]¶

Returns the utterance with separators removed

Parameters

utterance (str) – The string to remove the separators from
level (str, optional) – If specified (must be ‘phone’, ‘syllable’ or ‘word’), remove only the separators of the given level. Else remove all the separators.

Returns

The utterance with specified separators removed. Multiple
spaces are removed as well.

Raises

ValueError – If the level is specified and is not ‘phone’, ‘syllable’ or ‘word’.

split(utterance, level, keep_boundaries=False)[source]¶

Split the utterance at a given token level

This method is sensitive to either the utterance is striped or not. It may output empty tokens.

Parameters

utterance (str) – The string to split in tokens.
level (str) – Token level to split the string with. Must be ‘phone’, ‘syllable’ or ‘word’.
keep_boundaries (bool, optional) – If False (default), remove all the separators for all levels from the returned sub-utterances.

Returns

tokens – The tokens extracted from utt, may include empty tokens.

Return type

generator

Raises

ValueError – If the level is not ‘phone’, ‘syllable’ or ‘word’.

See also

tokenize(): an higher-level method to split an utterance

strip(utterance, level=None)[source]¶

Removes leading and ending separators of an utterance

Parameters

utterance (str) – The utterance to be striped.
level (str, optional) – Specify the level boundaries to strip. If not specified remove all the boundaries. If specified, must be ‘phone’, ‘syllable’ or ‘word’.

Returns

Return type

The striped utterance

tokenize(utterance, level=None, keep_boundaries=True)[source]¶

Return the tokens in utterance at the given level

Iterates on phones, syllable or words within a given utterance, other levels being ignored.

Parameters

utterance (str) – The utterance to be tokenized.
level (str, optional) – The level to tokenize the utterance at, must be ‘phone’, ‘syllable’ or ‘word’. If not specified, tokenize at all the defined levels and return a nested list.
keep_boundaries (bool, optional) – When True (default) preserve the sublevel token boundaries in the output. When False all token boundaries are removed.

Returns

token – The successive phones, syllables or words tokenized from the utterance. From outer to inner levels in the returned nested list are words, syllables and phones. Empty tokens are ignored, tokens are striped.

Return type

list of (list of (list of)) str

Raises

ValueError – If the level is not ‘phone’, ‘syllable’ or ‘word’.

Examples

>>> from wordseg.separator import Separator
>>> s = Separator(phone=' ', syllable=None, word=';eword')
>>> t = 'j uː ;eword n oʊ ;eword dʒ ʌ s t ;eword'
>>> list(s.tokenize(t, level='word'))
['j uː', 'n oʊ', 'dʒ ʌ s t']
>>> list(s.tokenize(t, level='word', keep_boundaries=False))
['juː', 'noʊ', 'dʒʌst']
>>> list(s.tokenize(t, level='phone'))
['j', 'uː', 'n', 'oʊ', 'dʒ', 'ʌ', 's', 't']
>>> list(s.tokenize(t))
[['j', 'uː'], ['n', 'oʊ'], ['dʒ', 'ʌ', 's', 't']]

upper_levels(level)[source]¶

Lists the defined levels upper than the given one

Parameters: level (str) – Must be ‘phone’, ‘syllable’ or ‘word’.
Raises: ValuError – when level is not defined in the separator.

Examples

>>> from wordseg.separator import Separator
>>> s = Separator(phone='p', syllable='s', word='w')
>>> s.upper_levels('phone')
['syllable', 'word']
>>> s.upper_levels('word')
[]
>>> s = Separator(phone='p', syllable=None, word='w')
>>> s.upper_levels('phone')
['word']