Data Folding

Note

wordseg.folding in python, not available in bash.

Folding and unfolding texts for use in iterative word segmenters

Iterative algorithms pass through the input text only once, the model is learned online. Thus only the end of the text is relevent for the algorithm evaluation. To use the whole input for evaluation, the folding module create “folded” versions of a text to be used in iterative text based algorithms.

Let “A B C” be a text made of three blocks A, B and C having roughly the same number of lines. Folding that text in 3 generates a list of 3 versions [“A1 B1 C1”, “C2 A2 B2”, and “B3 C3 A3”]. The algorithm is ran over the 3 versions and their outputs are then unfolded to retrieve the original text “A3 B2 C1”.

wordseg.folding.boundaries(text, nfolds)[source]

Returns nfolds boundaries as a list of line indices in text

Parameters
  • text (list) – The input text as a list of utterances.

  • nfolds (int) – The number of fold boundaries to compute.

Returns

boundaries – The list of indices in text corresponding to the computed fold boundaries.

Return type

list

Raises

ValueError – If the text has not enought lines to build the requested nfolds, or if nfolds is not strictly positive.

wordseg.folding.fold(text, nfolds, fold_boundaries=None)[source]

Create nfolds versions of an input text

In order to serve the unfold operation, this functions also build the index of the beginning of the last block in each fold.

Parameters
  • text (list) – The input text as a list of utterances.

  • nfolds (int) – The number of folds to build on text.

  • fold_boundaries (list, optional) – An increasing list of length nfolds with the start index of each fold in text. By default, use the boundaries() function.

Returns

  • folds (list) – a list of folded versions of the text

  • index (list) – a list of index positions for text unfolding

wordseg.folding.unfold(folds, index)[source]

Concatenate the last block of each fold to form the unfolded text

This is the reverse operation of the fold() method.

Parameters
  • folds – As outputed by the fold() method.

  • index – As outputed by the fold() method.

Returns

unfolded_text – The unfolded utterances composed of ending blocks of the input folds.

Return type

list of str