Data Folding¶
Note
wordseg.folding
in python, not available in bash.
Folding and unfolding texts for use in iterative word segmenters
Iterative algorithms pass through the input text only once, the model is learned online. Thus only the end of the text is relevent for the algorithm evaluation. To use the whole input for evaluation, the folding module create “folded” versions of a text to be used in iterative text based algorithms.
Let “A B C” be a text made of three blocks A, B and C having roughly the same number of lines. Folding that text in 3 generates a list of 3 versions [“A1 B1 C1”, “C2 A2 B2”, and “B3 C3 A3”]. The algorithm is ran over the 3 versions and their outputs are then unfolded to retrieve the original text “A3 B2 C1”.
-
wordseg.folding.
boundaries
(text, nfolds)[source]¶ Returns nfolds boundaries as a list of line indices in text
- Parameters
text (list) – The input text as a list of utterances.
nfolds (int) – The number of fold boundaries to compute.
- Returns
boundaries – The list of indices in text corresponding to the computed fold boundaries.
- Return type
list
- Raises
ValueError – If the text has not enought lines to build the requested nfolds, or if nfolds is not strictly positive.
-
wordseg.folding.
fold
(text, nfolds, fold_boundaries=None)[source]¶ Create nfolds versions of an input text
In order to serve the unfold operation, this functions also build the index of the beginning of the last block in each fold.
- Parameters
text (list) – The input text as a list of utterances.
nfolds (int) – The number of folds to build on text.
fold_boundaries (list, optional) – An increasing list of length nfolds with the start index of each fold in text. By default, use the boundaries() function.
- Returns
folds (list) – a list of folded versions of the text
index (list) – a list of index positions for text unfolding
-
wordseg.folding.
unfold
(folds, index)[source]¶ Concatenate the last block of each fold to form the unfolded text
This is the reverse operation of the fold() method.
- Parameters
folds – As outputed by the fold() method.
index – As outputed by the fold() method.
- Returns
unfolded_text – The unfolded utterances composed of ending blocks of the input folds.
- Return type
list of str