Segmentation Evaluation¶

Note

wordseg.evaluate in python, wordseg-eval in bash.

Word segmentation evaluation

Evaluates a segmented text against it’s gold version: outputs the precision, recall and f-score at type, token and boundary levels. We distinguish whether utterance edges (begin and end of the utterance) are counted towards the boundary performance or not.

The evaluation optionally computes the adjusted rank index (requires the prepared text to be provided) and a summary of which word come to be correctly segmented, or else segmented incorrectly (requires an output JSON file to be specified).

class wordseg.evaluate.BoundaryEvaluation[source]¶

Bases: wordseg.evaluate.TokenEvaluation

Evaluation of boundary f-score, precision and recall

Includes first and last boundary of an utterance

static get_boundary_positions(stringpos)[source]¶: Returns the positions of boundaries

update_lists(text, gold)[source]¶: Update evaluation for a suite of utterances

class wordseg.evaluate.BoundaryNoEdgeEvaluation[source]¶

Bases: wordseg.evaluate.BoundaryEvaluation

Evaluation of boundary f-score, precision and recall

Excludes first and last boundary of an utterance

static get_boundary_positions(stringpos)[source]¶: Returns the positions of boundaries

class wordseg.evaluate.SegmentationSummary[source]¶

Bases: object

Computes a summary of the segmentation errors

The errors can be oversegmentations, undersegmentations or missegmentations. Correct segmentations are also reported.

summarize(text, gold)[source]¶

Computes segmentation errors on a whole text

Call summarize_utterance() on each utterance of gold and text.

Parameters

text (list of str) – The list of utterances for the segmented text (to be evaluated)
gold (list of str) – The list of utterances for the gold text

Raises

ValueError – If text and gold do not have the same number of utterances. If summarize_utterance() raise a ValueError.

summarize_utterance(text, gold)[source]¶

Computes segmentation errors on a single utterance

This method returns no result but update the intern summary, accessible using to_dict().

Parameters

text (str) – A segmented utterance
gold (str) – A gold utterance

Raises

ValueError – If text and gold are mismatched, i.e. they do not contain the same suite of letters (once all the spaces removed).

to_dict()[source]¶

Exports the summary as a dictionary

Returns: summary – A dictionary with the complete summary in the following entries: ‘over’, ‘under’, ‘mis’, ‘correct’. In each entry, the words are sorted by decreasing frequency, and alphabetically (for equivalent frequency).
Return type: dict

class wordseg.evaluate.TokenEvaluation[source]¶

Bases: object

Evaluation of token f-score, precision and recall

exact_match()[source]¶: Returns the number of exact matches

fscore()[source]¶: Returns token fscore

precision()[source]¶: Returns token precision

recall()[source]¶: Returns token recall

update(test_set, gold_set)[source]¶: Update evaluation for a single utterance

update_lists(test, gold)[source]¶: Update evaluation for a suite of utterances

class wordseg.evaluate.TypeEvaluation[source]¶

Bases: wordseg.evaluate.TokenEvaluation

Evaluation of type f-score, precision and recall

static lexicon_check(textlex, goldlex)[source]¶: Compare hypothesis and gold lexicons

update_lists(text, gold)[source]¶: Update evaluation for a suite of utterances

wordseg.evaluate.compute_class_labels(words, units)[source]¶

Compute class labels to be used for cluster similarity measures

Each word is considered a class, and each unit is mapped to the word it belongs to. This function is used as a preprocessing step for the Adjusted Rand Index.

Parameters

words (list of str) – Utterances made of space separated words.
units (list of str) – Utterances made of space separated atomic units (phonemes or syllables).

Returns

class_labels – Each unit mapped to the word it belongs to (with words coded as integers)

Return type

numpy array of int

Raises

ValueError: – If words and units do not match together

Examples

>>> from wordseg.evaluate import compute_class_labels
>>> words = ['hello world', 'python']
>>> units = ['h el lo wo r ld', 'py th on']
>>> compute_class_labels(words, units)
array([0, 0, 0, 1, 1, 1, 2, 2, 2])

wordseg.evaluate.evaluate(text, gold, units=None)[source]¶

Scores a segmented text against its gold version

Parameters

text (sequence of str) – A suite of utterances made of space separated words.
gold (sequence of str) – A suite of utterances made of space separated words.
units (sequence of str, optional) – A suite of utterances made of space separated atomic units (phonemes or syllables). When specified, the function also computes the adjusted rand index.

Returns

scores – A dictionary with the following entries in that fixed order:

’type_fscore’
’type_precision’
’type_recall’
’token_fscore’
’token_precision’
’token_recall’
’boundary_all_fscore’
’boundary_all_precision’
’boundary_all_recall’
’boundary_noedge_fscore’
’boundary_noedge_precision’
’boundary_noedge_recall’

If units is specified in arguments, this additional entry is added:

’adjusted_rand_index’

Return type

ordered dict

Raises

ValueError – If gold and text have different size or differ in tokens

wordseg.evaluate.read_data(text, separator=<wordseg.separator.Separator object>)[source]¶

Load text data for evaluation

Parameters

text (list of str) – The list of utterances to read for the evaluation.
separator (Separator, optional) – Separators to tokenize the text with, default to space separated words.

Returns

(words, positions, lexicon) – where words are the input utterances with word separators removed, positions stores the start/stop index of each word for each utterance, and lexicon is the list of words.

Return type

three lists

wordseg.evaluate.summary(text, gold)[source]¶

Computes the summary of segmentation errors

This function is a simple wrapper on SegmentationSummary

Parameters

text (list of str) – The list of utterances for the segmented text (to be evaluated)
gold (list of str) – The list of utterances for the gold text

Returns

summary – A dictionary with the complete summary in the following entries: ‘over’, ‘under’, ‘mis’, ‘correct’.

Return type

dict

Raises

ValueError – If text and gold do not match, or something went wrong during the summary computation.