Adaptor Grammar

Note

wordseg.algo.ag in python, wordseg-ag in bash.

Learn parse trees from a grammar (Adaptor Grammar)

This algorithm adds word boundaries after adapting a grammar.

wordseg.algos.ag.DEFAULT_ARGS = '-E -d 0 -a 0.0001 -b 10000 -e 1 -f 1 -g 100 -h 0.01 -R -1 -P -x 10'

Default Adaptor Grammar parameters

class wordseg.algos.ag.ParseCounter(nutts)[source]

Bases: object

Count the most frequent utterances in a sequence of parses

most_common()[source]
update(parse)[source]
wordseg.algos.ag.build_colloc0_grammar(phones)[source]

Builds a Colloc0 grammar from a list of phones

Parameters

phones (list of str) – The list of existing phones in the grammar

Returns

grammar – The generated grammar as a string, just have a open(file, ‘w’).write(grammar) to save it to disk.

Return type

str

wordseg.algos.ag.check_grammar(grammar_file, category)[source]

Return True if the grammar is valid for that category

Raise a RuntimeError if the category is not a parent in the grammar, or if the grammar file is not found or not readable.

wordseg.algos.ag.get_grammar_files()[source]

Returns a list of example grammar files bundled with wordseg

Grammar files have the .lt extension and are stored in the directory wordseg/data/ag.

Raises
  • RuntimeError – If the configuration directory is not found or if there is no grammar files in it.

  • pkg_resources.DistributionNotFound – If ‘wordseg’ is not correctly installed

wordseg.algos.ag.is_parent_in_grammar(grammar_file, parent)[source]

Returns True if the parent is in the grammar

Parents are the first word of each line in the grammar file.

wordseg.algos.ag.postprocess(parse_counter, output_file, ignore_first_parses, log)[source]
wordseg.algos.ag.segment(text, train_text=None, grammar_file=None, category='Colloc0', args='-E -d 0 -a 0.0001 -b 10000 -e 1 -f 1 -g 100 -h 0.01 -R -1 -P -x 10', save_grammar_to=None, ignore_first_parses=0, nruns=8, njobs=1, tempdir='/tmp', log=<RootLogger root (WARNING)>)[source]

Segment a text using the Adaptor Grammar algorithm

The algorithm is ran 8 times in parallel and the results are collapsed. We ensure the random seed to be different for each run.

Parameters
  • text (sequence of str) – The list of utterances to segment using the model learned from train_text.

  • train_text (sequence, optional) – The list of utterances to train the model on. If None train the model directly on text.

  • grammar_file (str, optional) – The path to the grammar file to use for segmentation. If not specified, a Colloc0 grammar is generated from the input text.

  • category (str, optional) – The category to segment the text with, must be an existing parent in the grammar (i.e. the segment_category must be present in the left column of the grammar file), default to ‘Colloc0’.

  • args (str, optional) – Command line options to run the AG program with, use ‘wordseg-ag –help’ to have a complete list of available options

  • save_grammar_to (str, optional) – If defined, this is an output file where to save the grammar ussed for segmentation. This is usefull to keep trace of the used grammar when using an auto-generated one (i.e. when grammar_file is None).

  • ignore_first_parses (int, optional) – Ignore the first parses from the algorithm output. If negative, keep only the last ones (e.g. -1 keeps only the last one, -2 the last two).

  • nruns (int, optional) – number of runs to execute and output parses to collapse. This number 8 comes from the original recipe provided by M Jonhson.

  • njobs (int, optional) – The number of parallel subprocesses to run

  • tempdir (str, optional) – A directory where to store temporary data

  • log (logging.Logger, optional) – A logger where to send log messages

Returns

segmented – The test utterances with estimated word boundaries

Return type

list

Raises

RuntimeError – If one of the AG subprocesses fails or returns an error code. If the score_category is not found in the grammar.

wordseg.algos.ag.yield_parses(lines, ignore_firsts=0)[source]

Yields parse trees, ignoring the first ones

In the raw output of AG the parse , this function yields the successive tress, ignoring the first ones.

Parameters
  • lines (sequence) – The parse trees as outputed by the AG program, the trees are separated by an empty line.

  • ignore_first (int, optional) – The first trees are computed during the first iterations of AG and are usually less accurate. They can be ignored with that argument (default to 0).

Yields

tree (list) – The list of lines composing a full parse tree of the input text. Each line is an utterance in the PTB-format