Adaptor Grammar¶
Note
wordseg.algo.ag
in python, wordseg-ag
in bash.
Learn parse trees from a grammar (Adaptor Grammar)
This algorithm adds word boundaries after adapting a grammar.
-
wordseg.algos.ag.
DEFAULT_ARGS
= '-E -d 0 -a 0.0001 -b 10000 -e 1 -f 1 -g 100 -h 0.01 -R -1 -P -x 10'¶ Default Adaptor Grammar parameters
-
class
wordseg.algos.ag.
ParseCounter
(nutts)[source]¶ Bases:
object
Count the most frequent utterances in a sequence of parses
-
wordseg.algos.ag.
build_colloc0_grammar
(phones)[source]¶ Builds a Colloc0 grammar from a list of phones
- Parameters
phones (list of str) – The list of existing phones in the grammar
- Returns
grammar – The generated grammar as a string, just have a open(file, ‘w’).write(grammar) to save it to disk.
- Return type
str
-
wordseg.algos.ag.
check_grammar
(grammar_file, category)[source]¶ Return True if the grammar is valid for that category
Raise a RuntimeError if the category is not a parent in the grammar, or if the grammar file is not found or not readable.
-
wordseg.algos.ag.
get_grammar_files
()[source]¶ Returns a list of example grammar files bundled with wordseg
Grammar files have the .lt extension and are stored in the directory wordseg/data/ag.
- Raises
RuntimeError – If the configuration directory is not found or if there is no grammar files in it.
pkg_resources.DistributionNotFound – If ‘wordseg’ is not correctly installed
-
wordseg.algos.ag.
is_parent_in_grammar
(grammar_file, parent)[source]¶ Returns True if the parent is in the grammar
Parents are the first word of each line in the grammar file.
-
wordseg.algos.ag.
segment
(text, train_text=None, grammar_file=None, category='Colloc0', args='-E -d 0 -a 0.0001 -b 10000 -e 1 -f 1 -g 100 -h 0.01 -R -1 -P -x 10', save_grammar_to=None, ignore_first_parses=0, nruns=8, njobs=1, tempdir='/tmp', log=<RootLogger root (WARNING)>)[source]¶ Segment a text using the Adaptor Grammar algorithm
The algorithm is ran 8 times in parallel and the results are collapsed. We ensure the random seed to be different for each run.
- Parameters
text (sequence of str) – The list of utterances to segment using the model learned from train_text.
train_text (sequence, optional) – The list of utterances to train the model on. If None train the model directly on text.
grammar_file (str, optional) – The path to the grammar file to use for segmentation. If not specified, a Colloc0 grammar is generated from the input text.
category (str, optional) – The category to segment the text with, must be an existing parent in the grammar (i.e. the segment_category must be present in the left column of the grammar file), default to ‘Colloc0’.
args (str, optional) – Command line options to run the AG program with, use ‘wordseg-ag –help’ to have a complete list of available options
save_grammar_to (str, optional) – If defined, this is an output file where to save the grammar ussed for segmentation. This is usefull to keep trace of the used grammar when using an auto-generated one (i.e. when grammar_file is None).
ignore_first_parses (int, optional) – Ignore the first parses from the algorithm output. If negative, keep only the last ones (e.g. -1 keeps only the last one, -2 the last two).
nruns (int, optional) – number of runs to execute and output parses to collapse. This number 8 comes from the original recipe provided by M Jonhson.
njobs (int, optional) – The number of parallel subprocesses to run
tempdir (str, optional) – A directory where to store temporary data
log (logging.Logger, optional) – A logger where to send log messages
- Returns
segmented – The test utterances with estimated word boundaries
- Return type
list
- Raises
RuntimeError – If one of the AG subprocesses fails or returns an error code. If the score_category is not found in the grammar.
-
wordseg.algos.ag.
yield_parses
(lines, ignore_firsts=0)[source]¶ Yields parse trees, ignoring the first ones
In the raw output of AG the parse , this function yields the successive tress, ignoring the first ones.
- Parameters
lines (sequence) – The parse trees as outputed by the AG program, the trees are separated by an empty line.
ignore_first (int, optional) – The first trees are computed during the first iterations of AG and are usually less accurate. They can be ignored with that argument (default to 0).
- Yields
tree (list) – The list of lines composing a full parse tree of the input text. Each line is an utterance in the PTB-format