Bayesian Segmenter

Note

wordseg.algo.dpseg in python, wordseg-dpseg in bash.

Bayesian word segmentation algorithm

See Goldwater, Griffiths, Johnson (2010) and Phillips & Pearl (2014).

  1. Uses a hierarchical Pitman-Yor process rather than a hierarchical Dirichlet process model. The HDP model can be recovered by setting the PY parameters appropriately (set –a1 and –a2 to 0, –b1 and –b2 then correspond to the HDP parameters).

  2. Implements several different estimation procedures, including the original Gibbs sampler (flip sampler) as well as a sentence-based Gibbs sampler that uses dynamic programming (tree sampler) and a similar dynamic programming algorithm that chooses the best segmentation of each utterance rather than a sample. The latter two algorithms can be run either in batch mode or in online mode. If in online mode, they can also be set to “forget” parts of the previously analysis. This is described in more detail below.

  3. Functionality for using separate training and testing files. If you provide an evaluation file, the program will first run through its full training procedure (i.e., using whichever algorithm for however many iterations, kneeling, etc.). After that, it will freeze the lexicon in whatever state it is in and then make a single pass through the evaluation data, segmenting each sentence according to the probabilities computed from the frozen lexicon. No new words/counts will be added to the lexicon during evaluation. Evaluation can be set to either sample segmentations or choose the maximum probability segmentation for each utterance. Scores will be printed out at the end of the complete run based on either the evaluation data (if provided) or the training data (if not).

class wordseg.algos.dpseg.UnicodeGenerator(start=3001)[source]

Bases: object

Iterates on unicode characters

Excludes the space characters. Used to build a (unit -> char) mapping. The actual dpseg implementation requires that all units (phones or syllables) are encoded as a unicode char.

Parameters

start (int) – The first unicode character to be generated

Notes

This class is a perl to python simplified transcription of the original script create-unicode-dict-flexible.pl

Examples

This shows a basic usage mapping a list of strings to unicode.

>>> units = ['unit1', 'unit2', 'unit3']
>>> unicode_gen = UnicodeGenerator()
>>> unicode_mapping = {unit: unicode_gen() for unit in units}
wordseg.algos.dpseg.segment(text, nfolds=5, njobs=1, args='--ngram 1 --a1 0 --b1 1', log=<RootLogger root (WARNING)>, binary='/home/gitlab-runner/.conda/envs/wordseg-ci/lib/python3.7/site-packages/wordseg-0.8-py3.7-linux-x86_64.egg/bin/dpseg')[source]

Run the ‘dpseg’ binary on nfolds folds