Release notes

This section details the changes over wordseg releases. This project started as a complete rewrite of the word segmentation pipeline in the CDSwordSeg project.


Version numbers follow the semantic versioning principles.

not yet released

  • Simplified installation procedure using the standard script.

  • New documentation URL:

  • In documentation: added citation information (use DOI from zenodo).

  • Added a --train-file/-T option in wordseg-ag, wordseg-dibs, wordseg-puddle and wordseg-tp allowing to train the model on a text different from the one being segmented.

  • In wordseg-dibs

    • Due to the new option --train-file/-T, the short name for the option --threshold has been renamed from -T to -U.

    • The training text is now optional and, if not provided by the user, the input text will be used both for training and testing (to be consistent with other algorithms).

  • In wordseg-ag:

    • Improved error message: when tprob == 0 on double precision, indicates to try recompile with quadruple precision instead.

    • Optionally save the used grammar to a file using the option --save-grammar-to <grammar-file>.

    • From command-line the short name for --tstart changed from -T to -U. The -T short name is now used for the --train-file option.

    • Removed warnings with regular expressions on python-3.7.

  • In wordseg-dpseg, removed a warning with regular expressions on python-3.7.

  • in wordseg-puddle, added an option --by-frequency to choose words based on their frequencies.

  • in cluster tools:

    • improved duration and error reporting.

    • adaptations to new --train-file options.


  • New evaluation metrics in wordseg-eval:

    • adjusted rand index:

      This requires the prepared text to be computed (whereas the other metrics only rely on segmented and gold texts), so it is implemented as an option --rand-index <prep-file> in wordseg-eval.

      An easiest implementation would have been to change the specifications of wordseg-eval to take the prepared text instead of the gold one, but we prefered the optional --rand-index for backward compatibility.

    • segmentation errors summary:

      Detailed report of segmentation errors, may be undersegmentation, oversegmentation or missegmentation. Implemented as the option --summary <json-file> in wordseg-eval.

  • In wordseg-dibs, renamed baseline algorithm to gold, so as to avoid confusion with wordseg-baseline. See #48.

  • tools/ renamed tools/ and new tools/wordseg-{slurm, bash}.sh to submit jobs on SLURM based clusters and locally using bash.

  • Bugfix in tools/wordseg-{sge, slurm, bash}.sh: wordseg-dibs is correctly handled (was a problem with the train file). Those tools now included full pipeline, including statistics and text preparation.


  • Added tools/, a script to schedule a list of segmentation jobs to a cluster running Sun Grid Engine and the qsub scheduler.

  • Added example phonological rules and updated contributong guide in documentation.

  • In wordseg-prep ignore empty lines in both gold and segmented texts.

  • In wordseg-syll the syllabification is improved: syllabification of words with no vowel, better error messages (see #35, #36).

  • In wordseg-tp add of the mutual information dependancy measure. In the bash command, the argument --probability {forward,backward} is replaced by --dependency {ftp,btp,mi} (maintained for backward compatibility). See #40.

  • In wordseg-ag:

    • niteration is now 2000 by default (was 100),

    • improved log of iterations with -vv,

    • refactored postprocessing code:

      • parallelized

      • constant memory usage (was linear wrt niterations*nutts)

      • tree to words conversion in C++ instead of Python

      • temporary parses file is now gziped (gains a factor of 20 in disk usage)

      • new –temdir option to specify another path for tempfile (default is /tmp)

      • detection of incomplete parses (if any issues a warning)

      • better comments in code, more unit tests


  • Improved documentation and algorithms description.

  • Docker image now uses python-3.6 from anaconda,

  • New tests to ensure replication of scores from CDSWordSeg to wordseg for puddle, tp, dibs and dpseg.

  • In wordseg-ag the <grammar> and <segment-category> parameters are now optional. When omitted a default colloc0 grammar is generated from the input text.

  • In wordseg-dpseg

    • fixed forwarding of some arguments from Python to C++,

    • implementation of dpseg bugfix when single char on first line of a fold,

    • use the original random number generator to replicate exactly CDSWordSeg.

    • fixed default ngram to bigram (was already bigram but documented as unigram).

  • In wordseg-dibs

    • fixed bug when loading train text at syllable level (new –unit* option)

    • safer use of train text (ensure there are word separators in it, ignore empty lines).

  • In wordseg-eval

    • when called from bash, the scores are now displayed in a fixed order. New test to ensure bash and python calls to wordseg lead to identical results. See #31.

    • distinction between edge/no edge in boundary scoring. See #21.

  • In wordseg-stats the scores are now displayed in a fixed order.

  • In wordseg-syll

    • the --tolerant option allows to ignore utterances where the syllabification failed (the default is to exit the program on the first error). See #36.


  • Documentation improved, installation guide for working with docker.

  • Removed dependancies to numpy and pandas.

  • Tests are now done on a subpart of the CHILDES corpus (was Buckeye, under restrictive licence).

  • Simplified output in wordseg-stats, removed redundancy, renamed ‘uniques’ to ‘hapaxes’. See #18.

  • Bugfix in wordseg-tp -t relative when the last utterance of a text is made of a single phone. See #25.

  • Bugfix in wordseg-dpseg when loading parameters from a configuration file

  • In wordseg-ag:

    • Bugfix when compiling adaptor grammar on MacOS (removed pstream.h from AG). See #15.

    • Replaced std::tr1::unordered_{map,set} by std::unordered_{map,set}, removed useless code (custom allocator).


  • Features

    • New methods for basic statistics and normalized segmentation entropy in wordseg-stats

    • New forward/backward option in wordseg-tp.

    • New command wordseg-baseline that produces a random segmentation given the probability of a word boundary. If an oracle text is provided, the probability of word boundary is estimated from that text.

    • New command wordseg-syll estimates syllable boundaries on a text using the maximal onset principle. Exemples of onsets and vowels files for syllabifications are given in the directory data/syllabification.

    • Support for punctuation in input of wordseg-prep with the --punctuation option (#10).

    • For citation purposes a DOI is now automatically attached to each wordseg release.

    • Improved documentation.

  • Bugfixes

    • wordseg-dibs has been debugged (#16).

    • wordseg-ag has been debugged.

    • The following characters are now forbidden in separators, they interfer with regular expression matching:

    • Type scoring is now correctly implemented in wordseg-eval (#10, #14).


  • Implementation of Adaptor Grammar as wordseg-ag,

  • Installation now relies on cmake (was python setuptools),

  • Improvements in tests and documentation,

  • Various bugfixes.


  • First public release, adaptation from Alex Cristia’s CDSWordSeg.

  • Four algorithms (tp, puddle, dpseg, dibs).

  • Segmentation prepocessing and evaluation.

  • Unit tests and documentation.

  • On the original implementation, we applied the following changes:

    • conversion to C++11 standard,

    • replaced tr1/unsorted_map and mt19937 by the standard library,

    • code cleanup, removed useless functions and code,

    • complete rewrite of the build process (Makefile, link on boost).