Release notes¶
This section details the changes over wordseg releases. This project started as a complete rewrite of the word segmentation pipeline in the CDSwordSeg project.
Note
Version numbers follow the semantic versioning principles.
not yet released¶
Simplified installation procedure using the standard
setup.py
script.New documentation URL: https://docs.cognitive-ml.fr/wordseg.
In documentation: added citation information (use DOI from zenodo).
Added a
--train-file/-T
option in wordseg-ag, wordseg-dibs, wordseg-puddle and wordseg-tp allowing to train the model on a text different from the one being segmented.In wordseg-dibs
Due to the new option
--train-file/-T
, the short name for the option--threshold
has been renamed from-T
to-U
.The training text is now optional and, if not provided by the user, the input text will be used both for training and testing (to be consistent with other algorithms).
In wordseg-ag:
Improved error message: when tprob == 0 on double precision, indicates to try recompile with quadruple precision instead.
Optionally save the used grammar to a file using the option
--save-grammar-to <grammar-file>
.From command-line the short name for
--tstart
changed from-T
to-U
. The-T
short name is now used for the--train-file
option.Removed warnings with regular expressions on python-3.7.
In wordseg-dpseg, removed a warning with regular expressions on python-3.7.
in wordseg-puddle, added an option
--by-frequency
to choose words based on their frequencies.in cluster tools:
improved duration and error reporting.
adaptations to new
--train-file
options.
wordseg-0.7.1¶
New evaluation metrics in wordseg-eval:
adjusted rand index:
This requires the prepared text to be computed (whereas the other metrics only rely on segmented and gold texts), so it is implemented as an option
--rand-index <prep-file>
in wordseg-eval.An easiest implementation would have been to change the specifications of wordseg-eval to take the prepared text instead of the gold one, but we prefered the optional
--rand-index
for backward compatibility.segmentation errors summary:
Detailed report of segmentation errors, may be undersegmentation, oversegmentation or missegmentation. Implemented as the option
--summary <json-file>
in wordseg-eval.
In wordseg-dibs, renamed baseline algorithm to gold, so as to avoid confusion with wordseg-baseline. See #48.
tools/wordseg-qsub.sh
renamedtools/wordseg-sge.sh
and newtools/wordseg-{slurm, bash}.sh
to submit jobs on SLURM based clusters and locally using bash.Bugfix in
tools/wordseg-{sge, slurm, bash}.sh
: wordseg-dibs is correctly handled (was a problem with the train file). Those tools now included full pipeline, including statistics and text preparation.
wordseg-0.7¶
Added
tools/wordseg-qsub.sh
, a script to schedule a list of segmentation jobs to a cluster running Sun Grid Engine and theqsub
scheduler.Added example phonological rules and updated contributong guide in documentation.
In wordseg-prep ignore empty lines in both gold and segmented texts.
In wordseg-syll the syllabification is improved: syllabification of words with no vowel, better error messages (see #35, #36).
In wordseg-tp add of the mutual information dependancy measure. In the bash command, the argument
--probability {forward,backward}
is replaced by--dependency {ftp,btp,mi}
(maintained for backward compatibility). See #40.In wordseg-ag:
niteration is now 2000 by default (was 100),
improved log of iterations with
-vv
,refactored postprocessing code:
parallelized
constant memory usage (was linear wrt niterations*nutts)
tree to words conversion in C++ instead of Python
temporary parses file is now gziped (gains a factor of 20 in disk usage)
new –temdir option to specify another path for tempfile (default is /tmp)
detection of incomplete parses (if any issues a warning)
better comments in code, more unit tests
wordseg-0.6.2¶
Improved documentation and algorithms description.
Docker image now uses python-3.6 from anaconda,
New tests to ensure replication of scores from CDSWordSeg to wordseg for puddle, tp, dibs and dpseg.
In wordseg-ag the
<grammar>
and<segment-category>
parameters are now optional. When omitted a default colloc0 grammar is generated from the input text.In wordseg-dpseg
fixed forwarding of some arguments from Python to C++,
implementation of dpseg bugfix when single char on first line of a fold,
use the original random number generator to replicate exactly CDSWordSeg.
fixed default ngram to bigram (was already bigram but documented as unigram).
In wordseg-dibs
fixed bug when loading train text at syllable level (new –unit* option)
safer use of train text (ensure there are word separators in it, ignore empty lines).
In wordseg-eval
when called from bash, the scores are now displayed in a fixed order. New test to ensure bash and python calls to wordseg lead to identical results. See #31.
distinction between edge/no edge in boundary scoring. See #21.
In wordseg-stats the scores are now displayed in a fixed order.
In wordseg-syll
the
--tolerant
option allows to ignore utterances where the syllabification failed (the default is to exit the program on the first error). See #36.
wordseg-0.6.1¶
Documentation improved, installation guide for working with docker.
Removed dependancies to numpy and pandas.
Tests are now done on a subpart of the CHILDES corpus (was Buckeye, under restrictive licence).
Simplified output in wordseg-stats, removed redundancy, renamed ‘uniques’ to ‘hapaxes’. See #18.
Bugfix in wordseg-tp -t relative when the last utterance of a text is made of a single phone. See #25.
Bugfix in wordseg-dpseg when loading parameters from a configuration file
In wordseg-ag:
Bugfix when compiling adaptor grammar on MacOS (removed pstream.h from AG). See #15.
Replaced std::tr1::unordered_{map,set} by std::unordered_{map,set}, removed useless code (custom allocator).
wordseg-0.6¶
Features
New methods for basic statistics and normalized segmentation entropy in wordseg-stats
New forward/backward option in wordseg-tp.
New command wordseg-baseline that produces a random segmentation given the probability of a word boundary. If an oracle text is provided, the probability of word boundary is estimated from that text.
New command wordseg-syll estimates syllable boundaries on a text using the maximal onset principle. Exemples of onsets and vowels files for syllabifications are given in the directory
data/syllabification
.Support for punctuation in input of wordseg-prep with the
--punctuation
option (#10).For citation purposes a DOI is now automatically attached to each wordseg release.
Improved documentation.
Bugfixes
wordseg-dibs has been debugged (#16).
wordseg-ag has been debugged.
The following characters are now forbidden in separators, they interfer with regular expression matching:
!#$%&'*+-.^`|~:\\\"
Type scoring is now correctly implemented in wordseg-eval (#10, #14).
wordseg-0.5¶
Implementation of Adaptor Grammar as
wordseg-ag
,Installation now relies on cmake (was python setuptools),
Improvements in tests and documentation,
Various bugfixes.
wordseg-0.4.1¶
First public release, adaptation from Alex Cristia’s CDSWordSeg.
Four algorithms (tp, puddle, dpseg, dibs).
Segmentation prepocessing and evaluation.
Unit tests and documentation.
On the original implementation, we applied the following changes:
conversion to C++11 standard,
replaced
tr1/unsorted_map
andmt19937
by the standard library,code cleanup, removed useless functions and code,
complete rewrite of the build process (Makefile, link on boost).