The wordseg package is made of a collection of command-line programs and a Python library that can be installed using the instructions below.


Before going further, please clone the repository from github and go in its root directory:

git clone ./wordseg
cd ./wordseg


The package is implemented in Python and C++ and requires extra softwares to work:

  • Python3 (Python2 is no more supported),

  • a C++ compiler supporting the C++11 standard,

  • cmake for configuration and build (see here),

  • the boost program options C++ library for option parsing (see here),

  • the joblib Python package for parallel processing (see here).

  • the scikit-learn Python package for statistical analysis (see here).

Installation of the wordseg package

There are three options:

  • System-wide installation: This is the recommended installation if you want to use wordseg on your personal computer (and you do not want to modify/contribute to the code).

  • Installation in a virtual environment: This is the recommended installation if you are not administrator of your machine, if you are working in a multi-user environment (e.g. a computing cluster) or if you are developing with wordseg.

  • Installation in docker: This is the recommended installation if you are working on Windows or Mac, or in a cloud infrastructure, or if you want a reproducible environment.

System-wide installation

  • Install the required dependencies:

    • on Ubuntu/Debian:

      sudo apt-get install python3 python3-pip cmake libboost-program-options-dev
    • on Mac OSX:

      brew install python3 boost cmake
  • Finally compile and install wordseg:

    [sudo] make install


If you planned to modify the wordseg’s code, use make develop instead of make install.

Installation in a virtual environment


If you have already followed the instructions under System-wide installation skip this section to go directly to Optional: Run tests to check your installation.

This installation process is based on the conda python package manager and can be performed on any Linux, Mac OS or Windows system supported by conda (but you can use virtualenv as well).

  • First install conda from here.

  • Create a new Python 3 virtual environment named wordseg and install the required dependencies:

    conda env create -f environment.yml
  • Activate your virtual environment:

    conda activate wordseg
  • Install the wordseg package:

    make install


Do not forget to activate your virtual environment before using wordseg:

conda activate wordseg

Installation in docker

We provide a Dockerfile to build a docker image of wordseg that can be run on Linux, Mac and Windows.

  • First install docker for you OS:

  • Build the wordseg image:

    [sudo] docker build -t wordseg .
  • Now you can run wordseg from within a docker container.

    For exemple run an interactive bash session in docker, mapping a data directory on your local host to /data in docker:

    [sudo] docker run -v $PWD/test/data/:/data -it wordseg /bin/bash
    # you are now in the docker machine, run wordseg as usual
    root@1d32398b8c8e:/wordseg# head -5 /data/tagged.txt | wordseg-prep | wordseg-dpseg --nfolds 1
    yuw kuhdiytihtwihdhaxspuwn
    yuw hhaev t axkaht dhaet kaorn tuw
    aen d baxnaenax
    ehmehm teystiy kaorn


On Mac use wordseg-ag and wordseg-dpseg within docker. For exemple, if you already have a wordseg installation on your computer, you can use it for all but ag an dpseg algorithms, and use those two from docker. Here we use the local wordseg-prep along with the docker wordseg-dpseg:

user@host:~/dev/wordseg$ head -5 $PWD/test/data/tagged.txt | wordseg-prep | docker run -i wordseg wordseg-dpseg --nfolds 1
yuw kuhdiytihtwihdhaxspuwn
yuw hhaev t axkaht dhaet kaorn tuw
aen d baxnaenax
ehmehm teystiy kaorn

Optional: Run tests to check your installation

We recommend you always run this tests suite, because that will allow you to make sure that all dependencies and all subparts of the package have been appropriately installed. Simply have a:

make test


If all your tests passed, then you can skip this section. You have successfully installed wordseg. If some of the tests failed, then the package’s capabilities may be reduced.

  • The tests are located in ./test and are executed by pytest. In case of test failure, you may want to rerun the tests with the command pytest -v ./test to have a more detailed output.

  • pytest supports a lot of options. For exemple to stop the execution at the first failure, use pytest -x. To execute a single test case, use pytest ./test/

Optional: Build the documentation

To build the html documentation (the one you are currently reading), first install some dependencies. On Ubuntu/Debian:

sudo apt-get install texlive textlive-latex-extra dvipng
[sudo] pip install sphinx sphinx_rtd_theme numpydoc

Then have a:

make html

The documentation is built, it’s homepage being build/doc/html/index.html.