.. _tutorial:

Tutorial
========

This tutorial demonstrates the use of the **wordseg** tools on a
concrete example: comparing 3 algorithms on a 1000 utterances
English corpus.

* As **wordseg** can be used either in bash or python, this tutorial
  shows the two, choose your own!

* We will use the TP, dibs and puddle algorithms, which are the 3
  fastest ones.

* For the tutorial, work in a new directory you can delete afterward::

    mkdir -p ./wordseg_tutorial
    cd ./wordseg_tutorial


Input text
----------

You can find a sample text in test/data.
The file ``orthographic.txt`` contains the text utterances:

   | you could eat it with a spoon
   | you have to cut that corn too
   | and banana
   | good cheese

And the file ``tagged.txt`` with the utterances represented as a
list of phones with word boundaries tagged as *;eword*:

   | y uw ;esyll ;eword k uh d ;esyll ;eword iy t ;esyll ;eword ih t ;esyll ;eword w ih dh ;esyll ;eword ax ;esyll ;eword s p uw n ;esyll ;eword
   | y uw ;esyll ;eword hh ae v ;esyll ;eword t ax ;esyll ;eword k ah t ;esyll ;eword dh ae t ;esyll ;eword k ao r n ;esyll ;eword t uw ;esyll ;eword
   | ae n d ;esyll ;eword b ax ;esyll n ae ;esyll n ax ;esyll ;eword
   | g uh d ;esyll ;eword ch iy z ;esyll ;eword


Bash tutorial
-------------

The following script is located in ``../doc/tutorial.sh`` and takes an
input text file (in phonological form) as argument:

.. literalinclude:: tutorial.sh
   :language: bash

From the tutorial directory, we can execute the script and display the
result in a table with ``../doc/tutorial.sh ../test/data/tagged.txt | column -t``.


Python tutorial
---------------

The following script is located in ``../doc/tutorial.py``. It
implements exactly the same process as the bash one (part 1 only):

.. literalinclude:: tutorial.py
   :language: python

We can execute it using ``../doc/tutorial.py ../test/data/tagged.txt | column -t``.

Expected output
---------------

The bash and python give the same result, it should be something like::

  * Statistics

  {
  "phones":             {
    "tokens":             6199,
    "hapaxes":            0,
    "types":              39
    },
  "corpus":             {
    "nutts_single_word":  35,
    "nutts":              301,
    "entropy":            0.014991768574252533,
    "mattr":              0.9218384697130766
  },
  "syllables":          {
    "tokens":             2451,
    "hapaxes":            264,
    "types":              607
  },
  "words":              {
    "tokens":             1892,
    "hapaxes":            276,
    "types":              548
  }
  }

  * Evaluation NEEDS TO BE UPDATED

  score                      baseline  tp        puddle    dpseg     ag        dibs
  -------------------------- --------- --------- --------- --------- --------- ---------
  token_precision            0.06654   0.3325    0.2617    0.3312    0.4851    0.7084
  token_recall               0.1147    0.4059    0.05338   0.4408    0.6195    0.6226
  token_fscore               0.08422   0.3655    0.08867   0.3782    0.5441    0.6627
  type_precision             0.1097    0.2344    0.1058    0.4081    0.4087    0.4916
  type_recall                0.2172    0.3631    0.05657   0.3485    0.4288    0.6387
  type_fscore                0.1457    0.2849    0.07372   0.376     0.4185    0.5556
  boundary_all_precision     0.4043    0.6542    0.9884    0.6662    0.7376    0.9353
  boundary_all_recall        0.6566    0.7788    0.3096    0.8564    0.9138    0.8377
  boundary_all_fscore        0.5004    0.7111    0.4715    0.7494    0.8163    0.8838
  boundary_noedge_precision  0.2831    0.5505    0.9059    0.5756    0.6629    0.9068
  boundary_noedge_recall     0.5267    0.6952    0.0484    0.802     0.8812    0.7762
  boundary_noedge_fscore     0.3683    0.6144    0.09189   0.6702    0.7566    0.8364
  adjusted_rand_index        0.3974    0.6354    0.185     0.6352    0.686     0.7916