Tutorial

This tutorial demonstrates the use of the wordseg tools on a concrete example: comparing 3 algorithms on a 1000 utterances English corpus.

  • As wordseg can be used either in bash or python, this tutorial shows the two, choose your own!

  • We will use the TP, dibs and puddle algorithms, which are the 3 fastest ones.

  • For the tutorial, work in a new directory you can delete afterward:

    mkdir -p ./wordseg_tutorial
    cd ./wordseg_tutorial
    

Input text

You can find a sample text in test/data. The file orthographic.txt contains the text utterances:

you could eat it with a spoon
you have to cut that corn too
and banana
good cheese

And the file tagged.txt with the utterances represented as a list of phones with word boundaries tagged as ;eword:

y uw ;esyll ;eword k uh d ;esyll ;eword iy t ;esyll ;eword ih t ;esyll ;eword w ih dh ;esyll ;eword ax ;esyll ;eword s p uw n ;esyll ;eword
y uw ;esyll ;eword hh ae v ;esyll ;eword t ax ;esyll ;eword k ah t ;esyll ;eword dh ae t ;esyll ;eword k ao r n ;esyll ;eword t uw ;esyll ;eword
ae n d ;esyll ;eword b ax ;esyll n ae ;esyll n ax ;esyll ;eword
g uh d ;esyll ;eword ch iy z ;esyll ;eword

Bash tutorial

The following script is located in ../doc/tutorial.sh and takes an input text file (in phonological form) as argument:

#!/bin/bash

# prepare the input for segmentation and generate the gold text
cat $1 | wordseg-prep -u phone --gold gold.txt > prepared.txt

# compute statistics on the tokenized input text
cat $1 | wordseg-stats --json > stats.json

# display the statistics computed on the input text
echo "STATISTICS"
echo "=========="
echo

cat stats.json

echo
echo "TUTORIAL PART 1 (no training)"
echo "============================="


# segment the prepared text with different algorithms (we show few
# options for them, use --help to list all of them)
cat prepared.txt | wordseg-baseline -P 0.5 > segmented.baseline.txt
cat prepared.txt | wordseg-tp -d ftp -t relative > segmented.tp.txt
cat prepared.txt | wordseg-puddle -w 2 > segmented.puddle.txt
cat prepared.txt | wordseg-dpseg -f 1 -r 1 > segmented.dpseg.txt
cat prepared.txt | wordseg-ag --nruns 4 --njobs 4 --niterations 10 > segmented.ag.txt

# dibs must be provided with word boundaries to do some preliminary training.
# Boundaries are then removed to generate the text to segment (as with
# wordseg-prep).
cat $1 | wordseg-dibs -t gold > segmented.dibs.txt

# evaluate them against the gold file
for algo in baseline tp puddle dpseg dibs ag
do
    cat segmented.$algo.txt | wordseg-eval gold.txt -r prepared.txt > eval.$algo.txt
done


# concatenate the evaluations in a table

echo
(
    echo "score baseline tp puddle dpseg ag dibs"
    echo "------------------ ------- ------- ------- ------- ------- -------"
    for i in $(seq 1 13)
    do
        awk -v i=$i 'NR==i {printf $0}; END {printf " "}' eval.baseline.txt
        awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.tp.txt
        awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.puddle.txt
        awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.dpseg.txt
        awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.ag.txt
        awk -v i=$i 'NR==i {print $2}' eval.dibs.txt
    done
) | column -t


# ## REPEAT THE WHOLE PROCESS, but training on the first 80% of the file
echo
echo
echo "TUTORIAL PART 2 (train on 80% of data)"
echo "======================================"


# split the file into 80/20
csplit --quiet $1 $(( $(wc -l < $1 ) * 8 / 10 + 1))
mv xx00 train_tagged.txt
mv xx01 test_tagged.txt

# prepare the input for segmentation and generate the gold text for test only
cat train_tagged.txt | wordseg-prep -u phone --gold gold_train.txt > prepared_train.txt
cat test_tagged.txt | wordseg-prep -u phone --gold gold_test.txt > prepared_test.txt

# segment the prepared text with different algorithms -- NOTE train/test implemented for the following
cat prepared_test.txt | wordseg-tp -d ftp -t relative -T prepared_train.txt > segmented.tp.tt.txt
cat prepared_test.txt | wordseg-puddle -w 2 -T prepared_train.txt > segmented.puddle.tt.txt
cat prepared_test.txt | wordseg-ag --nruns 4 --njobs 4 --niterations 10 -T prepared_train.txt > segmented.ag.tt.txt

# dibs is provided with a training file in gold format for its parameter, plus the prepared train file
cat prepared_test.txt | wordseg-dibs -t gold -T train_tagged.txt > segmented.dibs.tt.txt

# evaluate them against the gold file
for algo in tp puddle dibs ag
do
    cat segmented.$algo.tt.txt | wordseg-eval gold_test.txt -r prepared_test.txt > eval.$algo.tt.txt
done

# concatenate the evaluations in a table
echo
(
    echo "score tp puddle ag dibs"
    echo "------------------ ------- ------- ------- -------"
    for i in $(seq 1 13)
    do
        awk -v i=$i 'NR==i {printf $0}; END {printf " "}' eval.tp.tt.txt
        awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.puddle.tt.txt
        awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.ag.tt.txt
        awk -v i=$i 'NR==i {print $2}' eval.dibs.tt.txt
    done
) | column -t

From the tutorial directory, we can execute the script and display the result in a table with ../doc/tutorial.sh ../test/data/tagged.txt | column -t.

Python tutorial

The following script is located in ../doc/tutorial.py. It implements exactly the same process as the bash one (part 1 only):

#!/usr/bin/env python

import json
import sys

from wordseg.evaluate import evaluate
from wordseg.prepare import prepare, gold
from wordseg.algos import tp, puddle, dpseg, baseline, dibs, ag
from wordseg.statistics import CorpusStatistics
from wordseg.separator import Separator


# load the input text file
text = open(sys.argv[1], 'r').readlines()

# compute some statistics on the input text (text tokenized at phone
# and word levels)
separator = Separator(phone=' ', syllable=None, word=';eword')
stats = CorpusStatistics(text, separator).describe_all()

# display the computed statistics
sys.stdout.write(
    '* Statistics\n\n' +
    json.dumps(stats, indent=4) + '\n')

# prepare the input for segmentation
prepared = list(prepare(text))

# generate the gold text
gold = list(gold(text))

# segment the prepared text with different algorithms
segmented_baseline = baseline.segment(prepared, probability=0.2)
segmented_tp = tp.segment(prepared, threshold='relative')
segmented_puddle = puddle.segment(prepared, njobs=4, window=2)
segmented_dpseg = dpseg.segment(prepared, nfolds=1, args='--randseed 1')
segmented_ag = ag.segment(prepared, nruns=4, njobs=4, args='-n 10')

# we must provide a trained model to dibs (with stats on diphones)
model_dibs = dibs.CorpusSummary(text)
segmented_dibs = dibs.segment(prepared, model_dibs)

# evaluate them against the gold file
eval_baseline = evaluate(segmented_baseline, gold, units=prepared)
eval_tp = evaluate(segmented_tp, gold, units=prepared)
eval_puddle = evaluate(segmented_puddle, gold, units=prepared)
eval_dpseg = evaluate(segmented_dpseg, gold, units=prepared)
eval_ag = evaluate(segmented_ag, gold, units=prepared)
eval_dibs = evaluate(segmented_dibs, gold, units=prepared)


# a little function to display score with 4-digits precision
def display(score):
    if score is None:
        return 'None'
    else:
        return '%.4g' % score


# concatenate the evaluations in a table and display them
header = ['score', 'baseline', 'tp', 'puddle', 'dpseg', 'ag', 'dibs']
pattern = ('{:<26}' + '{:<9}' * (len(header) - 1))
sys.stdout.write(
    '\n* Evaluation\n\n' +
    pattern.format(*header) + '\n' +
    pattern.format(*['-'*25] + ['-'*8] * (len(header) - 1)) + '\n')

for score in eval_tp.keys():
    line = pattern.format(*[
        score,
        display(eval_baseline[score]),
        display(eval_tp[score]),
        display(eval_puddle[score]),
        display(eval_dpseg[score]),
        display(eval_ag[score]),
        display(eval_dibs[score])])
    sys.stdout.write(line + '\n')

We can execute it using ../doc/tutorial.py ../test/data/tagged.txt | column -t.

Expected output

The bash and python give the same result, it should be something like:

* Statistics

{
"phones":             {
  "tokens":             6199,
  "hapaxes":            0,
  "types":              39
  },
"corpus":             {
  "nutts_single_word":  35,
  "nutts":              301,
  "entropy":            0.014991768574252533,
  "mattr":              0.9218384697130766
},
"syllables":          {
  "tokens":             2451,
  "hapaxes":            264,
  "types":              607
},
"words":              {
  "tokens":             1892,
  "hapaxes":            276,
  "types":              548
}
}

* Evaluation NEEDS TO BE UPDATED

score                      baseline  tp        puddle    dpseg     ag        dibs
-------------------------- --------- --------- --------- --------- --------- ---------
token_precision            0.06654   0.3325    0.2617    0.3312    0.4851    0.7084
token_recall               0.1147    0.4059    0.05338   0.4408    0.6195    0.6226
token_fscore               0.08422   0.3655    0.08867   0.3782    0.5441    0.6627
type_precision             0.1097    0.2344    0.1058    0.4081    0.4087    0.4916
type_recall                0.2172    0.3631    0.05657   0.3485    0.4288    0.6387
type_fscore                0.1457    0.2849    0.07372   0.376     0.4185    0.5556
boundary_all_precision     0.4043    0.6542    0.9884    0.6662    0.7376    0.9353
boundary_all_recall        0.6566    0.7788    0.3096    0.8564    0.9138    0.8377
boundary_all_fscore        0.5004    0.7111    0.4715    0.7494    0.8163    0.8838
boundary_noedge_precision  0.2831    0.5505    0.9059    0.5756    0.6629    0.9068
boundary_noedge_recall     0.5267    0.6952    0.0484    0.802     0.8812    0.7762
boundary_noedge_fscore     0.3683    0.6144    0.09189   0.6702    0.7566    0.8364
adjusted_rand_index        0.3974    0.6354    0.185     0.6352    0.686     0.7916