Tutorial¶
This tutorial demonstrates the use of the wordseg tools on a concrete example: comparing 3 algorithms on a 1000 utterances English corpus.
As wordseg can be used either in bash or python, this tutorial shows the two, choose your own!
We will use the TP, dibs and puddle algorithms, which are the 3 fastest ones.
For the tutorial, work in a new directory you can delete afterward:
mkdir -p ./wordseg_tutorial cd ./wordseg_tutorial
Input text¶
You can find a sample text in test/data.
The file orthographic.txt
contains the text utterances:
you could eat it with a spoonyou have to cut that corn tooand bananagood cheese
And the file tagged.txt
with the utterances represented as a
list of phones with word boundaries tagged as ;eword:
y uw ;esyll ;eword k uh d ;esyll ;eword iy t ;esyll ;eword ih t ;esyll ;eword w ih dh ;esyll ;eword ax ;esyll ;eword s p uw n ;esyll ;ewordy uw ;esyll ;eword hh ae v ;esyll ;eword t ax ;esyll ;eword k ah t ;esyll ;eword dh ae t ;esyll ;eword k ao r n ;esyll ;eword t uw ;esyll ;ewordae n d ;esyll ;eword b ax ;esyll n ae ;esyll n ax ;esyll ;ewordg uh d ;esyll ;eword ch iy z ;esyll ;eword
Bash tutorial¶
The following script is located in ../doc/tutorial.sh
and takes an
input text file (in phonological form) as argument:
#!/bin/bash
# prepare the input for segmentation and generate the gold text
cat $1 | wordseg-prep -u phone --gold gold.txt > prepared.txt
# compute statistics on the tokenized input text
cat $1 | wordseg-stats --json > stats.json
# display the statistics computed on the input text
echo "STATISTICS"
echo "=========="
echo
cat stats.json
echo
echo "TUTORIAL PART 1 (no training)"
echo "============================="
# segment the prepared text with different algorithms (we show few
# options for them, use --help to list all of them)
cat prepared.txt | wordseg-baseline -P 0.5 > segmented.baseline.txt
cat prepared.txt | wordseg-tp -d ftp -t relative > segmented.tp.txt
cat prepared.txt | wordseg-puddle -w 2 > segmented.puddle.txt
cat prepared.txt | wordseg-dpseg -f 1 -r 1 > segmented.dpseg.txt
cat prepared.txt | wordseg-ag --nruns 4 --njobs 4 --niterations 10 > segmented.ag.txt
# dibs must be provided with word boundaries to do some preliminary training.
# Boundaries are then removed to generate the text to segment (as with
# wordseg-prep).
cat $1 | wordseg-dibs -t gold > segmented.dibs.txt
# evaluate them against the gold file
for algo in baseline tp puddle dpseg dibs ag
do
cat segmented.$algo.txt | wordseg-eval gold.txt -r prepared.txt > eval.$algo.txt
done
# concatenate the evaluations in a table
echo
(
echo "score baseline tp puddle dpseg ag dibs"
echo "------------------ ------- ------- ------- ------- ------- -------"
for i in $(seq 1 13)
do
awk -v i=$i 'NR==i {printf $0}; END {printf " "}' eval.baseline.txt
awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.tp.txt
awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.puddle.txt
awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.dpseg.txt
awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.ag.txt
awk -v i=$i 'NR==i {print $2}' eval.dibs.txt
done
) | column -t
# ## REPEAT THE WHOLE PROCESS, but training on the first 80% of the file
echo
echo
echo "TUTORIAL PART 2 (train on 80% of data)"
echo "======================================"
# split the file into 80/20
csplit --quiet $1 $(( $(wc -l < $1 ) * 8 / 10 + 1))
mv xx00 train_tagged.txt
mv xx01 test_tagged.txt
# prepare the input for segmentation and generate the gold text for test only
cat train_tagged.txt | wordseg-prep -u phone --gold gold_train.txt > prepared_train.txt
cat test_tagged.txt | wordseg-prep -u phone --gold gold_test.txt > prepared_test.txt
# segment the prepared text with different algorithms -- NOTE train/test implemented for the following
cat prepared_test.txt | wordseg-tp -d ftp -t relative -T prepared_train.txt > segmented.tp.tt.txt
cat prepared_test.txt | wordseg-puddle -w 2 -T prepared_train.txt > segmented.puddle.tt.txt
cat prepared_test.txt | wordseg-ag --nruns 4 --njobs 4 --niterations 10 -T prepared_train.txt > segmented.ag.tt.txt
# dibs is provided with a training file in gold format for its parameter, plus the prepared train file
cat prepared_test.txt | wordseg-dibs -t gold -T train_tagged.txt > segmented.dibs.tt.txt
# evaluate them against the gold file
for algo in tp puddle dibs ag
do
cat segmented.$algo.tt.txt | wordseg-eval gold_test.txt -r prepared_test.txt > eval.$algo.tt.txt
done
# concatenate the evaluations in a table
echo
(
echo "score tp puddle ag dibs"
echo "------------------ ------- ------- ------- -------"
for i in $(seq 1 13)
do
awk -v i=$i 'NR==i {printf $0}; END {printf " "}' eval.tp.tt.txt
awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.puddle.tt.txt
awk -v i=$i 'NR==i {printf $2}; END {printf " "}' eval.ag.tt.txt
awk -v i=$i 'NR==i {print $2}' eval.dibs.tt.txt
done
) | column -t
From the tutorial directory, we can execute the script and display the
result in a table with ../doc/tutorial.sh ../test/data/tagged.txt | column -t
.
Python tutorial¶
The following script is located in ../doc/tutorial.py
. It
implements exactly the same process as the bash one (part 1 only):
#!/usr/bin/env python
import json
import sys
from wordseg.evaluate import evaluate
from wordseg.prepare import prepare, gold
from wordseg.algos import tp, puddle, dpseg, baseline, dibs, ag
from wordseg.statistics import CorpusStatistics
from wordseg.separator import Separator
# load the input text file
text = open(sys.argv[1], 'r').readlines()
# compute some statistics on the input text (text tokenized at phone
# and word levels)
separator = Separator(phone=' ', syllable=None, word=';eword')
stats = CorpusStatistics(text, separator).describe_all()
# display the computed statistics
sys.stdout.write(
'* Statistics\n\n' +
json.dumps(stats, indent=4) + '\n')
# prepare the input for segmentation
prepared = list(prepare(text))
# generate the gold text
gold = list(gold(text))
# segment the prepared text with different algorithms
segmented_baseline = baseline.segment(prepared, probability=0.2)
segmented_tp = tp.segment(prepared, threshold='relative')
segmented_puddle = puddle.segment(prepared, njobs=4, window=2)
segmented_dpseg = dpseg.segment(prepared, nfolds=1, args='--randseed 1')
segmented_ag = ag.segment(prepared, nruns=4, njobs=4, args='-n 10')
# we must provide a trained model to dibs (with stats on diphones)
model_dibs = dibs.CorpusSummary(text)
segmented_dibs = dibs.segment(prepared, model_dibs)
# evaluate them against the gold file
eval_baseline = evaluate(segmented_baseline, gold, units=prepared)
eval_tp = evaluate(segmented_tp, gold, units=prepared)
eval_puddle = evaluate(segmented_puddle, gold, units=prepared)
eval_dpseg = evaluate(segmented_dpseg, gold, units=prepared)
eval_ag = evaluate(segmented_ag, gold, units=prepared)
eval_dibs = evaluate(segmented_dibs, gold, units=prepared)
# a little function to display score with 4-digits precision
def display(score):
if score is None:
return 'None'
else:
return '%.4g' % score
# concatenate the evaluations in a table and display them
header = ['score', 'baseline', 'tp', 'puddle', 'dpseg', 'ag', 'dibs']
pattern = ('{:<26}' + '{:<9}' * (len(header) - 1))
sys.stdout.write(
'\n* Evaluation\n\n' +
pattern.format(*header) + '\n' +
pattern.format(*['-'*25] + ['-'*8] * (len(header) - 1)) + '\n')
for score in eval_tp.keys():
line = pattern.format(*[
score,
display(eval_baseline[score]),
display(eval_tp[score]),
display(eval_puddle[score]),
display(eval_dpseg[score]),
display(eval_ag[score]),
display(eval_dibs[score])])
sys.stdout.write(line + '\n')
We can execute it using ../doc/tutorial.py ../test/data/tagged.txt | column -t
.
Expected output¶
The bash and python give the same result, it should be something like:
* Statistics
{
"phones": {
"tokens": 6199,
"hapaxes": 0,
"types": 39
},
"corpus": {
"nutts_single_word": 35,
"nutts": 301,
"entropy": 0.014991768574252533,
"mattr": 0.9218384697130766
},
"syllables": {
"tokens": 2451,
"hapaxes": 264,
"types": 607
},
"words": {
"tokens": 1892,
"hapaxes": 276,
"types": 548
}
}
* Evaluation NEEDS TO BE UPDATED
score baseline tp puddle dpseg ag dibs
-------------------------- --------- --------- --------- --------- --------- ---------
token_precision 0.06654 0.3325 0.2617 0.3312 0.4851 0.7084
token_recall 0.1147 0.4059 0.05338 0.4408 0.6195 0.6226
token_fscore 0.08422 0.3655 0.08867 0.3782 0.5441 0.6627
type_precision 0.1097 0.2344 0.1058 0.4081 0.4087 0.4916
type_recall 0.2172 0.3631 0.05657 0.3485 0.4288 0.6387
type_fscore 0.1457 0.2849 0.07372 0.376 0.4185 0.5556
boundary_all_precision 0.4043 0.6542 0.9884 0.6662 0.7376 0.9353
boundary_all_recall 0.6566 0.7788 0.3096 0.8564 0.9138 0.8377
boundary_all_fscore 0.5004 0.7111 0.4715 0.7494 0.8163 0.8838
boundary_noedge_precision 0.2831 0.5505 0.9059 0.5756 0.6629 0.9068
boundary_noedge_recall 0.5267 0.6952 0.0484 0.802 0.8812 0.7762
boundary_noedge_fscore 0.3683 0.6144 0.09189 0.6702 0.7566 0.8364
adjusted_rand_index 0.3974 0.6354 0.185 0.6352 0.686 0.7916