Below is a brief tutorial on the OpenGrm SFST library based on a running example. We use the command-line SFST utilities for this; we could have instead used the corresponding C++-level utilities in a program (see
Available Operations).
Example Data and Models
For our running example, we will use the text of Oscar Wilde's
The Importance of Being Earnest, which has been upper-cased and had punctuation removed. The first 850 sentences,
earnest_train.txt,
are used as training data and the remaining 838 sentences,
earnest_test.txt, are used as test data. We also provide an
OpenFst-style symbol table,
earnest_train_1000.syms, which
contains the 1000 most frequent words in the training set plus the
<epsilon>
token for the empty string and the
<unk>
token for the unknown word. These training and test sets can be compiled into OpenFst archives with:
$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_train.txt >earnest_train.far
$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_test.txt >earnest_test.far
Using the
OpenGrm NGram utilities, these are used to construct a bigram Katz LM on the training data,
earnest_train.mod
, and the test data,
earnest_test.mod
. From
earnest_train.mod
we create a 2000 n-gram, entropy-pruned LM,
earnest_train.pru
.
$ ngramcount -order=$order earnest_train.far >earnest_train.cnts
$ ngrammake earnest_train.cnts >earnest_train.mod
$ ngramcount -order=$order earnest_test.far >earnest_test.cnts
$ ngrammake earnest_test.cnts >earnest_test.mod
$ ngramshrink --method="relative_entropy" --target_number_of_ngrams=2000 earnest_train.mod >earnest_train.pru
Perplexity
We can compute the perplexity of these models relative to the test set using the SFst operation
sfstperplexity
as follows (
ngramperplexity
returns the same values, restricted to n-gram input):
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_test.far
# of sources 838
cross entropy/source 49.92315
perplexity/symbol 73.4146
# of OOVs 1363
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_train.far
# of sources 850
cross entropy/source 37.8284
perplexity/symbol 26.19
# of OOVs 568
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru earnest_test.far
# of sources 838
cross entropy/source 51.5088
perplexity/symbol 84.1474
# of OOVs 1363
We can also compute the self perplexities and cross perplexities of the models:
$ sfstperplexity -phi_label=0 earnest_train.mod
# of sources 1
self entropy/source 55.9322
perplexity/symbol 72.518
$ sfstperplexity -phi_label=0 earnest_train.pru
# of sources 1
self entropy/source 56.3617
perplexity/symbol 84.8603
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru earnest_train.mod
# of sources 1
cross entropy/source 59.2278
perplexity/symbol 93.3396
# of OOVs 196
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_train.pru
# of sources 1
cross entropy/source 57.839
perplexity/symbol 95.336
# of OOVs 23
Sampling
We can sample from these SFSTs as follows:
$ sfstrandgen -phi_label=0 earnest_train.mod | fstprint --acceptor
0 1 YES
1 2 BUT
2 3 I
3 4 SHALL
4 5 PROBABLY
5 6 NEVER
6 7 <epsilon>
7 8 THAT
8 9 NAME
9 10 <epsilon>
10 11 YOUR
11 12 COUSIN
12 13 CECILY
13
Approximation
Idempotency
The following steps show that SFT approximation is effectively
idempotent: the perplexity of the approximation of a SFST onto the same topology gives the same perplexity as the source.
$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx earnest_test.far
# of sources 838
cross entropy/source 49.9231
perplexity/symbol 73.4145
# of OOVs 1363
An alternative, equivalent way to perform this approximation is to break it into two steps, where the counting and normalization are done separately.
$ sfstcount -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx_cnts
$ sfstnormalize -method=kl_min -phi_label=0 earnest_train.approx_cnts >earnest_train.approx2
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx2 earnest_test.far
# of sources 838
cross entropy/source 49.9231
perplexity/symbol 73.4145
# of OOVs 1363
Test Set-Based Target Topology
If we use the test-set bigram as the target topology we would expect to get a better test set perplexity since we include all the relevant bigrams.
$ sfstapprox -phi_label=0 earnest_train.mod earnest_test.mod >earnest_train.approx2
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx2 earnest_test.far
# of sources 838
cross entropy/source 49.3171
perplexity/symbol 69.6842
# of OOVs 1363
Pruned Target Topology
If we use the pruned training-set bigram as the target topology, we will get a different weighting than the greedy relative-entropy pruning performed above. This approximation seeks the minimum KL-distance solution
and proves somewhat better than the 84.1474 above.
$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.pru >earnest_train.pru.approx
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru.approx earnest_test.far
# of sources 838
cross entropy/source 51.3661
perplexity/symbol 83.1204
# of OOVs 1363