OpenGrm NGram Library Quick Tour
This tour is organized around the stages of n-gram model creation, modification and use:
- model format and textual I/O (
ngramread
and ngramprint
)
- n-gram counting (
ngramcount
)
- n-gram model parameter estimation (
ngrammake
)
- n-gram model merging and pruning (
ngrammerge
and ngramshrink
)
- n-gram model application (
ngramapply
)
Model Format
All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the
OpenFst library. An n-gram is a sequence of
k symbols:
w1 ... wk. Let
N be the set of n-grams in the model.
- There is a unigram state in every model, representing the empty string.
- Every proper prefix of every n-gram in N has an associated state in the model.
- The state associated with an n-gram w1 ... wk has a backoff transition (labeled with 〈epsilon〉) to the state associated with its suffix w2 ... wk.
- An n-gram w1 ... wk is represented as a transition, labeled with wk, from the state associated with its prefix w1 ... wk-1 to a destination state defined as follows:
- If w1 ... wk is a proper prefix of an n-gram in the model, then the destination of the transition is the state associated with w1 ... wk
- Otherwise, the destination of the transition is the state associated with the suffix w2 ... wk.
- Start and end of the sequence are not represented via transitions in the automaton or symbols in the symbol table. Rather
- The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted 〈s〉).
- The end of the sequence (often denoted 〈/s〉) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk 〈/s〉.
Textual I/O
ngramread
is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:
- By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by
ngramcount
to count n-grams.
- By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
w1 ... wk cnt
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
- By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
By default,
ngramread
constructs a symbol table on the fly, consisting of
〈epsilon〉 and every observed symbol in the text. With the flag
--symbols=filename you can provide the filename to provide a fixed symbol table, in the standard
OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is
〈unk〉 by default. The flag
--OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not
〈unk〉. For reading n-gram counts and ARPA format models, tokens
〈s〉 and
〈/s〉 are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
For example, the text of Oscar Wilde's
Importance of Being Earnest, using the suitably normalized copy found
here, can be converted into a concatenated archive of automata (similar to the
finite-state archive format) with:
$ ngramread earnest.txt >earnest.cat
ngramprint
is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.
- By default, only n-grams are printed (without backoff 〈epsilon〉 transitions), in the same format as discussed above for reading in n-gram counts: w1 ... wk score, where the score will be either the n-gram count or the n-gram probability, depending on whether the model has been normalized. By default, scores are converted from the internal negative log representation to real semiring counts or probabilities.
- By using the flag --ARPA, the n-gram model is printed in the well-known ARPA format.
- By using the flag --backoff, backoff 〈epsilon〉 transitions are printed along with the n-grams.
- By using the flag --negativelogs, scores are shown as negative logs, rather than being converted to the real semiring.
- By using the flag --integers, scores are converted to the real semiring and rounded to integers.
For writing n-gram counts and ARPA format models, tokens
〈s〉 and
〈/s〉 are used to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
$ ngramprint --ARPA earnest.mod | head -15
\data\
ngram 1=2306
ngram 2=10319
ngram 3=14796
ngram 4=15218
ngram 5=14170
\1-grams:
-99 <s> -0.9399067
-1.064551 </s>
-3.337681 MORNING -0.3590219
-2.990894 ROOM -0.4771213
-1.857355 IN -0.6232494
-2.87695 ALGERNON -0.4771213
N-gram Counting
ngramcount
is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch
--order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).
The 1-gram through 5-gram counts for the
earnest.cat
finite-state archive file created above can be created with:
$ ngramcount -order=5 earnest.cat >earnest.cnts
N-gram Model Parameter Estimation
ngrammake
is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by
ngramcount
(which contains raw, unnormalized counts).
The 5-gram counts in
earnest.cnts
created above can be converted into a n-gram model with:
$ ngrammake earnest.cnts >earnest.mod
Here is a
random path generated through this FST:
$ fstrandgen --select=log_prob earnest.mod | fstprint | cut -f3 | tr '\n' ' '
I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon>
(An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, such that this random generation using general
fstrandgen
is not exact. See random generation utilities under
ngramapply
below.
N-gram Model Merging and Pruning
ngrammerge
is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.
ngramshrink
is a command line utility for pruning n-gram models.
This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:
$ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru
A
random path generated through this FST is:
$ fstrandgen --select=log_prob earnest.pru | fstprint | cut -f3 | tr '\n' ' '
I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon>
See discussion of random generation below.
N-gram Model Application
ngramapply
is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata, or to correctly randomly sample from a mixture model.
Prior to randomly generating from the model,
ngramapply
will convert from a backoff representation back to a mixture representation, so that the
transitions have the correct semantics for taking a simple random path through the automaton.
To correctly randomly generate from a given model, use the flag
--samples as follows:
$ ngramapply --samples=1 earnest.mod
IT IS SIMPLY A VERY INEXCUSABLE MANNER
To see the
backoff arcs when randomly generating, use the flag
--show_backoff as follows:
$ ngramapply --show_backoff --samples=1 earnest.mod
YOUR BROTHER WAS I BELIEVE UNMARRIED WAS <epsilon> <epsilon> HE NOT <epsilon>
The following calculates the perplexity of two strings (
a hand bag and
bag hand a) from the example 5-gram model generated above:
echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramapply --v=1 earnest.mod -
A HAND BAG
ngram -logprob
N-gram probability found (base10)
p( A | <s> ) = [2gram] 1.87984
p( HAND | A ...) = [2gram] 2.56724
p( BAG | HAND ...) = [3gram] 0.0457417
p( </s> | BAG ...) = [4gram] 0.507622
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -5.00044; perplexity (base 10)= 17.7873
BAG HAND A
ngram -logprob
N-gram probability found (base10)
p( BAG | <s> ) = [1gram] 4.02771
p( HAND | BAG ...) = [1gram] 3.35968
p( A | HAND ...) = [1gram] 2.51843
p( </s> | A ...) = [1gram] 1.53325
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -11.4391; perplexity (base 10)= 724.048
2 sentences, 6 words, 0 OOVs
logprob(base 10)= -16.4395; perplexity (base 10)= 113.485
ngramapply
will have more general model application methods (using fstcompose with a phi matcher) soon.
--
BrianRoark - 12 Nov 2010