(r9) NGramQuickTour < GRM

GRM Web>NGramLibrary>NGramQuickTour (revision 9)~~EditAttach~~

OpenGrm NGram Library Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

model format and textual I/O (ngramread and ngramprint)
n-gram counting (ngramcount)
n-gram model parameter estimation (ngrammake)
n-gram model merging and pruning (ngrammerge and ngramshrink)
n-gram model application (ngramapply)

Model Format

All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. An n-gram is a sequence of k symbols: w₁ ... w_k. Let N be the set of n-grams in the model.

There is a unigram state in every model, representing the empty string.
Every proper prefix of every n-gram in N has an associated state in the model.
The state associated with an n-gram w₁ ... w_k has a backoff transition (labeled with ⟨epsilon⟩) to the state associated with its suffix w₂ ... w_k.
An n-gram w₁ ... w_k is represented as a transition, labeled with w_k, from the state associated with its prefix w₁ ... w_k-1 to a destination state defined as follows:
- If w₁ ... w_k is a proper prefix of an n-gram in the model, then the destination of the transition is the state associated with w₁ ... w_k
- Otherwise, the destination of the transition is the state associated with the suffix w₂ ... w_k.
Start and end of the sequence are not represented via transitions in the automaton or symbols in the symbol table. Rather
- The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted ⟨s⟩).
- The end of the sequence (often denoted ⟨/s⟩) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w₁ ... w_k, the final weight of that state represents the weight of the n-gram w₁ ... w_k ⟨/s⟩.

Textual I/O

ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:

By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
w₁ ... w_k cnt
where w₁ ... w_k are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

For example, the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be converted into a concatenated archive of automata (similar to the finite-state archive format) with:

$ ngramread earnest.txt >earnest.cat

ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.

By default, only n-grams are printed (without backoff ⟨epsilon⟩ transitions), in the same format as discussed above for reading in n-gram counts: w₁ ... w_k score, where the score will be either the n-gram count or the n-gram probability, depending on whether the model has been normalized. By default, scores are converted from the internal negative log representation to real semiring counts or probabilities.
By using the flag --ARPA, the n-gram model is printed in the well-known ARPA format.
By using the flag --backoff, backoff ⟨epsilon⟩ transitions are printed along with the n-grams.
By using the flag --negativelogs, scores are shown as negative logs, rather than being converted to the real semiring.
By using the flag --integers, scores are converted to the real semiring and rounded to integers.

For writing n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are used to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:

$ ngramprint --ARPA earnest.mod | head -15

\data\
ngram 1=2306
ngram 2=10319
ngram 3=14796
ngram 4=15218
ngram 5=14170

\1-grams:
-99   <s>   -0.9399067
-1.064551   </s>
-3.337681   MORNING   -0.3590219
-2.990894   ROOM   -0.4771213
-1.857355   IN   -0.6232494
-2.87695   ALGERNON   -0.4771213

N-gram Counting

ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

The 1-gram through 5-gram counts for the earnest.cat finite-state archive file created above can be created with:

$ ngramcount -order=5 earnest.cat >earnest.cnts

N-gram Model Parameter Estimation

ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).

The 5-gram counts in earnest.cnts created above can be converted into a n-gram model with:

$ ngrammake earnest.cnts >earnest.mod

Here is a random path generated through this FST:

$ fstrandgen --select=log_prob earnest.mod | fstprint | cut -f3 | tr '\n' ' '
I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon>

(An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, such that this random generation using general fstrandgen is not exact. See random generation utilities under ngramapply below.

N-gram Model Merging and Pruning

ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.

ngramshrink is a command line utility for pruning n-gram models.

This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:

$ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru

A random path generated through this FST is:

$ fstrandgen --select=log_prob earnest.pru | fstprint | cut -f3 | tr '\n' ' '
I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon>

See discussion of random generation below.

N-gram Model Application

ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata, or to correctly randomly sample from a mixture model.

Prior to randomly generating from the model, ngramapply will convert from a backoff representation back to a mixture representation, so that the transitions have the correct semantics for taking a simple random path through the automaton. To correctly randomly generate from a given model, use the flag --samples as follows:

$ ngramapply --samples=1 earnest.mod
IT IS SIMPLY A VERY INEXCUSABLE MANNER

To see the backoff arcs when randomly generating, use the flag --show_backoff as follows:

$ ngramapply --show_backoff --samples=1 earnest.mod
YOUR BROTHER WAS I BELIEVE UNMARRIED WAS <epsilon> <epsilon> HE NOT <epsilon>

The following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:

echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramapply --v=1 earnest.mod -
A HAND BAG
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( A | <s> )                         = [2gram]  1.87984
        p( HAND | A ...)                     = [2gram]  2.56724
        p( BAG | HAND ...)                   = [3gram]  0.0457417
        p( </s> | BAG ...)                   = [4gram]  0.507622
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -5.00044;  perplexity (base 10)= 17.7873

BAG HAND A
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( BAG | <s> )                       = [1gram]  4.02771
        p( HAND | BAG ...)                   = [1gram]  3.35968
        p( A | HAND ...)                     = [1gram]  2.51843
        p( </s> | A ...)                     = [1gram]  1.53325
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -11.4391;  perplexity (base 10)= 724.048

2 sentences, 6 words, 0 OOVs
logprob(base 10)= -16.4395;  perplexity (base 10)= 113.485

ngramapply will have more general model application methods (using fstcompose with a phi matcher) soon.

-- BrianRoark - 12 Nov 2010

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
txt	earnest.txt	r1	manage	89.0 K	2010-11-05 - 02:14	MichaelRiley

Topic revision: r9 - 2011-11-04 - BrianRoark