This tour is organized around the stages of n-gram model creation, modification and use:

- corpus I/O (
`ngramsymbols`

,`farcompilestrings`

and`farprintstrings`

) - n-gram model format
- n-gram counting (
`ngramcount`

) - n-gram model parameter estimation (
`ngrammake`

) - n-gram model merging, pruning and constraining (
`ngrammerge`

,`ngramshrink`

and`ngrammarginalize`

) - model I/O (
`ngramread`

,`ngramprint`

and`ngraminfo`

) - n-gram model sampling, application and evaluation (
`ngramrandgen`

,`ngramapply`

and`ngramperplexity`

)

For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of available operations below.

Text corpora are represented as binary finite-state archives, with one automaton per sentence. This provides efficient later processing by the NGram Library utilities and allows if desired more general probabilistic input (e.g. weighted DAGs or *lattices*).

The first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This
can be done with the command-line utility ngramsymbols. For example, the symbols in the text of Oscar Wilde's *Importance of Being Earnest*, using the suitably normalized copy found here, can be extracted with:

$ ngramsymbols earnest.txt earnest.syms

If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can
be accomplished by concatenating the corpora when passed to `ngramsymbols`

, eliminating out-of-vocabulary symbols. By default, `ngramsymbols`

creates symbol table entries for *<epsilon>* and an out-of-vocabulary token *<UNK>*. The identity of these labels can be changed using flags. A flag can then be passed to farcompilestrings to specify the out-of-vocabulary label, so that words not in the symbol table will get mapped to that index.

Given a symbol table, a text corpus can be converted to a binary FAR archive with:

$ farcompilestrings --fst_type=compact --symbols=earnest.syms --keep_symbols earnest.txt earnest.far

and can be printed with:

$ farprintstrings earnest.far > earnest.txt

All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer (FST) format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.

The model is normally stored in a general-purpose, mutable ( VectorFst) format, which is convenient for the various processing steps described below.This can be converted to a more compact (but immutable) format specifically for n-gram models ( NGramFst) when the desired final model is generated.

ngramcount is a command line utility for counting n-grams from an input corpus, represented in FAR format. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch *--order* the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

The 1-gram through 5-gram counts for the `earnest.far`

finite-state archive file created above can be created with:

$ ngramcount --order=5 earnest.far earnest.cnts

ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by `ngramcount`

(which contains raw, unnormalized counts).

The 5-gram counts in `earnest.cnts`

created above can be converted into a n-gram model with:

$ ngrammake earnest.cnts earnest.mod

Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Katz.

Here is a generated sesntence from the language model (using `ngramrandgen`

, which is described below):

$ ngramrandgen earnest.mod | farprintstrings I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon>

(An epsilon transition is emitted for each backoff.)

ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, and derive 5-gram counts from each independently using *ngramcount* as shown above. We can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):

$ ngrammerge earnest.aa.cnts earnest.ab.cnts earnest.merged.cnts $ fstequal earnest.cnts earnest.merged.cnts

Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing. Each of the two model or count FSTs can be weighted, using the *--alpha* switch for the first input FST, and the *--beta* switch for the second input FST.

ngramshrink is a command line utility for pruning n-gram models.

The following command shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:

$ ngramshrink --method=relative_entropy --theta=.00015 earnest.mod earnest.pru

A random sentence generated through this LM is:

$ ngramrandgen earnest.pru | farprintstrings I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon>

ngrammarginalize is a command line utility for re-estimating smoothed n-gram models using marginalization constraints similar to Kneser-Ney smoothing.

The following imposes marginalization constraints on the 5-gram model created above:

$ ngrammarginalize earnest.mod earnest.marg.mod

This functionality is available in version 1.1.0 and higher. Note that this algorithm may need to be run for several iterations, using the *--iterations* switch. See full operation documentation for further considerations and references.

ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:

$ ngramprint --ARPA earnest.mod earnest.ARPA $ head -15 earnest.ARPA \data\ ngram 1=2306 ngram 2=10319 ngram 3=14796 ngram 4=15218 ngram 5=14170 \1-grams: -99 <s> -0.9399067 -1.064551 </s> -3.337681 MORNING -0.3590219 -2.990894 ROOM -0.4771213 -1.857355 IN -0.6232494 -2.87695 ALGERNON -0.4771213

ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,

$ ngramread --ARPA earnest.ARPA earnest.mod

generates a n-gram model in FST format from the ARPA n-gram language model specification.

ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.

$ ngraminfo earnest.mod # of states 39076 # of ngram arcs 51618 # of backoff arcs 39075 initial state 1 unigram state 0 # of final states 5190 ngram order 5 # of 1-grams 2305 # of 2-grams 10319 # of 3-grams 14796 # of 4-grams 15218 # of 5-grams 14170 well-formed y normalized y

ngramrandgen is a command line utility for sampling from n-gram models.

$ ngramrandgen --max_sents=1 earnest.mod | farprintstrings IT IS SIMPLY A VERY INEXCUSABLE MANNER

ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata:

$ ngramapply earnest.mod earnest.far | farprintstrings --print_weight

The result is a FAR weighted by the n-gram model.

ngramperplexity can be used to evaluate an n-gram model. For example, the following
calculates the perplexity of two strings (*a hand bag* and *bag hand a*) from the example 5-gram model generated above:

echo -e "A HAND BAG\nBAG HAND A" |\ farcompilestrings --generate_keys=1 -symbols=earnest.syms --keep_symbols |\ ngramperplexity --v=1 earnest.mod - A HAND BAG ngram -logprob N-gram probability found (base10) p( A | <s> ) = [2gram] 1.87984 p( HAND | A ...) = [2gram] 2.56724 p( BAG | HAND ...) = [3gram] 0.0457417 p( </s> | BAG ...) = [4gram] 0.507622 1 sentences, 3 words, 0 OOVs logprob(base 10)= -5.00044; perplexity (base 10)= 17.7873 BAG HAND A ngram -logprob N-gram probability found (base10) p( BAG | <s> ) = [1gram] 4.02771 p( HAND | BAG ...) = [1gram] 3.35968 p( A | HAND ...) = [1gram] 2.51843 p( </s> | A ...) = [1gram] 1.53325 1 sentences, 3 words, 0 OOVs logprob(base 10)= -11.4391; perplexity (base 10)= 724.048 2 sentences, 6 words, 0 OOVs logprob(base 10)= -16.4395; perplexity (base 10)= 113.485

The OpenGrm NGram library is a C++ library. Users can call the available operations from that level rather than from the command line if desired. From C++, include `<ngram/ngram.h>`

in the installation include directory and link to `libfst.so`

, `libfar.so`

, and `libngram.so`

in the installation library directory. This assumes you've installed OpenFst (with `--enable-far=yes`

). (You may instead use just those include files for
the classes and functions that you will need.) All classes and functions are in the `ngram`

namespace.

As mentioned earlier, each n-gram model, including those with unnormalized counts, is represented as a weighted FST. Each of the n-gram operation classes holds the FST in the common base class `NGramModel`

. A partial description of this class follows:

template <class Arc> class NGramModel { public: typedef typename Arc::StateId StateId; // Construct an NGramModel object, consisting of the FST and some // information about the states under the assumption that the FST is // a model. explicit NGramModel(const Fst<Arc> &infst); // Returns highest n-gram order. int HiOrder() const; // Returns order of a given state. int StateOrder(StateId state) const; // Returns the unigram state. StateId UnigramState() const; // Validates model has a well-formed n-gram topology bool CheckTopology() const; // Validates that states are fully normalized (probabilities sum to 1.0) bool CheckNormalization() const; // Gets a const reference to the internal (expanded) FST. const Fst<Arc> &GetFst() const; private: const Fst<Arc> &fst_; };

From this class is derived NGramCount for counting, NGramMake for parameter estimation/smoothing, NGramShrink for
model pruning, NGramMerge for model interpolation/merging (among others). `NGramMake`

and `NGramShrink`

are further sub-classed for each specific smoothing and pruning method. For example, `NGramMake`

has methods (some abstract) common to most/all parameter estimation/smoothing techniques while `NGramKatz`

has the specific implementations for that method.

Click on operation name for additional information.

Operation | Usage | Description |
---|---|---|

NGramApply | ngramapply [--bo_arc_type] ngram.fst [in.far [out.far]] | Intersect n-gram model with fst archive |

NGramCount | ngramcount [--order] [in.far [out.fst]] | count n-grams from fst archive |

NGramCounter(order); | --- n-gram counter | |

NGramInfo | ngraminfo [in.mod] | print various information about an n-gram model |

NGramMake | ngrammake [--method] [--backoff] [--bins] [--witten_bell_k] [--discount_D] [in.fst [out.fst]] | n-gram model smoothing and normalization |

NGramAbsolute(&CountFst); | --- Absolute Discount smoothing | |

NGramKatz(&CountFst); | --- Katz smoothing | |

NGramKneserNey(&CountFst); | --- Kneser Ney smoothing | |

NGramUnsmoothed(&CountFst); | --- no smoothing | |

NGramWittenBell(&CountFst); | --- Witten-Bell smoothing | |

NGramMarginal | ngrammarginalize [--iterations] [--max_bo_updates] [--output_each_iteration] [--steady_state_file] [in.mod [out.mod]] | impose marginalization constraints on input model |

NGramMarginal(&M); | --- n-gram marginalization constraint class | |

NGramMerge | ngrammerge [--alpha] [--beta] [--use_smoothing] [--normalize] in1.fst in2.fst [out.fst] | merge two count or model FSTs |

NGramMerge(&M1, &M2, alpha, beta); | --- n-gram merge class | |

NGramPerplexity | ngramperplexity [--OOV_symbol] [--OOV_class_size] [--OOV_probability] ngram.fst [in.far [out.txt]] | calculate perplexity of input corpus from model |

NGramPrint | ngramprint [--ARPA] [--backoff] [--integers] [--negativelogs] [in.fst [out.txt]] | print n-gram model to text file |

NGramRandgen | ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] | randomly sample sentences from an n-gram model |

NGramRead | ngramread [--ARPA] [--epsilon_symbol] [--OOV_symbol] [in.txt [out.fst]] | read n-gram counts or model from file |

NGramShrink | ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] | n-gram model pruning |

NGramCountPrune(&M, count_pattern); | --- count-based model pruning | |

NGramRelativeEntropy(&M, theta); | --- relative-entropy-based model pruning | |

NGramSeymoreShrink(&M, theta); | --- Seymore/Rosenfeld-based model pruning | |

NGramSymbols | ngramsymbols [--epsilon_symbol] [--OOV_symbol] [in.txt [out.txt]] | create symbol table from corpus |

The shell script `ngramtdisttrain.sh`

is provided to run some common OpenGrm NGram pipelines of commands and to provide some rudimentary distributed computation support.
For example:

$ ngramdisttrain.sh --itype=text_sents --otype=pruned_lm --ifile=in.txt --ofile=lm.fst --symbols=in.syms --order=5 --smooth_method=katz --shrink_method=relative_entropy --theta=.00015

will read a text corpus in the format accepted by `farcompilestrings`

and output a backoff 5-gram LM pruned with a relative entropy threshold of .00015. See `ngramdisttrain.sh --help`

for available options
and values and see here for a discussion of the distributed computation support.

I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|

txt | earnest.txt | r1 | manage | 89.0 K | 2010-11-05 - 02:14 | MichaelRiley |

Topic revision: r35 - 2022-08-06 - KyleGorman

Copyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback