Line: 1 to 1  

OpenGrm NGram Library Quick Tour
This tour is organized around the stages of ngram model creation, modification and use:
For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of available operations below. Corpus I/OText corpora are represented as binary finitestate archives, with one automaton per sentence. This provides efficient later processing by the NGram Library utilities and allows if desired more general probabilistic input (e.g. weighted DAGs or lattices).The first step is to generate an OpenFststyle symbol table for the text tokens in input corpus. This can be done with the commandline utility ngramsymbols. For example, the symbols in the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be extracted with:  
Changed:  
< <  $ ngramsymbols <earnest.txt >earnest.syms  
> >  $ ngramsymbols earnest.txt earnest.syms  
If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can
be accomplished by concatenating the corpora when passed to
Given a symbol table, a text corpus can be converted to a binary FAR archive with:  
Changed:  
< <  $ farcompilestrings symbols=earnest.syms keep_symbols=1 earnest.txt >earnest.far  
> >  $ farcompilestrings fst_type=compact symbols=earnest.syms keep_symbols earnest.txt earnest.far  
and can be printed with:
$ farprintstrings earnest.far >earnest.txt Model FormatAll ngram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finitestate transducer (FST) format, encoded using the OpenFst library. For the precise details of the ngram format, see here.The model is normally stored in a generalpurpose, mutable (VectorFst) format, which is convenient for the various processing steps described below.This can be converted to a more compact (but immutable) format specifically for ngram models (NGramFst) when the desired final model is generated. Ngram Countingngramcount is a command line utility for counting ngrams from an input corpus, represented in FAR format. It produces an ngram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated ngram. By using the switch order the maximum length ngram to count can be chosen. All ngrams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).
The 1gram through 5gram counts for the  
Changed:  
< <  $ ngramcount order=5 earnest.far >earnest.cnts  
> >  $ ngramcount order=5 earnest.far earnest.cnts  
Ngram Model Parameter Estimation
ngrammake is a command line utility for normalizing and smoothing an ngram model. It takes as input the FST produced by
The 5gram counts in  
Changed:  
< <  $ ngrammake earnest.cnts >earnest.mod  
> >  $ ngrammake earnest.cnts earnest.mod  
Flags to ngrammake specify the smoothing (e.g. Katz, KnesserNey, etc) used with the default being Katz.
Here is a generated sesntence from the language model (using
$ ngramrandgen earnest.mod  farprintstrings I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon> (An epsilon transition is emitted for each backoff.) Ngram Model Merging, Pruning and Constrainingngrammerge is a command line utility for merging two ngram models into a single model  either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, and derive 5gram counts from each independently using ngramcount as shown above. We can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):  
Changed:  
< <  $ ngrammerge earnest.aa.cnts earnest.ab.cnts >earnest.merged.cnts  
> >  $ ngrammerge earnest.aa.cnts earnest.ab.cnts earnest.merged.cnts  
$ fstequal earnest.cnts earnest.merged.cnts
Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing. Each of the two model or count FSTs can be weighted, using the alpha switch for the first input FST, and the beta switch for the second input FST.
ngramshrink is a command line utility for pruning ngram models. The following command shrinks the 5gram model created above using entropy pruning to roughly 1/10 the original size:  
Changed:  
< <  $ ngramshrink method=relative_entropy theta=.00015 earnest.mod >earnest.pru  
> >  $ ngramshrink method=relative_entropy theta=.00015 earnest.mod earnest.pru  
A random sentence generated through this LM is:
$ ngramrandgen earnest.pru  farprintstrings I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon>
ngrammarginalize is a command line utility for reestimating smoothed ngram models using marginalization constraints similar to KneserNey smoothing. The following imposes marginalization constraints on the 5gram model created above:  
Changed:  
< <  $ ngrammarginalize earnest.mod >earnest.marg.mod  
> >  $ ngrammarginalize earnest.mod earnest.marg.mod  
This functionality is available in version 1.1.0 and higher. Note that this algorithm may need to be run for several iterations, using the iterations switch. See full operation documentation for further considerations and references. Ngram Model Reading, Printing and Infongramprint is a command line utility for reading in ngram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5gram model created below, the following prints out a portion of it in ARPA format:  
Changed:  
< <  $ ngramprint ARPA earnest.mod >earnest.ARPA  
> >  $ ngramprint ARPA earnest.mod earnest.ARPA  
$ head 15 earnest.ARPA
\datangram 1=2306
ngram 2=10319
ngram 3=14796
ngram 4=15218
ngram 5=14170
\1grams:
99 ngramread is a command line utility for reading in textual representations of ngram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,  
Changed:  
< <  $ ngramread ARPA earnest.ARPA >earnest.mod  
> >  $ ngramread ARPA earnest.ARPA earnest.mod  
generates a ngram model in FST format from the ARPA ngram language model specification. ngraminfo is a commandline utility that prints out various information about an ngram language model in FST format.
$ ngraminfo earnest.mod # of states 39076 # of ngram arcs 51618 # of backoff arcs 39075 initial state 1 unigram state 0 # of final states 5190 ngram order 5 # of 1grams 2305 # of 2grams 10319 # of 3grams 14796 # of 4grams 15218 # of 5grams 14170 wellformed y normalized y Ngram Model Sampling, Application and Evaluationngramrandgen is a command line utility for sampling from ngram models.
$ ngramrandgen max_sents=1 earnest.mod  farprintstrings IT IS SIMPLY A VERY INEXCUSABLE MANNER
ngramapply is a command line utility for applying ngram models. It can be called to apply a model to a concatenated archive of automata:  
Changed:  
< <  $ ngramapply earnest.mod earnest.far  farprintstrings print_weight  
> >  $ ngramapply earnest.mod earnest.far  farprintstrings print_weight  
The result is a FAR weighted by the ngram model.
ngramperplexity can be used to evaluate an ngram model. For example, the following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5gram model generated above: echo e "A HAND BAG\nBAG HAND A" \  
Changed:  
< <  farcompilestrings generate_keys=1 symbols=earnest.syms keep_symbols=1   
> >  farcompilestrings generate_keys=1 symbols=earnest.syms keep_symbols   
ngramperplexity v=1 earnest.mod 
A HAND BAG
ngram logprob
Ngram probability found (base10)
p( A 
BAG HAND A
ngram logprob
Ngram probability found (base10)
p( BAG  2 sentences, 6 words, 0 OOVs logprob(base 10)= 16.4395; perplexity (base 10)= 113.485
Using the C++ Library
The OpenGrm NGram library is a C++ library. Users can call the available operations from that level rather than from the command line if desired. From C++, include
As mentioned earlier, each ngram model, including those with unnormalized counts, is represented as a weighted FST. Each of the ngram operation classes holds the FST in the common base class
template <class Arc> class NGramModel { public: typedef typename Arc::StateId StateId; // Construct an NGramModel object, consisting of the FST and some // information about the states under the assumption that the FST is // a model. explicit NGramModel(const Fst<Arc> &infst); // Returns highest ngram order. int HiOrder() const; // Returns order of a given state. int StateOrder(StateId state) const; // Returns the unigram state. StateId UnigramState() const; // Validates model has a wellformed ngram topology bool CheckTopology() const; // Validates that states are fully normalized (probabilities sum to 1.0) bool CheckNormalization() const; // Gets a const reference to the internal (expanded) FST. const Fst<Arc> &GetFst() const; private: const Fst<Arc> &fst_; };
From this class is derived NGramCount for counting, NGramMake for parameter estimation/smoothing, NGramShrink for
model pruning, NGramMerge for model interpolation/merging (among others). Available OperationsClick on operation name for additional information.
Convenience Script  
Changed:  
< <  The shell script ngram.sh is provided to run some common OpenGrm NGram pipelines of commands and to provide some rudimentary distributed computation support.  
> >  The shell script ngramtdisttrain.sh is provided to run some common OpenGrm NGram pipelines of commands and to provide some rudimentary distributed computation support.  
For example:
 
Changed:  
< <  $ ngram.sh itype=text_sents otype=pruned_lm ifile=in.txt ofile=lm.fst symbols=in.syms order=5 smooth_method=katz shrink_method=relative_entropy theta=.00015  
> >  $ ngramdisttrain.sh itype=text_sents otype=pruned_lm ifile=in.txt ofile=lm.fst symbols=in.syms order=5 smooth_method=katz shrink_method=relative_entropy theta=.00015  
Changed:  
< <  will read a text corpus in the format accepted by farcompilestrings and output a backoff 5gram LM pruned with a relative entropy threshold of .00015. See ngram.sh help for available options  
> >  will read a text corpus in the format accepted by farcompilestrings and output a backoff 5gram LM pruned with a relative entropy threshold of .00015. See ngramdisttrain.sh help for available options  
and values and see here for a discussion of the distributed computation support.
