TWiki> GRM Web>NGramLibrary>NGramQuickTour (revision 13)EditAttach

OpenGrm NGram Library Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

  • textual I/O (ngramsymbols, farcompilestrings, farprintstrings, ngramread and ngramprint)
  • n-gram model format
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply, ngramperplexity)

For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of available operations below.

Textual I/O

Text corpora are represented as binary finite-state archives, with one automaton per sentence. This provides efficient later processing by the NGram Library utilities and allows if desired more general probabilistic input (e.g. weighted DAGs or lattices).

The first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This can be done with the command-line utility ngramsymbols. For example, the symbols in the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be extracted with:

$ ngramsymbols <earnest.txt >earnest.syms

If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can be accomplished by concatenating the corpora when passed to ngramsymbols, eliminating out-of-vocabulary symbols. Alternatively, flags can be passed to both ngramsymbols and farprintstrings to specify an out-of-vocabulary label.


Given a symbol table, a text corpus can be converted to a binary FAR archive with:

$ farcompilestrings -symbols=earnest.syms earnest.txt >earnest.far

and can be printed with:

$ farprintstrings earnest.far >earnest.txt


ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,

$ ngramread --ARPA earnest.ARPA >earnest.mod

generates a n-gram model in FST format from the ARPA n-gram language model specification.

ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:

$ ngramprint --ARPA earnest.mod | head -15
\data\
ngram 1=2306
ngram 2=10319
ngram 3=14796
ngram 4=15218
ngram 5=14170

\1-grams:
-99   <s>   -0.9399067
-1.064551   </s>
-3.337681   MORNING   -0.3590219
-2.990894   ROOM   -0.4771213
-1.857355   IN   -0.6232494
-2.87695   ALGERNON   -0.4771213

ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.

$ ngraminfo earnest.mod
# of states                                     42641
# of ngram arcs                                 56809
# of backoff arcs                               42640
initial state                                   1
unigram state                                   0
# of final states                               5190
ngram order                                     5
# of 1-grams                                    2306
# of 2-grams                                    10319
# of 3-grams                                    14796
# of 4-grams                                    15218
# of 5-grams                                    14170
well-formed                                     y
normalized                                      y

Model Format

All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.

N-gram Counting

ngramcount is a command line utility for counting n-grams from an input corpus, represented in FAR format. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

The 1-gram through 5-gram counts for the earnest.far finite-state archive file created above can be created with:

$ ngramcount -order=5 earnest.far >earnest.cnts

N-gram Model Parameter Estimation

ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).

The 5-gram counts in earnest.cnts created above can be converted into a n-gram model with:

$ ngrammake earnest.cnts >earnest.mod

Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Witten-Bell.

Here is a generated sesntence from the language model:

$ ngramrandgen earnest.mod | farprintstrings
I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon>  

(An epsilon transition is emitted for each backoff.)

N-gram Model Merging and Pruning

ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.


ngramshrink is a command line utility for pruning n-gram models.

This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:

$ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru

A random sentence generated through this LM is:

$ ngramrandgen earnest.pru | farprintstrings
I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon>  

N-gram Model Sampling, Application and Evaluation

ngramrandgen is a command line utility for sampling from n-gram models.

$ ngramrandgen [--npaths=1] earnest.mod | farprintstrings
IT IS SIMPLY A VERY INEXCUSABLE MANNER


ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata:

$ ngramapply earnest.mod earnest.far | farprintstrings -print_weight

The result is a FAR weighted by the n-gram model.


ngramperplexity can be used to evaluate an n-gram model. For example, the following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:

echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramperplexity --v=1 earnest.mod -
A HAND BAG
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( A | <s> )                         = [2gram]  1.87984
        p( HAND | A ...)                     = [2gram]  2.56724
        p( BAG | HAND ...)                   = [3gram]  0.0457417
        p( </s> | BAG ...)                   = [4gram]  0.507622
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -5.00044;  perplexity (base 10)= 17.7873

BAG HAND A
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( BAG | <s> )                       = [1gram]  4.02771
        p( HAND | BAG ...)                   = [1gram]  3.35968
        p( A | HAND ...)                     = [1gram]  2.51843
        p( </s> | A ...)                     = [1gram]  1.53325
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -11.4391;  perplexity (base 10)= 724.048

2 sentences, 6 words, 0 OOVs
logprob(base 10)= -16.4395;  perplexity (base 10)= 113.485

Using the C++ Library

The OpenGrm NGram library is a C++ library. Users can call the available operations from that level rather than from the command line if desired. From C++, include <ngram/ngram.h> in the installation include directory and link to libfst.so, libfar.so, and libngram.so in the installation library directory. This assumes you've installed OpenFst (with --enable-far=yes). (You may instead use just those include files for the classes and functions that you will need.) All classes and functions are in the ngram namespace.

Available Operations

Click on operation name for additional information.

Operation Usage Description
NGramApply    
NGramCount    
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
NGramMake    
NGramMerge    
NGramPerplexity    
NGramPrint    
NGramRandgen ngramrandgen [--npath] [--seed] [--max_length] [in.mod [out.far]] randomly sample sentences from an n-gram model
NGramRead    
NGramShrink ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] n-gram model pruning
  NGramCountPrune(&M, count_pattern); --- count-based model pruning
  NGramRelativeEntropy(&M, theta); --- relative-entropy-based model pruning
  NGramSeymoreShrink(&M, theta); --- Seymore/Rosenfeld-based model pruning
NGramSymbols    

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt earnest.txt r1 manage 89.0 K 2010-11-05 - 02:14 MichaelRiley  
Edit | Attach | Watch | Print version | History: r35 | r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2011-12-10 - MichaelRiley
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback