Difference: NGramQuickTour (1 vs. 34)

Revision 342019-06-14 - KyleGorman

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 24 to 24
 can be done with the command-line utility ngramsymbols. For example, the symbols in the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be extracted with:
Changed:
<
<
$ ngramsymbols <earnest.txt >earnest.syms
>
>
$ ngramsymbols earnest.txt earnest.syms
 

If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can

Line: 35 to 35
 Given a symbol table, a text corpus can be converted to a binary FAR archive with:
Changed:
<
<
$ farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt >earnest.far
>
>
$ farcompilestrings --fst_type=compact --symbols=earnest.syms --keep_symbols earnest.txt earnest.far
 

and can be printed with:

Line: 58 to 58
 The 1-gram through 5-gram counts for the earnest.far finite-state archive file created above can be created with:
Changed:
<
<
$ ngramcount -order=5 earnest.far >earnest.cnts
>
>
$ ngramcount --order=5 earnest.far earnest.cnts
 

Line: 69 to 69
 The 5-gram counts in earnest.cnts created above can be converted into a n-gram model with:
Changed:
<
<
$ ngrammake earnest.cnts >earnest.mod
>
>
$ ngrammake earnest.cnts earnest.mod
 

Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Katz.

Line: 89 to 89
 ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, and derive 5-gram counts from each independently using ngramcount as shown above. We can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):
Changed:
<
<
$ ngrammerge earnest.aa.cnts earnest.ab.cnts >earnest.merged.cnts
>
>
$ ngrammerge earnest.aa.cnts earnest.ab.cnts earnest.merged.cnts
 $ fstequal earnest.cnts earnest.merged.cnts
Line: 102 to 102
 The following command shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:
Changed:
<
<
$ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru
>
>
$ ngramshrink --method=relative_entropy --theta=.00015 earnest.mod earnest.pru
 

A random sentence generated through this LM is:

Line: 119 to 119
 The following imposes marginalization constraints on the 5-gram model created above:
Changed:
<
<
$ ngrammarginalize earnest.mod >earnest.marg.mod
>
>
$ ngrammarginalize earnest.mod earnest.marg.mod
 

This functionality is available in version 1.1.0 and higher. Note that this algorithm may need to be run for several iterations, using the --iterations switch. See full operation documentation for further considerations and references.

Line: 130 to 130
 ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
Changed:
<
<
$ ngramprint --ARPA earnest.mod >earnest.ARPA
>
>
$ ngramprint --ARPA earnest.mod earnest.ARPA
 $ head -15 earnest.ARPA \datangram 1=2306
Line: 152 to 152
 It has several options for input. For example,
Changed:
<
<
$ ngramread --ARPA earnest.ARPA >earnest.mod
>
>
$ ngramread --ARPA earnest.ARPA earnest.mod
 

generates a n-gram model in FST format from the ARPA n-gram language model specification.

Line: 191 to 191
 ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata:
Changed:
<
<
$ ngramapply earnest.mod earnest.far | farprintstrings -print_weight
>
>
$ ngramapply earnest.mod earnest.far | farprintstrings --print_weight
 

The result is a FAR weighted by the n-gram model.

Line: 202 to 202
 calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:
echo -e "A HAND BAG\nBAG HAND A" |\
Changed:
<
<
farcompilestrings -generate_keys=1 -symbols=earnest.syms --keep_symbols=1 |
>
>
farcompilestrings --generate_keys=1 -symbols=earnest.syms --keep_symbols |
  ngramperplexity --v=1 earnest.mod - A HAND BAG ngram -logprob
Line: 304 to 304
 

Convenience Script Work in progress, under construction

Changed:
<
<
The shell script ngram.sh is provided to run some common OpenGrm NGram pipelines of commands and to provide some rudimentary distributed computation support.
>
>
The shell script ngramtdisttrain.sh is provided to run some common OpenGrm NGram pipelines of commands and to provide some rudimentary distributed computation support.
 For example:


Changed:
<
<
$ ngram.sh --itype=text_sents --otype=pruned_lm --ifile=in.txt --ofile=lm.fst --symbols=in.syms --order=5 --smooth_method=katz --shrink_method=relative_entropy --theta=.00015
>
>
$ ngramdisttrain.sh --itype=text_sents --otype=pruned_lm --ifile=in.txt --ofile=lm.fst --symbols=in.syms --order=5 --smooth_method=katz --shrink_method=relative_entropy --theta=.00015
 
Changed:
<
<
will read a text corpus in the format accepted by farcompilestrings and output a backoff 5-gram LM pruned with a relative entropy threshold of .00015. See ngram.sh --help for available options
>
>
will read a text corpus in the format accepted by farcompilestrings and output a backoff 5-gram LM pruned with a relative entropy threshold of .00015. See ngramdisttrain.sh --help for available options
 and values and see here for a discussion of the distributed computation support.

META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"

Revision 332018-05-09 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Revision 322017-06-19 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 72 to 72
 $ ngrammake earnest.cnts >earnest.mod
Changed:
<
<
Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Witten-Bell.
>
>
Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Katz.
  Here is a generated sesntence from the language model (using ngramrandgen, which is described below):

Revision 312016-07-20 - JesseRosenstock

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 238 to 238
 As mentioned earlier, each n-gram model, including those with unnormalized counts, is represented as a weighted FST. Each of the n-gram operation classes holds the FST in the common base class NGramModel. A partial description of this class follows:


Added:
>
>
template <class Arc>
 class NGramModel { public:
Changed:
<
<
typedef int StateId;
>
>
typedef typename Arc::StateId StateId;
 
Changed:
<
<
// Construct an n-gram model container holding the input FST, whose ownership // is retained by the caller. NGramModel(StdMutableFst *fst);
>
>
// Construct an NGramModel object, consisting of the FST and some // information about the states under the assumption that the FST is // a model. explicit NGramModel(const Fst<Arc> &infst);
  // Returns highest n-gram order. int HiOrder() const;
Line: 259 to 261
  bool CheckNormalization() const;

// Gets a const reference to the internal (expanded) FST.

Changed:
<
<
StdExpandedFst &GetFst() const; // Gets a pointer to the internal (mutable) FST. StdMutableFst *GetMutableFst() const;
>
>
const Fst<Arc> &GetFst() const;
  private:
Changed:
<
<
StdMutableFst *fst_;
>
>
const Fst<Arc> &fst_;
 };

Revision 302013-08-07 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 311 to 311
 $ ngram.sh --itype=text_sents --otype=pruned_lm --ifile=in.txt --ofile=lm.fst --symbols=in.syms --order=5 --smooth_method=katz --shrink_method=relative_entropy --theta=.00015
Changed:
<
<
will read a text corpus in the format accepted by farcompilestrings and output an order 5 Katz backoff n-gram FST pruned with a relative entropy threshold of .00015. See 'ngram.sh --help" for available options
>
>
will read a text corpus in the format accepted by farcompilestrings and output a backoff 5-gram LM pruned with a relative entropy threshold of .00015. See ngram.sh --help for available options
 and values and see here for a discussion of the distributed computation support.

META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"

Revision 292013-08-07 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 301 to 301
 
  NGramSeymoreShrink(&M, theta); --- Seymore/Rosenfeld-based model pruning
NGramSymbols ngramsymbols [--epsilon_symbol] [--OOV_symbol] [in.txt [out.txt]] create symbol table from corpus
Added:
>
>

Convenience Script Work in progress, under construction

The shell script ngram.sh is provided to run some common OpenGrm NGram pipelines of commands and to provide some rudimentary distributed computation support. For example:

$ ngram.sh --itype=text_sents --otype=pruned_lm --ifile=in.txt --ofile=lm.fst --symbols=in.syms --order=5 --smooth_method=katz --shrink_method=relative_entropy --theta=.00015

will read a text corpus in the format accepted by farcompilestrings and output an order 5 Katz backoff n-gram FST pruned with a relative entropy threshold of .00015. See 'ngram.sh --help" for available options and values and see here for a discussion of the distributed computation support.

 
META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"
META TOPICMOVED by="MichaelRiley" date="1296787886" from="GRM.GrmQuickTour" to="GRM.NGramQuickTour"

Revision 282013-07-18 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 10 to 10
 
  • n-gram model format
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
Changed:
<
<
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
>
>
  • n-gram model merging, pruning and constraining (ngrammerge, ngramshrink and ngrammarginalize)
 
  • model I/O (ngramread, ngramprint and ngraminfo)
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply and ngramperplexity)
Line: 84 to 84
 (An epsilon transition is emitted for each backoff.)

Changed:
<
<

N-gram Model Merging and Pruning

>
>

N-gram Model Merging, Pruning and Constraining

  ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, and derive 5-gram counts from each independently using ngramcount as shown above. We can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):
Line: 99 to 99
  ngramshrink is a command line utility for pruning n-gram models.
Changed:
<
<
This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:
>
>
The following command shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:
 
$ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru
Line: 112 to 112
 I THINK BE ABLE TO DIARY GWENDOLEN WONDERFUL SECRETS MONEY YOU
Added:
>
>

ngrammarginalize is a command line utility for re-estimating smoothed n-gram models using marginalization constraints similar to Kneser-Ney smoothing.

The following imposes marginalization constraints on the 5-gram model created above:

$ ngrammarginalize earnest.mod >earnest.marg.mod

This functionality is available in version 1.1.0 and higher. Note that this algorithm may need to be run for several iterations, using the --iterations switch. See full operation documentation for further considerations and references.

 

N-gram Model Reading, Printing and Info

Line: 275 to 287
 
  NGramKneserNey(&CountFst); --- Kneser Ney smoothing
  NGramUnsmoothed(&CountFst); --- no smoothing
  NGramWittenBell(&CountFst); --- Witten-Bell smoothing
Added:
>
>
NGramMarginal ngrammarginalize [--iterations] [--max_bo_updates] [--output_each_iteration] [--steady_state_file] [in.mod [out.mod]] impose marginalization constraints on input model
  NGramMarginal(&M); --- n-gram marginalization constraint class
 
NGramMerge ngrammerge [--alpha] [--beta] [--use_smoothing] [--normalize] in1.fst in2.fst [out.fst] merge two count or model FSTs
  NGramMerge(&M1, &M2, alpha, beta); --- n-gram merge class
NGramPerplexity ngramperplexity [--OOV_symbol] [--OOV_class_size] [--OOV_probability] ngram.fst [in.far [out.txt]] calculate perplexity of input corpus from model

Revision 272012-06-01 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 48 to 48
 

Model Format

All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer (FST) format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.
Changed:
<
<
The model is normally stored in a general-purpose, mutable (VectorFst) format., which is convenient for the various processing steps described below.This can be converted to a much more compact (but immutable) format specifically for n-gram models (NGramFst) when the desired final model is generated.
>
>
The model is normally stored in a general-purpose, mutable (VectorFst) format, which is convenient for the various processing steps described below.This can be converted to a more compact (but immutable) format specifically for n-gram models (NGramFst) when the desired final model is generated.
 

N-gram Counting

Revision 262012-06-01 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 48 to 48
 

Model Format

All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer (FST) format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.
Added:
>
>
The model is normally stored in a general-purpose, mutable (VectorFst) format., which is convenient for the various processing steps described below.This can be converted to a much more compact (but immutable) format specifically for n-gram models (NGramFst) when the desired final model is generated.
 

N-gram Counting

Revision 252012-03-22 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 187 to 187
 ngramperplexity can be used to evaluate an n-gram model. For example, the following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:
Changed:
<
<
echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramperplexity --v=1 earnest.mod -
>
>
echo -e "A HAND BAG\nBAG HAND A" | farcompilestrings -generate_keys=1 -symbols=earnest.syms --keep_symbols=1 | ngramperplexity --v=1 earnest.mod -
 A HAND BAG ngram -logprob N-gram probability found (base10)

Revision 242012-03-04 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 147 to 147
 
$ ngraminfo earnest.mod
Changed:
<
<
# of states 42641 # of ngram arcs 56809 # of backoff arcs 42640
>
>
# of states 39076 # of ngram arcs 51618 # of backoff arcs 39075
 initial state 1 unigram state 0 # of final states 5190 ngram order 5
Changed:
<
<
# of 1-grams 2306
>
>
# of 1-grams 2305
 # of 2-grams 10319 # of 3-grams 14796 # of 4-grams 15218
Line: 169 to 169
 ngramrandgen is a command line utility for sampling from n-gram models.
Changed:
<
<
$ ngramrandgen [--max_sents=1] earnest.mod | farprintstrings
>
>
$ ngramrandgen --max_sents=1 earnest.mod | farprintstrings
 IT IS SIMPLY A VERY INEXCUSABLE MANNER

Revision 232012-03-04 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 276 to 276
 
NGramPerplexity ngramperplexity [--OOV_symbol] [--OOV_class_size] [--OOV_probability] ngram.fst [in.far [out.txt]] calculate perplexity of input corpus from model
NGramPrint ngramprint [--ARPA] [--backoff] [--integers] [--negativelogs] [in.fst [out.txt]] print n-gram model to text file
NGramRandgen ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] randomly sample sentences from an n-gram model
Changed:
<
<
NGramRead    
>
>
NGramRead ngramread [--ARPA] [--epsilon_symbol] [--OOV_symbol] [in.txt [out.fst]] read n-gram counts or model from file
 
NGramShrink ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] n-gram model pruning
  NGramCountPrune(&M, count_pattern); --- count-based model pruning
  NGramRelativeEntropy(&M, theta); --- relative-entropy-based model pruning

Revision 222012-03-03 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 35 to 35
 Given a symbol table, a text corpus can be converted to a binary FAR archive with:
Changed:
<
<
$ farcompilestrings -symbols=earnest.syms earnest.txt >earnest.far
>
>
$ farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt >earnest.far
 

and can be printed with:

Revision 212012-03-02 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Revision 202012-02-22 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 261 to 261
 Click on operation name for additional information.

Operation Usage Description
Changed:
<
<
NGramApply    
>
>
NGramApply ngramapply [--bo_arc_type] ngram.fst [in.far [out.far]] Intersect n-gram model with fst archive
 
NGramCount ngramcount [--order] [in.far [out.fst]] count n-grams from fst archive
  NGramCounter(order); --- n-gram counter
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
Line: 274 to 274
 
NGramMerge ngrammerge [--alpha] [--beta] [--use_smoothing] [--normalize] in1.fst in2.fst [out.fst] merge two count or model FSTs
  NGramMerge(&M1, &M2, alpha, beta); --- n-gram merge class
NGramPerplexity ngramperplexity [--OOV_symbol] [--OOV_class_size] [--OOV_probability] ngram.fst [in.far [out.txt]] calculate perplexity of input corpus from model
Changed:
<
<
NGramPrint    
>
>
NGramPrint ngramprint [--ARPA] [--backoff] [--integers] [--negativelogs] [in.fst [out.txt]] print n-gram model to text file
 
NGramRandgen ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] randomly sample sentences from an n-gram model
NGramRead    
NGramShrink ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] n-gram model pruning

Revision 192011-12-16 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 266 to 266
 
  NGramCounter(order); --- n-gram counter
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
NGramMake ngrammake [--method] [--backoff] [--bins] [--witten_bell_k] [--discount_D] [in.fst [out.fst]] n-gram model smoothing and normalization
Changed:
<
<
  NGramAbsolute(&M); --- Absolute Discount smoothing
  NGramKatz(&M); --- Katz smoothing
  NGramKneserNey(&M); --- Kneser Ney smoothing
  NGramUnsmoothed(&M); --- no smoothing
  NGramWittenBell(&M); --- Witten-Bell smoothing
>
>
  NGramAbsolute(&CountFst); --- Absolute Discount smoothing
  NGramKatz(&CountFst); --- Katz smoothing
  NGramKneserNey(&CountFst); --- Kneser Ney smoothing
  NGramUnsmoothed(&CountFst); --- no smoothing
  NGramWittenBell(&CountFst); --- Witten-Bell smoothing
 
NGramMerge ngrammerge [--alpha] [--beta] [--use_smoothing] [--normalize] in1.fst in2.fst [out.fst] merge two count or model FSTs
  NGramMerge(&M1, &M2, alpha, beta); --- n-gram merge class
Changed:
<
<
NGramPerplexity    
>
>
NGramPerplexity ngramperplexity [--OOV_symbol] [--OOV_class_size] [--OOV_probability] ngram.fst [in.far [out.txt]] calculate perplexity of input corpus from model
 
NGramPrint    
NGramRandgen ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] randomly sample sentences from an n-gram model
NGramRead    

Revision 182011-12-15 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 262 to 262
 
Operation Usage Description
NGramApply    
Changed:
<
<
NGramCount    
>
>
NGramCount ngramcount [--order] [in.far [out.fst]] count n-grams from fst archive
  NGramCounter(order); --- n-gram counter
 
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
Changed:
<
<
NGramMake    
NGramMerge    
>
>
NGramMake ngrammake [--method] [--backoff] [--bins] [--witten_bell_k] [--discount_D] [in.fst [out.fst]] n-gram model smoothing and normalization
  NGramAbsolute(&M); --- Absolute Discount smoothing
  NGramKatz(&M); --- Katz smoothing
  NGramKneserNey(&M); --- Kneser Ney smoothing
  NGramUnsmoothed(&M); --- no smoothing
  NGramWittenBell(&M); --- Witten-Bell smoothing
NGramMerge ngrammerge [--alpha] [--beta] [--use_smoothing] [--normalize] in1.fst in2.fst [out.fst] merge two count or model FSTs
  NGramMerge(&M1, &M2, alpha, beta); --- n-gram merge class
 
NGramPerplexity    
NGramPrint    
NGramRandgen ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] randomly sample sentences from an n-gram model
Line: 274 to 281
 
  NGramCountPrune(&M, count_pattern); --- count-based model pruning
  NGramRelativeEntropy(&M, theta); --- relative-entropy-based model pruning
  NGramSeymoreShrink(&M, theta); --- Seymore/Rosenfeld-based model pruning
Changed:
<
<
NGramSymbols    

>
>
NGramSymbols ngramsymbols [--epsilon_symbol] [--OOV_symbol] [in.txt [out.txt]] create symbol table from corpus
 
META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"
META TOPICMOVED by="MichaelRiley" date="1296787886" from="GRM.GrmQuickTour" to="GRM.NGramQuickTour"

Revision 172011-12-14 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Added:
>
>
 This tour is organized around the stages of n-gram model creation, modification and use:

  • corpus I/O (ngramsymbols, farcompilestrings and farprintstrings)
Line: 44 to 46
 

Model Format

Changed:
<
<
All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.
>
>
All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer (FST) format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.
 

N-gram Counting

Line: 167 to 169
 ngramrandgen is a command line utility for sampling from n-gram models.
Changed:
<
<
$ ngramrandgen [--npaths=1] earnest.mod | farprintstrings
>
>
$ ngramrandgen [--max_sents=1] earnest.mod | farprintstrings
 IT IS SIMPLY A VERY INEXCUSABLE MANNER
Line: 217 to 219
 The OpenGrm NGram library is a C++ library. Users can call the available operations from that level rather than from the command line if desired. From C++, include <ngram/ngram.h> in the installation include directory and link to libfst.so, libfar.so, and libngram.so in the installation library directory. This assumes you've installed OpenFst (with --enable-far=yes). (You may instead use just those include files for the classes and functions that you will need.) All classes and functions are in the ngram namespace.
Added:
>
>
As mentioned earlier, each n-gram model, including those with unnormalized counts, is represented as a weighted FST. Each of the n-gram operation classes holds the FST in the common base class NGramModel. A partial description of this class follows:

class NGramModel {
public:
   typedef int StateId; 
 
   // Construct an n-gram model container holding the input FST, whose ownership
   // is retained by the caller. 
   NGramModel(StdMutableFst *fst);  

   // Returns highest n-gram order. 
   int HiOrder() const;    
   // Returns order of a given state.  
   int StateOrder(StateId state) const;   
   // Returns the unigram state. 
   StateId UnigramState() const;  
  
   // Validates model has a well-formed n-gram topology 
   bool CheckTopology() const;   
   // Validates that states are fully normalized (probabilities sum to 1.0) 
   bool CheckNormalization() const;   

   // Gets a const reference to the internal (expanded) FST. 
   StdExpandedFst &GetFst() const;   
   // Gets a pointer to the internal (mutable) FST. 
   StdMutableFst *GetMutableFst() const; 
     
private:
   StdMutableFst *fst_;
};

From this class is derived NGramCount for counting, NGramMake for parameter estimation/smoothing, NGramShrink for model pruning, NGramMerge for model interpolation/merging (among others). NGramMake and NGramShrink are further sub-classed for each specific smoothing and pruning method. For example, NGramMake has methods (some abstract) common to most/all parameter estimation/smoothing techniques while NGramKatz has the specific implementations for that method.

 

Available Operations

Line: 231 to 268
 
NGramMerge    
NGramPerplexity    
NGramPrint    
Changed:
<
<
NGramRandgen ngramrandgen [--npath] [--seed] [--max_length] [in.mod [out.far]] randomly sample sentences from an n-gram model
>
>
NGramRandgen ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] randomly sample sentences from an n-gram model
 
NGramRead    
NGramShrink ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] n-gram model pruning
  NGramCountPrune(&M, count_pattern); --- count-based model pruning

Revision 162011-12-13 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 82 to 82
 

N-gram Model Merging and Pruning

Changed:
<
<
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, e.g., by using split:
>
>
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, and derive 5-gram counts from each independently using ngramcount as shown above. We can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):
 
Deleted:
<
<
$ split -844 earnest.txt earnest.

If we count each half independently, we can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):

$ farcompilestrings -symbols=earnest.syms earnest.aa >earnest.aa.far
$ ngramcount -order=5 earnest.aa.far >earnest.aa.cnts
$ farcompilestrings -symbols=earnest.syms earnest.ab >earnest.ab.far
$ ngramcount -order=5 earnest.ab.far >earnest.ab.cnts
 $ ngrammerge earnest.aa.cnts earnest.ab.cnts >earnest.merged.cnts $ fstequal earnest.cnts earnest.merged.cnts
Changed:
<
<
Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing.

Each of the two model or count FSTs can be weighted, using the --alpha switch for the first input FST, and the --beta switch for the second input FST. These weights are interpreted in the real semiring and both default to one, meaning that by default the original counts or probabilities are not scaled. To triple the contribution of the first model and double the contribution of the second:

$ ngrammerge --alpha=3 --beta=2 earnest.aa.mod earnest.ab.mod >earnest.merged.mod
>
>
Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing. Each of the two model or count FSTs can be weighted, using the --alpha switch for the first input FST, and the --beta switch for the second input FST.
 
Line: 130 to 114
 ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
Changed:
<
<
$ ngramprint --ARPA earnest.mod | head -15
>
>
$ ngramprint --ARPA earnest.mod >earnest.ARPA $ head -15 earnest.ARPA
 \datangram 1=2306 ngram 2=10319

Revision 152011-12-13 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 10 to 10
 
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • model I/O (ngramread, ngramprint and ngraminfo)
Changed:
<
<
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply, ngramperplexity)
>
>
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply and ngramperplexity)
  For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of available operations below.
Line: 82 to 82
 

N-gram Model Merging and Pruning

Changed:
<
<
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.
>
>
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, e.g., by using split:

$ split -844 earnest.txt earnest.

If we count each half independently, we can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):

$ farcompilestrings -symbols=earnest.syms earnest.aa >earnest.aa.far
$ ngramcount -order=5 earnest.aa.far >earnest.aa.cnts
$ farcompilestrings -symbols=earnest.syms earnest.ab >earnest.ab.far
$ ngramcount -order=5 earnest.ab.far >earnest.ab.cnts
$ ngrammerge earnest.aa.cnts earnest.ab.cnts >earnest.merged.cnts
$ fstequal earnest.cnts earnest.merged.cnts

Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing.

Each of the two model or count FSTs can be weighted, using the --alpha switch for the first input FST, and the --beta switch for the second input FST. These weights are interpreted in the real semiring and both default to one, meaning that by default the original counts or probabilities are not scaled. To triple the contribution of the first model and double the contribution of the second:

$ ngrammerge --alpha=3 --beta=2 earnest.aa.mod earnest.ab.mod >earnest.merged.mod
 
Line: 104 to 127
 

N-gram Model Reading, Printing and Info

Deleted:
<
<
ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,

$ ngramread --ARPA earnest.ARPA >earnest.mod

generates a n-gram model in FST format from the ARPA n-gram language model specification.

 ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
Line: 133 to 147
 -2.87695 ALGERNON -0.4771213
Added:
>
>
ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,

$ ngramread --ARPA earnest.ARPA >earnest.mod

generates a n-gram model in FST format from the ARPA n-gram language model specification.

 ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.

Revision 142011-12-12 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Changed:
<
<
  • textual I/O (ngramsymbols, farcompilestrings, farprintstrings, ngramread and ngramprint)
>
>
  • corpus I/O (ngramsymbols, farcompilestrings and farprintstrings)
 
  • n-gram model format
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
Added:
>
>
  • model I/O (ngramread, ngramprint and ngraminfo)
 
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply, ngramperplexity)

For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of available operations below.

Changed:
<
<

Textual I/O

>
>

Corpus I/O

 Text corpora are represented as binary finite-state archives, with one automaton per sentence. This provides efficient later processing by the NGram Library utilities and allows if desired more general probabilistic input (e.g. weighted DAGs or lattices).
Changed:
<
<
The first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This
>
>
The first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This
 can be done with the command-line utility ngramsymbols. For example, the symbols in the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be extracted with:
Line: 25 to 26
 

If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can

Changed:
<
<
be accomplished by concatenating the corpora when passed to ngramsymbols, eliminating out-of-vocabulary symbols. Alternatively, flags can be passed to both ngramsymbols and farprintstrings to specify an out-of-vocabulary label.
>
>
be accomplished by concatenating the corpora when passed to ngramsymbols, eliminating out-of-vocabulary symbols. By default, ngramsymbols creates symbol table entries for <epsilon> and an out-of-vocabulary token <unk>. The identity of these labels can be changed using flags. A flag can then be passed to farcompilestrings to specify the out-of-vocabulary label, so that words not in the symbol table will get mapped to that index.
 
Line: 42 to 42
 $ farprintstrings earnest.far >earnest.txt
Deleted:
<
<

ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,

$ ngramread --ARPA earnest.ARPA >earnest.mod

generates a n-gram model in FST format from the ARPA n-gram language model specification.

ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:

$ ngramprint --ARPA earnest.mod | head -15
\data\
ngram 1=2306
ngram 2=10319
ngram 3=14796
ngram 4=15218
ngram 5=14170

\1-grams:
-99   <s>   -0.9399067
-1.064551   </s>
-3.337681   MORNING   -0.3590219
-2.990894   ROOM   -0.4771213
-1.857355   IN   -0.6232494
-2.87695   ALGERNON   -0.4771213

ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.

$ ngraminfo earnest.mod
# of states                                     42641
# of ngram arcs                                 56809
# of backoff arcs                               42640
initial state                                   1
unigram state                                   0
# of final states                               5190
ngram order                                     5
# of 1-grams                                    2306
# of 2-grams                                    10319
# of 3-grams                                    14796
# of 4-grams                                    15218
# of 5-grams                                    14170
well-formed                                     y
normalized                                      y
 

Model Format

All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.
Line: 121 to 70
  Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Witten-Bell.
Changed:
<
<
Here is a generated sesntence from the language model:
>
>
Here is a generated sesntence from the language model (using ngramrandgen, which is described below):
 
$ ngramrandgen earnest.mod | farprintstrings
Line: 152 to 101
 I THINK BE ABLE TO DIARY GWENDOLEN WONDERFUL SECRETS MONEY YOU
Added:
>
>

N-gram Model Reading, Printing and Info

ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,

$ ngramread --ARPA earnest.ARPA >earnest.mod

generates a n-gram model in FST format from the ARPA n-gram language model specification.

ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:

$ ngramprint --ARPA earnest.mod | head -15
\data\
ngram 1=2306
ngram 2=10319
ngram 3=14796
ngram 4=15218
ngram 5=14170

\1-grams:
-99   <s>   -0.9399067
-1.064551   </s>
-3.337681   MORNING   -0.3590219
-2.990894   ROOM   -0.4771213
-1.857355   IN   -0.6232494
-2.87695   ALGERNON   -0.4771213

ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.

$ ngraminfo earnest.mod
# of states                                     42641
# of ngram arcs                                 56809
# of backoff arcs                               42640
initial state                                   1
unigram state                                   0
# of final states                               5190
ngram order                                     5
# of 1-grams                                    2306
# of 2-grams                                    10319
# of 3-grams                                    14796
# of 4-grams                                    15218
# of 5-grams                                    14170
well-formed                                     y
normalized                                      y
 

N-gram Model Sampling, Application and Evaluation

Revision 132011-12-10 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"
Changed:
<
<

OpenGrm NGram Library Quick Tour

>
>

OpenGrm NGram Library Quick Tour

  This tour is organized around the stages of n-gram model creation, modification and use:
Line: 11 to 11
 
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply, ngramperplexity)
Changed:
<
<
For additional details, see the summary table of available operations below that has links to each operation's full documentation.
>
>
For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of available operations below.
 

Textual I/O

Line: 73 to 73
 -2.87695 ALGERNON -0.4771213
Changed:
<
<
ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.
>
>
ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.
 
$ ngraminfo earnest.mod
# of states                                     42641
Changed:
<
<
ngram arcs 56809 backoff arcs 42640
>
>
# of ngram arcs 56809 # of backoff arcs 42640
 initial state 1 unigram state 0 # of final states 5190
Line: 202 to 202
 
Added:
>
>

Using the C++ Library

The OpenGrm NGram library is a C++ library. Users can call the available operations from that level rather than from the command line if desired. From C++, include <ngram/ngram.h> in the installation include directory and link to libfst.so, libfar.so, and libngram.so in the installation library directory. This assumes you've installed OpenFst (with --enable-far=yes). (You may instead use just those include files for the classes and functions that you will need.) All classes and functions are in the ngram namespace.

 

Available Operations

Line: 210 to 217
 
Operation Usage Description
NGramApply    
NGramCount    
Changed:
<
<
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
>
>
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
 
NGramMake    
NGramMerge    
NGramPerplexity    
NGramPrint    
NGramRandgen ngramrandgen [--npath] [--seed] [--max_length] [in.mod [out.far]] randomly sample sentences from an n-gram model
NGramRead    
Changed:
<
<
NGramShrink NgramCountPrune(&M, count_pattern); count-based model pruning
  NGramRelativeEntropy(&M, theta); relative-entropy-based model pruning
  NGramSeymoreShrink(&M, theta); Seymore/Rosenfeld-based model pruning
  ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]]  
>
>
NGramShrink ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] n-gram model pruning
  NGramCountPrune(&M, count_pattern); --- count-based model pruning
  NGramRelativeEntropy(&M, theta); --- relative-entropy-based model pruning
  NGramSeymoreShrink(&M, theta); --- Seymore/Rosenfeld-based model pruning
 
NGramSymbols    
Changed:
<
<
-- BrianRoark - 12 Nov 2010
>
>
 
META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"
META TOPICMOVED by="MichaelRiley" date="1296787886" from="GRM.GrmQuickTour" to="GRM.NGramQuickTour"

Revision 122011-12-09 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Changed:
<
<
  • model format and textual I/O (ngramread and ngramprint)
>
>
  • textual I/O (ngramsymbols, farcompilestrings, farprintstrings, ngramread and ngramprint)
  • n-gram model format
 
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
Changed:
<
<
  • n-gram model application (ngramapply)
>
>
  • n-gram model sampling, application and evaluation (ngramrandgen, ngramapply, ngramperplexity)
 
Changed:
<
<

Model Format

All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. An n-gram is a sequence of k symbols: w1 ... wk. Let N be the set of n-grams in the model.

  • There is a unigram state in every model, representing the empty string.
  • Every proper prefix of every n-gram in N has an associated state in the model.
  • The state associated with an n-gram w1 ... wk has a backoff transition (labeled with ⟨epsilon⟩) to the state associated with its suffix w2 ... wk.
  • An n-gram w1 ... wk is represented as a transition, labeled with wk, from the state associated with its prefix w1 ... wk-1 to a destination state defined as follows:
    • If w1 ... wk is a proper prefix of an n-gram in the model, then the destination of the transition is the state associated with w1 ... wk
    • Otherwise, the destination of the transition is the state associated with the suffix w2 ... wk.
  • Start and end of the sequence are not represented via transitions in the automaton or symbols in the symbol table. Rather
    • The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted ⟨s⟩).
    • The end of the sequence (often denoted ⟨/s⟩) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk ⟨/s⟩.
>
>
For additional details, see the summary table of available operations below that has links to each operation's full documentation.
 

Textual I/O

Added:
>
>
Text corpora are represented as binary finite-state archives, with one automaton per sentence. This provides efficient later processing by the NGram Library utilities and allows if desired more general probabilistic input (e.g. weighted DAGs or lattices).
 
Changed:
<
<
ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:
>
>
The first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This can be done with the command-line utility ngramsymbols. For example, the symbols in the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be extracted with:
 
Changed:
<
<
  • By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
  • By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
    where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
>
>
$ ngramsymbols <earnest.txt >earnest.syms
 
Changed:
<
<
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
>
>
If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can be accomplished by concatenating the corpora when passed to ngramsymbols, eliminating out-of-vocabulary symbols. Alternatively, flags can be passed to both ngramsymbols and farprintstrings to specify an out-of-vocabulary label.


 
Changed:
<
<
For example, the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be converted into a concatenated archive of automata (similar to the finite-state archive format) with:
>
>
Given a symbol table, a text corpus can be converted to a binary FAR archive with:
 
Changed:
<
<
$ ngramread earnest.txt >earnest.cat
>
>
$ farcompilestrings -symbols=earnest.syms earnest.txt >earnest.far

and can be printed with:

$ farprintstrings earnest.far >earnest.txt
 


Changed:
<
<
ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.
>
>
ngramread is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example,
 
Changed:
<
<
  • By default, only n-grams are printed (without backoff ⟨epsilon⟩ transitions), in the same format as discussed above for reading in n-gram counts: w1 ... wk score, where the score will be either the n-gram count or the n-gram probability, depending on whether the model has been normalized. By default, scores are converted from the internal negative log representation to real semiring counts or probabilities.
  • By using the flag --ARPA, the n-gram model is printed in the well-known ARPA format.
  • By using the flag --backoff, backoff ⟨epsilon⟩ transitions are printed along with the n-grams.
  • By using the flag --negativelogs, scores are shown as negative logs, rather than being converted to the real semiring.
  • By using the flag --integers, scores are converted to the real semiring and rounded to integers.
>
>
$ ngramread --ARPA earnest.ARPA >earnest.mod
 
Changed:
<
<
For writing n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are used to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
>
>
generates a n-gram model in FST format from the ARPA n-gram language model specification.
 
Changed:
<
<
For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
>
>
ngramprint is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the same automaton structure, so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
 
$ ngramprint --ARPA earnest.mod | head -15
Line: 77 to 73
 -2.87695 ALGERNON -0.4771213
Added:
>
>
ngraminfo is a command-line utility that prints out various information about an n-gram language model in FST format.

$ ngraminfo earnest.mod
# of states                                     42641
ngram arcs                                      56809
backoff arcs                                    42640
initial state                                   1
unigram state                                   0
# of final states                               5190
ngram order                                     5
# of 1-grams                                    2306
# of 2-grams                                    10319
# of 3-grams                                    14796
# of 4-grams                                    15218
# of 5-grams                                    14170
well-formed                                     y
normalized                                      y

Model Format

All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. For the precise details of the n-gram format, see here.
 

N-gram Counting

Changed:
<
<
ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).
>
>
ngramcount is a command line utility for counting n-grams from an input corpus, represented in FAR format. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).
 
Changed:
<
<
The 1-gram through 5-gram counts for the earnest.cat finite-state archive file created above can be created with:
>
>
The 1-gram through 5-gram counts for the earnest.far finite-state archive file created above can be created with:
 
Changed:
<
<
$ ngramcount -order=5 earnest.cat >earnest.cnts
>
>
$ ngramcount -order=5 earnest.far >earnest.cnts
 

N-gram Model Parameter Estimation

Changed:
<
<
ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).
>
>
ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).
  The 5-gram counts in earnest.cnts created above can be converted into a n-gram model with:
Line: 99 to 119
 $ ngrammake earnest.cnts >earnest.mod
Changed:
<
<
Here is a random path generated through this FST:
>
>
Flags to ngrammake specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Witten-Bell.

Here is a generated sesntence from the language model:

 
Changed:
<
<
$ fstrandgen --select=log_prob earnest.mod | fstprint | cut -f3 | tr '\n' ' '
>
>
$ ngramrandgen earnest.mod | farprintstrings
 I WOULD STRONGLY ADVISE YOU MR WORTHING TO TRY AND ACQUIRE SOME RELATIONS AS FAR AS THE PIANO IS CONCERNED SENTIMENT IS MY FORTE
Changed:
<
<
(An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, such that this random generation using general fstrandgen is not exact. See random generation utilities under ngramapply below.
>
>
(An epsilon transition is emitted for each backoff.)
 

N-gram Model Merging and Pruning

Changed:
<
<
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.
>
>
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.
 
Changed:
<
<
ngramshrink is a command line utility for pruning n-gram models.
>
>
ngramshrink is a command line utility for pruning n-gram models.
  This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:
Line: 123 to 145
 $ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru
Changed:
<
<
A random path generated through this FST is:
>
>
A random sentence generated through this LM is:
 
Changed:
<
<
$ fstrandgen --select=log_prob earnest.pru | fstprint | cut -f3 | tr '\n' ' '
>
>
$ ngramrandgen earnest.pru | farprintstrings
 I THINK BE ABLE TO DIARY GWENDOLEN WONDERFUL SECRETS MONEY YOU
Changed:
<
<
See discussion of random generation below.
>
>

N-gram Model Sampling, Application and Evaluation

 
Changed:
<
<

N-gram Model Application

ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata, or to correctly randomly sample from a mixture model.

Prior to randomly generating from the model, ngramapply will convert from a backoff representation back to a mixture representation, so that the transitions have the correct semantics for taking a simple random path through the automaton. To correctly randomly generate from a given model, use the flag --samples as follows:

>
>
ngramrandgen is a command line utility for sampling from n-gram models.
 
Changed:
<
<
$ ngramapply --samples=1 earnest.mod
>
>
$ ngramrandgen [--npaths=1] earnest.mod | farprintstrings
 IT IS SIMPLY A VERY INEXCUSABLE MANNER
Changed:
<
<
To see the backoff arcs when randomly generating, use the flag --show_backoff as follows:
>
>

ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata:
 
Changed:
<
<
$ ngramapply --show_backoff --samples=1 earnest.mod YOUR BROTHER WAS I BELIEVE UNMARRIED WAS HE NOT
>
>
$ ngramapply earnest.mod earnest.far | farprintstrings -print_weight
 
Changed:
<
<
The following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:
>
>
The result is a FAR weighted by the n-gram model.


ngramperplexity can be used to evaluate an n-gram model. For example, the following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:

 
Changed:
<
<
echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramapply --v=1 earnest.mod -
>
>
echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramperplexity --v=1 earnest.mod -
 A HAND BAG ngram -logprob N-gram probability found (base10)
Line: 180 to 202
 
Deleted:
<
<
ngramapply will have more general model application methods (using fstcompose with a phi matcher) soon.
 

Available Operations

Click on operation name for additional information.

Operation Usage Description
Changed:
<
<
NGramRead ArcMap(&A, mapper); transforms arcs in an FST
  ArcMap(A, &B, mapper);  
  ArcMapFst<InArc, OutArc, ArcMapper>(A, mapper);  
  fstmap [--delta=$d] [--map=$type] [--weight=$w] in.fst out.fst  
>
>
NGramApply    
NGramCount    
NGramInfo ngraminfo [in.mod] print various information about an n-gram model
NGramMake    
NGramMerge    
NGramPerplexity    
NGramPrint    
NGramRandgen ngramrandgen [--npath] [--seed] [--max_length] [in.mod [out.far]] randomly sample sentences from an n-gram model
NGramRead    
NGramShrink NgramCountPrune(&M, count_pattern); count-based model pruning
  NGramRelativeEntropy(&M, theta); relative-entropy-based model pruning
  NGramSeymoreShrink(&M, theta); Seymore/Rosenfeld-based model pruning
  ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]]  
NGramSymbols    
  -- BrianRoark - 12 Nov 2010

Revision 112011-12-08 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 25 to 25
 
    • The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted ⟨s⟩).
    • The end of the sequence (often denoted ⟨/s⟩) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk ⟨/s⟩.
Changed:
<
<
>
>
 

Textual I/O

ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:

Line: 77 to 77
 -2.87695 ALGERNON -0.4771213
Changed:
<
<
#Counting
>
>
 

N-gram Counting

ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

Line: 132 to 132
  See discussion of random generation below.
Changed:
<
<
#Application
>
>
 

N-gram Model Application

ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata, or to correctly randomly sample from a mixture model.

Revision 102011-12-08 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 10 to 10
 
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model application (ngramapply)
Added:
>
>
 

Model Format

All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. An n-gram is a sequence of k symbols: w1 ... wk. Let N be the set of n-grams in the model.

Line: 24 to 25
 
    • The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted ⟨s⟩).
    • The end of the sequence (often denoted ⟨/s⟩) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk ⟨/s⟩.
Added:
>
>
 

Textual I/O

ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:

Line: 75 to 77
 -2.87695 ALGERNON -0.4771213
Added:
>
>
#Counting
 

N-gram Counting

ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

Line: 85 to 88
 $ ngramcount -order=5 earnest.cat >earnest.cnts
Added:
>
>
 

N-gram Model Parameter Estimation

ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).

Line: 104 to 108
  (An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, such that this random generation using general fstrandgen is not exact. See random generation utilities under ngramapply below.
Added:
>
>
 

N-gram Model Merging and Pruning

ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.

Line: 127 to 132
  See discussion of random generation below.
Added:
>
>
#Application
 

N-gram Model Application

ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata, or to correctly randomly sample from a mixture model.

Line: 176 to 182
  ngramapply will have more general model application methods (using fstcompose with a phi matcher) soon.
Added:
>
>

Available Operations

Click on operation name for additional information.

Operation Usage Description
NGramRead ArcMap(&A, mapper); transforms arcs in an FST
  ArcMap(A, &B, mapper);  
  ArcMapFst<InArc, OutArc, ArcMapper>(A, mapper);  
  fstmap [--delta=$d] [--map=$type] [--weight=$w] in.fst out.fst  
 -- BrianRoark - 12 Nov 2010

META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"

Revision 92011-11-04 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 31 to 31
 
  • By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
  • By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
Changed:
<
<
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
>
>
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
 
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

Revision 82011-10-11 - MartinJansche

Line: 1 to 1
 
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

Line: 115 to 115
 This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:
Changed:
<
<
$ ngramshrink -relative_entropy -theta=.00015 earnest.mod >earnest.pru
>
>
$ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru
 

A random path generated through this FST is:

Revision 72011-02-04 - MichaelRiley

Line: 1 to 1
Changed:
<
<

OpenGrm Quick Tour

>
>
META TOPICPARENT name="NGramLibrary"

OpenGrm NGram Library Quick Tour

  This tour is organized around the stages of n-gram model creation, modification and use:
Line: 178 to 179
 -- BrianRoark - 12 Nov 2010

META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"
Added:
>
>
META TOPICMOVED by="MichaelRiley" date="1296787886" from="GRM.GrmQuickTour" to="GRM.NGramQuickTour"

Revision 62010-11-12 - BrianRoark

Line: 1 to 1
 

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Line: 101 to 101
 I WOULD STRONGLY ADVISE YOU MR WORTHING TO TRY AND ACQUIRE SOME RELATIONS AS FAR AS THE PIANO IS CONCERNED SENTIMENT IS MY FORTE
Changed:
<
<
(An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, hence this random generation using general fstrandgen is not exact. See random generation utilities under ngramapply below.
>
>
(An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, such that this random generation using general fstrandgen is not exact. See random generation utilities under ngramapply below.
 

N-gram Model Merging and Pruning

Revision 52010-11-12 - BrianRoark

Line: 1 to 1
 

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Line: 33 to 33
  where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
Changed:
<
<
By default, ngramread= constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
>
>
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
 
Changed:
<
<
For example, the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be converted into a finite-state archive format with:
>
>
For example, the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be converted into a concatenated archive of automata (similar to the finite-state archive format) with:
 
Changed:
<
<
$ ngramread earnest.txt >earnest.far
>
>
$ ngramread earnest.txt >earnest.cat
 
Deleted:
<
<
 

ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.

Line: 78 to 78
  ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).
Changed:
<
<
The 1-gram through 5-gram counts for the earnest.far finite-state archive file created above can be created with:
>
>
The 1-gram through 5-gram counts for the earnest.cat finite-state archive file created above can be created with:
 
Changed:
<
<
$ ngramcount -order=5 earnest.far >earnest.cnts
>
>
$ ngramcount -order=5 earnest.cat >earnest.cnts
 

N-gram Model Parameter Estimation

Line: 101 to 101
 I WOULD STRONGLY ADVISE YOU MR WORTHING TO TRY AND ACQUIRE SOME RELATIONS AS FAR AS THE PIANO IS CONCERNED SENTIMENT IS MY FORTE
Changed:
<
<
(An epsilon transition us emitted for each backoff.)
>
>
(An epsilon transition is emitted for each backoff.) Note that the model is encoded as a backoff model, so that the epsilons have a particular semantics, hence this random generation using general fstrandgen is not exact. See random generation utilities under ngramapply below.
 

N-gram Model Merging and Pruning

Line: 124 to 124
 I THINK BE ABLE TO DIARY GWENDOLEN WONDERFUL SECRETS MONEY YOU
Added:
>
>
See discussion of random generation below.
 

N-gram Model Application

Changed:
<
<
ngramapply is a command line utilty for applying n-gram models.
>
>
ngramapply is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata, or to correctly randomly sample from a mixture model.

Prior to randomly generating from the model, ngramapply will convert from a backoff representation back to a mixture representation, so that the transitions have the correct semantics for taking a simple random path through the automaton. To correctly randomly generate from a given model, use the flag --samples as follows:

$ ngramapply --samples=1 earnest.mod
IT IS SIMPLY A VERY INEXCUSABLE MANNER

To see the backoff arcs when randomly generating, use the flag --show_backoff as follows:

$ ngramapply --show_backoff --samples=1 earnest.mod
YOUR BROTHER WAS I BELIEVE UNMARRIED WAS <epsilon> <epsilon> HE NOT <epsilon>
 
Changed:
<
<
The following applies the string a hand bag, compiled into an FST, to the example 5-gram model generated above:

# Extract FST symbol table from the n-gram model
$ fstprint --save_isymbols=earnest.syms handbag.fst >/dev/null
# Compile text string into an FST
$ fstcompile --isymbols=earnest.syms --keep_isymbols --acceptor <<EOF >handbag.fst
0 1 A
1 2 HAND
2 3 BAG
EOF
# Apply the n-gram model to the string FST
$ ngramapply --verbose earnest.mod handbag.fst >applied.fst
 HAND BAG
>
>
The following calculates the perplexity of two strings (a hand bag and bag hand a) from the example 5-gram model generated above:
echo -e "A HAND BAG\nBAG HAND A" | ngramread | ngramapply --v=1 earnest.mod -
A HAND BAG
  ngram -logprob N-gram probability found (base10) p( A | ) = [2gram] 1.87984
Line: 151 to 158
 1 sentences, 3 words, 0 OOVs logprob(base 10)= -5.00044; perplexity (base 10)= 17.7873
Changed:
<
<
A HAND BAG
>
>
BAG HAND A
  ngram -logprob N-gram probability found (base10)
Changed:
<
<
p( A | ) = [2gram] 1.87984 p( HAND | A ...) = [2gram] 2.56724 p( BAG | HAND ...) = [3gram] 0.0457417 p( | BAG ...) = [4gram] 0.507622
>
>
p( BAG | ) = [1gram] 4.02771 p( HAND | BAG ...) = [1gram] 3.35968 p( A | HAND ...) = [1gram] 2.51843 p( | A ...) = [1gram] 1.53325
 1 sentences, 3 words, 0 OOVs
Changed:
<
<
logprob(base 10)= -5.00044; perplexity (base 10)= 17.7873
>
>
logprob(base 10)= -11.4391; perplexity (base 10)= 724.048
  2 sentences, 6 words, 0 OOVs
Changed:
<
<
logprob(base 10)= -10.0009; perplexity (base 10)= 17.7873
>
>
logprob(base 10)= -16.4395; perplexity (base 10)= 113.485
 
Changed:
<
<
-- BrianRoark - 05 Oct 2010
>
>
ngramapply will have more general model application methods (using fstcompose with a phi matcher) soon.

-- BrianRoark - 12 Nov 2010

 
META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"

Revision 42010-11-05 - MichaelRiley

Line: 1 to 1
 

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Changed:
<
<
  • Model format and I/O (ngramread and ngramprint)
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model application (ngramapply)
>
>
  • model format and textual I/O (ngramread and ngramprint)
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model application (ngramapply)
 
Changed:
<
<

Model format

>
>

Model Format

  All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. An n-gram is a sequence of k symbols: w1 ... wk. Let N be the set of n-grams in the model.
Line: 23 to 23
 
    • The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted ⟨s⟩).
    • The end of the sequence (often denoted ⟨/s⟩) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk ⟨/s⟩.
Changed:
<
<

ngramread

>
>

Textual I/O

 
Changed:
<
<
ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:
>
>
ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:
 
Changed:
<
<
  • By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
>
>
* By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
 
  • By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
    where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
Changed:
<
<
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
>
>
By default, ngramread= constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
 
Changed:
<
<

ngramprint

>
>
For example, the text of Oscar Wilde's Importance of Being Earnest, using the suitably normalized copy found here, can be converted into a finite-state archive format with:
 
Changed:
<
<
ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.
>
>
$ ngramread earnest.txt >earnest.far


ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.

 
  • By default, only n-grams are printed (without backoff ⟨epsilon⟩ transitions), in the same format as discussed above for reading in n-gram counts: w1 ... wk score, where the score will be either the n-gram count or the n-gram probability, depending on whether the model has been normalized. By default, scores are converted from the internal negative log representation to real semiring counts or probabilities.
  • By using the flag --ARPA, the n-gram model is printed in the well-known ARPA format.
Line: 47 to 54
  For writing n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are used to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
Changed:
<
<

ngramcount

ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

ngrammake

>
>
For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format:
 
Changed:
<
<
ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).
>
>
$ ngramprint --ARPA earnest.mod | head -15
 
Changed:
<
<

ngrammerge

>
>
\datangram 1=2306 ngram 2=10319 ngram 3=14796 ngram 4=15218 ngram 5=14170

\1-grams: -99 -0.9399067 -1.064551 -3.337681 MORNING -0.3590219 -2.990894 ROOM -0.4771213 -1.857355 IN -0.6232494 -2.87695 ALGERNON -0.4771213

N-gram Counting

ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread=. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

The 1-gram through 5-gram counts for the earnest.far finite-state archive file created above can be created with:

$ ngramcount -order=5 earnest.far >earnest.cnts

N-gram Model Parameter Estimation

ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).

The 5-gram counts in earnest.cnts created above can be converted into a n-gram model with:

$ ngrammake earnest.cnts >earnest.mod

Here is a random path generated through this FST:

$ fstrandgen --select=log_prob earnest.mod | fstprint | cut -f3 | tr '\n' ' '
I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon>  

(An epsilon transition us emitted for each backoff.)

N-gram Model Merging and Pruning

ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.


ngramshrink is a command line utility for pruning n-gram models.

This following shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size:

$ ngramshrink -relative_entropy -theta=.00015 earnest.mod >earnest.pru

A random path generated through this FST is:

$ fstrandgen --select=log_prob earnest.pru | fstprint | cut -f3 | tr '\n' ' '
I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon>  

N-gram Model Application

ngramapply is a command line utilty for applying n-gram models.

The following applies the string a hand bag, compiled into an FST, to the example 5-gram model generated above:


# Extract FST symbol table from the n-gram model
$ fstprint --save_isymbols=earnest.syms handbag.fst >/dev/null
# Compile text string into an FST
$ fstcompile --isymbols=earnest.syms --keep_isymbols --acceptor <<EOF >handbag.fst
0 1 A
1 2 HAND
2 3 BAG
EOF
# Apply the n-gram model to the string FST
$ ngramapply --verbose earnest.mod handbag.fst >applied.fst
 HAND BAG
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( A | <s> )                         = [2gram]  1.87984
        p( HAND | A ...)                     = [2gram]  2.56724
        p( BAG | HAND ...)                   = [3gram]  0.0457417
        p( </s> | BAG ...)                   = [4gram]  0.507622
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -5.00044;  perplexity (base 10)= 17.7873

A HAND BAG
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( A | <s> )                         = [2gram]  1.87984
        p( HAND | A ...)                     = [2gram]  2.56724
        p( BAG | HAND ...)                   = [3gram]  0.0457417
        p( </s> | BAG ...)                   = [4gram]  0.507622
1 sentences, 3 words, 0 OOVs
logprob(base 10)= -5.00044;  perplexity (base 10)= 17.7873
 
Changed:
<
<
ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.
>
>
2 sentences, 6 words, 0 OOVs logprob(base 10)= -10.0009; perplexity (base 10)= 17.7873
 
Changed:
<
<

ngramshrink

ngramshrink is a command line utility for pruning n-gram models.

ngramapply

ngramapply is a command line utilty for applying n-gram models.

>
>
  -- BrianRoark - 05 Oct 2010
Added:
>
>
META FILEATTACHMENT attachment="earnest.txt" attr="" comment="" date="1288923241" name="earnest.txt" path="earnest.txt" size="91184" stream="earnest.txt" tmpFilename="/var/tmp/CGItemp7285" user="MichaelRiley" version="1"

Revision 32010-10-06 - BrianRoark

Line: 1 to 1
 

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Line: 15 to 15
 
  • There is a unigram state in every model, representing the empty string.
  • Every proper prefix of every n-gram in N has an associated state in the model.
Changed:
<
<
  • The state associated with an n-gram w1 ... wk of length k has a backoff transition (labeled with ⟨epsilon⟩) to the state associated with its suffix of length k-1.
  • An n-gram consisting of k symbols is represented as a transition from the state associated with its prefix of length k-1 to a destination state defined as follows:
    • If the n-gram is a proper prefix of another n-gram in the model, then the destination of the transition is the state associated with the n-gram
    • Otherwise, the destination of the transition is the state associated with the suffix of the n-gram of length k-1.
>
>
  • The state associated with an n-gram w1 ... wk has a backoff transition (labeled with ⟨epsilon⟩) to the state associated with its suffix w2 ... wk.
  • An n-gram w1 ... wk is represented as a transition, labeled with wk, from the state associated with its prefix w1 ... wk-1 to a destination state defined as follows:
    • If w1 ... wk is a proper prefix of an n-gram in the model, then the destination of the transition is the state associated with w1 ... wk
    • Otherwise, the destination of the transition is the state associated with the suffix w2 ... wk.
  • Start and end of the sequence are not represented via transitions in the automaton or symbols in the symbol table. Rather
    • The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted ⟨s⟩).
    • The end of the sequence (often denoted ⟨/s⟩) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk ⟨/s⟩.
 

ngramread

Line: 30 to 33
  where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
Changed:
<
<
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩.
>
>
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. For reading n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
 

ngramprint

Line: 42 to 45
 
  • By using the flag --negativelogs, scores are shown as negative logs, rather than being converted to the real semiring.
  • By using the flag --integers, scores are converted to the real semiring and rounded to integers.
Added:
>
>
For writing n-gram counts and ARPA format models, tokens ⟨s⟩ and ⟨/s⟩ are used to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

ngramcount

ngramcount is a command line utility for counting n-grams from an input corpus, as prepared by ngramread. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch --order the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model).

ngrammake

ngrammake is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by ngramcount (which contains raw, unnormalized counts).

ngrammerge

ngrammerge is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models.

ngramshrink

ngramshrink is a command line utility for pruning n-gram models.

ngramapply

ngramapply is a command line utilty for applying n-gram models.

 -- BrianRoark - 05 Oct 2010

Revision 22010-10-06 - BrianRoark

Line: 1 to 1
 

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

Changed:
<
<
  • I/O (ngramread and ngramprint)
>
>
  • Model format and I/O (ngramread and ngramprint)
 
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model application (ngramapply)
Added:
>
>

Model format

All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. An n-gram is a sequence of k symbols: w1 ... wk. Let N be the set of n-grams in the model.

  • There is a unigram state in every model, representing the empty string.
  • Every proper prefix of every n-gram in N has an associated state in the model.
  • The state associated with an n-gram w1 ... wk of length k has a backoff transition (labeled with ⟨epsilon⟩) to the state associated with its suffix of length k-1.
  • An n-gram consisting of k symbols is represented as a transition from the state associated with its prefix of length k-1 to a destination state defined as follows:
    • If the n-gram is a proper prefix of another n-gram in the model, then the destination of the transition is the state associated with the n-gram
    • Otherwise, the destination of the transition is the state associated with the suffix of the n-gram of length k-1.
 

ngramread

ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:

  • By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
  • By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
Changed:
<
<
w1 ... wk cnt
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
>
>
w1 ... wk cnt
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
 
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
Changed:
<
<
By default, ngramread constructs a symbol table on the fly, consisting of and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not .
>
>
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩.

ngramprint

ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.

  • By default, only n-grams are printed (without backoff ⟨epsilon⟩ transitions), in the same format as discussed above for reading in n-gram counts: w1 ... wk score, where the score will be either the n-gram count or the n-gram probability, depending on whether the model has been normalized. By default, scores are converted from the internal negative log representation to real semiring counts or probabilities.
  • By using the flag --ARPA, the n-gram model is printed in the well-known ARPA format.
  • By using the flag --backoff, backoff ⟨epsilon⟩ transitions are printed along with the n-grams.
  • By using the flag --negativelogs, scores are shown as negative logs, rather than being converted to the real semiring.
  • By using the flag --integers, scores are converted to the real semiring and rounded to integers.
  -- BrianRoark - 05 Oct 2010

Revision 12010-10-05 - BrianRoark

Line: 1 to 1
Added:
>
>

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

  • I/O (ngramread and ngramprint)
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model application (ngramapply)

ngramread

ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:

  • By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
  • By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
    where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not .

-- BrianRoark - 05 Oct 2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback