This operation produces a smoothed, normalized language model from input ngram count FST. It smooths the model in one of six ways:
bins
and discount
parameters.
bins
parameters.
See Chen and Goodman (1998) for a discussion of these smoothing methods.
All of the smoothing methods can be used to build either a mixture model (in which higher order ngram distributions are interpolated with lower order ngram distributions) or a backoff model (using the backoff option, in which lower order ngram distributions are only used if the higher order ngram was unobserved in the corpus). Even though some of the methods are typically primarily used with either mixture or backoff smoothing (e.g., Katz with backoff), in this library they can be used with either. Note that mixture models are converted to a backoff topology by presumming the mixtures and placing the mixed probability on the highest order transition.
If the bins option is left as the default (1), then the number of bins for the discounting methods (katz,absolute,kneser_ney
) are set to method appropriate defaults (5 for katz
, 1 for absolute
).
The C++ classes are all derived from the base class NGramMake
.
ngrammake [options] [in.fst [out.fst]] method: type = string, one of: witten_bell (default)  absolute  katz  kneser_ney  presmoothed  unsmoothed backoff: type = bool, default = false bins: type = int64, default = 1 witten_bell_k, type = double, default = 1.0 discount_D, type = double, default = 1.0 

class NGramAbsolute ngram(StdMutableFst *countfst); 

class NGramKatz ngram(StdMutableFst *countfst); 

class NGramKneserNey ngram(StdMutableFst *countfst); 

class NGramUnsmoothed ngram(StdMutableFst *countfst); 

class NGramWittenBell ngram(StdMutableFst *countfst); 
The presmoothed method normalizes at each state based on the ngram count of the history, which is only appropriate under specialized circumstances, such as when the counts have been derived from strings with backoff transitions indicated.
 MichaelRiley  09 Dec 2011