TWiki
>
GRM Web
>
NGramLibrary
>
NGramQuickTour
>
NGramMake
(revision 4) (raw view)
Edit
Attach
---+ NGramMake ---++ Description This operation produces a smoothed, normalized language model from input n-gram count FST. It smooths the model in one of six ways: * _witten_bell:_ smooths using Witten-Bell (cite), with a hyperparameter _k_, as presented in Carpenter (2005). * _absolute_: smooths based on Absolute Discounting (cite), using =bins= and =discount= parameters. * _katz_: smooths based on Katz Backoff (cite), using =bins= parameters. * _kneser_ney_: smooths based on Kneser-Ney (cite), a variant of Absolute Discounting. * _presmoothed_: normalizes at each state based on the n-gram count of the history. * _unsmoothed_: normalizes the model but provides no smoothing. See Chen and Goodman (1998) for a discussion of these smoothing methods. All of the smoothing methods can be used to build either a mixture model (in which higher order n-gram distributions are interpolated with lower order n-gram distributions) or a backoff model (using the _--backoff_ option, in which lower order n-gram distributions are only used if the higher order n-gram was unobserved in the corpus). Even though some of the methods are typically primarily used with either mixture or backoff smoothing (e.g., Katz with backoff), in this library they can be used with either. Note that mixture models are converted to a backoff topology by pre-summing the mixtures and placing the mixed probability on the highest order transition. If the _--bins_ option is left as the default (-1), then the number of bins for the discounting methods (=katz,absolute,kneser_ney=) are set to method appropriate defaults (5 for =katz=, 1 for =absolute=). The C++ classes are all derived from the base class =NGramMake=. ---++ Usage |<verbatim> ngrammake [--options] [in.fst [out.fst]] --method: type = string, one of: witten_bell (default) | absolute | katz | kneser_ney | presmoothed | unsmoothed --backoff: type = bool, default = false --bins: type = int64, default = -1 --witten_bell_k, type = double, default = 1.0 --discount_D, type = double, default = 1.0 </verbatim> | | |<verbatim> class NGramAbsolute ngram(StdMutableFst *countfst); </verbatim>| | |<verbatim> class NGramKatz ngram(StdMutableFst *countfst); </verbatim>| | |<verbatim> class NGramKneserNey ngram(StdMutableFst *countfst); </verbatim>| | |<verbatim> class NGramUnsmoothed ngram(StdMutableFst *countfst); </verbatim>| | |<verbatim> class NGramWittenBell ngram(StdMutableFst *countfst); </verbatim>| | ---++ Examples ---++ Caveats The presmoothed method normalizes at each state based on the n-gram count of the history, which is only appropriate under specialized circumstances, such as when the counts have been derived from strings with backoff transitions indicated. ---++ References -- Main.MichaelRiley - 09 Dec 2011
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r7
<
r6
<
r5
<
r4
<
r3
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r4 - 2011-12-15
-
BrianRoark
GRM
Log In
or
Register
GRM Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
Contrib
FST
Forum
GRM
Kernel
Main
Sandbox
TWiki
Main
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback