The following gives the encoding of all n-gram models produced by the utilities here, including those with unnormalized counts, as a cyclic weighted finite-state transducer in OpenFst format.

An n-gram is a sequence of *k* symbols: *w _{1} ... w_{k}*. Let

- There is a
*unigram*state in every model, representing the empty string. - Every proper prefix of every n-gram in
*N*has an associated state in the model. - The state associated with an n-gram
*w*has a backoff transition (labeled with_{1}... w_{k}*<epsilon>*) to the state associated with its suffix*w*._{2}... w_{k} - An n-gram
*w*is represented as a transition, labeled with_{1}... w_{k}*w*, from the state associated with its prefix_{k}*w*to a destination state defined as follows:_{1}... w_{k-1}- If
*w*is a proper prefix of an n-gram in the model, then the destination of the transition is the state associated with_{1}... w_{k}*w*_{1}... w_{k} - Otherwise, the destination of the transition is the state associated with the suffix
*w*._{2}... w_{k}

- If
- Start and end of the sequence are not represented via transitions in the automaton or symbols in the symbol table. Rather
- The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted
*<s>*). - The end of the sequence (often denoted
*</s>*) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix*w*, the final weight of that state represents the weight of the n-gram_{1}... w_{k}*w*._{1}... w_{k}</s>

- The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted

Topic revision: r3 - 2012-03-04 - BrianRoark

Copyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback