TWiki> GRM Web>NGramQuickTour (revision 2)EditAttach

OpenGrm Quick Tour

This tour is organized around the stages of n-gram model creation, modification and use:

  • Model format and I/O (ngramread and ngramprint)
  • n-gram counting (ngramcount)
  • n-gram model parameter estimation (ngrammake)
  • n-gram model merging and pruning (ngrammerge and ngramshrink)
  • n-gram model application (ngramapply)

Model format

All n-gram models produced by these utilities, including those with unnormalized counts, have a cyclic weighted finite-state transducer format, encoded using the OpenFst library. An n-gram is a sequence of k symbols: w1 ... wk. Let N be the set of n-grams in the model.

  • There is a unigram state in every model, representing the empty string.
  • Every proper prefix of every n-gram in N has an associated state in the model.
  • The state associated with an n-gram w1 ... wk of length k has a backoff transition (labeled with ⟨epsilon⟩) to the state associated with its suffix of length k-1.
  • An n-gram consisting of k symbols is represented as a transition from the state associated with its prefix of length k-1 to a destination state defined as follows:
    • If the n-gram is a proper prefix of another n-gram in the model, then the destination of the transition is the state associated with the n-gram
    • Otherwise, the destination of the transition is the state associated with the suffix of the n-gram of length k-1.

ngramread

ngramread is a command line utility for reading in text files and producing FSTs appropriate for use by other functions and utilities. It has flags for specifying the format of the text input, currently one of three options:

  • By default, each line in the text is read as a white-space delimited sequence of symbols, with one string per line. Each string is encoded as a linear automaton, and the resulting set of automata are concatenated into a single archive. The final automaton in the archive holds the symbol table for the archive. The archive can then be used by ngramcount to count n-grams.
  • By using the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
    where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The list must be consistently ordered, so that (1) any proper prefix of an n-gram is listed before that n-gram, and (2) if word v comes before word w somewhere, it must do so everywhere. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩.

ngramprint

ngramprint is a command line utility for reading in n-gram models and producing text files. Since both raw counts and normalized models are encoded with the same automaton structure, either can be accessed for this function. There are multiple options for output.

  • By default, only n-grams are printed (without backoff ⟨epsilon⟩ transitions), in the same format as discussed above for reading in n-gram counts: w1 ... wk score, where the score will be either the n-gram count or the n-gram probability, depending on whether the model has been normalized. By default, scores are converted from the internal negative log representation to real semiring counts or probabilities.
  • By using the flag --ARPA, the n-gram model is printed in the well-known ARPA format.
  • By using the flag --backoff, backoff ⟨epsilon⟩ transitions are printed along with the n-grams.
  • By using the flag --negativelogs, scores are shown as negative logs, rather than being converted to the real semiring.
  • By using the flag --integers, scores are converted to the real semiring and rounded to integers.

-- BrianRoark - 05 Oct 2010

Edit | Attach | Watch | Print version | History: r35 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2010-10-06 - BrianRoark
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback