It has flags for specifying the format of the text input, currently one of two options:

  • By default or with the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
    where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. The tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).



-- MichaelRiley - 09 Dec 2011

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2011-12-10 - MichaelRiley
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback