GRM Web>NGramLibrary>NGramQuickTour>NGramRead (2022-08-06, KyleGorman)

NGramRead

Description

It has flags for specifying the format of the text input, currently one of two options:

By default, the text file is read as a sorted list of n-grams with their count. The format is:
w₁ ... w_k cnt
where w₁ ... w_k are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of <epsilon> and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is <UNK> by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not <UNK>. The tokens <s> and </s> are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format. For the precise details of the n-gram format, see here.

Usage

ngramread [--options] [in.fst [out.txt]] --ARPA: type = bool, default = false --epsilon_symbol: type = string, default = <epsilon> --OOV_symbol: type = string, default = <UNK>

ngramread [--options] [in.fst [out.txt]]
  --ARPA: type = bool, default = false
  --epsilon_symbol: type = string, default = <epsilon>
  --OOV_symbol: type = string, default = <UNK>

Examples

$ ngramread --ARPA in.ARPA-format.txt >out.mod

Caveats

Topic revision: r5 - 2022-08-06 - KyleGorman