Description
It has flags for specifying the format of the text input, currently one of two options:
- By default or with the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
w1 ... wk cnt
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
- By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
By default,
ngramread
constructs a symbol table on the fly, consisting of
〈epsilon〉 and every observed symbol in the text. With the flag
--symbols=filename you can provide the filename to provide a fixed symbol table, in the standard
OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is
〈unk〉 by default. The flag
--OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not
〈unk〉. The tokens
〈s〉 and
〈/s〉 are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).
Usage
Caveats
--
MichaelRiley - 09 Dec 2011