Difference: NGramRead (1 vs. 4)

Revision 42012-03-04 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="NGramQuickTour"

NGramRead

Line: 6 to 6
  It has flags for specifying the format of the text input, currently one of two options:

Changed:
<
<
  • By default or with the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
>
>
  • By default, the text file is read as a sorted list of n-grams with their count. The format is:
  w1 ... wk cnt
where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.
Changed:
<
<
By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. The tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

Usage

>
>
By default, ngramread constructs a symbol table on the fly, consisting of <epsilon> and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is <unk> by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not <unk>. The tokens <s> and </s> are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format. For the precise details of the n-gram format, see here.
 
Changed:
<
<

Examples

>
>

Usage

 
Changed:
<
<

Caveats

>
>
ngramread [--options] [in.fst [out.txt]]
  --ARPA: type = bool, default = false
  --epsilon_symbol: type = string, default = <epsilon>
  --OOV_symbol: type = string, default = <unk>
 
Added:
>
>

Examples

 
Added:
>
>
$ ngramread --ARPA in.ARPA-format.txt >out.mod
 
Deleted:
<
<
-- MichaelRiley - 09 Dec 2011
 \ No newline at end of file
Added:
>
>

Caveats

Revision 32011-12-13 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramQuickTour"

NGramRead

Line: 14 to 14
 By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. The tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

Usage

Added:
>
>

Examples

 

Caveats

Revision 22011-12-10 - MichaelRiley

Line: 1 to 1
 
META TOPICPARENT name="NGramQuickTour"

NGramRead

Line: 14 to 14
 By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. The tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

Usage

Deleted:
<
<

Complexity

 

Caveats

Deleted:
<
<

References

 

-- MichaelRiley - 09 Dec 2011

Revision 12011-12-09 - MichaelRiley

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="NGramQuickTour"

NGramRead

Description

It has flags for specifying the format of the text input, currently one of two options:

  • By default or with the flag --counts, the text file is read as a sorted list of n-grams with their count. The format is:
    w1 ... wk cnt
    where w1 ... wk are the k words of the n-gram and cnt is the (float) count of that n-gram. The n-grams in the list must be lexicographically ordered. An n-gram count automaton is built from the input.
  • By using the flag --ARPA, the file is read as an n-gram model in the well-known ARPA format. An n-gram model automaton is built from the input.

By default, ngramread constructs a symbol table on the fly, consisting of ⟨epsilon⟩ and every observed symbol in the text. With the flag --symbols=filename you can provide the filename to provide a fixed symbol table, in the standard OpenFst format. All symbols in the input text not found in the provided symbol table will be mapped to an OOV symbol, which is ⟨unk⟩ by default. The flag --OOV_symbol can be used to specify the OOV symbol in the provided symbol table if it is not ⟨unk⟩. The tokens ⟨s⟩ and ⟨/s⟩ are taken to represent start-of-sequence and end-of-sequence, respectively. Neither of these symbols are used in our automaton format (see above).

Usage

Complexity

Caveats

References

-- MichaelRiley - 09 Dec 2011

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback