String (de)compilation

This directory contains functions useful for mapping strings into FSAs (compilation) and for mapping string FSTs onto strings (printing).

String token modes

This library provides three string compilation/printing "modes" controlled by the enum fst::TokenType.

For string compilation, C++ users should use fst::StringCompile in stringcompile.h. For string printing, C++ users should use fst::StringPrint in stringprint.h.

A compilation or printing mode is said to be safe when the Label template parameter has enough precision to losslessly encode all possible labels generated by that mode. If a mode is not safe for the given Label, silent conversion may occur.

BYTE and UTF8 modes follow special conventions for interpreting the ASCII square bracket characters [ and ], which are also described below.

BYTE mode

BYTE mode is a sensible default for most users.

When compiling with BYTE mode, one arc is generated for each byte in the input string and that byte is used as both input and output label for that arc.

When printing with BYTE mode, each arc label is converted to a byte in the output bytestring.

Since this mode maps between Label and bytes, this mode is safe (modulo bracket parsing) when the Label template parameter has at least 8 bits of (positive) integer precision.

UTF8 mode

When compiling with UTF8 mode, the input string is assumed to be encoded UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a somewhat smaller machine than BYTE mode since multiple bytes in the input may correspond to single arcs in the resulting machine in this mode. However, if the input is ASCII, both BYTE and UTF8 modes will produce equivalent FSAs (though the former will be slightly faster).

When printing with UTF8 mode, each arc is interpreted as a Unicode codepoint and the result is encoded as a UTF-8 bytestring.

Since Unicode guarantees no more than 232 unique codepoints, this mode is safe (modulo bracket parsing) when the Label template parameter has at least 32 bits of (positive) integer precision. However, in the common case where all codepoints in the input fall within the Basic Multilingual Plane (BMP), no conversions will occur if the Label template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in the input are ASCII, no conversions will occur if the Label template parameter has at least 8 bits of (positive) integer precision.

SYMBOL mode

When compiling with SYMBOL mode, the input string is parsed into tokens separated by space character, and then each input is looked up in a provided fst::SymbolTable. Compilation fails if any token is not present in the symbol table. When printing with SYMBOL mode, each arc's olabel (output-side label) is looked up in a provided fst::SymbolTable and each symbol string is written to the output string, once again separated by a space character. If the input is a transducer rather than an acceptor, it is implicitly reinterpreted as the acceptor given by its output projection. Printing fails if the FSA is not a string FSA (i.e., is not fst::kString).

Since fst::SymbolTable is keyed with int64, this mode is safe when the Label template parameter has as much precision as an int64. However, no conversions will occur if the Label template parameter has at least as much precision as every key in the the symbol table.

Square bracket parsing

BYTE and UTF8 modes follow special conventions for parsing (unescaped) ASCII square bracket characters [ and ]. When a string is enclosed by a set of (unescaped) square brackets, we refer to it as a bracketed span. The delimiting brackets are subsequentely ignored and the bracketed span is interpreted as follows:

  1. If the bracketed span is empty, an error results.
  2. If the bracketed span contains one or more space characters, each space-delimited token is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisions with BMP codepoints).
  3. Otherwise, the bracketed span is run through stroll. If strtoll can parse the entire span as a (octal, hexidecimal, or decimal) integer, the span is labeled using that integer, cast to the Label template type. b. Otherwise, the span is treated as a "generated" symbol and stored in a global symbol table in a private use area.

Note that bracket parsing thus has the same safety guarantees as SYMBOL mode. However, no conversions will occur during bracket parsing so long as:

  • bracketed integers are within the range of the Label template parameter, and
  • bracketed "generated" symbols are assigned keys within the range of the Label template parameter.

When compiling with BYTE or UTF8 mode, one may use a backslash to "escape" a bracket literal; one can use fst::EscapeString to do this automatically. When printing, no special treatment of bracket literals is necessary, as non-literal brackets are lost during the compilation process.

The following pairs of input strings and corresponding labels illustrate how brackets are parsed in BYTE or UTF8 mode:

  • "[0146][0x6f][111]": {102, 111, 111}
  • "[foo bar baz]": {1048577, 1048578, 1048579}
  • "foo[foo foo]": {102, 111, 111, 1048577, 1048577}
  • "foo[foo][foo]": {102, 111, 111, 1048577, 1048577}
  • "foo\[bar\]": {102, 111, 111, 91, 98, 97, 114, 93}

One can use fst::GeneratedSymbols to retrieve a view of the generated symbol table.

String maps

fst::StringMapCompile compiles an FST from a vector of vector of strings. The first element in each inner vector is interpreted as the input string. The optional second element is interpreted as the output string for the transduction; if not specified it defaults to the value of the first element. An optional third element is interpreted as a weight for the transduction; if not specified it defaults to semiring One. These elements are used to form a cross-product transducer using the user-specified compilation mode (as above). The resulting FST represents the union of all pairs of weighted cross-products.

fst::StringFileCompile is quite similar to fst::StringMapCompile, except that it reads pairs of input strings directly from a one-to-three-column TSV (tab-separated values) file.

Both functions use a prefix-tree construction, so the resulting FST may be significantly more compact than would be generated by repeatedly computing the union of many transducers. In the case that all inputs are strings rather than string pairs, the resulting transducer is an epsilon-free acceptor.

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2022-09-25 - KyleGorman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback