String (de)compilation
This directory contains functions useful for mapping strings into FSAs (
compilation) and for mapping string FSTs onto strings (
printing).
String token modes
This library provides three string compilation/printing "modes" controlled by the enum
fst::TokenType
.
For string compilation, C++ users should use
fst::StringCompile
in
stringcompile.h
. For string printing, C++ users should use
fst::StringPrint
in
stringprint.h
.
A compilation or printing mode is said to be
safe when the
Label
template parameter has enough precision to losslessly encode all possible labels generated by that mode. If a mode is not safe for the given
Label
, silent conversion may occur.
BYTE
and
UTF8
modes follow special conventions for interpreting the ASCII square bracket characters [ and ], which are also described below.
BYTE
mode
BYTE
mode is a sensible default for most users.
When compiling with
BYTE
mode, one arc is generated for each byte in the input string and that byte is used as both input and output label for that arc.
When printing with
BYTE
mode, each arc label is converted to a byte in the output bytestring.
Since this mode maps between
Label
and bytes, this mode is safe (modulo bracket parsing) when the
Label
template parameter has at least 8 bits of (positive) integer precision.
UTF8
mode
When compiling with
UTF8
mode, the input string is assumed to be encoded UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a somewhat smaller machine than
BYTE
mode since multiple bytes in the input may correspond to single arcs in the resulting machine in this mode. However, if the input is ASCII, both
BYTE
and
UTF8
modes will produce equivalent FSAs (though the former will be slightly faster).
When printing with
UTF8
mode, each arc is interpreted as a Unicode codepoint and the result is encoded as a UTF-8 bytestring.
Since Unicode guarantees no more than 2
32 unique codepoints, this mode is safe (modulo bracket parsing) when the Label template parameter has at least 32 bits of (positive) integer precision. However, in the common case where all codepoints in the input fall within the
Basic Multilingual Plane (BMP), no conversions will occur if the Label template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in the input are ASCII, no conversions will occur if the Label template parameter has at least 8 bits of (positive) integer precision.
SYMBOL
mode
When compiling with
SYMBOL
mode, the input string is parsed into tokens separated by space character, and then each input is looked up in a provided
fst::SymbolTable
. Compilation fails if any token is not present in the symbol table. When printing with
SYMBOL
mode, each arc's olabel (output-side label) is looked up in a provided
fst::SymbolTable
and each symbol string is written to the output string, once again separated by a space character. If the input is a transducer rather than an acceptor, it is implicitly reinterpreted as the acceptor given by its output projection. Printing fails if the FSA is not a string FSA (i.e., is not
fst::kString
).
Since
fst::SymbolTable
is keyed with
int64
, this mode is safe when the
Label
template parameter has as much precision as an
int64
. However, no conversions will occur if the Label template parameter has at least as much precision as every key in the the symbol table.
Square bracket parsing
BYTE
and
UTF8
modes follow special conventions for parsing (unescaped) ASCII square bracket characters [ and ]. When a string is enclosed by a set of (unescaped) square brackets, we refer to it as a
bracketed span. The delimiting brackets are subsequentely ignored and the bracketed span is interpreted as follows:
- If the bracketed span is empty, an error results.
- If the bracketed span contains one or more space characters, each space-delimited token is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisions with BMP codepoints).
- Otherwise, the bracketed span is run through stroll. If strtoll can parse the entire span as a (octal, hexidecimal, or decimal) integer, the span is labeled using that integer, cast to the
Label
template type. b. Otherwise, the span is treated as a "generated" symbol and stored in a global symbol table in a private use area.
Note that bracket parsing thus has the same safety guarantees as
SYMBOL
mode. However, no conversions will occur during bracket parsing so long as:
- bracketed integers are within the range of the Label template parameter, and
- bracketed "generated" symbols are assigned keys within the range of the
Label
template parameter.
When compiling with
BYTE
or
UTF8
mode, one may use a backslash to "escape" a bracket literal; one can use
fst::EscapeString
to do this automatically. When printing, no special treatment of bracket literals is necessary, as non-literal brackets are lost during the compilation process.
The following pairs of input strings and corresponding labels illustrate how brackets are parsed in
BYTE
or
UTF8
mode:
- "[0146][0x6f][111]": {102, 111, 111}
- "[foo bar baz]": {1048577, 1048578, 1048579}
- "foo[foo foo]": {102, 111, 111, 1048577, 1048577}
- "foo[foo][foo]": {102, 111, 111, 1048577, 1048577}
- "foo\[bar\]": {102, 111, 111, 91, 98, 97, 114, 93}
One can use
fst::GeneratedSymbols
to retrieve a view of the generated symbol table.
String maps
fst::StringMapCompile
compiles an FST from a vector of vector of strings. The first element in each inner vector is interpreted as the input string. The optional second element is interpreted as the output string for the transduction; if not specified it defaults to the value of the first element. An optional third element is interpreted as a weight for the transduction; if not specified it defaults to semiring One. These elements are used to form a cross-product transducer using the user-specified compilation mode (as above). The resulting FST represents the union of all pairs of weighted cross-products.
fst::StringFileCompile
is quite similar to
fst::StringMapCompile
, except that it reads pairs of input strings directly from a one-to-three-column TSV (tab-separated values) file.
Both functions use a prefix-tree construction, so the resulting FST may be significantly more compact than would be generated by repeatedly computing the union of many transducers. In the case that all inputs are strings rather than string pairs, the resulting transducer is an epsilon-free acceptor.