String (de)compilation
This directory contains functions useful for mapping strings into FSAs
(
compilation) and for mapping string FSTs onto strings (
printing).
String token modes
This library provides three string compilation/printing "modes" controlled by
the enum
fst::StringTokenType
.
For string compilation, C++ users should use
fst::CompileString
in
stringcompile.h
. For string printing, C++ users should use
fst::PrintString
in
stringprint.h
.
A compilation or printing mode is said to be
safe when the
Label
template
parameter has enough precision to losslessly encode all possible labels
generated by that mode. If a mode is not safe for the given
Label
, silent
conversion may occur.
BYTE
and
UTF8
modes follow special conventions for interpreting the ASCII
square bracket characters
[
and
]
, which are also described below.
BYTE
mode
BYTE
mode is a sensible default for most users.
When compiling with
BYTE
mode, one arc is generated for each byte in the input
string and that byte is used as both input and output label for that arc. A
symbol table containing all bytes (in a printable format) is automatically
attached to the resulting FSA.
When printing with
BYTE
mode, each arc label is converted to a byte in the
output bytestring.
Since this mode maps between
Labels and bytes, this mode is safe (modulo
bracket parsing) when the
Label
template parameter has at least 8 bits of
(positive) integer precision.
UTF8
mode
When compiling with
UTF8
mode, the input string is assumed to be encoded
UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a
somewhat smaller machine than
BYTE
mode since multiple bytes in the input may
correspond to single arcs in the resulting machine in this mode. However, if the
input is ASCII, both
BYTE
and
UTF8
modes will produce equivalent FSAs
(though the former will be slightly faster). A symbol table containing all bytes
(in a printable format) as well as any non-ASCII Unicode codepoints present in
the input string is attached to the resulting FSA.
When printing with
UTF8
mode, each arc is interpreted as a Unicode codepoint
and the result is encoded as a UTF-8 bytestring.
Since Unicode guarantees no more than 2
32 unique codepoints, this mode is
safe (modulo bracket parsing) when the
Label
template parameter has at least
32 bits of (positive) integer precision. However, in the common case where all
codepoints in the input fall within the
Basic Multilingual Plane (BMP),
no conversions will occur if the
Label
template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in
the input are ASCII, no conversions will occur if the
Label
template parameter
has at least 8 bits of (positive) integer precision.
SYMBOL
mode
When compiling with
SYMBOL
mode, the input string is parsed into tokens
separated by space character, and then each input is looked up in a provided
fst::SymbolTable
. The symbol table is then assigned to the resulting FSA.
Compilation fails if any token is not present in the symbol table. When printing
with
SYMBOL
mode, each arc's
olabel
(output-side label) is looked up in a
provided
fst::SymbolTable
and each symbol string is written to the output
string, once again separated by a space character. If the input is a transducer
rather than an acceptor, it is implicitly reinterpreted as the acceptor given by
its output projection. Printing fails if the FSA is not a string FSA (i.e., is
not
fst::kString
).
Since
fst::SymbolTable
is keyed with
int64s, this mode is safe when the
Label
template parameter has as much precision as an
int64
. However, no
conversions will occur if the
Label
template parameter has at least as much
precision as every key in the the symbol table.
Square bracket parsing
BYTE
and
UTF8
modes follow special conventions for parsing (unescaped) ASCII
square bracket characters
[
and
]
. When a string is enclosed by a set of
(unescaped) square brackets, we refer to it as a
bracketed span. The
delimiting brackets are subsequentely ignored and the bracketed span is
interpreted as follows:
- If the bracketed span is empty, an error results.
- If the bracketed span contains one or more space characters, each space-delimited token is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisions with BMP codepoints).
- Otherwise, the bracketed span is run through
stroll
. If strtoll
can parse the entire span as a (octal, hexidecimal, or decimal) integer, the span is labeled using that integer, cast to the Label
template type. b. Otherwise, the span is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisons with BMP codepoints).
Note that bracket parsing thus has the same safety guarantees as
SYMBOL
mode.
However, no conversions will occur during bracket parsing so long as:
- bracketed integers are within the range of the
Label
template parameter, and
- bracketed "generated" symbols are assigned keys within the range of the
Label
template parameter.
When compiling with
BYTE
or
UTF8
mode, one may use a backslash to "escape" a
bracket literal. When printing, no special treatment of bracket literals is
necessary, as non-literal brackets are lost during the compilation process.
The following pairs of input strings and corresponding labels illustrate how
brackets are parsed in
BYTE
or
UTF8
mode:
-
"[0146][0x6f][111]"
: {102, 111, 111}
-
"[foo bar baz]"
: {1048577, 1048578, 1048579}
-
"foo[foo foo]"
: {102, 111, 111, 1048577, 1048577}
-
"foo[foo][foo]"
: {102, 111, 111, 1048577, 1048577}
-
"foo\[bar\]"
: {102, 111, 111, 91, 98, 97, 114, 93}
String maps
fst::CompileStringMap
compiles an FST from a vector of pairs of input
strings. Each input pair of strings is used to form a cross-product transducer
(see
here) using the user-specified
compilation mode (as above). The resulting FST represents the union of all pairs
of cross-products.
fst::CompileStringFile
is quite similar to
fst::CompileStringMap
,
except that it reads pairs of input strings directly from a two-column TSV
(tab-separated values) file.
Both functions take separate specifications for the input- and output-side
compilation mode. By specifying separate input and output compilation modes, one
can build a transducer that acts as a converter between modes.
Both functions use a prefix-tree construction, so the resulting FST may be
significantly more compact than would be generated by repeated invocations of
fst::Union
.