(r2) PyniniStringDoc < GRM

GRM Web>PyniniStringDoc (revision 2) (raw view)~~EditAttach~~
---+ String (de)compilation

This directory contains functions useful for mapping strings into FSAs
(_compilation_) and for mapping string FSTs onto strings (_printing_).

---++ String token modes

This library provides three string compilation/printing "modes" controlled by
the enum =fst::StringTokenType=.

For string compilation, C++ users should use =fst::CompileString= in
=stringcompile.h=. For string printing, C++ users should use
=fst::PrintString= in =stringprint.h=.

A compilation or printing mode is said to be _safe_ when the =Label= template
parameter has enough precision to losslessly encode all possible labels
generated by that mode. If a mode is not safe for the given =Label=, silent
conversion may occur.

=BYTE= and =UTF8= modes follow special conventions for interpreting the ASCII
square bracket characters =[= and =]=, which are also described below.

---+++ =BYTE= mode

=BYTE= mode is a sensible default for most users.

When compiling with =BYTE= mode, one arc is generated for each byte in the input
string and that byte is used as both input and output label for that arc. A
symbol table containing all bytes (in a printable format) is automatically
attached to the resulting FSA.

When printing with =BYTE= mode, each arc label is converted to a byte in the
output bytestring.

Since this mode maps between <tt>Label</tt>s and bytes, this mode is safe (modulo
bracket parsing) when the =Label= template parameter has at least 8 bits of
(positive) integer precision.

---+++ =UTF8= mode

When compiling with =UTF8= mode, the input string is assumed to be encoded
UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a
somewhat smaller machine than =BYTE= mode since multiple bytes in the input may
correspond to single arcs in the resulting machine in this mode. However, if the
input is ASCII, both =BYTE= and =UTF8= modes will produce equivalent FSAs
(though the former will be slightly faster). A symbol table containing all bytes
(in a printable format) as well as any non-ASCII Unicode codepoints present in
the input string is attached to the resulting FSA.

When printing with =UTF8= mode, each arc is interpreted as a Unicode codepoint
and the result is encoded as a UTF-8 bytestring.

Since Unicode guarantees no more than 2<sup>32</sup> unique codepoints, this mode is
safe (modulo bracket parsing) when the =Label= template parameter has at least
32 bits of (positive) integer precision. However, in the common case where all
codepoints in the input fall within the 
[[https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane][Basic Multilingual Plane (BMP)]],
no conversions will occur if the =Label= template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in
the input are ASCII, no conversions will occur if the =Label= template parameter
has at least 8 bits of (positive) integer precision.

---+++ =SYMBOL= mode

When compiling with =SYMBOL= mode, the input string is parsed into tokens
separated by space character, and then each input is looked up in a provided
=fst::SymbolTable=. The symbol table is then assigned to the resulting FSA.
Compilation fails if any token is not present in the symbol table. When printing
with =SYMBOL= mode, each arc's =olabel= (output-side label) is looked up in a
provided =fst::SymbolTable= and each symbol string is written to the output
string, once again separated by a space character. If the input is a transducer
rather than an acceptor, it is implicitly reinterpreted as the acceptor given by
its output projection. Printing fails if the FSA is not a string FSA (i.e., is
not =fst::kString=).

Since =fst::SymbolTable= is keyed with <tt>int64</tt>s, this mode is safe when the
=Label= template parameter has as much precision as an =int64=. However, no
conversions will occur if the =Label= template parameter has at least as much
precision as every key in the the symbol table.

---+++ Square bracket parsing

=BYTE= and =UTF8= modes follow special conventions for parsing (unescaped) ASCII
square bracket characters =[= and =]=. When a string is enclosed by a set of
(unescaped) square brackets, we refer to it as a _bracketed span_. The
delimiting brackets are subsequentely ignored and the bracketed span is
interpreted as follows:

   1.  If the bracketed span is empty, an error results.
   2.  If the bracketed span contains one or more space characters, each
   space-delimited token is treated as a "generated" symbol and added to the
   symbol table attached to the resulting FSA (at a suitably high index so as
   to avoid any collisions with BMP codepoints).
   3.  Otherwise, the bracketed span is run through =stroll=. If =strtoll= can parse the entire span as a (octal, hexidecimal, or decimal)
   integer, the span is labeled using that integer, cast to the =Label=
   template type. b. Otherwise, the span is treated as a "generated" symbol and
   added to the symbol table attached to the resulting FSA (at a suitably high
   index so as to avoid any collisons with BMP codepoints).

Note that bracket parsing thus has the same safety guarantees as =SYMBOL= mode.
However, no conversions will occur during bracket parsing so long as:

   *   bracketed integers are within the range of the =Label= template parameter,
    and
   *   bracketed "generated" symbols are assigned keys within the range of the
    =Label= template parameter.

When compiling with =BYTE= or =UTF8= mode, one may use a backslash to "escape" a
bracket literal. When printing, no special treatment of bracket literals is
necessary, as non-literal brackets are lost during the compilation process.

The following pairs of input strings and corresponding labels illustrate how
brackets are parsed in =BYTE= or =UTF8= mode:

   *   ="[0146][0x6f][111]"=: ={102, 111, 111}=
   *   ="[foo bar baz]"=: ={1048577, 1048578, 1048579}=
   *   ="foo[foo foo]"=: ={102, 111, 111, 1048577, 1048577}=
   *   ="foo[foo][foo]"=: ={102, 111, 111, 1048577, 1048577}=
   *   ="foo\[bar\]"=: ={102, 111, 111, 91, 98, 97, 114, 93}=

---++ String maps

=fst::CompileStringMap= compiles an FST from a vector of pairs of input
strings. Each input pair of strings is used to form a cross-product transducer
(see [[PyniniOperatorsDoc#CrossProductDoc][here]]) using the user-specified
compilation mode (as above). The resulting FST represents the union of all pairs
of cross-products.

=fst::CompileStringFile= is quite similar to =fst::CompileStringMap=,
except that it reads pairs of input strings directly from a two-column TSV
(tab-separated values) file.

Both functions take separate specifications for the input- and output-side
compilation mode. By specifying separate input and output compilation modes, one
can build a transducer that acts as a converter between modes.

Both functions use a prefix-tree construction, so the resulting FST may be
significantly more compact than would be generated by repeated invocations of
=fst::Union=.
Topic revision: r2 - 2017-07-05 - KyleGorman