TWiki
>
GRM Web
>
PyniniStringDoc
(revision 2) (raw view)
Edit
Attach
---+ String (de)compilation This directory contains functions useful for mapping strings into FSAs (_compilation_) and for mapping string FSTs onto strings (_printing_). ---++ String token modes This library provides three string compilation/printing "modes" controlled by the enum =fst::StringTokenType=. For string compilation, C++ users should use =fst::CompileString= in =stringcompile.h=. For string printing, C++ users should use =fst::PrintString= in =stringprint.h=. A compilation or printing mode is said to be _safe_ when the =Label= template parameter has enough precision to losslessly encode all possible labels generated by that mode. If a mode is not safe for the given =Label=, silent conversion may occur. =BYTE= and =UTF8= modes follow special conventions for interpreting the ASCII square bracket characters =[= and =]=, which are also described below. ---+++ =BYTE= mode =BYTE= mode is a sensible default for most users. When compiling with =BYTE= mode, one arc is generated for each byte in the input string and that byte is used as both input and output label for that arc. A symbol table containing all bytes (in a printable format) is automatically attached to the resulting FSA. When printing with =BYTE= mode, each arc label is converted to a byte in the output bytestring. Since this mode maps between <tt>Label</tt>s and bytes, this mode is safe (modulo bracket parsing) when the =Label= template parameter has at least 8 bits of (positive) integer precision. ---+++ =UTF8= mode When compiling with =UTF8= mode, the input string is assumed to be encoded UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a somewhat smaller machine than =BYTE= mode since multiple bytes in the input may correspond to single arcs in the resulting machine in this mode. However, if the input is ASCII, both =BYTE= and =UTF8= modes will produce equivalent FSAs (though the former will be slightly faster). A symbol table containing all bytes (in a printable format) as well as any non-ASCII Unicode codepoints present in the input string is attached to the resulting FSA. When printing with =UTF8= mode, each arc is interpreted as a Unicode codepoint and the result is encoded as a UTF-8 bytestring. Since Unicode guarantees no more than 2<sup>32</sup> unique codepoints, this mode is safe (modulo bracket parsing) when the =Label= template parameter has at least 32 bits of (positive) integer precision. However, in the common case where all codepoints in the input fall within the [[https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane][Basic Multilingual Plane (BMP)]], no conversions will occur if the =Label= template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in the input are ASCII, no conversions will occur if the =Label= template parameter has at least 8 bits of (positive) integer precision. ---+++ =SYMBOL= mode When compiling with =SYMBOL= mode, the input string is parsed into tokens separated by space character, and then each input is looked up in a provided =fst::SymbolTable=. The symbol table is then assigned to the resulting FSA. Compilation fails if any token is not present in the symbol table. When printing with =SYMBOL= mode, each arc's =olabel= (output-side label) is looked up in a provided =fst::SymbolTable= and each symbol string is written to the output string, once again separated by a space character. If the input is a transducer rather than an acceptor, it is implicitly reinterpreted as the acceptor given by its output projection. Printing fails if the FSA is not a string FSA (i.e., is not =fst::kString=). Since =fst::SymbolTable= is keyed with <tt>int64</tt>s, this mode is safe when the =Label= template parameter has as much precision as an =int64=. However, no conversions will occur if the =Label= template parameter has at least as much precision as every key in the the symbol table. ---+++ Square bracket parsing =BYTE= and =UTF8= modes follow special conventions for parsing (unescaped) ASCII square bracket characters =[= and =]=. When a string is enclosed by a set of (unescaped) square brackets, we refer to it as a _bracketed span_. The delimiting brackets are subsequentely ignored and the bracketed span is interpreted as follows: 1. If the bracketed span is empty, an error results. 2. If the bracketed span contains one or more space characters, each space-delimited token is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisions with BMP codepoints). 3. Otherwise, the bracketed span is run through =stroll=. If =strtoll= can parse the entire span as a (octal, hexidecimal, or decimal) integer, the span is labeled using that integer, cast to the =Label= template type. b. Otherwise, the span is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisons with BMP codepoints). Note that bracket parsing thus has the same safety guarantees as =SYMBOL= mode. However, no conversions will occur during bracket parsing so long as: * bracketed integers are within the range of the =Label= template parameter, and * bracketed "generated" symbols are assigned keys within the range of the =Label= template parameter. When compiling with =BYTE= or =UTF8= mode, one may use a backslash to "escape" a bracket literal. When printing, no special treatment of bracket literals is necessary, as non-literal brackets are lost during the compilation process. The following pairs of input strings and corresponding labels illustrate how brackets are parsed in =BYTE= or =UTF8= mode: * ="[0146][0x6f][111]"=: ={102, 111, 111}= * ="[foo bar baz]"=: ={1048577, 1048578, 1048579}= * ="foo[foo foo]"=: ={102, 111, 111, 1048577, 1048577}= * ="foo[foo][foo]"=: ={102, 111, 111, 1048577, 1048577}= * ="foo\[bar\]"=: ={102, 111, 111, 91, 98, 97, 114, 93}= ---++ String maps =fst::CompileStringMap= compiles an FST from a vector of pairs of input strings. Each input pair of strings is used to form a cross-product transducer (see [[PyniniOperatorsDoc#CrossProductDoc][here]]) using the user-specified compilation mode (as above). The resulting FST represents the union of all pairs of cross-products. =fst::CompileStringFile= is quite similar to =fst::CompileStringMap=, except that it reads pairs of input strings directly from a two-column TSV (tab-separated values) file. Both functions take separate specifications for the input- and output-side compilation mode. By specifying separate input and output compilation modes, one can build a transducer that acts as a converter between modes. Both functions use a prefix-tree construction, so the resulting FST may be significantly more compact than would be generated by repeated invocations of =fst::Union=.
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r9
|
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r2 - 2017-07-05
-
KyleGorman
GRM
Log In
or
Register
GRM Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
Contrib
FST
Forum
GRM
Kernel
Main
Sandbox
TWiki
Main
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback