FST optimization
There are several ways to "optimize" a weighted finite-state transducer (WFST).
The C++ template function
fst::Optimize
underlies Pynini's
optimize
instance method and Thrax's
Optimize
function. But what exactly does this make
more optimal [sic]? How does it accomplish that? And what, if anything, can we
assert about the resulting transducer?
What makes an WFST suboptimal...
Tolstoy wrote that "All happy families are alike; each unhappy family is unhappy
in its own way", and that's true of WFSTs too.
- WFST operations (including composition, concatenation, and union) introduce epsilon-transitions, arcs with epsilon input and output labels (e.g., all but two arcs here). Epsilon-removal removes these transitions.
- A WFST is deterministic if for each state, there is never more than one arc with a given label leaving that state (e.g., state 1 here). Determinization makes an WFST deterministic. While this can actually increase the size of the WFST, it is a prerequisite for the next algorithm.
- A WFST is minimal if it has the minimum number of states possible for expressing its relation. Minimization modifies a deterministic WFST's topology so that it becomes minimal (e.g., the examples here).
...what we can do about it...
fst::Optimize
checks whether the WFST is known to be epsilon-free; if not,
epsilon-removal is performed. Then, arc-sum mapping is performed. Then, it
attempts to perform minimization, determinizing along the way, if necessary, as
follows. If the WFST is (known to be) an acyclic acceptor but not (known to be)
deterministic, we determinize it, then minimize it. This works because any
acyclic transducer (over a
zero-sum-free semiring) is determinizable (Mohri
2009).
However, if the WFST is a transducer, it may be non-functional (i.e., it
describes a one-to-many relation); non-functional transducers can be
determinized, but this is quite inefficient. And, if the WFST has weighted
cycles (whether or not it's an acceptor or a transducer), it may lack the
twins property, in which case the determinization algorithm will never
terminate. However, in such a case, it is still usually possible to reduce the
number of arcs and states in the WFST (Allauzen et al. 2004): the WFST is
determinized and minimized
as if it were an acceptor (and if the WFST has
weighted cycles,
as if it were an unweighted acceptor). (Alternatively, we
could have checked for the twins property and/or for non-functionalism, but both
checks are themselves quite expensive, so we have chosen not to.) In that case,
determinization and minimization may introduce
multi-arcs, arcs with the same
source state, destination state, input label, and output-label. We therefore perform
arc-sum mapping, which merges such arcs using ⊕ to combine their weights,
making the WFST
multi-arc-free.
...and what we can assert about an optimized transducer
Once we have performed optimization on a WFST, it is free of multi-arcs and
unnecessary epsilon-epsilon arcs. And, if the WFST was an acyclic acceptor, it
is guaranteed to also be deterministic and minimal. We
cannot be sure that it is
smaller (in terms of the number of states and arcs) than the input, though it usually
is in practice.
Other behind-the-scenes optimizations
When computing the
difference of two
WFSTs, the right-hand side must be epsilon-free and deterministic. If it is not known
to have these properties, Thrax and Pynini perform epsilon-removal and attempt
determinization.
References
Allauzen, C., Mohri, M., Riley, M., and Roark, B. 2004. A generalized
construction of integrated speech recognition transducers. In
ICASSP, 761-764.
Mohri, M. 2009. Weighted automata algorithms. In M. Droste, W. Kuich, and H.
Vogler,
Handbook of weighted automata, pages 213-254. New York: Springer.