GRM Web>NGramLibrary>NGramQuickTour>NGramMerge (2012-03-08, MichaelRiley)

NGramMerge

Description

This operation merges two n-gram language models or two n-gram count FSTs. The operation provides options for weighting the two input FSTs, and for using the smoothing while merging.

Usage

ngrammerge [--options] in1.fst in2.fst [out.fst] --alpha: type = double, default = 1.0, weight for in1.fst in real semiring --beta: type = double, default = 1.0, weight for in2.fst in real semiring --normalize: type = bool, default = false, whether to normalize the resulting model --use_smoothing: type = bool, default = false, whether to use model smoothing when merging --fixedorder: type = bool, default false, whether to merge in the given argument order
class NGramMerge(StdMutableFst infst1, StdMutableFst infst2, double alpha, double beta);

ngrammerge [--options] in1.fst in2.fst [out.fst]
  --alpha: type = double, default = 1.0, weight for in1.fst in real semiring
  --beta: type = double, default = 1.0, weight for in2.fst in real semiring
  --normalize: type = bool, default = false, whether to normalize the resulting model
  --use_smoothing: type = bool, default = false, whether to use model smoothing when merging
  --fixedorder: type = bool, default false, whether to merge in the given argument order

 class NGramMerge(StdMutableFst *infst1, StdMutableFst *infst2, double alpha, double beta);

In addition to the simple C++ usage above, optional arguments permit the passing of non-default values for various parameters similar to the command-line version.

Examples

Suppose we split our corpus up into two parts, earnest.aa and earnest.ab, e.g., by using split:

$ split -844 earnest.txt earnest.

If we count each half independently, we can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts):

$ farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.aa >earnest.aa.far
$ ngramcount -order=5 earnest.aa.far >earnest.aa.cnts
$ farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.ab >earnest.ab.far
$ ngramcount -order=5 earnest.ab.far >earnest.ab.cnts
$ ngrammerge earnest.aa.cnts earnest.ab.cnts >earnest.merged.cnts
$ fstequal earnest.cnts earnest.merged.cnts

Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing.

Each of the two model or count FSTs can be weighted, using the --alpha switch for the first input FST, and the --beta switch for the second input FST. These weights are interpreted in the real semiring and both default to one, meaning that by default the original counts or probabilities are not scaled. For an n-gram w₁ ... w_k, the default count merging approach will yield

C(w₁ ... w_k) = <alpha> C1(w₁ ... w_k) + <beta> C2(w₁ ... w_k)

To merge two smoothed models, the --use_smoothing=true option provides non-zero probability from each input language model to any in-vocabulary n-gram; and the --normalize=true option ensures that the resulting model is fully normalized. For example, to produce a merged model that weights the contribution of the first model by a factor of 3 and the contribution of the second model by a factor of 2:

$ ngrammerge --use_smoothing --normalize --alpha=3 --beta=2 earnest.aa.mod earnest.ab.mod >earnest.merged.mod

Caveats

References

Topic revision: r7 - 2012-03-08 - MichaelRiley