You can use the formatting commands describes in TextFormattingRules in your comment.
If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
When I load a pre-trained language model into c++ and iterate over arcs, I see that there are no arcs labeled with the given OOV symbol, although it is contained in the symbol table.
Are OOV words handled somewhere other than the FST itself, or is the absence of these arcs likely due to a quirk in my particular language model?
Any insight or ideas would be greatly appreciated. Thanks!
Hi,
I built 3-gram language model on few English words.
My c++ program receive streaming character one by one.
I would like to use the 3-gram model to score the up-coming character with history context. What example can I start with ? Is it possible not to convert to farstrings each time ?
I tried order-1 ngram count with this simple text
Goose is hehe
Goose is hehe
Goose is
Goose
But I don't understand the resulting count fst.
0 -1.3863
0 0 Goose Goose -1.3863
0 0 is is -1.0986
0 0 hehe hehe -0.69315
The document says Transitions and final costs are weighted with the negative log count of the associated n-gram, but I can't make sense with these numbers, can someone help me out? Thx!!
Hi,
The counts are stored as negative natural log (base e), so -0.69315 is -log(2), -1.0986 is -log(3) and -1.3863 is -log(4). The count of each word is kept on arcs in a single state machine (since this is order 1) and the final cost is for the end of string (which occurred four times in your example). You printed this using fstprint, but you can also try ngramprint which in this case yields:
<s> 4
Goose 4
is 3
hehe 2
</s> 4
where <s> and </s> are the implicit begin of string and end of string events. These are implicit because we don't actually use the symbols to encode them in the fst.
Hope that clears it up for you. If not, try the link in the 'Model Format' section of the quick tour, to the page on 'precise details'.
best,
brian
Hi,
I try to merge two LM by following command line:
./tools/opengrm-ngram-1.0.3/bin/ngrammerge --alpha=0.2 --beta=0.8 --normalize --use_smoothing A.fst B.fst AB.mrg.fst
But I get a error:
FATAL: NGramModel: input not deterministic
A.fst is normal fst LM trained by SRILM. B.fst is class-expanded LM by fstreplace command.
I also try to convert fst LM into arpa LM by command line:
./tools/opengrm-ngram-1.0.3/bin/ngramprint --ARPA B.fst > B.arpa
But I got a same error:
Hi,
you've introduced non-determinism into the ngram models via your replace class modification. The ngrammerge (and ngramprint) commands are simple operations that expect a standard n-gram topology, hence the error messages. For more complex model topologies of the sort you have, you'll have to write your own model merge function that does the right thing when presented with non-determinism. The base library functions don't handle these complex examples, but the code should give you some indication of how to approach such a model mixture. Such is the benefit of open source!
brian
error when converting LM genereted by HTK into fst format
Hi,
I try to convert arpa LM genereted by HTK tool into fst format. The command is :
./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst
But I get a error:
Hi,
it appears that you have n-grams ending in your stop symbol (probably </s>) that have backoff weights, i.e., the ARPA format has an n-gram that looks like:
-1.583633 XYZ </s> -0.30103
But </s> means end-of-string, which we encode as final cost, not an arc leading to a new state. Hence there is no state where that backoff cost would be used. (Think of it this way: what's the next word you predict after </s>? In the standard semantics of </s>, it is the last term predicted, so nothing comes afterwards.) Do you also have n-grams that start with </s>?
So, one fix on your ARPA format is just to remove the backoff weight after n-grams that end in </s>.
hope that helps,
brian
Hi, Brian
Thank you! I get it.
Another case is there are n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but It is a the only choice to train class-based LM. How do I fix it? Is it reasonable to remove directly these n-grams?
Thanks,
Huanliang Wang
Hi, Brian
Thank you! I get it.
Another case is there are some n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but it is my only choice to train class-based LM with automatic class clustering from large plain data . How do I fix it? Is it reasonable to remove directly these n-grams?
Thanks,
Huanliang Wang
yes, you might try just removing those n-grams. In the ARPA format, you'll have to adjust the n-gram counts at the top of the file to match the number you have at each order.
Hi, Brian
Thank you! I got it.
Could you give me an example how to use fstreplace to replace e nonterminal label in a fst by another fst?
Thanks,
Huanliang Wang
Hi,
I try to convert arpa LM genereted by HTK tool into fst format. The command is :
./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst
But I get a error:
Hi,
I'm currently playing around with a test example and I noticed than after ngrammake if I call fstinfo (not ngraminfo) on the resulting language model fstinfo complains about the model being ill-formed. This is due to transitions (typically on epsilons) that have "Infinity" weight, which does not seem to be supported by openFST. Is that "working as intended"? The problem is later if I call fstshortestpath to get e.g. the n most likely sentences from the model the result contain not only "Infinity" weights but also "BadNumber" which might be a result of the infinite values.
Thanks,
Roland
Hi Roland,
yes, under certain circumstances, some states in the model end up with infinite backoff cost, i.e., zero probability of backoff. In many cases this is, in fact, the correct weight to assign to backoff. For example, with a very small vocabulary and many observations, you might have a bigram state that has observations for every symbol in the vocabulary, hence no probability mass should be given to backoff. Still, this does cause some problems with OpenFst. In the next version (due to be released in the next month or so) we will by default have a minimum backoff probability of some very small epsilon (i.e., very large negative log probability). As a workaround in the meantime, I would suggest using fstprint to print the model to text, then use sed or perl or whatever to replace Infinity with some very large cost -- I think SRILM uses 99 in such cases, which would work fine.
hope that helps,
brian
If I may add another quick question, when running fstshortestpath on the ngram count language model (i.e. after ngramcount but before ngrammake) I was expecting to get the most frequent n-gram, but instead the algorithm never seems to terminate. Any idea why that is? I though that shortestpath over the tropical sr should always terminate anyway.
Thanks,
Roland
The ngram count Fst contains arcs with negative log counts. Since the counts can be greater than one, the negative log counts can be less than zero. Hence the shortest path is an infinite string repeating the most frequent symbol. Each symbol emission shortens the path, hence non-termination.
brian
Hi. I maintain several voice-recognition-related packages, including openfst, for the Fedora Linux distribution. I am working on an OpenGrm NGram package. My first attempt at building version 1.0.3 (with GCC 4.7.2 and glibc 2.15) failed:
In file included from ngramrandgen.cc:32:0:
./../include/ngram/ngram-randgen.h:55:48: error: there are no arguments to 'getpid' that depend on a template parameter, so a declaration of 'getpid' must be available [-fpermissive]
./../include/ngram/ngram-randgen.h:55:48: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
It appears that an explicit #include <unistd.h> is needed in ngram-randgen.h. That header was probably pulled in through some other header in previous versions of either gcc or glibc.
I was wondering what the expected result is when feeding a lattice, rather than a string/sentence, to the ngramperplexity utility? Is this supported? It seems to report the perplexity of an arbitrary path through the lattice.
Hi Josef,
ngramperplexity reports the perplexity of the path through the lattice that you get by taking the first arc out of each state that you reach. (Note that this is what you want for strings encoded as single-path automata.) Not sure what the preferred functionality should be for general lattices. Could make sense to show a warning or an error there; but at this point the onus is on the user to ensure that what is being scored is the same as what you get from farcompilestrings - unweighted, single-path automata. If you have an idea of what preferred functionality would be for non-string lattices, email me.
brian
Hi,
I do not want to print my fst and execute NGramApply in bash before reading the new fst again in c++.
Is there a method to use the method NGramApply directly in c++ ?
Thanks
Hi Markus,
there is no single method; rather there are several ways to perform composition with the model, depending on how you want to interpret the backoff arcs. The most straightforward way to do this in your own code is to look at src/bin/ngramapply.cc and use the composition method for the particular kind of backoff arc, e.g., ngram.FailLMCompose() when interpreting the backoff as a failure transition. In other words, write your own ngramapply method based on inspection of the ngramapply code.
Hope that helps,
brian
Hi,
thanks, I think yes that should work.
I am using FailureArcs and my LM fst is created, so I do not need to build a lm fst out of strings or an ARPA lm.
I first just need to read the fst lm from my disk:
#include <ngram/ngram.h>
fst::StdMutableFst *fstforNGram;
fstforNGram->Read($MYNGRAMFST);
ngram::NGramModel ngram(fstforNGram);
// that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'
If I read the lm , I could then just add:
ngram.FailLMCompose(*lattice, &cfst, kSpecialLabel);
and the composed fst should be ready, right?
Thanks for helping
yes, but I have a problem to read the fst lm in c++:
fst::StdMutableFst *fstforNGram;
fstforNGram->Read($MYNGRAMFST);
to that point it works.
ngram::NGramModel ngram(fstforNGram);
that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'
Thanks
Hi, I have been using OpenGrm with my Grapheme-to-Phoneme conversion tools for a while now and recently added some functionality to output weighted alignment lattices in .far format.
It is my understanding that these weighted lattices can only currently be utilized with Witten-Bell smoothing; is this correct?
Is there any plan to support fractional counts with Kneser-Ney smoothing, for instance along the lines of,
"Correlated Bigram LSA for Unsupervised Language
Model Adaptation", Tam and Schultz.
or would I be best advised to implement this myself?
Hi Josef,
Witten-Bell generalizes straightforwardly to fractional counts, as you point out. No immediate plans for new versions of other smoothing methods along those lines, so if that's something that you need urgently, you would need to implement it.
brian
Hi Luke,
this is basically a floating point precision issue, the system is trying to subtract two approximately equal numbers (while calculating backoff weights). The new version of the library coming out in a month or so has much improved floating point precision, which will help. In the meantime, you can get this to work by modifying a constant value in src/include/ngram/ngram-model.h which will allow these two numbers to be judged to be approximately equal. Look for:
static const double kNormEps = 0.000001;
near the top of that file. Change to 0.0001, then recompile.
This sort of problem usually comes up when you train a model with a relatively small vocabulary (like a phone or POS-tag model) and a relatively large corpus. The n-gram counts end up not following Good-Turing assumptions about what the distribution should look like (hence the odd discount values). In those cases, you're probably better off with Witten-Bell smoothing with the --witten_bell_k=15 or something like that. Or even trying an unsmoothed model.
And stay tuned for the next release, which deals more gracefully with some of these small vocabulary scenarios.
Brian
I generated an ngram model from a .arpa file with the following command:
ngramread --ARPA lm.arpa > lm.model
ngramread does not complain, but ngraminfo and trying to load the model from C++ code generate the following error:
FATAL: NGramModel: bad ngram model topology
How can I troubleshoot the problem?
Hi,
that error is coming from a sanity check that verifies that every state in the language model (other than the start and unigram states) is reached by exactly one 'ascending' arc, that goes from a lower order to a higher order state. ARPA format models can diverge from this, by, for example, having 'holes' (e.g., bigrams pruned but trigrams with that bigram as a suffix retained). But ngramread should plug all of those. maybe duplication? I'll email you about this.
Benoit found a case where certain 'holes' from a pruned ARPA model were not being filled appropriately in the conversion. The sanity check routines on loading the model ensured that this anomaly was caught (causing the errors he mentioned), and we were able to find the cases where this was occurring and update the code. The updated conversion functions will be in the forthcoming version update release of the library, within the next month or two. In the meantime, if anyone encounters this problem, let me know and I can provide a workaround.
-- CyrilAllauzen - 09 Aug 2012