You can use the formatting commands describes in TextFormattingRules in your comment.
If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
Is it possible to do lookahead in the Thrax grm files? For example, require at least one digit, one lowercase, and one uppercase as in regex below:
( (?=.*\d) (?=.*[a-z]) (?=.*[A-Z]) .{6,20} )
Thanks
I'm not sure what you are trying to do, but you may just want to use a CDRewrite rule, which allows you to change one regexp to another in the context of two other regexps that are not considered part of the first two regexps.
Regex lookahead is not something that is implemented per se. But CDRewrite implements all of the functionality that one uses regexp lookahead in PCRE's for, as far as I can tell. If you want to detect a regular expression in the context of another regular expression and know that you have detected it, an easy way is to write a CDRewrite rule that inserts some marker after (or before) the first regular expression if it occurs in the context of the second regexp. This gives you all the functionality that the PCRE lookahead would give you.
Hi all,
I am new to Thrax and OpenFst and I would appreciate it a lot if you could help me with the following issue. I need to use my own symbol table with a PDT or to be able to extract the symbol table in a non-binary format. So far I was not able to do so as the fst extracted from my far has an empty symbol table.
Let me show you how I worked:
1. I created my grammar that will cover digits one to nine and I got the symbol table I use let's say with another fst.
numbers_en_US.grm
# Numbers simple grammar for en-US.
# Covers numbers 0 to 9
my_symbol_table=SymbolTable['numbers.txt'];
export PARENS = ("[<s>]" : "[</s>]");
space = " " ;
units = Optimize
[
("zero".my_symbol_table) |
("one".my_symbol_table) |
("two".my_symbol_table) |
("three".my_symbol_table) |
("four".my_symbol_table) |
("five".my_symbol_table) |
("six".my_symbol_table) |
("seven".my_symbol_table) |
("eight".my_symbol_table) |
("nine".my_symbol_table)
];
export NUMBERS = ("[<s>]" (units space)* units "[</s>]")* ;
numbers.txt
eight 0
extra1 1
extra2 2
<eps> 3
five 4
four 5
nine 6
one 7
</s> 8
<s> 9
seven 10
six 11
three 12
two 13
zero 14
2. Then I compiled my grammar, extracted the fst from the far and checked the fst info:
$ fstinfo NUMBERS
fst type vector
arc type standard
input symbol table none
output symbol table none
# of states 12
# of arcs 32
initial state 11
...
3. So as the symbol table is empty, when I test, it is impossible to get rewrites:
$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS --output_mode=numbers.txt
Input string: one
Rewrite failed.
$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS
Input string: one
Rewrite failed.
So, any ideas on how to use my symbol table? Or even how to get the internal symbol table in a non-binary format?
Thanks,
Sofia
The symbols generated for the PARENS will be in the FST named *StringFstSymbolTable, which you will see if you do a farextract on the far.
But it looks as if you are assuming two symbol tables here, one being your own, the other being the one that will be generated for those extended labels. I think what you want to do is something like this:
export PARENS = ("<s>".my_symbol_table : "</s>".my_symbol_table);
Then you need to run the compiler with the --save_symbols flag. Finally you will need to use the --input_mode and probably the --output_mode flags to thraxrewrite-tester with the argument being your symbol table.
If that still doesn't work, can you send me (rws@google.com) the complete set of files needed to build your target, and I will have a look.
--R
Hi Richard, I followed your advice but the .far I get with my symbol table is completely different from the one without it. Which is expected but "initial state 0" worries me for example. I will send you my set of files to get an idea.
Hi,
I downloaded openfst 1.4.1 and opengrm-ngram 1.2.1 but the latter won't compile on openSuse 13.1.
./configure says "configure: error: fst/extensions/far/far.h header not found"
however i find this file at /home/roger/sphinx/openfst-1.4.1/src/include/fst/extensions/far/far.h
compile&installation of openfst was successfull (as far i can tell yet)
do I need to add this path/header file somewhere?
Thanks
Roger
Hi all, nice to meet you!
Let me introduce myself, as I am new here. My name is Alexis and I am a computational linguist and software developer. I was very excited with the discovery of the Thrax framework and after a short investigation I decided this was my thing I immediately started digging into it, but unfortunately I was not able to find "real-world" examples of usage, which would have simplified my task.
However, I just kept going on. I have been working for Yandex and developing a rule-based system for generating Russian phonetic transcriptions (in the context of speech synthesis). My company has been very generous and allowed me to open source the rules I wrote.
Probably I do not even use half of the power of Thrax, but I managed to write a working rule-based system just sticking to the basics I thought this could be useful for someone else (as it would have been for myself at the beginning). That is why I thought I should post here about them. Please, take in account that this was my first try with Thrax and that I probably could have written the rules in a much better way, if I had more knowledge.
In case someone is interested, you will find them here: https://github.com/wilpert/RusPhonetizer/tree/master/grammars
Thrax was a wonderfully powerful and easy to use framework for my work, something I did not experience before. I am utterly thankful to the authors for their amazing achievement. And to Yandex for allowing me to share my work.
Thanks to you all and be happy
Alexis
Hi Alexis:
Glad it has proved useful to you. Yeah there are various toy examples around, but not much "real world" examples that I know of that are public, at least not yet.
I'll be happy to take a look sometime at your grammars and send along suggestions if I have any.
Richard Sproat
Hi Richard,
yes, it would be great if you would find any time to have a look at my grammars, any feedback would be terribly appreciated!
Thanks again for the software,
Alexis
I am trying to compile Thrax in a Ubuntu VM using VirtualBox. I have gcc 4.8.2 installed and compiled openfst with far and pet enabled and in shared mode. I have 1Gb of RAM dedicated to the VM. If I try ./configure --enable-shared, it fails because I run out of memory. If I try just ./configure and then make, everything seems to compile ok until I get an internal compilation error:
/bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c -o loader.lo `test -f 'walker/loader.cc' || echo './'`walker/loader.cc
libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c walker/loader.cc -fPIC -DPIC -o .libs/loader.o
g++: internal compiler error: Killed (program cc1plus)
Try commenting out the lines that refer to Log64Arc in src/include/thrax/function.h, viz
function.h:70:extern Registry<Function<fst::Log64Arc>* > kLog64ArcRegistry;
function.h:87: typedef name<fst::LogArc> Log64Arc ## name; function.h:88: REGISTER_LOGARC_FUNCTION(Log64Arc ## name)
(Obviously be careful in that #define REGISTER_GRM_FUNCTION to leave the continuation "\"s all happy.
The downside is you won't get log64 arcs. The upside is it should be smaller. The fact that it's running out of memory in compiling the loader makes me suspect that may be the problem because for each of the different arc types, all of the templated classes have to be expanded. This should reduce the size, therefore. If that still doesn't work, remove log arcs too. You won't likely be using them. Indeed, for precisely these sorts of issues I have been thinking of disabling those in future versions.
I did that and also had to comment out similar lines in src/lib/walker/evaluator-specialization.cc (lines 35 and 49-53).
I also tried taking out LogArc and all it's mentions in function.h and evaluator-specialization.cc. But I still get an internal compilation error.
Ok thanks.
So the question is why you aren't getting that by inheritance. This is the first time I've seen this problem and I have no idea where it has suddenly broken.
Hi Richard (etc.), using Thrax 1.1.0 (and with OpenFst 1.3.4 already installed), compilation fails while making the file `ast/identifier-node.cc` due to an issue in the `include/thrax/compat/utils.h` header. Here's the error:
/bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c -o identifier-node.lo `test -f 'ast/identifier-node.cc' || echo './'`ast/identifier-node.cc
libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c ast/identifier-node.cc -fno-common -DPIC -o .libs/identifier-node.o
In file included from ast/identifier-node.cc:22:
./../include/thrax/compat/utils.h:119:8: error: field has incomplete type
'char []'
char buf[];
^
I presume this is because buf[] doesn't have a length defined (nor is it initialized with a string), and when I change the line to
char buf[1024];
compilation goes through. (I'm not sure this is a sensible default; I spent no time trying to understand what this code is doing.)
I'd include a patch but it's one line.
Kyle
Just remove that line: that variable is not used. Apparently it's a holdover from some earlier implementation, and I just forgot to update it. I'll fix this in the next release.
Hi,
I am currently using thrax to extend my some features of an alignment tool I wrote for my g2p system.
The basic idea is that the user can specify some alignment correspondence rules and optional default penalties, and then these can be incorporated into the EM training process.
At present I have kind of hacked the functionality of the thraxcompiler command tool to read in the grammar, and then return the desired FST+symbol table to the alignment program.
EDIT: Maybe it makes more sense to just provide a couple of snippets:
GetFstFromGrammar
sy = SymbolTable['simple.syms'];
zero = "0".sy : "zero".sy;
units = ( "these're".sy : ( "these're".sy | "[these]" | "[these]" "are".sy ) );
split = ( "[these]" "are".sy : "these're".sy );
sigma = "<sigma>".sy : "<sigma>".sy;
abc = ( "a".sy "b c".sy : "a b b".sy );
export RULES = Optimize[ sigma* ( units | zero | abc ) sigma* ];
Here the 'sigma' is used in combination with a specialized 1-state alignment transducer that relies on RHO and SIGMA matchers.
Is there an alternative or recommended way to do this? It would be great if I could either specify the symbol table just once at the beginning, or automatically infer/generate the whole symbol table and return it - or even better modify the grammar from my C++ application to simply what the user is responsible for doing.
I went through the FAQ but did not notice any answers to these questions.
Thanks for your time.
UPDATE:
I solved this by creating some bindings with pybindgen and then writing a generator that interprets a simplified version of the Thrax grammar, then expands it to the versbose version with the extra quotes and symfile suffixes, etc.
yes (openfst 1.3.4 compiled with --enable-far and some other enable options ), thrax compiled successfully,but compilation fails while making the file `batch_test.c` (extracted form export.tgz), can you me some advice
I'd like to but first I need to understand what is going on. I can't reproduce your error (apparently) and I don't know what batch_test.c is since it's not part of the Thrax distribution. Is this your own code? If so then I need to see EXACTLY what you are doing, including probably your sending me a directory with all of the additional code.
If this is part of the Thrax distribution then please tell me where it is because I can't find it (nor do I remember such a file).
thank you for your reply.
in this page:
http://openfst.cs.nyu.edu/twiki/bin/view/Contrib/ThraxContrib,
you can see
Projects using the OpenGrm Thrax tools:
export.tgz: Grammars and software developed as part of a text normalization class taught at the Center for Spoken Language Understanding, Fall 2011. URL for the course: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/
i download "export.tgz" .
there is a file called batch_tester.cc in batch_tester directory(extract from export.tgz)。
Ok that helps. Yes, I did write that, but it wasn't obvious from your query that this is what you were referring to. Please in future give all necessary information when reporting a bug.
In the meantime I will have a look. I do not know off the top of my head what the problem is.
Ok it's the usual nonsense about ordering of shared object libraries. If you do things in this order it should work:
g++ -g -O2 -o batch_tester batch_tester.o -L/usr/local/lib/fst -lm -ldl -lfst -lthrax -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib /usr/local/lib/fst/libfstfar.so
Evidently there is a bug in the configuration of the distribution that was not causing problems before, but is now. I will look into that, but in the meantime, please try linking manually as above.
So far I find thrax a very neat piece of software but I have two questions...
Can I somehow use probability semiring as weights, because it seems Thrax only allows specifying log and tropical semirings? How about the other ones... Or should I somehow postprocess the generated far file?
Another question: I tried to use "fstdraw" on a far file, but got: ERROR: FstHeader::Read: Bad FST header: example.far
Is this a version mismatch?
Sorry, I missed the earlier comment -- for some reason I didn't get email about it.
Unfortunately the restriction to Log and Tropical is due to a similar restriction in the fst library: the real semiring does not come predefined. The best suggestion would be to use Tropical and then just do the obvious e^-cost conversion.
Access control: