Philipp Koehn | 1 Aug 08:57 2007
Picon
Picon

Re: Question about ttable-limit parameters in moses.ini

Hi,

it means that the second ttable (if there is any) has no limit,
all its translation options are used. This is typically the setup,
for instance, if the first translation table maps words and the
second table maps POS.

-phi

On 7/31/07, Jorge Civera Saiz <jorcisai@...> wrote:
> Dear MOSES experts,
>
> I have a short question for you. What is the meaning of the value "0" in the
> file moses.ini (see example below)?
>
> # limit on how many phrase translations e for each phrase f are loaded
> # 0 = all elements loaded
> [ttable-limit]
> 20
> 0
>
> Thanks in advance.
>
> Best regards,
>
> Jorge
>
>
> -------------------------------------------------
> This mail sent through IMP: http://horde.org/imp/
(Continue reading)

Chris Manley | 2 Aug 16:08 2007
Picon

Mark-Up Langauge for Moses?


Hello: I am relatively new to STMT, so forgive my ignorance if the answer is
obvious to most, but in reviewing the technology over the past several
months, has there been any serious discussion regarding creation of a meta
markup language for data supported by Moses to assist in improving the
systems capability in the syntactic area?

Thanks in advance for you response(s).

-Chris

menor bangget | 3 Aug 03:37 2007
Picon

factored training with POS tags

Hello all,

I have some questions that maybe silly, but it can't be
helped, so here we go....

  1. I have read the step-by-step instructions to build baseline system on www.statmt.org\wmt07\baseline.html; and it looks easier to do than one on www.statmt.org\moses about factored training.
  • Then I wonder, what's the major different between those two?
  • Does the baseline system isn't good enough to do the translation?
  • Is it true that baseline system only use phrase-based translation?


  2. I'm interested to do experiment using POS tags for      English-Indone sian. But I don't actually know what to      do.
  • Can anybody tell me how to do this, so I can follow easily? I mean about the pre-processing, how to get POS tagged data, etc.

As far as I know, there isn't any POS tagger for Indonesian.
And I've read about TreeTagger, that is language independent,
and it was said that can easily adaptable to other languages
if a lexicon and a manually tagged training corpus are available.
The problem is I don't know how to get a manually tagged training corpus.
  • Do I really have to tag all my corpus by hand?
  • How about the tagset? Can anybody give me a way to do this, or  maybe better advice?
  • Just in case I can't get Indonesian POS tagged corpus, can I train my system only using English tagged corpus?
Best regards.

-Amri

Bergabunglah dengan orang-orang yang berwawasan, di bidang Anda di Yahoo! Answers
Lefteris Avramidis | 3 Aug 04:28 2007
Picon
Picon

Re: factored training with POS tags

menor bangget wrote:
Hello all, I have some questions that maybe silly, but it can't be helped, so here we go....
  1. I have read the step-by-step instructions to build baseline system on www.statmt.org\wmt07\baseline.html; and it looks easier to do than one on www.statmt.org\moses about factored training.
  • Then I wonder, what's the major different between those two?
I guess the fact that the WMT07 baseline instructions were written more lately and in a more descriptive way. I prefer them than the www.statmt.org\moses ones, as well.
  • Does the baseline system isn't good enough to do the translation?
It depends what do you mean "good enough" ! I'd say that the baseline translation quality depends on the similarity of the languages you are translating. Most people are happier after adjusting a lot of the decoding/training parameters, or adding some factors as well.
  • Is it true that baseline system only use phrase-based translation?
The baseline is the simplest model you can get and yes, it uses phrase-based translation.

  2. I'm interested to do experiment using POS tags for      English-Indonesian. But I don't actually know what to      do.
  • Can anybody tell me how to do this, so I can follow easily? I mean about the pre-processing, how to get POS tagged data, etc.
You have to place the factors in the format word|factor in both your training and your test set. Then, follow the instructions given here: http://www.statmt.org/moses/?n=FactoredTraining.FactoredTraining

Practically, this means that you have to decide if your are using your factors for translation purpose, or reordering purpose, let's say, and specify the parameters accordingly.

I have ran an experiments with POS tags only on the English side, and my factor parameters were
-input-factor-max 1 -alignment-factors 0-0 -translation-factors 0,1-0 -reordering-factors 0-0 -generation-factors 0-0 -decoding-steps t0,g0 

this means that I mapped the translation probabilities as
sourceword+POSfactor->targetword while all the other parameters just return the same value they get.


    As far as I know, there isn't any POS tagger for Indonesian. And I've read about TreeTagger, that is language independent, and it was said that can easily adaptable to other languages if a lexicon and a manually tagged training corpus are available. The problem is I don't know how to get a manually tagged training corpus.
    • Do I really have to tag all my corpus by hand?
    I can't say much about the system you mentioned above. I suppose you have to study a bit on how to construct a POS tagger for the desired language. Then, you will have to manually tag as much of your POS tagger training corpus is needed, so as to build the POS tagger. Then you can use the POS tagger you constructed to automatically tag your translation training data.
    POS tagger training data and translation training data don't need to be  the same (or have the same size).

      • How about the tagset? Can anybody give me a way to do this, or  maybe better advice?
      • Just in case I can't get Indonesian POS tagged corpus, can I train my system only using English tagged corpus?
      Yes, but better results are not guaranteed. Try it, though.
        Best regards. -Amri

        Bergabunglah dengan orang-orang yang berwawasan, di bidang Anda di Yahoo! Answers _______________________________________________ Moses-support mailing list Moses-support-3s7WtUTddSA@public.gmane.org http://mailman.mit.edu/mailman/listinfo/moses-support

        Philipp Koehn | 3 Aug 08:33 2007
        Picon
        Picon

        Re: Mark-Up Langauge for Moses?

        Hi Chris,
        
        I am not quite sure what are trying to propose here.
        The way we are addressing syntactic annotation is
        the factored model, and the way we use markup
        language is to override the decoders choices for
        specific phrases.
        
        -phi
        
        On 8/2/07, Chris Manley <cmanley@...> wrote:
        >
        > Hello: I am relatively new to STMT, so forgive my ignorance if the answer is
        > obvious to most, but in reviewing the technology over the past several
        > months, has there been any serious discussion regarding creation of a meta
        > markup language for data supported by Moses to assist in improving the
        > systems capability in the syntactic area?
        >
        > Thanks in advance for you response(s).
        >
        > -Chris
        >
        >
        >
        > _______________________________________________
        > Moses-support mailing list
        > Moses-support@...
        > http://mailman.mit.edu/mailman/listinfo/moses-support
        >
        >
        
        
        Stig Alvestad | 13 Aug 10:55 2007
        Picon

        Lexical tables with lots of NULL

        Hi !

         

        I recently completed my second training process of a new model, and discovered some differences compared to the first time. When running the training-script train-factored-phrase-model, I got lots of messages, first there were many "alignment point out of range ..", then came lots of "use of uninitialized value in schalar ..." and finally I got lots of warnings like the ones below (I'm doing translation from english to norwegian, using 3-gram LMs).

         

        WARNING: sentence 1200 has alignment point (4, 0) out of bounds (4, 4)

        E: annen operasjon p?? urethra

        F: other operation on urethra

        WARNING: sentence 1201 has alignment point (10, 0) out of bounds (8, 8)

        E: perkutan drenasje av pseudocyste eller abscess i pancreas

        F: percutaneous drainage of pseudocyst or abscess of pancreas

        WARNING: sentence 1202 has alignment point (10, 0) out of bounds (8, 7)

        E: lukking av endeenterostomi med anastomose til colon

        F: closure of terminal enterostomy with anastomosis to colon

        WARNING: sentence 1209 has alignment point (5, 0) out of bounds (5, 4)

        E: andre spesifiserte kvinnelige kj??nnsorganer

        F: other specified female genital organs

         

        I didn't get any errors, it all terminated nicely. But looking at the lex-files in the newly constructed model, almost all entries (ca 90%) in the two lex-files are like this "NULL educational 1.0000000", "NULL plumbing 1.0000000", "NULL reformere 1.0000000", "NULL renskrivning 1.0000000", "beordre NULL 0.0000058", "skyfle NULL 0.0000058". When I did the first training, my lex-files had almost no such entries with NULL, so the difference is huge.

         

        The only thing I did differently this time, was to use several LM when running the training-script, but that should be ok. The data is sentence-aligned, but quite extended compared to the first time.

         

        Below is the command I used to execute the training-script with parameters

         

        bin/moses-scripts/scripts-20070717-1342/training/train-factored-phrase-model.perl -model-dir /home/stig/wsDirMoses/model/2opptrening13aug/ -scripts-root-dir /home/stig/wsDirMoses/bin/moses-scripts/scripts-20070717-1342 -root-dir /home/stig/wsDirMoses -corpus /home/stig/wsDirMoses/corpus/opptrening12aug/alleKodeverk.lowercased -f en -e no -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/home/stig/wsDirMoses/lm/opptrening12aug/generell.lm:0 -lm 0:3:/home/stig/wsDirMoses/lm/opptrening12aug/icd10.lm:0 -lm 0:3:/home/stig/wsDirMoses/lm/opptrening12aug/ICFtitler.lm:0 -lm 0:3:/home/stig/wsDirMoses/lm/opptrening12aug/ncsp.lm:0

         

        If anyone have any idea why I get so many NULLs in my lexical tables and all those messages during training, I’d be happy to know,

         

        Stig Alvestad

        Philipp Koehn | 14 Aug 06:44 2007
        Picon
        Picon

        Re: Lexical tables with lots of NULL

        Hi,
        
        this is indeed odd, and should be examined closer.
        
        This already seems strange:
        > WARNING: sentence 1200 has alignment point (4, 0) out of bounds (4, 4)
        >
        > E: annen operasjon p?? urethra
        >
        > F: other operation on urethra
        
        How is it possible that there is an alignment between the 5th and the 1st
        word (note: 4 and 0 is computer science counting), if each sentence only
        has 4 words.
        
        How do the GIZA++ alignments look like for this sentence?
        
        Why is this alignment point created?
        
        I wonder, if a surplus of space-like characters causes some havoc here.
        
        -phi
        
        
        Stig Alvestad | 14 Aug 16:44 2007
        Picon

        Re: Lexical tables with lots of NULL

        
        I have to admit I'm uncertain where the GIZA++ alignments can be found. Could you point me in the correct
        direction, and I'll look them up.
        
        After you mentioned your suspicion of unwanted extra characters, I've been checking the fileencodings
        and fileformats of the input files extra carefully. Since I'm making the input files in Windows and
        copying them to my linux harddrive, there are many things which can go wrong. I've used TextPad (in
        Windows) to convert the files into Utf-8 encoding and Unix fileformat and Vim (in linux) to verify that
        those properties still hold in Linux before training. But I still got the same results.
        
        Just to sum up my understanding of the requirements for the input-files used for training;
        -character encoding: Utf-8
        -file format: Unix
        -corpus-file organization: one sentence per line, no trailing spaces
        -no blank lines
        -everything lowercased
        -sentences max 100 words (I've used max 40)
        
        Stig
        
        -----Opprinnelig melding-----
        Fra: phkoehn@...
        [mailto:phkoehn@...] På vegne av Philipp Koehn
        Sendt: 14. august 2007 06:45
        Til: Stig Alvestad
        Kopi: moses-support@...
        Emne: Re: [Moses-support] Lexical tables with lots of NULL
        
        Hi,
        
        this is indeed odd, and should be examined closer.
        
        This already seems strange:
        > WARNING: sentence 1200 has alignment point (4, 0) out of bounds (4, 4)
        >
        > E: annen operasjon p?? urethra
        >
        > F: other operation on urethra
        
        How is it possible that there is an alignment between the 5th and the 1st
        word (note: 4 and 0 is computer science counting), if each sentence only
        has 4 words.
        
        How do the GIZA++ alignments look like for this sentence?
        
        Why is this alignment point created?
        
        I wonder, if a surplus of space-like characters causes some havoc here.
        
        -phi
        
        
        Chris Callison-Burch | 14 Aug 20:57 2007
        Picon
        Picon

        Re: Lexical tables with lots of NULL

        Hi Stig,
        
        The Giza++ alignment files that you're looking for are the ones with  
        the .A3.final.gz extensions and are located in the giza.fr-en/ and  
        giza.en-fr/ subdirectories that are created by the train-factored- 
        models.perl script.
        
        They look like this:
        
        % zcat giza.en-fr/en-fr.A3.final.gz | head -3
        # Sentence pair (1) source length 4 target length 4 alignment score :  
        0.000408665
        resumption of the session
        NULL ({ }) reprise ({ 1 }) de ({ 2 }) la ({ 3 }) session ({ 4 })
        
        % zcat giza.fr-en/fr-en.A3.final.gz | head -3
        # Sentence pair (1) source length 4 target length 4 alignment score :  
        0.00500671
        reprise de la session
        NULL ({ }) resumption ({ 1 }) of ({ 2 }) the ({ 3 }) session ({ 4 })
        
        where the numbers in the double parentheses correspond to the other  
        languages word indexes.  You should be able to see if you're having  
        problems with space characters as words by inspecting these files.
        
        If you want a better way of converting files to UTF-8, you should  
        check out the "iconv" program which is available under most Linux  
        distributions.   Example usage for converting from Windows Latin1  
        encoding to UTF-8:
        
        iconv -f Latin1 -t utf8 inputfile > outputfile
        
        Hope this helps.
        
        Chris C-B
        
        On Aug 14, 2007, at 10:44 AM, Stig Alvestad wrote:
        
        >
        > I have to admit I'm uncertain where the GIZA++ alignments can be  
        > found. Could you point me in the correct direction, and I'll look  
        > them up.
        >
        > After you mentioned your suspicion of unwanted extra characters,  
        > I've been checking the fileencodings and fileformats of the input  
        > files extra carefully. Since I'm making the input files in Windows  
        > and copying them to my linux harddrive, there are many things which  
        > can go wrong. I've used TextPad (in Windows) to convert the files  
        > into Utf-8 encoding and Unix fileformat and Vim (in linux) to  
        > verify that those properties still hold in Linux before training.  
        > But I still got the same results.
        >
        > Just to sum up my understanding of the requirements for the input- 
        > files used for training;
        > -character encoding: Utf-8
        > -file format: Unix
        > -corpus-file organization: one sentence per line, no trailing spaces
        > -no blank lines
        > -everything lowercased
        > -sentences max 100 words (I've used max 40)
        >
        > Stig
        >
        >
        > -----Opprinnelig melding-----
        > Fra: phkoehn@...
        [mailto:phkoehn@...] På vegne av  
        > Philipp Koehn
        > Sendt: 14. august 2007 06:45
        > Til: Stig Alvestad
        > Kopi: moses-support@...
        > Emne: Re: [Moses-support] Lexical tables with lots of NULL
        >
        > Hi,
        >
        > this is indeed odd, and should be examined closer.
        >
        > This already seems strange:
        >
        >> WARNING: sentence 1200 has alignment point (4, 0) out of bounds  
        >> (4, 4)
        >>
        >> E: annen operasjon p?? urethra
        >>
        >> F: other operation on urethra
        >>
        >
        > How is it possible that there is an alignment between the 5th and  
        > the 1st
        > word (note: 4 and 0 is computer science counting), if each sentence  
        > only
        > has 4 words.
        >
        > How do the GIZA++ alignments look like for this sentence?
        >
        > Why is this alignment point created?
        >
        > I wonder, if a surplus of space-like characters causes some havoc  
        > here.
        >
        > -phi
        >
        > _______________________________________________
        > Moses-support mailing list
        > Moses-support@...
        > http://mailman.mit.edu/mailman/listinfo/moses-support
        >
        >
        
        
        Stig Alvestad | 15 Aug 10:37 2007
        Picon

        Re: Lexical tables with lots of NULL

        
        I took a look at the giza alignment files, and it was no wonder I got lots of warnings regarding alignment.
        First of all: I ran the training-script yesterday (14th Aug), but not all files in giza.no-en and
        giza.en-no were updated (I just show giza.en-no, but it was the same in giza.no-en);
        
        -en-no.A3.final.gz	last modified 18th of July (first time I trained with Moses)
        -en-no.cooc		last modified 14th of August
        -en-no.gizacfg	last modified 18th of July
        
        As could be expected, the contents of the final.gz-files corresponded to the training files I used the
        first time (18th July) and NOT the input files I had used yesterday. 
        
        I moved the giza.en-no and giza.no-en folders out of my working directory, hoping that it would motivate
        GIZA to make new and updated files the next time. 
        
        When running the training process this time, I got the same messages during training, lex-files still full
        of NULL, and GIZA created new folders with updated alignment info. But when comparing the contents of
        A3.final.gz with the training corpus and the warnings, I realized that this did not fix my problem. See the
        two examples below:
        
        Line 1232 - 1233, in training corpus Norwegian:
        cannabinose
        uspesifisert trakom
        
        Line 1232 - 1233, in training corpus English:
        cannabinosis
        trachoma , unspecified
        
        Warnings during training:
        WARNING: sentence 1232 has alignment point (8, 0) out of bounds (1, 1)
        E: cannabinose
        F: cannabinosis
        WARNING: sentence 1233 has alignment point (1, 2) out of bounds (3, 2)
        E: uspesifisert trakom
        F: trachoma , unspecified
        
        In giza.no-en/no-en.A3.final.gz
        # Sentence pair (1232) source length 10 target length 10 alignment score : 5.7549e-14
        og fra ikke gå ved ! på dislokasjon i styrke 
        NULL ({ }) from ({ 2 }) out ({ 3 }) disorders ({ 4 5 }) a ({ }) deep ({ 6 7 8 10 }) and ({ 9 }) into ({ }) or ({ }) of ({ }) in ({
        1 }) 
        
        # Sentence pair (1233) source length 6 target length 4 alignment score : 0.000104882
        gjøre fremre emboli fra 
        NULL ({ }) one ({ 1 }) embolism ({ 2 3 }) from ({ 4 }) out ({ }) disorders ({ }) congenital ({ }) 
        
        In giza.en-no/en-no.A3.final.gz
        # Sentence pair (1232) source length 10 target length 10 alignment score : 5.75612e-15
        from out disorders a deep and into or of in 
        NULL ({ }) og ({ 9 10 }) fra ({ 1 }) ikke ({ 2 }) gå ({ }) ved ({ }) ! ({ }) på ({ }) dislokasjon ({ 3 }) i ({ 6 }) styrke ({
        4 5 7 8 }) 
        
        # Sentence pair (1233) source length 4 target length 6 alignment score : 1.4427e-06
        one embolism from out disorders congenital 
        NULL ({ }) gjøre ({ 1 }) fremre ({ 2 }) emboli ({ 4 5 6 }) fra ({ 3 }) 
        
        The sentences in the alignment files are completely different from those in the input file. I've searched
        my training files, but these sentences don't occur there at all. And the sentences in alignment files
        don't make any sense, it seems to me that they have been pieced together randomly.
        
        I'll go through the process of making the input files once more, and try the iconv program suggested by Chris
        for conversion. If it doesn't work I feel that a reinstall of Moses could be worth a try.
        
        Thanks to Chris and Philip for the help so far 
        
        Stig
        
        -----Opprinnelig melding-----
        Fra: Chris Callison-Burch [mailto:callison-burch@...] 
        Sendt: 14. august 2007 20:58
        Til: Stig Alvestad
        Kopi: moses-support@...
        Emne: [SPAM] - Re: [Moses-support] Lexical tables with lots of NULL - Email found in subject
        
        Hi Stig,
        
        The Giza++ alignment files that you're looking for are the ones with  
        the .A3.final.gz extensions and are located in the giza.fr-en/ and  
        giza.en-fr/ subdirectories that are created by the train-factored- 
        models.perl script.
        
        They look like this:
        
        % zcat giza.en-fr/en-fr.A3.final.gz | head -3
        # Sentence pair (1) source length 4 target length 4 alignment score :  
        0.000408665
        resumption of the session
        NULL ({ }) reprise ({ 1 }) de ({ 2 }) la ({ 3 }) session ({ 4 })
        
        % zcat giza.fr-en/fr-en.A3.final.gz | head -3
        # Sentence pair (1) source length 4 target length 4 alignment score :  
        0.00500671
        reprise de la session
        NULL ({ }) resumption ({ 1 }) of ({ 2 }) the ({ 3 }) session ({ 4 })
        
        where the numbers in the double parentheses correspond to the other  
        languages word indexes.  You should be able to see if you're having  
        problems with space characters as words by inspecting these files.
        
        If you want a better way of converting files to UTF-8, you should  
        check out the "iconv" program which is available under most Linux  
        distributions.   Example usage for converting from Windows Latin1  
        encoding to UTF-8:
        
        iconv -f Latin1 -t utf8 inputfile > outputfile
        
        Hope this helps.
        
        Chris C-B
        
        On Aug 14, 2007, at 10:44 AM, Stig Alvestad wrote:
        
        >
        > I have to admit I'm uncertain where the GIZA++ alignments can be  
        > found. Could you point me in the correct direction, and I'll look  
        > them up.
        >
        > After you mentioned your suspicion of unwanted extra characters,  
        > I've been checking the fileencodings and fileformats of the input  
        > files extra carefully. Since I'm making the input files in Windows  
        > and copying them to my linux harddrive, there are many things which  
        > can go wrong. I've used TextPad (in Windows) to convert the files  
        > into Utf-8 encoding and Unix fileformat and Vim (in linux) to  
        > verify that those properties still hold in Linux before training.  
        > But I still got the same results.
        >
        > Just to sum up my understanding of the requirements for the input- 
        > files used for training;
        > -character encoding: Utf-8
        > -file format: Unix
        > -corpus-file organization: one sentence per line, no trailing spaces
        > -no blank lines
        > -everything lowercased
        > -sentences max 100 words (I've used max 40)
        >
        > Stig
        >
        >
        > -----Opprinnelig melding-----
        > Fra: phkoehn@...
        [mailto:phkoehn@...] På vegne av  
        > Philipp Koehn
        > Sendt: 14. august 2007 06:45
        > Til: Stig Alvestad
        > Kopi: moses-support@...
        > Emne: Re: [Moses-support] Lexical tables with lots of NULL
        >
        > Hi,
        >
        > this is indeed odd, and should be examined closer.
        >
        > This already seems strange:
        >
        >> WARNING: sentence 1200 has alignment point (4, 0) out of bounds  
        >> (4, 4)
        >>
        >> E: annen operasjon p?? urethra
        >>
        >> F: other operation on urethra
        >>
        >
        > How is it possible that there is an alignment between the 5th and  
        > the 1st
        > word (note: 4 and 0 is computer science counting), if each sentence  
        > only
        > has 4 words.
        >
        > How do the GIZA++ alignments look like for this sentence?
        >
        > Why is this alignment point created?
        >
        > I wonder, if a surplus of space-like characters causes some havoc  
        > here.
        >
        > -phi
        >
        > _______________________________________________
        > Moses-support mailing list
        > Moses-support@...
        > http://mailman.mit.edu/mailman/listinfo/moses-support
        >
        >
        
        

        Gmane