Paltonio Daun Fraga | 17 Aug 17:46 2009
Picon

ANOTHER APPROACH

Dear sirs,

I saw your project snowball for a lot of human languages.

I propose you another approach, based on the project GETA (now GETALPS) from Grenoble, France

This is based on RADICAL (bases) dictionnaires (ambiguous or non ambiguous) 
with classifications like (regular, irregularities) those calling subgrammars in a recursive way
and an AUTOMATA based on NON-DETERMINISTIC FINITE AUTOMATA (Hopcroft & Ullmn)
Plus endings (prefixes or suffixes) for word decomposition (segmentation) with morpho-syntactic information as (person, time/tense, number, and so on)

So you will have only one algorythm (ATEF at GETA, an other variants as SYGMART at www.sygtext.fr of Jacques Chauché)

The GETA (Bernard Vauquois) model for Automatic Translation was based on several levels of treatment
The lowest input is a Morpho-syntactic analyser based on this ATEF (or ANMOR i programmed in FORTRAN)
the output is a complex tree (arborescence decorée) with morphologic, syntactic and semantic traits to guide in ambiguity solving.

The next level is a Syntactic Analyser based on a tree-transducer (CETA then ARBOR and lately ROBRA, a recursive tree pattern matcher and transformation guided by recursive subgrammars quite equivalent to a Turing Machine with reduced the risk for non-stop) The output is a tree structure with the profound analysis of the text.
ROBRA is due to Christian Boitet by the 1970's.

The highest level is a transfer level that maps the structures and LexicalUnities to another(s) language(s)
than a phrase synthetiser and a again a deterministic automata to recompose the words in the output language model. called SYGMOR.

I will show you other details soon. 
Attentiously, yours, 

Paltonio Fraga
FFLCH-USP, São Paulo, BR


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Paltonio Daun Fraga | 17 Aug 17:58 2009
Picon

Re: ANOTHER APPROACH

here a link to a simplified article:


 http://paltonio.sites.uol.com.br/AN-PORTU.HTM


and look for BOITET, Christian  on http://www.mt-archive.info/authors-B.htm

On Mon, Aug 17, 2009 at 12:46 PM, Paltonio Daun Fraga <paltoniofraga <at> gmail.com> wrote:
Dear sirs,
I saw your project snowball for a lot of human languages.

I propose you another approach, based on the project GETA (now GETALPS) from Grenoble, France

This is based on RADICAL (bases) dictionnaires (ambiguous or non ambiguous) 
with classifications like (regular, irregularities) those calling subgrammars in a recursive way
and an AUTOMATA based on NON-DETERMINISTIC FINITE AUTOMATA (Hopcroft & Ullmn)
Plus endings (prefixes or suffixes) for word decomposition (segmentation) with morpho-syntactic information as (person, time/tense, number, and so on)

So you will have only one algorythm (ATEF at GETA, an other variants as SYGMART at www.sygtext.fr of Jacques Chauché)

The GETA (Bernard Vauquois) model for Automatic Translation was based on several levels of treatment
The lowest input is a Morpho-syntactic analyser based on this ATEF (or ANMOR i programmed in FORTRAN)
the output is a complex tree (arborescence decorée) with morphologic, syntactic and semantic traits to guide in ambiguity solving.

The next level is a Syntactic Analyser based on a tree-transducer (CETA then ARBOR and lately ROBRA, a recursive tree pattern matcher and transformation guided by recursive subgrammars quite equivalent to a Turing Machine with reduced the risk for non-stop) The output is a tree structure with the profound analysis of the text.
ROBRA is due to Christian Boitet by the 1970's.

The highest level is a transfer level that maps the structures and LexicalUnities to another(s) language(s)
than a phrase synthetiser and a again a deterministic automata to recompose the words in the output language model. called SYGMOR.

I will show you other details soon. 
Attentiously, yours, 

Paltonio Fraga
FFLCH-USP, São Paulo, BR





--
Deus nos guie. Jesus nos proteja e nos ilumine.  
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Leonardo Borges | 19 Aug 22:31 2009
Picon

Doubt about the portuguese stem

Hello guys,

I am currently evaluating Sphinx as an option for my projects and, since I am brazilian, wanted to give it a try to the Portuguese stemmer you guys provide.

Thus, I compiled sphinx with the libstemmer option and everything went great.

Given the following phrase, in one of my documents: "Então, vamos começar a usar libstemmer"

The following searches return the correct document:
"Então", "Entã", "então", "entã"
which is great, but if I search for:
"Entao"
It returns nothing.

Since I didn't dig into the algorithm, is this the expected behavior? In that case, the way to accomplish what I'm trying is removing accents myself? Or perhaps you guys have other suggestions?

Thanks a lot,
Leonardo Borges
www.leonardoborges.com

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Grant Ingersoll | 20 Aug 15:39 2009
Picon

Re: Doubt about the portuguese stem

I did the following in Lucene, using the Snowball Portuguese stemmer:

PortugueseStemmer stemmer = new PortugueseStemmer();
    stemmer.setCurrent("então");
    stemmer.stem();
    System.out.println("Stem: " + stemmer.getCurrent());

The output is:
Stem: entã

I can't speak for Sphinx, but in Lucene/Solr (http://lucene.apache.org/solr) the common step to do (either before or after stemming) is to strip accents, such that entã would become enta, which would then allow users who both know the correct accents and those who don't to still get matches.  In short, the problem isn't in Snowball, it is either in your setup of Sphinx or Sphinx itself in that it doesn't allow you to strip accents.

Hope this helps,
Grant

On Aug 19, 2009, at 4:31 PM, Leonardo Borges wrote:

Hello guys,

I am currently evaluating Sphinx as an option for my projects and, since I am brazilian, wanted to give it a try to the Portuguese stemmer you guys provide.

Thus, I compiled sphinx with the libstemmer option and everything went great.

Given the following phrase, in one of my documents: "Então, vamos começar a usar libstemmer"

The following searches return the correct document:
"Então", "Entã", "então", "entã"
which is great, but if I search for:
"Entao"
It returns nothing.

Since I didn't dig into the algorithm, is this the expected behavior? In that case, the way to accomplish what I'm trying is removing accents myself? Or perhaps you guys have other suggestions?

Thanks a lot,
Leonardo Borges
www.leonardoborges.com
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

--------------------------
Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Leonardo Borges | 20 Aug 15:43 2009
Picon

Re: Doubt about the portuguese stem

Grant,
Tks for the feedback.

Yeah, I just wanted to confirm that this is how the algorithm is supposed to work. I'll have to find a way to strip accents using sphinx.

Regards,
Leonardo Borges
www.leonardoborges.com

On Thu, Aug 20, 2009 at 3:39 PM, Grant Ingersoll <gsingers <at> apache.org> wrote:
I did the following in Lucene, using the Snowball Portuguese stemmer:

PortugueseStemmer stemmer = new PortugueseStemmer();
    stemmer.setCurrent("então");
    stemmer.stem();
    System.out.println("Stem: " + stemmer.getCurrent());

The output is:
Stem: entã

I can't speak for Sphinx, but in Lucene/Solr (http://lucene.apache.org/solr) the common step to do (either before or after stemming) is to strip accents, such that entã would become enta, which would then allow users who both know the correct accents and those who don't to still get matches.  In short, the problem isn't in Snowball, it is either in your setup of Sphinx or Sphinx itself in that it doesn't allow you to strip accents.

Hope this helps,
Grant

On Aug 19, 2009, at 4:31 PM, Leonardo Borges wrote:

Hello guys,

I am currently evaluating Sphinx as an option for my projects and, since I am brazilian, wanted to give it a try to the Portuguese stemmer you guys provide.

Thus, I compiled sphinx with the libstemmer option and everything went great.

Given the following phrase, in one of my documents: "Então, vamos começar a usar libstemmer"

The following searches return the correct document:
"Então", "Entã", "então", "entã"
which is great, but if I search for:
"Entao"
It returns nothing.

Since I didn't dig into the algorithm, is this the expected behavior? In that case, the way to accomplish what I'm trying is removing accents myself? Or perhaps you guys have other suggestions?

Thanks a lot,
Leonardo Borges
www.leonardoborges.com
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

--------------------------
Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 21 Aug 12:00 2009
Picon

Re: Doubt about the portuguese stem


Leonardo,

Hello. Yes this point often comes up (well not that often .. perhaps once
every two years). You can search on the gmane archive of Snowball discuss,
looking for "andrew green". In summary, here is a note I wrote 19 May 2007,
on the subject of accents in Spanish, and the problem of their omission,

The occasion when I came across this problem before was news data in
Spanish where the placing of accents was very untrustworthy. There is a
variant of the Snowball German stemmer in which umlaut is represented by
following e, but there are no variants for the Romance language
stemmers.

I'm not sure what the deal is for Portuguese, but Spanish is as you
describe it. In French, the application of accents is quite rigorously
applied, except that they can be omitted when the text is entirely in
upper case. (But is that stylistic feature less prevalent than it was a
century ago? I'm not sure ...) Anyway, keeping accents in place with
French does not seem to be problematic.

Italian presents an interesting case. They use acute and grave, but not
by any consistent rule. There are different schemes for how acute/grave
is applied, which varies (or used to vary) among publishing houses. This
is why the Italian stemmer begins with the strange operation of
replacing all acutes with graves. A critical ending is then -o+accent,
but even if the accent is absent, -o is a similar ending, and will be
removed by the same rule (compare porto`, he carried, with porto, I
carry). The result is the the Italian stemmer does not behave very
differently on texts with all accents stripped.

--- end of quote

The answer for Spanish is to extend the ending list to include unaccented
forms, again see correspondene with "ignacio perez",

I have just done a simple test in which the line of suffixes,

'{a'}ramos' 'i{e'}ramos' 'i{e'}semos' '{a'}semos'

is additionally preceded by the line

'aramos' 'ieramos' 'iesemos' 'asemos'

and this works fine, your word "tomaramos" splitting as "tom-aramos".

So I can suggest that as an approach: supplement the algorithm with
extra endings, corresponding to the accented forms but with the accent
removed. I suggest you build it up bit by bit, and test it out as each
new ending, or set of endings, is included.

---- end of quote

I think Portuguese is more difficult, because the 'tilde' indicates a
consonant value, and this is reflected in the stemming algorithm. It seems
to me okay to stem sa~o to sa~ (forgive the limitations of my keyboard), but
not to stem sa~o to sa, or to stem sao to sa. Of course, English speakers
typically write 'Sao Paolo', but that is unfortunate, and the result of
them, like me, not having the accented 'a' on the keyboard. I do not know
the answer to this: any advice would be helpful,

Martin

At 10:31 PM 8/19/2009 +0200, Leonardo Borges wrote:
>Hello guys,
>
>I am currently evaluating Sphinx as an option for my projects and, since I
>am brazilian, wanted to give it a try to the Portuguese stemmer you guys
>provide.
>
>Thus, I compiled sphinx with the libstemmer option and everything went
>great.
>
>Given the following phrase, in one of my documents: "Então, vamos começar a
>usar libstemmer"
>
>The following searches return the correct document:
>"Então", "Entã", "então", "entã"
>which is great, but if I search for:
>"Entao"
>It returns nothing.
>

Gmane