Richard R. Liu | 1 May 2012 14:06
Picon

Re: Multi language full text: Which stemming language should be used?

 <at>  Manoj M,

By "multi language" which of these do you mean:
A.  Each document in the corpus is in a single language, but not all documents are in the same language.
B.  Passages (e.g., paragraphs) in a single document may be in different languages.

At any rate, since the whole point of stemming is conflation, i.e., ignoring different forms of the same
word, stemming depends on the language, since it dictates how different forms of a word are built.  So,
assuming you have some way of determining the language of a word and you use the correct stemmer to stem it,
you will have to store not only the stem, but also its language, since, just as two languages may have words
that are spelled the same, two stems in two different languages may also be spelled the same.

With regard to determining the language, if you are dealing with A (above), you might have some meta data
with this information; however, you will have to judge how dependable it is.  If you are dealing with B,
there are some heuristics that operate on the sentence level.  You could apply stop word lists in all the
possible languages to the sentence, then select the language that produces the most (unique) stop words. 
Or you could apply stemmers and select the language that produces the most stemming.

On the query side, you also have the problem of determining the language -- presumably just one! -- of the
query terms.  In this case, if you resort to one of the heuristics described above, it would probably have to
be the one based on stemmers, since people rarely use stop words in queries.

Regards,
Richard

On May 1, 2012, at 13:00 , snowball-discuss-request <at> lists.tartarus.org wrote:

> Send Snowball-discuss mailing list submissions to
> 	snowball-discuss <at> lists.tartarus.org
> 
(Continue reading)

Gregor Kališnik | 4 May 2012 16:52
Favicon

Slovene stemmer

Hi.

I've seen in mailing list about slovenian stemmer and I'm wondering what is 
the state of it?

I'm also wondering what are the steps to make it available to python wrapper 
(PyStemmer). I've managed to make the stemwords cli to use it.

Thank you for answers.

Best regards,
Gregor Kališnik
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 7 May 2012 17:10
Picon

New stemmers

We have new contributed stemmers for Irish gaelic and Czech from Jimmy O'Regan.
Martin Porter | 7 May 2012 17:44
Picon

Re: Slovene stemmer

See

http://article.gmane.org/gmane.comp.search.snowball/1047

which gives the current position. We never took the contributed
stemmer on board as we were unable to resolve certain questions, such
as how it differs from the published Popovic/Willett stemmer. I'd
always hoped someone would send us the Popovic/Willett stemmer in
snowball, but this has not happened!

Martin

Gmane