Wolfgang Klinger | 22 Oct 2009 14:50

german stemmer / Kürbisse


*hiya!*

I use the sphinx search engine and have problems with
libstemmer_de.

I have text that includes the german word "Kürbis".
The plural of "Kürbis" is "Kürbisse". Now if I search für "Kürbisse"
I would expect results für "Kürbis" too (that's why I use libstemmer).

Obviously libstemmer_de creates "kurbiss" as stemmed form instead
of "kurbis" and therefore I get no results.
Is that a known problem? How can I solve it?

tia, kind regards
Wolfgang
Martin Porter | 26 Oct 2009 11:46
Picon

Re: german_stemmer -isse


Wolfgang,

The problem is of course with nouns ending -is in German. wildnis has (I
take it) plural wildnisse, which the algorithm stems to wildnis and so on.
Have you any idea how many words there might be in this category? I've
looked through the snowball German vocab, and very many words ending -is are
foreign words (thesis, paris ...) where (I believe?) the -isse plural rule
will not apply. Ku"rbis is not so very common a word, though suitably
seasonal with October 31 approaching!

Usually the -sse ending is correctly stemmed to -ss, but looking through the
vocabulary, it is clear that -isse would  often be better stemmed to -is. Do
you have a view on that?

http://snowball.tartarus.org/algorithms/german/diffs.txt

I'm a bit rusty on German, as you see. I'm rather preocuppied with other
work at the moment, but could find time to look at this problem with a
German in a couple of weeks. Meanwhile, at your end, you'll have to endure
the mis-stemming for the time being,

Martin

At 02:50 PM 10/22/2009 +0200, Wolfgang Klinger wrote:
>
>
>*hiya!*
>
>I use the sphinx search engine and have problems with
(Continue reading)

Bernardo Brandão | 27 Oct 2009 19:21
Picon

Portuguese Stemmer

Hi Guys,

 

I was investigating Lucene and came uppon the SnowBall package.  I was testing the PortugueseStemmer to see how applicable the SnowBallAnalyzer would be in our portal (using Lucene).  I tested stemming the word airplane in portuguese, which is “avião” and its plural is “aviões”.  Apparently the PortugueseStemmer will not stemm the two words to be the same (“aviões” stemmed to “aviõ” and “avião” stemmed to “aviã”). 

 

I donwloaded the lucene contrib source files, I noticed the PortugueseStemmer had the following comment: “This file was generated automatically by the Snowball to Java compiler” and “Generated class implementing code defined by a snowball script”.

Is there any way you guys can improve this Stemmer?

 

Thanks,

Bernardo Brandão - bernardo <at> lumis.com.br
Arquiteto de Software - Produto


Lumis Tecnologia da Informação
Tel [21] 3094-7500
www.lumis.com.br

 

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Grant Ingersoll | 28 Oct 2009 02:04
Picon
Favicon
Gravatar

Re: german stemmer / Kürbisse

Hi Wolfgang,

I can't speak for Sphinx, but I think in general with stemming you  
will always run into these kinds of situations where words that you  
think should stem a particular way don't.  The approach we take in  
Lucene/Solr is to either have a protected word list or another  
TokenFilter (in our chain) that handles the exceptions that we deem  
important.  YMMV with other search engines.

-Grant

On Oct 22, 2009, at 8:50 AM, Wolfgang Klinger wrote:

>
> *hiya!*
>
> I use the sphinx search engine and have problems with
> libstemmer_de.
>
> I have text that includes the german word "Kürbis".
> The plural of "Kürbis" is "Kürbisse". Now if I search für "Kürbisse"
> I would expect results für "Kürbis" too (that's why I use libstemmer).
>
> Obviously libstemmer_de creates "kurbiss" as stemmed form instead
> of "kurbis" and therefore I get no results.
> Is that a known problem? How can I solve it?
>
>
> tia, kind regards
> Wolfgang
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 28 Oct 2009 17:32
Picon

Re: german_stemmer -isse


Wolfgang,

Okay, I will set some time aside, I hope in November, and look into this. I
think from your examples and my scanning of the vocabulary that -isse might
benefit from special treatment.

I will keep you informed of progress,

Martin 

>
>I can give some more examples:
>
>Hindernisse -> Hindernis
>Vorkommnisse -> Vorkommnis
>Ereignisse -> Ereignis
>Geschehnisse -> Geschehnis
>Gefängnisse -> Gefängnis
>Bisse -> Biß
>
>(plural -> singular)
>
>and two where this doesn't apply (these words are proper nouns):
>
>Hornisse (hornet)
>Prämisse (premise)
>
>
>I solved my problem in the meantime with a wordlist where "Kürbisse"
>is replaced by "Kürbis" before indexing
>
>
>hth,
>kind regards
>Wolfgang
>
Martin Porter | 28 Oct 2009 17:50
Picon

Re: Portuguese Stemmer


Bernardo,

Yes, the stemmers can be improved, among other things by building in
"exception lists" at various critical places. If you look at the snowball
site the Porter2 stemmer has been improved upon in this way. We've not done
this for the other languages, however, because it requires an intimate
language knowledge only available to native speakers.

Actually I wouldn't worry about your example to much. It is, I take it, an
irregular plural, and that is usually not what worries people in the English
stemmer. The English stemmer does not try to rectify irregular plurals
(woman/women) or irregular past tense forms (sing/sang/sung). No one
complains in practice. What upsets people is the Porter stemmer stemming
"arsenal" and "arsenic" to the same form. (The first is, among other things,
a popular football club.) So that is the sort of thing one tries to rectify
in an exception list.

It would be useful to have French2, Portuguese2 stemmers, and so on,
corresponding to the English2 stemmer -- taken a step further by native
speakers familiar with the complaints of an audience using the stemmers as
part of IR system, but we have not really managed to establish such a
community, at least, not yet.

Martin

>Hi Guys,
>
> 
>
>I was investigating Lucene and came uppon the SnowBall package.  I was
>testing the PortugueseStemmer to see how applicable the SnowBallAnalyzer
>would be in our portal (using Lucene).  I tested stemming the word airplane
>in portuguese, which is “avião” and its plural is “aviões”.  Apparently the
>PortugueseStemmer will not stemm the two words to be the same (“aviões”
>stemmed to “aviõ” and “avião” stemmed to “aviã”).  
>
Martin Porter | 28 Oct 2009 17:54
Picon

Re: german stemmer / Kürbisse

At 09:04 PM 10/27/2009 -0400, Grant Ingersoll wrote:
>Hi Wolfgang,
>
>I can't speak for Sphinx, but I think in general with stemming you  
>will always run into these kinds of situations where words that you  
>think should stem a particular way don't.  

That is certainly true, and it may apply here (I assumed when first reading
Wolfgang's email that it did). But I had the feeling looking at the test
vocabulary that the -isse ending might benefit from a special rule --- we
shall see.

Martin

Gmane