Vili Lehdonvirta | 1 Dec 08:52 2003
Picon
Picon

Re: Finnish stemmer: some suggestions and some doubts

Hi Martin,

Thanks for your prompt reply. Perhaps I could well work with the stemmer.
I'll get into contact with Kalervo. I'll also need to ask the Nutch people
about their stemming plans. Though one big issue is the availability of
the implementation in Java, as Nutch is a Java project. Without that, it
might not be worthwhile for me to look into it.

But first, tomorrow I leave for a holiday trip for two weeks. Well
deserved, I believe. :)

Regards,

--

-- 
	Vili Lehdonvirta
	vili.lehdonvirta <at> hut.fi
	+65 94367590
Olly Betts | 1 Dec 11:50 2003

Re: Finnish stemmer: some suggestions and some doubts

On Sun, Nov 30, 2003 at 11:47:32AM +0000, Martin Porter wrote:
> A word of warning: the Java generated Finnish stemmer does not work because
> the stemmer uses a Snowball feature not used by the other stemmers, on which
> the Java codegenerator were tested. Neither Richard Boulton nor I currently
> have the resources to fix this.

What is the feature?

Xapian has just gained JNI wrappers to allow use from Java.  This
includes wrappers for our C++ wrappers for the snowball stemmers, so if
you want to use the Finnish stemmer from Java that would be an option.

Cheers,
    Olly
Martin Porter | 1 Dec 14:52 2003
Picon

Re: Finnish stemmer: some suggestions and some doubts


>What is the feature?

It is the use of routine names after strings in amongs. To quote from the
manual:

In its most general form, each string  'Sij'  may be optionally followed by
a routine name, 

    among( (C)
           'S11' R11 'S12' R12 ... (C1)
           'S21' R21 'S22' R22 ... (C2)
           ...
           'Sn1' Rn1 'Sn2' Rn1 ... (Cn)
         )

So here each  Rij  is either a routine name or is null. If null, it is
equivalent to a routine which simply returns signal t.
Martin Porter | 2 Dec 00:03 2003
Picon

Re: Finnish stemmer: some suggestions and some doubts


Vili,

I notice there is an embarrassing ambiguity in my earlier email. When I said
"I communicated with Kalervo Jarvelin of Tampere university while doing the
work (kalervo.jarvelin <at> yta.fi), who was most helpful, but not after its
completion," I meant of course that I did not communicate with him after its
completion, not that he was unhelpful after its completion!

Martin
Peter Kolloch | 2 Dec 13:36 2003
Picon
Picon

English stemmer

Hi,

first thanks for your work.

I recognized a systematic weakness in your English stemmer. It does not 
reduce superlatives. For example "greatest" and "brightest" just remain 
unchanged.

Unfortunately just stupiditly removing "est" does probably not work that 
well. Grammatically this does not make sense for "test", "guest", 
"zest", Yet they might be no clashes with other stems...

Regards,
Peter

[I am currently not subscribed to this list, so it would be nice, if you 
could send me a copy of your answer directly to my address]
Martin Porter | 2 Dec 23:59 2003
Picon

Re: English stemmer


Yes -est is not removed, there being too many words from which its removal
would be incorrect - behest, attest, request and so on. Removing -est from
longer words only is not really satisfactory, since the English comparitive
and superlative endings are only added to short adjectives anyway: "curioser
and curioser" is not, of course, correct English.

Martin

>I recognized a systematic weakness in your English stemmer. It does not 
>reduce superlatives. For example "greatest" and "brightest" just remain 
>unchanged.
Gary Davidson | 7 Dec 23:59 2003
Picon

German Stemmer


I have implemented  6 of the stemmers using NLucene on a .NET ECommerce site 
in Europe, currently my clients are very happy with stemming except for one 
excpetion. The German words "Papiere" and "Papieres" do not stem to the same 
  root "papiere" as I thought they would. Is there some reason for this? As 
I am a programmer and not a linguist I am at a loss.

TIA

Gary Davidson

_________________________________________________________________
Take advantage of our best MSN Dial-up offer of the year — six months 
 <at> $9.95/month. Sign up now! http://join.msn.com/?page=dept/dialup
Gary Davidson | 8 Dec 14:08 2003
Picon

Re: German Stemmer


I am sorry I was not clear, "Papieres" was a typo.

Thanks, I believe I found the problem elsewhere.

Gary Davidson

>From: martin.porter <at> grapeshot.co.uk (Martin Porter)
>To: "Gary Davidson" <sql_1 <at> hotmail.com>, 
>snowball-discuss <at> lists.tartarus.org
>Subject: Re: [Snowball-discuss] German Stemmer
>Date: Mon, 8 Dec 2003 12:12:56 GMT
>
>
> >The German words "Papiere" and "Papieres" do not stem to the same
> >root "papiere" as I thought they would. Is there some reason for this? As
> >I am a programmer and not a linguist I am at a loss.
> >
> >TIA
> >
> >Gary Davidson
>
>
>Gary,
>
>No, they don't stem to papiere, but they do stem to the same form (papi),
>which means that they are conflated by the stemming process.
>
>Stemming is not exact in the way you seem to expect - see the introductory
>document on the snowball site -
(Continue reading)

Martin Porter | 8 Dec 13:12 2003
Picon

Re: German Stemmer


>The German words "Papiere" and "Papieres" do not stem to the same 
>root "papiere" as I thought they would. Is there some reason for this? As 
>I am a programmer and not a linguist I am at a loss.
>
>TIA
>
>Gary Davidson

Gary,

No, they don't stem to papiere, but they do stem to the same form (papi),
which means that they are conflated by the stemming process.

Stemming is not exact in the way you seem to expect - see the introductory
document on the snowball site -

http://snowball.tartarus.org/texts/introduction.html

Martin
Fred Fung | 11 Dec 16:52 2003

Snowball French stemming

Good Day,
 
I am using OpenFTS 0.35 with the Snowball French stemming algorithm and the Snowball wrapper downloaded from the snowball.tartarus.org site. With this algorithm, I came across the following inconsistency :
 
The word "française" stemmed to become "français". This is fine. However, when I stemmed the word "français", it became "franc".
 
I looked at the example list of French vocabulary and its stemmed equivalent posted on the Snowball site under the French link, and "français" is indeed stemmed to become "franc".
 
But here is the problem : I am using this stemming algorithm in conjuction with the text search package OpenFTS 0.35. When I use the French stemming algorithm to convert a piece of text containing the word "française" into its indexing equivalent, and later, search the table for the word "français", I would expect this text to be considered as a match as well. But obviously, it is not the case (and I have tried it) since, "français" will be stemmed (using the same stemming algorithm) to "franc" before the search starts, and will never match the stem "français" stored in the tsvector field.
 
Is this something one has to live with using this French stemming algorithm ? If not, is there any way to work around the problem I mentioned here ?
 
Thanks.
 
 
Fred 

Gmane