Blake Madden | 7 Dec 2004 18:44
Picon
Favicon

Undoubling in Dutch stemmer

Martin,

Regarding this feature in the Dutch stemmer:

"Define undoubling the ending as removing the last
letter if the word ends kk, dd or tt."

A collegue of mine mentioned that it is wise to
undouble "nn", "mm", and "ff" as well for Dutch.  Just
off the top of your head, does this seem resonable and
would you want to make this change?  Otherwise, should
I prompt him for more info (e.g. example text)?

Thanks,
Blake

=====
**********************************************************************
Blake Madden
Localization Manager
Mobile: 937-369-8564
E-mail: madden_blake <at> yahoo.com
**********************************************************************

		
__________________________________ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 
(Continue reading)

Martin Porter | 9 Dec 2004 09:31
Picon

Re: Undoubling in Dutch stemmer


I don't recall the details now, but I think I went through consonant by
consonant trying the effect of undoubling. That was my approach in
developing the stemmers generally. So if a plausible rule is not included it
is because, on balance, it did not seem to lead to an improvement. Of course
any rule might be reassessed, and will store this one for the next time I
look at the Dutch stemmer.

You may have noticed that there is not much work being done on the stemmers
at the moment ...

Martin
Edwin de Jonge | 9 Dec 2004 15:08
Picon

RE: Undoubling in Dutch stemmer

Hi Martin/Blake,

I raised the same point Blake made, quite some time ago. I didn't have
the time to make a proposal for improvement of the snowball script for
Dutch.
I'll try to do this next week. Being a native speaker of Dutch I find
the "kk", "dd", "tt" undoubling rather arbitrary.
Double consonants in Dutch are mainly used to make the previous vowel
short (with the exception that at the end of a word no double consonants
are used).

The idea (not in snowball syntax)
One of the problems with current Dutch Stemmer is that short and long
vowel words are stemmed to the same stem.
E.g. : "makken" becomes "mak", "maken" becomes also "mak" ("maak" would
be more natural).
If the undoubling of consonant is generalized to all consonants, then
the stemmer should adjust for this effect by doubling vowels.

So the following should be done:
1) modify the undouble procedure to do the following:
 If ending in a double consonant
	remove one of the consonants.         //generalisation of
undouble rule
 else
	if ending CVC (consonant among('a''e''o''u') consonant)
		double vowel			  //make vowel long by
doubling it.
		if ending among ('v', 'z')	  //transform 'v' and
'z'  into 'f' and 's' ('huizen' -> 'huis', 'leven'->'leef'
(Continue reading)

Diana Maynard | 9 Dec 2004 22:02
Picon
Favicon

evaluation of Snowball stemmers

Apologies if the answer to this question can be found somewhere obvious, but
I've hunted around and can't find it, and I can't see a way to search the list
archives.
Has anyone evaluated the Snowball stemmers on the various languages, and if
so, what were the results?
Many thanks
Diana

-------------------------------
Dr Diana Maynard
Research Associate
NLP group
University of Sheffield, UK
Olly Betts | 10 Dec 2004 01:50
Favicon
Gravatar

Re: evaluation of Snowball stemmers

On Thu, Dec 09, 2004 at 09:02:03PM +0000, Diana Maynard wrote:
> Apologies if the answer to this question can be found somewhere obvious, but
> I've hunted around and can't find it, and I can't see a way to search the 
> list archives.

You can search the gmane archive, though it may not have the oldest
articles unless somebody has imported the list archives (it has 650 or so):

http://search.gmane.org/search.php?group=gmane.comp.search.snowball

> Has anyone evaluated the Snowball stemmers on the various languages, and if
> so, what were the results?

I can't help here I'm afraid.

Cheers,
    Olly
Martin Porter | 10 Dec 2004 09:04
Picon

Re: evaluation of Snowball stemmers


Diana,

I have not carefully monitored the use of the stemmers in evaluation work,
although I think it is fairly extensive. (Of course the stemmers are often
used in IR experiments even when stemming itself is not the subject of
evaluation.) But see this paper: 

Stephen Tomlinson (2003) Lexical and algorithmic stemming compared for 9
European languages with Hummingbird SearchServer(TM) at CLEF 2003. In Carol
Peters, editor, Working notes for the CLEF 2003 Workshop 21-22 August,
Trondheim, Norway.

http://www.stephent.com/ir/papers/clef03.html 

Tomlinson (2003) compares the Snowball stemmers with a commercial lexical
stemming (lemmatization) system. Of the nine languages tested, six gave
differences that were not statistically significant, two did better under
the lemmatization system, and one better under Snowball - I think I got that
right: you can verify it by looking at the paper.  

Given the simplicity and cheapness of the Snowball stemmers compared with a
full lemmatization system I think this is a good result for Snowball. 

Unfortunately I have not been able to find out much about the Hummingbird
system, either from Tomlinson's paper or elsewhere.

Martin
Fred Gey | 10 Dec 2004 19:06
Picon
Picon

RE: evaluation of Snowball stemmers

Hi,

 

We utilized the Russian snowball stemmer for our Russian IR experiences in CLEF 2003.

After the workshop we ran some stemmer/no-stemmer experiments. The results were remarkable (I did not test statistical significance): for Title-Description (shorter queries) average precision went from 0. 259 to 0.334 (29% improvement), for Title-Description-Narrative (longer queries), average precision went from 0.236 to 0.367 (56% improvement).  These experiments were performed post-workshop, so are not included in the notebook paper online, but are in the final paper “UC Berkeley at CLEF-2003 – Russian Language Experiments and Domain-Specific Retrieval  in the book just released by Springer: Comparative Evaluation of Multilingual Information Access Systems, 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August 21-22, 2003, Revised Selected Papers Series : Lecture Notes in Computer Science , Vol.  3237.

 

There were some encoding issues.  The snowball Russian stemmer only works on KOI-8 encoding, so we converted the entire CLEF Russian collection from UTF-8 to KOI-8 using the unix iconv utility.

 

Fred

Fredric C Gey, PhD

Data Archivist and Assistant Director

UC Data Archive & Technical Assistance (UC DATA)

University of California, Berkeley

 

Interests: Cross-language Information Retrieval

               Social Science Databases

web page: http://ucdata.berkeley.edu/gey.html

 

 

-----Original Message-----
From: snowball-discuss-bounces <at> lists.tartarus.org [mailto:snowball-discuss-bounces <at> lists.tartarus.org] On Behalf Of Martin Porter
Sent: Friday, December 10, 2004 12:05 AM
To: Diana Maynard; snowball-discuss <at> lists.tartarus.org
Subject: Re: [Snowball-discuss] evaluation of Snowball stemmers

 

 

Diana,

 

I have not carefully monitored the use of the stemmers in evaluation work,

although I think it is fairly extensive. (Of course the stemmers are often

used in IR experiments even when stemming itself is not the subject of

evaluation.) But see this paper:

 

 

Stephen Tomlinson (2003) Lexical and algorithmic stemming compared for 9

European languages with Hummingbird SearchServer(TM) at CLEF 2003. In Carol

Peters, editor, Working notes for the CLEF 2003 Workshop 21-22 August,

Trondheim, Norway.

 

http://www.stephent.com/ir/papers/clef03.html

 

 

Tomlinson (2003) compares the Snowball stemmers with a commercial lexical

stemming (lemmatization) system. Of the nine languages tested, six gave

differences that were not statistically significant, two did better under

the lemmatization system, and one better under Snowball - I think I got that

right: you can verify it by looking at the paper. 

 

Given the simplicity and cheapness of the Snowball stemmers compared with a

full lemmatization system I think this is a good result for Snowball.

 

Unfortunately I have not been able to find out much about the Hummingbird

system, either from Tomlinson's paper or elsewhere.

 

Martin

 

 

 

 

_______________________________________________

Snowball-discuss mailing list

Snowball-discuss <at> lists.tartarus.org

http://lists.tartarus.org/mailman/listinfo/snowball-discuss

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 10 Dec 2004 23:26
Picon

RE: evaluation of Snowball stemmers


Fred,

Do you mean you got a 29%/56% average precision improvement when you
switched stemming off? Anything is possible, but this does surprise me: I
would have expected Russian, with its highly (and regularly) inflected
vocabulary to do quite well under stemming.

If you look at the paper at 

http://clef.isti.cnr.it/2004/working_notes/WorkingNotes2004/16.pdf

(Mono- and Crosslingual Retrieval Experiments at the University of
Hildesheim  - René Hackl, Thomas Mandl and  Christa Womser-Hacker) the
evidence, for Finnish, points the other way ("the snowball stemmer works
very well"). Their Russian experiments were not unfortunately taken to
conclusion, but I feel much more confidence myself in the snowball Russian
stemmer than the snowball Finnish stemmer.

On the other hand I have had verbal notice (which I did not entirely trust!)
of the Finnish stemmer doing badly in some other tests.

I should point out that although the version of the stemmer you picked up
works for KOI-8, Snowball is designed to make switching to other character
codes as easy as possible. See the notes at

http://snowball.tartarus.org/codesets/guide.html

Martin

 
Oleg Bartunov | 10 Dec 2004 23:35
Picon

RE: evaluation of Snowball stemmers

Here s a paper which compare several stemmers (including snowball) on
russian corpus.
http://company.yandex.ru/articles/iseg-las-vegas.html

My own experience is good.

 	Oleg
On Fri, 10 Dec 2004, Martin Porter wrote:

>
> Fred,
>
> Do you mean you got a 29%/56% average precision improvement when you
> switched stemming off? Anything is possible, but this does surprise me: I
> would have expected Russian, with its highly (and regularly) inflected
> vocabulary to do quite well under stemming.
>
> If you look at the paper at
>
> http://clef.isti.cnr.it/2004/working_notes/WorkingNotes2004/16.pdf
>
> (Mono- and Crosslingual Retrieval Experiments at the University of
> Hildesheim  - Ren? Hackl, Thomas Mandl and  Christa Womser-Hacker) the
> evidence, for Finnish, points the other way ("the snowball stemmer works
> very well"). Their Russian experiments were not unfortunately taken to
> conclusion, but I feel much more confidence myself in the snowball Russian
> stemmer than the snowball Finnish stemmer.
>
> On the other hand I have had verbal notice (which I did not entirely trust!)
> of the Finnish stemmer doing badly in some other tests.
>
> I should point out that although the version of the stemmer you picked up
> works for KOI-8, Snowball is designed to make switching to other character
> codes as easy as possible. See the notes at
>
> http://snowball.tartarus.org/codesets/guide.html
>
> Martin
>
>
>
>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>

 	Regards,
 		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg <at> sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Fred Gey | 10 Dec 2004 23:52
Picon
Picon

RE: evaluation of Snowball stemmers

Sorry for the word order reversal -- should have been:

After the workshop we ran some no-stemmer/stemmer experiments. The results
were remarkable (I did not test statistical significance): for
Title-Description (shorter queries) average precision went from 0. 259 to
0.334 (29% improvement), for Title-Description-Narrative (longer queries),
average precision went from 0.236 to 0.367 (56% improvement).

Fred

-----Original Message-----
From: Martin Porter [mailto:martin.porter <at> grapeshot.co.uk] 
Sent: Friday, December 10, 2004 2:26 PM
To: gey <at> berkeley.edu; 'Diana Maynard'
Cc: snowball-discuss <at> lists.tartarus.org
Subject: RE: [Snowball-discuss] evaluation of Snowball stemmers

Fred,

Do you mean you got a 29%/56% average precision improvement when you
switched stemming off? Anything is possible, but this does surprise me: I
would have expected Russian, with its highly (and regularly) inflected
vocabulary to do quite well under stemming.

If you look at the paper at 

http://clef.isti.cnr.it/2004/working_notes/WorkingNotes2004/16.pdf

(Mono- and Crosslingual Retrieval Experiments at the University of
Hildesheim  - René Hackl, Thomas Mandl and  Christa Womser-Hacker) the
evidence, for Finnish, points the other way ("the snowball stemmer works
very well"). Their Russian experiments were not unfortunately taken to
conclusion, but I feel much more confidence myself in the snowball Russian
stemmer than the snowball Finnish stemmer.

On the other hand I have had verbal notice (which I did not entirely trust!)
of the Finnish stemmer doing badly in some other tests.

I should point out that although the version of the stemmer you picked up
works for KOI-8, Snowball is designed to make switching to other character
codes as easy as possible. See the notes at

http://snowball.tartarus.org/codesets/guide.html

Martin

 

Gmane