Hi,
We utilized the Russian snowball stemmer for our Russian IR experiences
in CLEF 2003.
After the workshop we ran some stemmer/no-stemmer experiments. The
results were remarkable (I did not test statistical significance): for
Title-Description (shorter queries) average precision went from 0. 259 to 0.334
(29% improvement), for Title-Description-Narrative (longer queries), average
precision went from 0.236 to 0.367 (56% improvement). These experiments were performed
post-workshop, so are not included in the notebook paper online, but are in the
final paper “UC Berkeley at CLEF-2003
– Russian Language Experiments and Domain-Specific Retrieval ”
in the book just released by Springer: Comparative
Evaluation of Multilingual Information Access Systems, 4th Workshop of the
Cross-Language Evaluation Forum, CLEF 2003, Trondheim,
Norway, August 21-22, 2003, Revised Selected Papers Series : Lecture Notes in
Computer Science , Vol. 3237.
There were some encoding issues. The snowball Russian stemmer only works
on KOI-8 encoding, so we converted the entire CLEF Russian collection from
UTF-8 to KOI-8 using the unix
iconv utility.
Fred
Fredric C Gey, PhD
Data Archivist and Assistant Director
UC Data Archive & Technical Assistance (UC DATA)
University of California, Berkeley
Interests: Cross-language Information Retrieval
Social Science Databases
web page: http://ucdata.berkeley.edu/gey.html
-----Original Message-----
From: snowball-discuss-bounces <at> lists.tartarus.org
[mailto:snowball-discuss-bounces <at> lists.tartarus.org] On Behalf Of Martin Porter
Sent: Friday, December 10, 2004 12:05 AM
To: Diana Maynard; snowball-discuss <at> lists.tartarus.org
Subject: Re: [Snowball-discuss] evaluation of Snowball stemmers
Diana,
I have not carefully monitored the use of the stemmers in evaluation
work,
although I think it is fairly extensive. (Of course the stemmers are
often
used in IR experiments even when stemming itself is not the subject of
evaluation.) But see this paper:
Stephen Tomlinson (2003) Lexical and algorithmic stemming compared for
9
European languages with Hummingbird SearchServer(TM) at CLEF 2003. In
Carol
Peters, editor, Working notes for the CLEF 2003 Workshop 21-22 August,
Trondheim,
Norway.
http://www.stephent.com/ir/papers/clef03.html
Tomlinson (2003) compares the Snowball stemmers with a commercial
lexical
stemming (lemmatization) system. Of the nine languages tested, six gave
differences that were not statistically significant, two did better
under
the lemmatization system, and one better under Snowball - I think I got
that
right: you can verify it by looking at the paper.
Given the simplicity and cheapness of the Snowball stemmers compared
with a
full lemmatization system I think this is a good result for Snowball.
Unfortunately I have not been able to find out much about the
Hummingbird
system, either from Tomlinson's paper or elsewhere.
Martin
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss