Karl Wettin | 7 Jan 09:30 2009
Picon

Re: Why not abstract Snowball Java?

FYI, I recently commited a SnoballProgram.java to Apache Lucene  
patched with the below. This reflectionless access to the programs  
saves us quite a bit of resources. It really would make a lot of sense  
if this patch was applied to the tartarus trunk. But we can keep our  
own branch over at Lucene, no problem.

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/snowball/src/java/org/tartarus/snowball/SnowballProgram.java

     karl

18 jan 2008 kl. 18.15 skrev Karl Wettin:

>
> 18 jan 2008 kl. 13.44 skrev Karl Wettin:
>
>> How is it that the snowball program is not abstract and has the  
>> abstract method stem()?
>
>
> It is so much more expesive to do the now required reflection to stem.
>
> Index: java/org/tartarus/snowball/SnowballProgram.java
> ===================================================================
> --- java/org/tartarus/snowball/SnowballProgram.java	(revision 500)
> +++ java/org/tartarus/snowball/SnowballProgram.java	(working copy)
>  <at>  <at>  -2,13 +2,15  <at>  <at> 
> package org.tartarus.snowball;
> import java.lang.reflect.InvocationTargetException;
>
> -public class SnowballProgram {
(Continue reading)

Karl Wettin | 7 Jan 10:44 2009
Picon

Re: Swedish stems need patching

I've been looking in to this. -an and -ans suffix removal stemming  
(removing n or ns) is rather damaging. All together there are 1500  
Swedish words in singlar form that are suffixed with an or ans, some  
50 of them are problematic when this suffix is removed. It would of  
course be possible to keep track of all these exceptions that should  
not be stemmed.

A few of the problematic words: afrikan, amerikan, banan, glan, glans,  
gan, gans, dans (brygdans, gammeldans, mfl) , dan, fina, finans,  
krans, kran, koran, roman, romans, samman,  seans (actually  
problematic with the current stemmer too), svan, svans, tran, trans,  
usans, vacklan, vakans

There are probable more words that I have not seem as I only stemmed  
using the rule as described above, i.e. there might be lots of words  
that when stemmed using other rules in the Swedish stemmer will show  
not to be that unique and problematic combined with the an/ans rule.  
I'll try to come up with something later on, for now I'm "happy" to  
know that the an/ans suffix stemming can not be applied without  
further work.

        karl

29 jan 2008 kl. 17.54 skrev Martin Porter:

>
>
> Janko,
>
> Either -an, -ans was overlooked when the stemmer was written, or the
(Continue reading)

Martin Porter | 26 Jan 10:22 2009
Picon

Re: Slovenian stemmer


Bostjan,

I'm sorry not to have replied to your email sooner. I have been very busy
this past week.

You can see where the stemmer you attached comes from if you go to

http://news.gmane.org/gmane.comp.search.snowball

and type "slovene" in the search box at the bottom. This gives the email
history. The stemmer was done by Bostjan Jerko, and I wanted various points
clarified before adding it in as one of the snowball set, but I eventually
lost touch with Bostjan and the issues were not finally resolved. (I see you
are also "Bostan" which adds to the confusion.)

Something I wanted resolved was how the stemmer differed from the one
described by Willett and Popovic in,

Popovic M and Willett P (1990) Processing of documents and queries in a
Slovene language free text retrieval system. Literary and Linguistic
Computing, 5: 182-190. 

(I urge you to look at the Willett/ Popovic work).

My changes to Jerko's stemmer did not, I recall, change its functionality,
so the version you have could be used as a test.

It seems to me you have three choices,

(Continue reading)


Gmane