Richard Boulton | 3 Mar 2008 21:12

Re: Extending the Java compiler

Sebastiano Vigna wrote:
> We propose you to integrate our changes in your distribution so to both 
> distributing a much faster version, and avoiding us the problem of 
> releasing a forked compiler for MG4J.

I'm pleased to say I've just committed various fixes supplied by you to 
the snowball HEAD.  However, the patch I received didn't quite work (it 
referenced a non-existent AbstractSnowballTermProcessor class, for a 
start), so what I've committed is slightly different to the contents of 
the patch.  For reference, the changes I committed are at 
http://svn.tartarus.org/trunk/snowball/?rev=502&root=Snowball&view=rev

Rather than add a SnowballStemmer class which is a near duplicate of 
SnowballProgram (ie, identical except for the added stem() method), I 
added a SnowballStemmer class which inherits fom SnowballProgram.  I 
think this makes sense, but let me know if it is a problem.  The 
generated stemming algorithms inherit from SnowballStemmer, since they 
all have a stem() method.

> MG4J contains a high-performance implementation of strings called 
> MutableString. We want to avoid StringBuffer/String whenever possible, 
> and use MutableString instead. To avoid dependency on MutableString, 
> however, and the inherent slowness of StringBuffer (which is 
> synchronised) at the same time, we propose to compile by default for 
> Java 1.5 using StringBuilder instead (the Java replacement for 
> StringBuffer).

This change is now applied.  However, in order to allow either 
StringBuffer or StringBuilder to be used, I had to add overloaded 
versions of SnowballProgram.slice_to() and assign_to().  (The other 
(Continue reading)

Supheakmungkol SARIN | 27 Mar 2008 07:06
Picon

Installing problem

Dear all,

I tried to install Pystemmer 1.0.1 on my linux ubuntu 7.10 machine but unsuccessful. It produced the following error:

running install
running build
running build_ext
building 'Stemmer' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -O2 -Wall -Wstrict-prototypes -fPIC -Isrc -Ilibstemmer_c/include -I/usr/include/python2.5 -c libstemmer_c/src_c/stem_ISO_8859_1_danish.c -o build/temp.linux-i686-2.5/libstemmer_c/src_c/stem_ISO_8859_1_danish.o
In file included from /usr/lib/gcc/i486-linux-gnu/4.1.3/include/syslimits.h:7,
                 from /usr/lib/gcc/i486-linux-gnu/4.1.3/include/limits.h:11,
                 from libstemmer_c/src_c/../runtime/header.h:2,
                 from libstemmer_c/src_c/stem_ISO_8859_1_danish.c:4:
/usr/lib/gcc/i486-linux-gnu/4.1.3/include/limits.h:122:61: error: limits.h: No such file or directory
error: command 'gcc' failed with exit status 1

Did anyone experience this before?

Any comment/suggestion is highly appreciated.

Thank you.

Best regards,

Supheakmungkol

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Amit Aggarwal | 28 Mar 2008 02:04

Query Expansion with Snowball?

Hello dear Snowball community,
 
I am looking for a query expansion tool in order to make searching better for my site. One way of doing this is to stem the word and then conflate the stem to get all variations of the word. For example,
 
user query: swimming
1. stem: swim
2. conflation: swim, swims, swimming, swam
 
I am curious if you think Snowball fits the bill for this job. I know it can do step 1, but has anybody out there tried to do step 2?
 
If anybody knows of anything else that might be useful, please do share!
 
Thanks.
Amit
 
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 28 Mar 2008 12:22
Picon

Re: Query Expansion with Snowball?


Amit,

Yes, this is very much what stemming (and Snowball) is useful for. The
Snowball English stemmer does not conflate 'swam' with 'swim', but that
is all discussed in the introductory account.

Obviously there are many ways of using a stemmer to fit in with an IR
model. One way is to keep a relation of all words with stemmed forms,
(w, s), and expand by applying the projection w->s->w.

swimming -> swim -> swim, swims, swimming

Martin

On Thu, 2008-03-27 at 18:04 -0700, Amit Aggarwal wrote:
> Hello dear Snowball community,
>  
> I am looking for a query expansion tool in order to make searching
> better for my site. One way of doing this is to stem the word and then
> conflate the stem to get all variations of the word. For example,
>  
> user query: swimming
> 1. stem: swim
> 2. conflation: swim, swims, swimming, swam
>  
> I am curious if you think Snowball fits the bill for this job. I know
> it can do step 1, but has anybody out there tried to do step 2? 
>  
> If anybody knows of anything else that might be useful, please do
> share! 
>  
> Thanks.
> Amit
>  
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
ranapratap.syamala | 31 Mar 2008 18:15
Picon

Dutch stemmers- "heden' and "rheden"

Hi,
 
I was looking at the Dutch stemmer and came across a couple of terms from the sample vocabulary that was provided on the website (http://snowball.tartarus.org/algorithms/dutch/diffs.txt) that are stemming to themselves
 
"heden" and "rheden".
 
But when I looked at the rules, it seems like Step1(b) should be enforced and the words should be stemmed to "hed" and "rhed" respectively.
 
h    e    d    e    n
                 |<---->|  R1 (satisfies the R1 adjustment for German stemmer that the region before R1 should contain atleast 3 letters)
 
According to Step1(b),
  (b) en   ene delete if in R1 and preceded by a valid en-ending, and then undouble the ending
 
(valid en-ending: Define a valid en-ending as a non-vowel, and not gem)
 
According to this rule, the "en" suffix should be deleted from the term since it is present with in R1 and has a valid en-ending and stem to "hed"
 
Similarly
 
r    h    e    d    e    n
                      |<----->|  R1 (satisfies the R1 adjustment for German stemmer that the region before R1 should contain atleast 3 letters)
 
should be stemmed to "rhed"
 
I am just wondering if there is something that I am missing or am I misinterpreting the rule??
 
Thanks
Rana
 
 
 
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

Gmane