Yulia Kuksin | 2 Apr 2003 00:49
Picon
Favicon

java source codes

dear Martin,
I am a linguist and working on cross-language information retrieval.
Thank you very much for your site (and your work). I was lucky to find all
stemmers and stopword lists I need. I downloaded the tarball for java
and compiled the source files from /snowball subdirectory. But when I try
to compile the stemmers from /snowball/ext I get prompt, as if all
compilation
was successful, but no .class files. I tried at the Linux at my department
and at home (Windows) and equallly failed. I am a poor programmer, but
a good linguist and shall be thankful for help and glad if I can be in any
way
helpful.
yours,
Yulia Kuksin
Richard Boulton | 2 Apr 2003 01:09
Gravatar

Re: java source codes

On Tue, 2003-04-01 at 23:49, Yulia Kuksin wrote:
> dear Martin,
> I am a linguist and working on cross-language information retrieval.
> Thank you very much for your site (and your work). I was lucky to find all
> stemmers and stopword lists I need. I downloaded the tarball for java
> and compiled the source files from /snowball subdirectory. But when I try
> to compile the stemmers from /snowball/ext I get prompt, as if all
> compilation
> was successful, but no .class files. I tried at the Linux at my department
> and at home (Windows) and equallly failed. I am a poor programmer, but
> a good linguist and shall be thankful for help and glad if I can be in any
> way
> helpful.
> yours,
> Yulia Kuksin

I shall answer this, since I wrote the Java stuff.

I'm not sure why you're having problems, so I'll just describe how I
would compile (on linux).  If you try the same commands and it doesn't
work, it would be useful to know a bit about your environment: in
particular, what version of Java, and what java compiler you are using.

Having downloaded the tarball "snowball_java.tgz":

$ tar zxf snowball_java.tgz
$ javac net/sf/snowball/Among.java
$ javac net/sf/snowball/SnowballProgram.java
$ javac net/sf/snowball/TestApp.java
$ javac net/sf/snowball/ext/*.java
(Continue reading)

Chris Cleveland | 2 Apr 2003 18:36

Java source

Speaking of Java source code, where is the latest version of snowball_java.tgz? I can't seem to find it on
the Tartarus site.
Olly Betts | 2 Apr 2003 18:39
Favicon
Gravatar

Re: Java source

On Wed, Apr 02, 2003 at 10:36:38AM -0600, Chris Cleveland wrote:
> Speaking of Java source code, where is the latest version of
> snowball_java.tgz? I can't seem to find it on the Tartarus site.

http://snowball.tartarus.org/snowball_java.tgzo

Not sure where that's linked to from (I found it using locate from my
account on the server) - couldn't see an obvious link.

Cheers,
    Olly
Olly Betts | 2 Apr 2003 19:33
Favicon
Gravatar

Re: Java source

On Wed, Apr 02, 2003 at 05:39:50PM +0100, Olly Betts wrote:
> On Wed, Apr 02, 2003 at 10:36:38AM -0600, Chris Cleveland wrote:
> > Speaking of Java source code, where is the latest version of
> > snowball_java.tgz? I can't seem to find it on the Tartarus site.
> 
> http://snowball.tartarus.org/snowball_java.tgzo

Umm, there's a surplus "o" there of course:

http://snowball.tartarus.org/snowball_java.tgz

> Not sure where that's linked to from (I found it using locate from my
> account on the server) - couldn't see an obvious link.

Found it, the link is from:

http://snowball.tartarus.org/q/use.html

The HTML there says:

  To run Java, download the

  <a HREF=" http://snowball.tartarus.org/snowball_java.tgz">tarball</a> at
  , which will unpack into an
  appropriate directory structure.

The leading space on the HREF might cause a problem for some browsers.
and the text would read better with "at" removed.

Cheers,
(Continue reading)

Martin Porter | 2 Apr 2003 23:55
Picon

Re: Java source


Okay, I'll look into all this. The java stuff has always been a bit more
obscure than the C stuff, I think, and needs to be made equally prominent.
And we'll fix the dodgy link (is that my work or yours, Richard?).

>  <a HREF=" http://snowball.tartarus.org/snowball_java.tgz">tarball</a> at
>  , which will unpack into an
>  appropriate directory structure.
>
>The leading space on the HREF might cause a problem for some browsers.
>and the text would read better with "at" removed.
>
>Cheers,
>    Olly
>
Martin Porter | 11 Apr 2003 13:02
Picon

Re: Porter Stemmer Question


Jason,

Thanks for your email. I think you will need to adapt my work for your
needs, but you might find the following useful:

The Porter stemmer behaves less well the Porter2 stemmer here, but to recover
the suffix in the Porter stemmer, do the following:

If W stems to S, put a mark in W after the longest common prefix of S and W,
e.g.

    W = rat|ing
    S = rate

then -

advance the mark if it precedes 'y' (alimon|y -> alimony|)
advance the mark if it separates a double consonant (cut|ting -> cutt|ing)
other than double l before y (thoughtful|ly is left alone).

With my sample vocab, you then get these endings,

ability
able
ableness
ables
ably
al
alities
(Continue reading)

Benjamin Franz | 14 Apr 2003 23:12

Lingua::Stem

I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new
version that 'wrappers' the Snowball based Perl stemmers (along with some
non-Snowball based versions) into the standarized Lingua::Stem API.  

Something I noticed today while looking for anyone who might be using the
Lingua::Stem Perl module was that last year it was mentioned as being a
poor performer here on the Snowball-Discuss list. After examining the
benchmark code used (see
<URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>)
I discovered the main reason it performed poorly in the tests was it was
being used in its absolutely lowest performing mode (one word at a time,
no stem caching). I thought it would be worth re-doing the benchmark using
its faster modes. So, here is the benchmark code redone to take full
advantage of Lingua::Stem's performance features:

#!/usr/bin/perl

use Benchmark;
use Lingua::Stem qw (:all :caching);

my  <at> word = grep chomp, <>;

#################################################
# Preload word list so we have identical runs
my  <at> word_list = ();
my $s = 100;
for (1..$s) {
  my $result;
  my $w =  <at> word[rand(scalar( <at> word))];
  push ( <at> word_list,$w);
(Continue reading)

Allan Fields | 15 Apr 2003 04:13
Picon

Re: Benchmarking Lingua::Stem

Hi,
On Mon, Apr 14, 2003 at 09:04:19AM -0700, Benjamin Franz wrote:
> On Mon, 14 Apr 2003, Benjamin Franz wrote:
> 
> > While doing a little Googling to find out if/how people are using Perl 
> > modules I've written, I ran across the Snowball-discuss list benchmarking 
> > of Lingua::Stem last year 
> > <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>
> > 
> > I feel that your criticism of the performance of Lingua::Stem is
> > mis-placed. Your benchmark used it in its _lowest_ performance mode (one
> > word at a time, caching disabled). If you process _all_ the words in one
> > pass you will find its performance _exceeds_ its competition - by quite a
> > large margin. That is _without_ even turning on the stem caching system 
> > (which can multiply the performance several times on large stemming 
> > operations).

Yes, that's true..

I guess this should probably be followed up on the list as a point of
clarification.  When I did those benchmarks, I did them assuming there would
be only one word per call, and many (multiple) calls of the subroutines, so
it followed that subroutine overhead was as much an issue as the
implementation of the algorithm itself.  I remember looking at the batch
features, but deciding to benchmark it the way I was expecting to call it.
In hindsight, that probably wasn't the fairest benchmarking strategy (and
it was a quick benchmark.)

As for my other point: although your stemmer was one of the most featureful,
I just hadn't understood why some of the program structure used symbolic
(Continue reading)

Oleg Bartunov | 15 Apr 2003 04:46
Picon

Re: Lingua::Stem

Did you try Lingua::Stem::Snowball ?
It's not pure perl wrapper but uses XS, so should be much faster.
Also, it doesn't add "additional errors" :-) I recollect Martin
has worried about many errors in selfmade stemmers claiming they are
Porter's stemmer.

	Oleg
On Mon, 14 Apr 2003, Benjamin Franz wrote:

> I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new
> version that 'wrappers' the Snowball based Perl stemmers (along with some
> non-Snowball based versions) into the standarized Lingua::Stem API.
>
> Something I noticed today while looking for anyone who might be using the
> Lingua::Stem Perl module was that last year it was mentioned as being a
> poor performer here on the Snowball-Discuss list. After examining the
> benchmark code used (see
> <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>)
> I discovered the main reason it performed poorly in the tests was it was
> being used in its absolutely lowest performing mode (one word at a time,
> no stem caching). I thought it would be worth re-doing the benchmark using
> its faster modes. So, here is the benchmark code redone to take full
> advantage of Lingua::Stem's performance features:
>
> #!/usr/bin/perl
>
> use Benchmark;
> use Lingua::Stem qw (:all :caching);
>
> my  <at> word = grep chomp, <>;
(Continue reading)


Gmane