Martin Porter | 2 Nov 2007 10:25
Picon

Re: Problem with spanish stemmer

On Wed, 2007-10-31 at 12:46 -0300, Ignacio Perez wrote:
> My problem was that I was attempting to modify the generated code ... 

Ah, it is not the first time someone has got into difficulties there.
Perhaps we should include a note in a prominent position on the site
explaining that modifying the generated code is almost impossible. The
reason is that each 'among' gives rise to a delicately constructed table
which ceases to work as soon as a small change is made to its contents.
The advantage of these tables is that they make ending lookup very
efficient. The disadvantage is that they can only be changed by going
back to the snowball source and regenerating them.

Martin 
Olly Betts | 2 Nov 2007 20:25
Favicon
Gravatar

Re: Problem with spanish stemmer

On Fri, Nov 02, 2007 at 09:25:41AM +0000, Martin Porter wrote:
> Perhaps we should include a note in a prominent position on the site
> explaining that modifying the generated code is almost impossible.

Perhaps the comment at the top of the generated file would be the best
place to explain this?

Cheers,
    Olly
Martin Porter | 3 Nov 2007 01:10
Picon

Re: Problem with spanish stemmer

On Fri, 2007-11-02 at 19:25 +0000, Olly Betts wrote:
> On Fri, Nov 02, 2007 at 09:25:41AM +0000, Martin Porter wrote:
> > Perhaps we should include a note in a prominent position on the site
> > explaining that modifying the generated code is almost impossible.
> 
> Perhaps the comment at the top of the generated file would be the best
> place to explain this?
> 
> Cheers,
>     Olly
> 

Yes, a good plan!
Mike Wertheim | 16 Nov 2007 07:38
Picon

memory use in java version

I'm working on a java web application that will be doing stemming frequently.

I need to decide how many SnowballProgram java objects to keep in
memory.  Currently, I have 14 of these objects in memory -- one for
each language.  When the code needs to stem a word, it gets the
SnowballProgram object for the desired language, synchronizes on that
object, calls the object's "stem" method, and then leaves the
synchronized block.

I'm concerned that this synchronization may become a performance
bottleneck, so I'm considering having a larger number of
SnowballProgram objects in memory.

I'd like to find out how much memory a SnowballProgram object keeps
around between invocations of the "stem" method.  If that's too
general of a question, then I'd like to know how much memory an
englishStemmer keeps in memory after having calculated the stems for
1000 different words.  (Does the SnowballProgram keep an internal
cache of already-stemmed words?)

If anyone else has worked on a similar java web app, I'd appreciate
hearing about what choices you made regarding memory usage, threads,
caching, etc.

Thanks!
Mike
Richard Boulton | 16 Nov 2007 10:09

Re: memory use in java version

Mike Wertheim wrote:
> I'd like to find out how much memory a SnowballProgram object keeps
> around between invocations of the "stem" method.  If that's too
> general of a question, then I'd like to know how much memory an
> englishStemmer keeps in memory after having calculated the stems for
> 1000 different words.  (Does the SnowballProgram keep an internal
> cache of already-stemmed words?)

I can't really help with how much memory it uses - but there is no 
internal cache at all in the snowball code.

--

-- 
Richard
Martin Porter | 16 Nov 2007 15:52
Picon

Re: memory use in java version

On Fri, 2007-11-16 at 09:09 +0000, Richard Boulton wrote:

> 
> I can't really help with how much memory it uses - but there is no 
> internal cache at all in the snowball code.
> 

Mike,

Yes, I would have thought memory use is not an issue. There is the
memory required for the C or java software, but almost nothing else.
Since it's all thread safe, I guess you can throw as many Java
SnowballProgram objects into the pot as you wish without fear of it
spilling over. 

Similar remarks for use with C,

Martin  
Sebastiano Vigna | 20 Nov 2007 11:52
Picon

Extending the Java compiler

Dear Martin & Richard,
we are writing you after doing some work on the Java version of  
libstemmer. We are interested mainly in two areas:

0) efficiency;
1) easy integration with our search engine, MG4J.

For the first issue, we have reworked the Java code fixing a number  
of problems (instance instead of class variables, etc.) and presently  
we stem about three times faster than before. We have also  
substituted characters array instead of strings whenever possible,  
and done other routine Java optimisations.

We propose you to integrate our changes in your distribution so to  
both distributing a much faster version, and avoiding us the problem  
of releasing a forked compiler for MG4J.

MG4J contains a high-performance implementation of strings called  
MutableString. We want to avoid StringBuffer/String whenever  
possible, and use MutableString instead. To avoid dependency on  
MutableString, however, and the inherent slowness of StringBuffer  
(which is synchronised) at the same time, we propose to compile by  
default for Java 1.5 using StringBuilder instead (the Java  
replacement for StringBuffer). The user will be able, however, to  
supply its own mutable string buffer class, provided it sports the  
StringBuilder methods used by Snowball. Thus, people will be able to  
supply java.lang.StringBuffer, to compile for Java <1.5, or  
it.unimi.dsi.mg4j.util.MutableString, to compile for MG4J. We would  
then integrate in MG4J a customised SnowballProgram, which would  
interface with the stemmers generated by the compiler.
(Continue reading)


Gmane