Tolun ARDAHANLI | 2 May 2007 23:16
Picon

hi i need your help...

hi to everybody;

i need a help about "extended boolean model"... is anybody know that from where can I find the algorithm and an example C code of it?

thanks for all... sincerely...


Tolun ARDAHANLI
Debian GNU/Linux... forever...
(old MCSE/MCDBA)
new Linux user...;-)...

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Oleg Bartunov | 4 May 2007 13:23
Picon

Snowball API versioning

Hi there,

I'm asking about API versioning to let third-party products track
snowball changes. We use snowball stemmer in our full-text search engine for 
PostgreSQL. Currently, it's extension module but we expect it will became
built-in core FTS in the next release, which should happen in two months.
There were several (we noticed 2) API changes this year. It's headache !
I suggest to follow standard versioning scheme major.minor.

 	Regards,
 		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg <at> sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Richard Boulton | 4 May 2007 15:42

Re: Snowball API versioning

Oleg Bartunov wrote:
> Hi there,
> 
> I'm asking about API versioning to let third-party products track
> snowball changes. We use snowball stemmer in our full-text search engine 
> for PostgreSQL. Currently, it's extension module but we expect it will 
> became
> built-in core FTS in the next release, which should happen in two months.
> There were several (we noticed 2) API changes this year. It's headache !
> I suggest to follow standard versioning scheme major.minor.

Some kind of versioning would indeed be a good idea, but it's not clear 
to me what the API changes you're referring to are: as far as I can see, 
these are the ways in which the code accessible from snowball changes:

1. Internal changes to the compiler, resulting in the generated stemmer 
code being different, but behaving the same.
2. New features being added to the snowball language, but old .sbl files 
will still produce equivalent output.
3. Changes to the definition of the snowball language, resulting in .sbl 
files no-longer producing equivalent output.
4. Changes to a snowball script, such that it produces different output.
5. Changes to the libstemmer interface (ie, the libstemmer.h file, for C).

IIRC, there have been several changes of type 1, but none of 2 or 3 in 
recent months/years.  There have been no changes to libstemmer.h since 
August 2005.

Therefore, I suspect you're talking about changes of type 4.  I would 
like to add versioning to the stemming algorithms at some point, such 
that each change to an algorithm increments the version number, but 
haven't had time to do this yet.

Also, I would like to modify the libstemmer interface such that the 
current version of a stemming algorithm can be obtained, and also such 
that a particular version of a stemming algorithm can be requested.  It 
would also be possible to compile a version of libstemmer such that 
several old versions of a particular stemmer were available.  This would 
allow a database to store the stemmer version number which was used to 
index with, so that searches can use the same stemmer version.  However, 
a newly created database would simply use the latest stemmer version. 
Again, I simply haven't had time to do this yet.

For now, I recommend that you simply take a copy of libstemmer into your 
distribution, and update that static version of libstemmer as 
appropriate when you make new releases of your distribution.

I don't think that a major.minor versioning scheme would be appropriate 
here, but maybe you are thinking of something different to me (in which 
case, please enlighten me).

--

-- 
Richard
Andrew Green | 19 May 2007 07:18
Picon
Favicon

Spanish stemmer with accents stripped before stemming

Hello!

We're trying to fine-tune an applicaiton we're building that searches
mainly Spanish content.  We've been working with Lucene and the included
Spanish snowball stemmer.

We are removing accents before stemming because misplaced-accent errors
are extremely common--we cannot expect users to enter correctly accented
terms in their queries.

Our approach so far has been to simply replace accented characters in
the Spanish stemmer Java code with unaccented ones. This has produced a
stemmer that works--to a degree--but some simple terms are no longer
stemmed correctly. Specifically, regular plural masculine nouns
("libros" or "datos") are not stemmed at all.

Still, after reading the explanation of the Spanish algorithm [1] I
can't see why this would be the case--or why the removal of accented
characters from both the code and stemmer input would affect the
algorithm's effectivenes at all.

You can see our code here [2] and here [3].

Any suggestions?

Thanks in advance,
Andrew Green

[1] http://snowball.tartarus.org/algorithms/spanish/stemmer.html
[2] http://200.67.231.185/svn/pescador/trunk/java_source/net/sf/snowball/ext/Spanish2Stemmer.java
[3] http://200.67.231.185/svn/pescador/trunk/java_source/org/apache/lucene/analysis/snowball/SnowballAnalyzerWithoutAccents.java
Martin Porter | 21 May 2007 13:49
Picon

Re: Spanish stemmer with accents stripped before stemming


I find I can't connect to 200.67.231.185, so I'm not too sure what's
going on here. Obviously to us, it a bit easier to look at the problem
from the snowball angle, rather than think about the generated java
after it's been put inside lucene! As far as the snowball script is
concerned, I believe you could strip out accents from the source,
eliminate the duplicate strings in the amongs(..) that would result, and
recompile, getting the effect you want.

(Incidentally, I have hit this problem with Spanish stemming before, but
it was a long while ago -- before the development of snowball.)

Also I'm not familiar with the java codegenerated output. I don't know
if Richard Boulton (who write the java codegenerator) has anything more
to add at this stage?

Martin
Andrew Green | 23 May 2007 01:09
Picon
Favicon

Re: Spanish stemmer with accents stripped before stemming

Hi, Martin,

Thank you very much for your replies...

>Obviously to us, it a bit easier to look at the problem
> from the snowball angle, rather than think about the generated java
> after it's been put inside lucene! As far as the snowball script is
> concerned, I believe you could strip out accents from the source,
> eliminate the duplicate strings in the amongs(..) that would result, and
> recompile, getting the effect you want.

OK... We just tried that, and it works very well so far! Thanks a bunch.

> (Incidentally, I have hit this problem with Spanish stemming before, but
> it was a long while ago -- before the development of snowball.)

Accents are the boogeyman of day-to-day written Spanish usage and it's
hard to imagine an effective search engine that obliges users to type
them correctly.

> Thinking about it further, this will not work, since the strings are
> placed in tables which would need to be fully reorganised if any of
> the
> characters in the strings were readjusted. (it is the way a snowball
> 'among' is implemented).
> 
> The only way to do this is to modify the stem.sbl file for Spanish,
> regenerate the java code with the snowball compiler (which you can
> download) and replace the old java with the new in your application.

Ah... So now we know why our previous attempt failed.

It occurs to me that perhaps it would be a good idea to modify
Snowball's Spanish stemmer to accept both accented and accent-stripped
input.

Greetings,
Andrew Green

P.S. Our little server was down during the weekend--sorry--it's back
online again--though now a commit error to the repository has made the
relevant files difficult to access--though now they seem less relevant,
I suppose.
Martin Porter | 24 May 2007 11:29
Picon

Re: Spanish stemmer with accents stripped before stemming


Andrew,

> It occurs to me that perhaps it would be a good idea to modify
> Snowball's Spanish stemmer to accept both accented and accent-stripped
> input.

I think that is a good point. As you say, "accents are the bogeyman of
day-to-day written Spanish usage and it's hard to imagine an effective
search engine that obliges users to type them correctly."

The occasion when I came across this problem before was news data in
Spanish where the placing of accents was very untrustworthy. There is a
variant of the Snowball German stemmer in which umlaut is represented by
following e, but there are no variants for the Romance language
stemmers.

I'm not sure what the deal is for Portuguese, but Spanish is as you
describe it. In French, the application of accents is quite rigorously
applied, except that they can be omitted when the text is entirely in
upper case. (But is that stylistic feature less prevalent than it was a
century ago? I'm not sure ...) Anyway, keeping accents in place with
French does not seem to be problematic.

Italian presents an interesting case. They use acute and grave, but not
by any consistent rule. There are different schemes for how acute/grave
is applied, which varies (or used to vary) among publishing houses. This
is why the Italian stemmer begins with the strange operation of
replacing all acutes with graves. A critical ending is then -o+accent,
but even if the accent is absent, -o is a similar ending, and will be
removed by the same rule (compare porto`, he carried, with porto, I
carry). The result is the the Italian stemmer does not behave very
differently on texts with all accents stripped.

We'll keep your suggestion in mind as a Snowball development.

Martin

Gmane