Michael Schlenker | 1 Aug 2002 14:26
Picon

Tcl binding for Snowball stemmers

Hi all,

i just built a small tcl binding for snowball stemmers.

You can build an tcl extension (.dll / .so) with it to use the ANSI C
stemmers in tcl, at the moment only 4 are included (porter1 porter2
german french), but others can be added easily.

I've uploaded it to patches at www.sf.net/projects/snowball, so if you
want to take a look...

Michael Schlenker

------------
At little toy app for the binding (run it with wish or tclkit):

#!/bin/sh
# \
exec wish "$0" "$ <at> "

package require Tk
package require Tclsnowball

proc do_stem {word} {
        set ::result1 [::snowball::stem porter $word]
        set ::result2 [::snowball::stem porter2 $word]
        set ::result3 [::snowball::stem german $word]
        set ::result4 [::snowball::stem french $word]
}

(Continue reading)

Martin Porter | 1 Aug 2002 23:15
Picon

Re: Tcl binding for Snowball stemmers


Thanks Michael, that sounds interesting. I will take a look.

Some binding extensions to the stemmers have been incorporated on the
Snowball website - I don't know how you feel about that ...

Martin

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Martin Porter | 3 Aug 2002 10:22
Picon

Re: About russian stemmer


Andrey,

You are right! Some links are missing. I wonder how that can have happened
... I will restore the links and make sure you can download tarball.tgz.

I'll try and do it today (I need to use another machine)

More later -

Martin

At 01:28 AM 8/3/02 +0400, Andrey Brindeew wrote:
>Hi, Martin!
>
>Some links on the following page are broken:
>http://snowball.sourceforge.net/russian/stemmer.html
>I'm very want to get file 'tarball.tgz' from this page.
>Can you mail this file to me?
>
>-- 
>WBR, Andrey Brindeew.
>"No one person can understand Perl culture completely"
>(C) Larry Wall.
>
>Attachment Converted: C:\EUDORA\ATTACH\Aboutrus
>

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
(Continue reading)

Martin Porter | 3 Aug 2002 18:04
Picon

Re: About russian stemmer


Andrey,

I think everything you immediately need is now in place, although certain
files are still missing. We'll restore those shortly,

Martin

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Martin Porter | 3 Aug 2002 18:08
Picon

Re: About russian stemmer


Richard,

The sepeerate *.tgz files in the language directories are still not back in
place, and I think the Makefile on Ixion must have failed. Could you
investigate? I can of course transmit everything from Norwich, but only want
to do that if we give up on the Makefile completely, which would be a pity.

Martin

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Martin Porter | 3 Aug 2002 21:52
Picon

Re: About russian stemmer


Richard,

But of course I did a cvs delete on the *.tgz files some time ago, which I
suppose will be reflected on Ixion. I remember discussing it with you, and
you explained it would be okay, since they'd be created by the Makefile. But
that was the Makefile at sourceforge perhaps. Anyway, I'll do a cvs add on
those files tomorrow, and a cvs update. Then at least then they'll go back in.

I am getting confused thinking about this ... 

Martin

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Martin Porter | 17 Aug 2002 11:51
Picon

(no subject)


Enea,

I think there are two issues: -ito may be taken as a past participle ending
of a verb when it is simply part of the stem. abito and soprabito are
examples here, and this is what in the introductory paper at

    http://snowball.sourceforge.net/texts/introduction.html

are called mis-stemming errors. Or -ito really is a past participle ending,
but the past participle form has come to acquire a distinct meaning from the
usual verb form. These are called over-stemming errors.

When developing the Italian stemmer I was very conscious of the problem of
over-stemming with past participles. For example 'bandire' is 'to banish',
and the p.p. form 'bandito' can mean 'banished'. But as often it has the
meaning of 'outlaw' - someone who has been banished. We have the same word
in English of course: 'bandit'. The p.p. has come to acquire a precise
meaning of its own. This happens in all the romance languages, but is
especially noteworthy in Italian. 

One of your examples is of this type. 'marito' is from 'marire', although I
suppose 'sposare/sposarsi' is the usual verb nowadays. 

'prestito' is not the p.p. of 'prestare', but the two forms are connected in
an obvious way. Less obviously 'spirare' and 'spririto' are connected
(because 'spirit' used to be thought of as 'breath'). Your other examples,
'solito' etc., are simple mis-stemmings.

Does it matter? Not necessarily. See the discussion in the introductory
(Continue reading)

Martin Porter | 17 Aug 2002 13:36
Picon

Re: Error in Spanish stemming algorithm


Mabel,

Could you send or copy any future Snowball related emails to

    snowball-discuss <at> lists.sourceforge.net

please? 

To answer your specific question, the definitions of R1 and RV are
independent of each other, and start with the whole word. Another way of
looking at RV is to see four cases:

    VC...V|...
    CC...V|...
    VV...C|...
    CV.|...   

If the word begins VC (vowel consonant) RV ends after the next V, if it
begins CC RV ends after the next V, if it begins VV RV ends after the next
C, and if it begins CV RV ends after the third letter. Or RV is the whole
word if none of these patterns apply.

Apart from that I can't really help with your Visual Basic implementation
until you give me an example word which it stems differently from the
Snowball algorithm. If you will do that, I can send you back a picture of
how the regions are defined and how the word is treated by the different steps.

It is possible that the given Spanish algorithm is incorrect: we found
errors in the Russian algorithm definition when it came to be recoded in
(Continue reading)

Martin Porter | 18 Aug 2002 06:20
Picon

Re: Error in Spanish stemming algorithm


Mabel,

Well, I should have said,

If the word begins VC (vowel consonant), RV begins after the next V, if it
begins CC, RV begins after the next V, if it begins VV, RV begins after the
next C, and if it begins CV, RV begins after the third letter. Or RV is the
whole word if none of these patterns apply.

Easy to make these small slips.

If you don't sort it out, send the VB program and I'll have a look at it.

Martin

-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Martin Porter | 27 Aug 2002 21:08
Picon

Re: Compile Errors ....


Dear Art,

(Snowball messages should be sent to snowball-discuss <at> lists.sourceforge.net)

I can hardly criticise you for using Borland 5.1 when I see that my own
version is Borland 4.02! It is a while since I've used the Borland system,
but I recall a clear distinction between warning and error messages, and
would have thoughts its complaints about operator ambiguities ought to fall
into the former category.

Have there been versions C where the binding powers of operators was
different from the contemporary ANSI C definitions? I find it odd how
compilers so often attach importance to this stuff.

Anyway, the fully bracketed expressions are of course

  if (min == (1<<16)) error(a, 16);
  for (i = 12; i >= 0; i -= 4) write_hexdigit(g, (ch >> i) & 0xf);
  bit_is_set(symbol * b, int i) { return b[i/8] & (1 << (i%8)); }

Tell us how you get on. It will be more of a problem if you hit the same
thing in the C output generated by Snowball.

Martin

At 10:23 AM 8/27/02 -0600, Art Pollard wrote:
>
>Martin:
>
(Continue reading)


Gmane