Olly Betts | 2 Jun 2006 15:34
Favicon
Gravatar

Tclsnowball

I stumbled across these Tcl wrappers for snowball, by Michael Schlenker:

http://wiki.tcl.tk/8699

Perhaps they should be listed here:

http://snowball.tartarus.org/projects.php

Cheers,
    Olly
Richard Boulton | 4 Jun 2006 07:38

Re: Tclsnowball

On Fri, Jun 02, 2006 at 02:34:06PM +0100, Olly Betts wrote:
> I stumbled across these Tcl wrappers for snowball, by Michael Schlenker:
> 
> http://wiki.tcl.tk/8699

Thanks - I've added a link to projects.php
Erwin Glockner | 7 Jun 2006 00:06
Picon

romanian stemmer

Hello everyone,

my name is Erwín Glockner, I'm a student of computational linguistics in 
Heidelberg, Germany. Together with my fellow students Doina Gliga and 
Marina Stegarescu we started to write a romanian stemmer in Snowball.
We planned to finish the stemmer until end of this month. We would be 
happy if the stemmer would be accepted as part of the Snowball-distribution.
There is still some work to do, e.g. evaluating the stemmer, making a 
stopwords-list, unicode support, etc. After finishing this we will send 
you our stemmer with the corresponding files, but I couldn't find any 
email adress to whom the stemmer should be sent to.
Could please someone tell me the address(es)?

With kind regards,
E. Glockner, D. Gliga, M. Stegarescu.
richard | 7 Jun 2006 02:44

Re: romanian stemmer

On Wed, Jun 07, 2006 at 12:06:30AM +0200, Erwin Glockner wrote:
> my name is Erwín Glockner, I'm a student of computational linguistics in 
> Heidelberg, Germany. Together with my fellow students Doina Gliga and 
> Marina Stegarescu we started to write a romanian stemmer in Snowball.
> We planned to finish the stemmer until end of this month. We would be 
> happy if the stemmer would be accepted as part of the Snowball-distribution.

Excellent news, we're always happy to receive new algorithms.  I assume this
means you are happy to license the stemmer under the same terms as the existing
snowball software (ie, the BSD license, see
http://snowball.tartarus.org/license.php).

> There is still some work to do, e.g. evaluating the stemmer, making a 
> stopwords-list, unicode support, etc. After finishing this we will send 
> you our stemmer with the corresponding files, but I couldn't find any 
> email adress to whom the stemmer should be sent to.
> Could please someone tell me the address(es)?

If you wish, you can send it directly to me (richard <at> lemurconsulting.com is the
best address to use).  Martin may chip in asking for a copy to be sent to him,
too, but I'll leave that up to him.  You could also try sending it directly to
this mailing list, though I fear it may be set to strip out attachments so that
could be futile.

We prefer to keep as much discussion as possible on this mailing list, so we'll
probably CC any follow-up messages to this list.

Thanks for the interest.

--

-- 
(Continue reading)

Martin Porter | 7 Jun 2006 10:01
Picon

Re: romanian stemmer


Yes excellent news.

Copy me in on anything you send. 

Martin
Richard Boulton | 12 Jun 2006 18:14

Python bindings

We've had discussions at various points over the last few years about
Python bindings (in fact, the first submission to the list on the subject
was in 2001).  Until yesterday, we hosted links to a couple of
implementations of Python bindings for snowball on our website, but didn't
support either of them.  Having tried them out at the weekend, I found that
I couldn't get one to compile, and wasn't happy with the interface provided
by the other.  I therefore put together a lightweight wrapper on the
libstemmer library (using the Pyrex wrapper generation tool, to make this
an easier job).

I'm calling this PyStemmer, which is the same name as the original wrapper,
written by Andreas Jung, but I don't feel this is a problem since the
original wrapper hasn't been modified in 5 years, and has a very similar
interface.

Version 1.0 is now available on the snowball website (on the "wrappers"
page), at:

http://snowball.tartarus.org/wrappers/PyStemmer-1.0.tar.gz

--

-- 
Richard
Andreas Jung | 12 Jun 2006 18:23

Re: Python bindings


--On 12. Juni 2006 17:14:48 +0100 Richard Boulton 
<richard <at> lemurconsulting.com> wrote:

> I'm calling this PyStemmer, which is the same name as the original
> wrapper, written by Andreas Jung, but I don't feel this is a problem
> since the original wrapper hasn't been modified in 5 years, and has a
> very similar interface.
>

Yes, the original PyStemmer code is really old. However the most up2date
version of my Python stemmer code is part of TextIndexNG3 for Zope. The 
extension modules are available from Sourceforge:

<http://svn.sourceforge.net/viewcvs.cgi/textindexng/extension_modules/trunk/

-aj

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Andres Hohendahl | 23 Jun 2006 18:45
Picon

Suggestion of Improvement

Hi,

 

Being a NLP researcher at the Engineering University of Buenos Aires, I have been working on the basic ideas of some popular spelling and stemming algorithms, used basically for word spell-correction in ASPELL and ISPELL GNU/free projects, linked to Open Office and other free word processing stuff.

I got from them a useful description (language in text form. *.aff and *.dic ) of the stem-morphological transformations in many languages, made all over the world and maintained by many NLP groups.

 

From there I saw a similarity with your project.

I programmed a library for a small company, capable of efficiently applying accumulative stemming in direct and reverse form, in C#, under .NET.

 

BTW, stemming does natural grammatical changes in words (pluralization, gender-change, adjectivation, sustantivation, superlative and diminutive, etc.)

My algorithm keeps track of this, while stemming or flectioning, so a “good side” effect is to get the grammatical POS of the stemmed words.

 

Isn’t it interesting to melt them all together?

 

My modest algorithm is very efficient and does not need to be compiled, it builds efficient memory structures to deal with terminations (suffixes) and prefixes using specialized TRIES, this is as efficient as compiling or even more, because you save a step, (no compilation) so your rules can be upgraded easily at runtime (being reloaded, of course).

 

Is there a good description of the snowball meta-language to deal with, and may be build something new and good between both worlds?

 

Greetings from Argentina

 

Andrés Hohendahl

 

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 24 Jun 2006 11:42
Picon

Re: modifying Java English stemmer to accept newexceptions


Brian,

I can't comment on the Java code, since Richard Boulton wrote the
codegenerator for it, and undertakes support (I don't know if Richard cares
to comment?), but I would advise modifying the Snowball scripts and
codegenerating the Java afresh. In fact, you pretty well have to do that,
since the 'among's compile into optimized table structures that you can't
really modify by hand (add an extra item and the whole table changes...)

I did at one time collect all the main irregularities of English verbs, with
a view to doing what you are now attempting, although I never put a new
stemmer based on it into service. You have to be very careful: the past of
'see' is 'saw', but a saw is also a cutting tool, and so on. So you might
find the table below, which I put together many years ago, useful. You can
find these tables in dictionaries and grammars, but they are rarely
complete, and are often cluttered with archaic forms that are no longer useful.

Martin

 
Paradigm form            verb list
----------------------------------------------------------------
SMELL  SMELT         (r) burn learn spell smell spill spoil dwell(a)
BEND   BENT              bend build lend send spend rend(a) gird(a)
HIT    HIT           (*) bet burst cast cost(r) cut hit hurt let put
                         quit rid set shed shut slit split spread
                         thrust upset wet(r)
SEW    SEWED   SEWN      sew sow show hew(a) mow(r) saw(r) strew(r)
                         shave(r)
BEAT   BEAT    BEATEN    beat
DRINK  DRANK   DRUNK     begin drink ring shrink sing sink spring stink
                         swim
WIN    WON               cling dig fling sling spin stick sting string
                         swing win wring slink
SIT    SAT               sit spit
BLEED  BLED              bleed breed feed lead meet read speed
GET    GOT               get
HANG   HUNG              hang
FIND   FOUND             bind find grind wind
LIGHT  LIT               light slide
SHINE  SHONE             shine
FIGHT  FOUGHT            fight
STRIKE STRUCK            strike
HOLD   HELD              hold
SHOOT  SHOT              shoot
COME   CAME    COME      come become
RUN    RAN     RUN       run
KEEP   KEPT              creep keep leap sweep sleep weep
SELL   SOLD              sell tell
FLEE   FLED              flee
HEAR   HEARD             hear
SAY    SAID              say
SHOE   SHOD              shoe
MEAN   MEANT             deal dream feel kneel lean mean
BUY    BOUGHT            buy
LEAVE  LEFT              leave bereave(a)
LOSE   LOST              lose
RIDE   RODE    RIDDEN    drive ride rise arise strive write smite(a)
STRIDE STRODE  -         stride
FLY    FLEW    FLOWN     fly
STEAL  STOLE   STOLEN    freeze speak steal weave
BREAK  BROKE   BROKEN    break wake awake
FORGET FORGOT  FORGOTTEN forget tread
BEAR   BORE    BORNE     bear tear swear wear
LIE    LAY     LAIN      lie
BITE   BIT     BITTEN    bite hide
CHOOSE CHOSE   CHOSEN    choose
SEE    SAW     SEEN      see
EAT    ATE     EATEN     eat
FORBID FORBADE FORBIDDEN forbid forgive give bid(a)
TAKE   TOOK    TAKEN     forsake(a) shake take
FALL   FELL    FALLEN    fall
DRAW   DREW    DRAWN     draw
GROW   GREW    GROWN     blow grow know throw
SLAY   SLEW    SLAIN     slay(a)
SWELL  SWELLED SWOLLEN   swell(r)
SHEAR  SHEARED SHORN     shear(r)
MAKE   MADE              make
BRING  BROUGHT           bring think
TEACH  TAUGHT            teach beseech(a) seek(a)
CATCH  CAUGHT            catch
STAND  STOOD             stand understand
GO     WENT    GONE      go
DO     DID     DONE      do

Verbs marked (r) also have regular forms. Verbs marked (a) are archaic. Verbs
marked (*) are irregular, but not in a way that causes difficulties to a
stemming algorithm.

The pp of `hang' is `hanged' or `hung', depending on the sense. `lie' is
irregular when it means `lying down', regular when it means `telling
falsehoods'. `stride' has no pp in normal use.

We are left with 135 verbs with irregularities in the past or pp forms:

 arise awake bear beat become begin bend bind bite bleed blow
 break breed bring build burn buy catch choose cling come creep
 deal dig do draw dream drink drive eat fall feed feel fight find
 flee fling fly forbid forget forgive freeze get give go grind
 grow hang hear hide hold keep kneel know lead lean leap learn
 leave lend lie light lose make mean meet mow read ride ring rise
 run saw say see sell send sew shake shave shear shine shoe shoot
 show shrink sing sink sit sleep slide sling slink smell sow
 speak speed spell spend spill spin spit spoil spring stand steal
 stick sting stink strew stride strike string strive swear sweep
 swell swim swing take teach tear tell think throw tread
 understand wake wear weave weep win wind wring write

plus these 20 invariant forms:

 bet burst cast cost cut hit hurt let put quit rid set shed shut
 slit split spread thrust upset wet

and these 11 archaic forms, which might/might not be included:

 bereave beseech bid dwell forsake gird hew rend seek slay smite
Martin Porter | 26 Jun 2006 10:15
Picon

Re: Suggestion of Improvement


Andrés,

Your work sounds very interesting. I am not familiar with ASPELL and ISPELL
... do you have any web links to your own work and its background that would
give us more of an idea of the approach you have taken?

Martin

Gmane