Tclsnowball
2006-06-02 13:34:06 GMT
I stumbled across these Tcl wrappers for snowball, by Michael Schlenker: http://wiki.tcl.tk/8699 Perhaps they should be listed here: http://snowball.tartarus.org/projects.php Cheers, Olly
I stumbled across these Tcl wrappers for snowball, by Michael Schlenker: http://wiki.tcl.tk/8699 Perhaps they should be listed here: http://snowball.tartarus.org/projects.php Cheers, Olly
On Fri, Jun 02, 2006 at 02:34:06PM +0100, Olly Betts wrote: > I stumbled across these Tcl wrappers for snowball, by Michael Schlenker: > > http://wiki.tcl.tk/8699 Thanks - I've added a link to projects.php
Hello everyone, my name is Erwín Glockner, I'm a student of computational linguistics in Heidelberg, Germany. Together with my fellow students Doina Gliga and Marina Stegarescu we started to write a romanian stemmer in Snowball. We planned to finish the stemmer until end of this month. We would be happy if the stemmer would be accepted as part of the Snowball-distribution. There is still some work to do, e.g. evaluating the stemmer, making a stopwords-list, unicode support, etc. After finishing this we will send you our stemmer with the corresponding files, but I couldn't find any email adress to whom the stemmer should be sent to. Could please someone tell me the address(es)? With kind regards, E. Glockner, D. Gliga, M. Stegarescu.
On Wed, Jun 07, 2006 at 12:06:30AM +0200, Erwin Glockner wrote: > my name is Erwín Glockner, I'm a student of computational linguistics in > Heidelberg, Germany. Together with my fellow students Doina Gliga and > Marina Stegarescu we started to write a romanian stemmer in Snowball. > We planned to finish the stemmer until end of this month. We would be > happy if the stemmer would be accepted as part of the Snowball-distribution. Excellent news, we're always happy to receive new algorithms. I assume this means you are happy to license the stemmer under the same terms as the existing snowball software (ie, the BSD license, see http://snowball.tartarus.org/license.php). > There is still some work to do, e.g. evaluating the stemmer, making a > stopwords-list, unicode support, etc. After finishing this we will send > you our stemmer with the corresponding files, but I couldn't find any > email adress to whom the stemmer should be sent to. > Could please someone tell me the address(es)? If you wish, you can send it directly to me (richard <at> lemurconsulting.com is the best address to use). Martin may chip in asking for a copy to be sent to him, too, but I'll leave that up to him. You could also try sending it directly to this mailing list, though I fear it may be set to strip out attachments so that could be futile. We prefer to keep as much discussion as possible on this mailing list, so we'll probably CC any follow-up messages to this list. Thanks for the interest. -- --(Continue reading)
We've had discussions at various points over the last few years about Python bindings (in fact, the first submission to the list on the subject was in 2001). Until yesterday, we hosted links to a couple of implementations of Python bindings for snowball on our website, but didn't support either of them. Having tried them out at the weekend, I found that I couldn't get one to compile, and wasn't happy with the interface provided by the other. I therefore put together a lightweight wrapper on the libstemmer library (using the Pyrex wrapper generation tool, to make this an easier job). I'm calling this PyStemmer, which is the same name as the original wrapper, written by Andreas Jung, but I don't feel this is a problem since the original wrapper hasn't been modified in 5 years, and has a very similar interface. Version 1.0 is now available on the snowball website (on the "wrappers" page), at: http://snowball.tartarus.org/wrappers/PyStemmer-1.0.tar.gz -- -- Richard
--On 12. Juni 2006 17:14:48 +0100 Richard Boulton <richard <at> lemurconsulting.com> wrote: > I'm calling this PyStemmer, which is the same name as the original > wrapper, written by Andreas Jung, but I don't feel this is a problem > since the original wrapper hasn't been modified in 5 years, and has a > very similar interface. > Yes, the original PyStemmer code is really old. However the most up2date version of my Python stemmer code is part of TextIndexNG3 for Zope. The extension modules are available from Sourceforge: <http://svn.sourceforge.net/viewcvs.cgi/textindexng/extension_modules/trunk/ -aj
_______________________________________________ Snowball-discuss mailing list Snowball-discuss <at> lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Hi,
Being a NLP researcher at the Engineering University of Buenos Aires, I have been working on the basic ideas of some popular spelling and stemming algorithms, used basically for word spell-correction in ASPELL and ISPELL GNU/free projects, linked to Open Office and other free word processing stuff.
I got from them a useful description (language in text form. *.aff and *.dic ) of the stem-morphological transformations in many languages, made all over the world and maintained by many NLP groups.
From there I saw a similarity with your project.
I programmed a library for a small company, capable of efficiently applying accumulative stemming in direct and reverse form, in C#, under .NET.
BTW, stemming does natural grammatical changes in words (pluralization, gender-change, adjectivation, sustantivation, superlative and diminutive, etc.)
My algorithm keeps track of this, while stemming or flectioning, so a “good side” effect is to get the grammatical POS of the stemmed words.
Isn’t it interesting to melt them all together?
My modest algorithm is very efficient and does not need to be compiled, it builds efficient memory structures to deal with terminations (suffixes) and prefixes using specialized TRIES, this is as efficient as compiling or even more, because you save a step, (no compilation) so your rules can be upgraded easily at runtime (being reloaded, of course).
Is there a good description of the snowball meta-language to deal with, and may be build something new and good between both worlds?
Greetings from Argentina
Andrés Hohendahl
_______________________________________________ Snowball-discuss mailing list Snowball-discuss <at> lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Brian,
I can't comment on the Java code, since Richard Boulton wrote the
codegenerator for it, and undertakes support (I don't know if Richard cares
to comment?), but I would advise modifying the Snowball scripts and
codegenerating the Java afresh. In fact, you pretty well have to do that,
since the 'among's compile into optimized table structures that you can't
really modify by hand (add an extra item and the whole table changes...)
I did at one time collect all the main irregularities of English verbs, with
a view to doing what you are now attempting, although I never put a new
stemmer based on it into service. You have to be very careful: the past of
'see' is 'saw', but a saw is also a cutting tool, and so on. So you might
find the table below, which I put together many years ago, useful. You can
find these tables in dictionaries and grammars, but they are rarely
complete, and are often cluttered with archaic forms that are no longer useful.
Martin
Paradigm form verb list
----------------------------------------------------------------
SMELL SMELT (r) burn learn spell smell spill spoil dwell(a)
BEND BENT bend build lend send spend rend(a) gird(a)
HIT HIT (*) bet burst cast cost(r) cut hit hurt let put
quit rid set shed shut slit split spread
thrust upset wet(r)
SEW SEWED SEWN sew sow show hew(a) mow(r) saw(r) strew(r)
shave(r)
BEAT BEAT BEATEN beat
DRINK DRANK DRUNK begin drink ring shrink sing sink spring stink
swim
WIN WON cling dig fling sling spin stick sting string
swing win wring slink
SIT SAT sit spit
BLEED BLED bleed breed feed lead meet read speed
GET GOT get
HANG HUNG hang
FIND FOUND bind find grind wind
LIGHT LIT light slide
SHINE SHONE shine
FIGHT FOUGHT fight
STRIKE STRUCK strike
HOLD HELD hold
SHOOT SHOT shoot
COME CAME COME come become
RUN RAN RUN run
KEEP KEPT creep keep leap sweep sleep weep
SELL SOLD sell tell
FLEE FLED flee
HEAR HEARD hear
SAY SAID say
SHOE SHOD shoe
MEAN MEANT deal dream feel kneel lean mean
BUY BOUGHT buy
LEAVE LEFT leave bereave(a)
LOSE LOST lose
RIDE RODE RIDDEN drive ride rise arise strive write smite(a)
STRIDE STRODE - stride
FLY FLEW FLOWN fly
STEAL STOLE STOLEN freeze speak steal weave
BREAK BROKE BROKEN break wake awake
FORGET FORGOT FORGOTTEN forget tread
BEAR BORE BORNE bear tear swear wear
LIE LAY LAIN lie
BITE BIT BITTEN bite hide
CHOOSE CHOSE CHOSEN choose
SEE SAW SEEN see
EAT ATE EATEN eat
FORBID FORBADE FORBIDDEN forbid forgive give bid(a)
TAKE TOOK TAKEN forsake(a) shake take
FALL FELL FALLEN fall
DRAW DREW DRAWN draw
GROW GREW GROWN blow grow know throw
SLAY SLEW SLAIN slay(a)
SWELL SWELLED SWOLLEN swell(r)
SHEAR SHEARED SHORN shear(r)
MAKE MADE make
BRING BROUGHT bring think
TEACH TAUGHT teach beseech(a) seek(a)
CATCH CAUGHT catch
STAND STOOD stand understand
GO WENT GONE go
DO DID DONE do
Verbs marked (r) also have regular forms. Verbs marked (a) are archaic. Verbs
marked (*) are irregular, but not in a way that causes difficulties to a
stemming algorithm.
The pp of `hang' is `hanged' or `hung', depending on the sense. `lie' is
irregular when it means `lying down', regular when it means `telling
falsehoods'. `stride' has no pp in normal use.
We are left with 135 verbs with irregularities in the past or pp forms:
arise awake bear beat become begin bend bind bite bleed blow
break breed bring build burn buy catch choose cling come creep
deal dig do draw dream drink drive eat fall feed feel fight find
flee fling fly forbid forget forgive freeze get give go grind
grow hang hear hide hold keep kneel know lead lean leap learn
leave lend lie light lose make mean meet mow read ride ring rise
run saw say see sell send sew shake shave shear shine shoe shoot
show shrink sing sink sit sleep slide sling slink smell sow
speak speed spell spend spill spin spit spoil spring stand steal
stick sting stink strew stride strike string strive swear sweep
swell swim swing take teach tear tell think throw tread
understand wake wear weave weep win wind wring write
plus these 20 invariant forms:
bet burst cast cost cut hit hurt let put quit rid set shed shut
slit split spread thrust upset wet
and these 11 archaic forms, which might/might not be included:
bereave beseech bid dwell forsake gird hew rend seek slay smite
Andrés, Your work sounds very interesting. I am not familiar with ASPELL and ISPELL ... do you have any web links to your own work and its background that would give us more of an idea of the approach you have taken? Martin
RSS Feed11 | |
|---|---|
13 | |
19 | |
8 | |
9 | |
1 | |
5 | |
1 | |
6 | |
8 | |
4 | |
3 | |
8 | |
2 | |
19 | |
10 | |
20 | |
7 | |
4 | |
9 | |
3 | |
5 | |
7 | |
5 | |
3 | |
11 | |
3 | |
2 | |
1 | |
3 | |
2 | |
10 | |
33 | |
14 | |
3 | |
5 | |
4 | |
15 | |
9 | |
22 | |
18 | |
7 | |
4 | |
6 | |
13 | |
15 | |
1 | |
3 | |
5 | |
13 | |
3 | |
9 | |
7 | |
9 | |
4 | |
1 | |
10 | |
5 | |
3 | |
14 | |
1 | |
7 | |
8 | |
3 | |
8 | |
5 | |
1 | |
7 | |
8 | |
12 | |
37 | |
17 | |
2 | |
35 | |
1 | |
1 | |
11 | |
10 | |
3 | |
2 | |
8 | |
9 | |
5 | |
8 | |
1 | |
5 | |
7 | |
5 | |
24 | |
11 | |
12 | |
3 | |
17 | |
20 | |
3 | |
9 | |
17 | |
3 | |
12 | |
15 | |
8 | |
24 | |
26 | |
15 | |
16 | |
31 | |
7 | |
6 | |
7 | |
32 | |
7 | |
14 | |
7 | |
147 | |
10 | |
16 | |
9 | |
1 | |
27 | |
18 | |
18 | |
32 | |
17 | |
32 | |
50 | |
5 |