Weongyo Jeong | 6 Jan 15:37 2006

Hello, I just writen a python wrapper of snowball.

Hello.  First sorry for my short english.

Someday ago, I writed a python wrapper of snowball library for my own
works. This includes python code too.

I also know that http://sourceforge.net/projects/pystemmer/ exists. but
i can't run on my environment and it's old codes which did updated at
2002 year.

to compile this, do like below:

$ tar xvvzf PySnowballStemmer-0.0.1.tar.gz
$ cd PySnowballStemmer-0.0.1
$ python setup.py build
$ su - root
Password:
# python setup.py install
# exit
$ python
Python 2.4.2 (#2, Sep 30 2005, 22:23:39)
[GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import SnowballStemmer
>>> a = SnowballStemmer.SnowballStemmer ()
>>> a.new ("english", "UTF_8")
>>> a.stem_str ("simple")
'simpl'
>>> a.delete ()

you can download a file from my homepage.
(Continue reading)

Martin Porter | 9 Jan 11:23 2006
Picon

Small changes to English stemmer


There have been two small changes to the English (Porter2) stemming algorithm.
The first is that the Rule

    ied ies
        replace by ie if preceded by just one letter, otherwise by i

has been changed to

    ied ies
        replace by i if preceded by more than one letter, otherwise by ie

There is a corresponding change in the Snowball script:

            'ied' 'ies'
                   ((next atlimit <-'ie') or <-'i')

            'ied' 'ies'
                   ((hop 2 <-'i') or <-'ie')

This ONLY affects the two 'words' ied and ies. Formerly they stemmed to i, now
they stem to ie.

The second is that the line,

    do ( ['y'] v <-'Y' set Y_found)

which did not match the Rule

Set initial y ... to Y,
(Continue reading)

Martin Porter | 10 Jan 16:02 2006
Picon

Small changes to English stemmer


There have been two small changes to the English (Porter2) stemming algorithm.
The first is that the Rule

    ied ies
        replace by ie if preceded by just one letter, otherwise by i

has been changed to

    ied ies
        replace by i if preceded by more than one letter, otherwise by ie

There is a corresponding change in the Snowball script:

            'ied' 'ies'
                   ((next atlimit <-'ie') or <-'i')

            'ied' 'ies'
                   ((hop 2 <-'i') or <-'ie')

This ONLY affects the two 'words' ied and ies. Formerly they stemmed to i, now
they stem to ie.

The second is that the line,

    do ( ['y'] v <-'Y' set Y_found)

which did not match the Rule

Set initial y ... to Y,
(Continue reading)

Richard Boulton | 10 Jan 16:59 2006

Re: Hello, I just writen a python wrapper of snowball.

On Tue, 2005-12-20 at 01:21 +1100, Weongyo Jeong wrote:
> Today, I writed a python wrapper of snowball library for my own works. I
> attached a file which includes all of sources.  that includes python
> code too.

I've added a link to this from our projects page.  It will be visible in
an hour or two.

--

-- 
Richard
Tolkin, Steve | 13 Jan 16:42 2006

RE: Small changes to English stemmer

1. I don't understand what problem the first change (for ied and ies) is
intended to solve.

I think nowadays the most likely usage of "ied" is "improvised explosive
device".
Stemming this to "ie" is no better than, and perhaps worse than,
producing "i".
Perhaps the best treatment is to leave it alone, as "ied", so it will
conflate with "ieds".

The most likely use of "ie" (after "i.e." written without the periods)
is for Internet Explorer.
But this will be rarely spelled ies.  The most likely usage of "ies" is
as an acronym.  Google finds 16 million hits and the first 100 are all
acronyms.  So again perhaps just leave it alone.

2. The most frequent use of a leading Y as vowel is in proper names,
e.g., Yvonne (13 M hits) and Yvette (5 M).  But I do not think these are
affected by the second change, still producing:
yvonne -> yvonn
yvette -> yvett

Hopefully helpfully yours,
Steve
---
Steven Tolkin 
There is nothing so practical as a good theory.  Comments are by me, not
Fidelity Investments, its subsidiaries or affiliates.

-----Original Message-----
(Continue reading)

Martin Porter | 13 Jan 16:58 2006
Picon

New page


We've added a page

http://snowball.tartarus.org/otherlangs/index.html

to include Snowball stemming algorithms coded up in other languages, and you
can find a version of the Russian stemmer there, coded in php5 by Dennis
Kreminsky.

Martin 
Patrick Mézard | 21 Jan 19:09 2006
Picon

Problem with PySnowballStemmer

Hello,
First, thank you Weongyo Jeong for providing updated python bindings, I 
was definitely looking for them.

However, I fail to make them work with UTF-8 inputs:
"""
# -*- coding: iso-8859-1 -*-
import SnowballStemmer

encodings = [
     ('UTF_8', 'utf8'),
     ('ISO_8859_1', 'iso-8859-1'),
]

for sn_enc, py_enc in encodings:
     s = SnowballStemmer.SnowballStemmer().new('french', sn_enc)
     #This is a 'latin small letter e acute' at the end of the word.
     u = unicode('pitié', 'iso-8859-1').encode(py_enc)
     print sn_enc, ':', repr(u), '=>', repr(s.stem_str(u))
"""

outputs:
"""
UTF_8 : 'piti\xc3\xa9' => 'piti\xc3'
ISO_8859_1 : 'piti\xe9' => pit
"""

The UTF-8 version returns an invalid UTF-8 sequence. I am completely new 
to Snowball and I have just seen the announce according to which Unicode 
support was added last year. Until now I failed to find reliable 
(Continue reading)

karl wettin | 24 Jan 19:05 2006
Picon

German suffix stripping not complete

Hello list,

I'm wondering if there is a good reason for the German stemmer not to
suffix strip the s in words ending on 'os'.
Autos, kinos, echos, bu"ros, silos, pianos, et.c.

Here are some words you can consider.

Albatros, apropos, chaos, epos, kosmos, gros, rigoros, grandios, los, haarlos.

All I can think of will be pretty much ok suffix stripped.
Martin Porter | 30 Jan 11:05 2006
Picon

Re: German suffix stripping not complete


Karl,

I'm sure the reason it was not done is that the group is so small. What you
would certainly need to do is to check for the ending -los, and not remove
the -s in that case. If you take the sample vocabulary provided with German,
you then get the following residual list,

        ambros amos autos bartholomaios büros chaos credos fotos
        haemorrheos heros hos infos jethros jos lebensmittelembargos
        migros moos mythos pharaos platos salomos studios theophrastos
        wahlbüros wos

25 words in all. -s could be removed with benefit or without harm from all,
or almost all, of these words. There is some overlap here with your own word
list.

Thank you for pointing this out. I will review the German algorithm at some
point in the future, and possibly incorporate your sugestion,

Martin

>Hello list,
>
>I'm wondering if there is a good reason for the German stemmer not to
>suffix strip the s in words ending on 'os'.
>Autos, kinos, echos, bu"ros, silos, pianos, et.c.
>
>Here are some words you can consider.
>
(Continue reading)


Gmane