Martin Porter | 6 Dec 2001 13:06
Picon

Re: Unicode representation (formerly: Regarding a unicode version of Snowball)


James Aylett was asking whether the Unicode issue for Snowball is one of
internal representation, and the answer I think is that it is not.
Snowball's internal representation currently is one byte per character, and
you can define your own coding scheme. The question is really for the API:
if people want to hand over strings for stemming in a Unicode form, how
exactly will these strings be encoded? I imagine Snowball can be extended to
any kind of encoding. It's clearly silly making it handle strings in UTF-16
form if everyone is using strings in UTF-8 form and vice versa.

Of course perhaps people will want to hand over strings in all sorts of
forms, in which case you need to convert before processing, but I wasn't
expecting that to be necessary.

In any case there is perhaps no need to continue the discussion at present:
there is no immediate demand for unicode support.

(Generally, it seems to me that Unicode is fine if you are based in the
Latin alphabet and want to make occasional excursions into exotic
characters, and also appropriate for Chinese and Japanese, but to represent,
say, Cyrillic in Unicode in UTF-8 form seems terribly clumsy. Do Oleg and
Andrei have a view on this?)

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

(Continue reading)

Chris Cleveland | 4 Dec 2001 18:10

Java version

I've started to look into creating a Java version of the Snowball stemmers. If anyone else is working on
this, please contact me.

--
Chris Cleveland
Dieselpoint, Inc.
2000 North Racine Avenue, Suite 4100
Chicago, Illinois  60614
773.528.1700 x116 voice, 773.528.8862 fax
http://dieselpoint.com
ccleveland <at> dieselpoint.com

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________

Richard Boulton | 5 Dec 2001 13:02
Gravatar

Re: Java version

On Tue, 2001-12-04 at 17:10, Chris Cleveland wrote:
> I've started to look into creating a Java version of the Snowball
> stemmers. If anyone else is working on this, please contact me.

I had a brief look at the feasability of doing this a while back, though
I never got as far as writing anything.  I presume the way you plan to
do this is to modify the Snowball-to-C converter so that it outputs
Java: that certainly seems like the best approach to me.

As far as I could tell, you should only need to modify the file
"generator.c".  Many parts of it are going to be the same for Java as
for C, so my conclusion was that it shouldn't be too big a job.  The C
generator uses some constructs like "goto" which help make writing
generated code easy: however, there are no such constructs in snowball
itself so that shouldn't prove a problem.

The code in generator.c is a little hard to follow (some of the function
names are a little on the short side ('w()')), but it's not too bad. 
I'm sure we can clear up any confusions if you post questions to the
list.

Let us know how you get on: it would be good to integrate the ability to
write output in multiple languages into snowball.

--

-- 
Richard

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
(Continue reading)

Martin Porter | 5 Dec 2001 18:38
Picon

Re: Java version


The problem surely is finding another way of doing the goto's. I suggested
to Richard using exceptions - I hope that idea isn't just nonsense.
Unfortunately I don't really have time to tackle Java generation at the moment. 

I believe generator.c is not hard to understand (w() is a basic write
function and has a short name because it's used everywhere). Perhaps more
tricky is understanding the syntax tree that generator.c generates its code
from. I could do something towards documenting it if that were useful.

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________

James Aylett | 5 Dec 2001 19:02

Re: Java version

On Wed, Dec 05, 2001 at 10:38:57AM -0700, Martin Porter wrote:

> The problem surely is finding another way of doing the goto's. I
> suggested to Richard using exceptions - I hope that idea isn't just
> nonsense.

That will have a performance hit unless it's genuinely an exceptional
case. (Well, it'll have a hit then anyway, but that's not the point.)

Anyone implementing this should look into labelled breaks, which
can be used more safely than goto in many of the same situations. Not
having looked at the Snowball code generator, I can't be certain this
will solve all uses of goto, but unless Snowball is outputting really
hideous code it should do :-)

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                            zap.tartarus.org
  james <at> tartarus.org                                    www.footlights.org

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________
(Continue reading)

the Tolkin family | 5 Dec 2001 04:53
Picon

99% of English words ending -sis and -xis are not plurals

A while ago I said I might suggest more fundamental changes to
the approach used in the Porter2 stemmer. 
Here is another one, probably my last.
(You probably can also improve handling of  f  ->  v  e.g.life, self, etc.)
 
There are over 800 words that end with -sis and of these only 11,
about 1%, are plurals. 
Almost all the rest are singular words, whose plural ends with -ses.
The words that are plurals are generally quite uncommon.  Here they are:
brindisis chaprassis dalasis kolbasis kolbassis lassis
pachisis parchesis parchisis reversis sannyasis tsotsis
So instead of the current rule, which simply removes the final -s, I propose
the following rule, which changes -sis to -ses, with a few exceptions. 
(We generally want to conflate singular and plural.  But there are too
many -ses words to go in the usual direction from plural to singular. 
So this rules goes in the other direction.)
This must be run before the current rule 1, so I'll call it rule 0.5a. 
I express this in pseudocode.
 
if word ends with sis {
  if word is sis then stem is sis && stop
  if word is psis then stem is psi && stop
  if word is thesis then stem is thesis && stop
  if word is theses then stem is thesis && stop
  change final sis to ses
}
 
I put special handling for thesis and theses because otherwise these
would become "these".  Certainly thesis is a likely search term.
(Another possible stem for thesis and theses might be "thes".)
 
(The rule above could be written so that -sis must occur in the R1
or R2 region.   That would remove the special cases for sis and psis,
but would cause the need to add several others.)
 
The 11 true plurals above are not longer handled correctly, but those words
are rare and many other plurals are not handled correctly today, so I do not bother
to fix them  Perhaps could special case lassis -> lassi to avoid clash with lass.
 
Another possible special case is "basis".  The rule above conflates it with bases,
which is its plural, but that causes it to also conflate with base. One might want
to add another special case: if word is basis then stem is basis && stop
This rules causes a few conflations that might not be as desirable as possible,
e.g. ellipsis and ellipses, synapsis and synapses, phasis and phases,
and whosis and whose. 
These could also be worth adding to the list of special cases.
But I have tried to have as few as possible.
 
An analogous rule applies to -xis.  Again, almost all of the about 60 words
ending with -xis are not plural. 
The rule 0.5b below captures this, and the few exceptions.
 
if word ends with xis {
  if word is xis then stem is xi && stop
  if word is maxis then stem is maxi && stop
  if word is taxis then stem is taxi && stop
  change final xis to xes
}
 
Here axis gets conflated with axes (its plural) but also with axe.  That seems
acceptable.  (There is a singular word taxis, with plural taxes, but both those
strings are far more common in their usual meaning.  We do not want to
conflate taxis with tax.)
 
Misc.
I have written these as 2 separate rules but a performance tweak might test if
the word ends with -is first.
 
On a completely separate topic, the words "lens" is another word
that should be special cased to return "lens" as its stem , so that
it conflates with lenses (and so it does not conflate with the
common computer science abbreviation for length.)
 
References:
This analysis is based on the very large list of words known as YAWL (Yet Another
and elsewhere.
 
Hopefully helpfully yours,
Steve
--
Steven Tolkin          steve.tolkin <at> fmr.com      617-563-0516
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________
Martin Porter | 6 Dec 2001 13:06
Picon

Re: 99% of English words ending -sis and -xis are not plurals


Steve,

Well, I am still reeling after the suggestion that -ive should not be
removed, :-) , which indeed is not a bad suggestion.

I'm less sure about the -sis -xis idea, although it is very interesting. It
is really a rule for handling Greek plurals, and English is unusual in
accepting many plural forms from other languages: beaux (French),
cognoscenti (Italian), cacti (Latin), hypotheses (Greek), seraphim (Hebrew).
(It is a phenomenon quite easy to explain historically however.) It has
occurred to me that one might be able to work out the language of a word by
digram analysis - or something similar - and stem accordingly. So hypnotic
is "obviously" Greek, and stems to hypnos, chateaux is "obviously" French
and stems to chateau. Greek -sis endings are therefore part of a general
problem.

Remember in any case that an English stemmer is going to regard -ses endings
as normal plural forms, and remove -s, abuses, bookcases, houses etc. The
Porter stemmer removes -es from longer words, so the problem reduces to
removing -is from analysis etc. The Lovins stemmer (which is more concerned
with "scientific" vocabulary) does that, but also respells final -yt as -ys
so that analysis, analyses, analytic conflate.

The question is, how important is it in practice. One should not be too
influenced by something  like YAWL, which is more an aid to scrabble players
than practical list of words for contemporary English. (although if someone
put down "chaprassis" and claimed a triple word score I'd be most upset!) My
sample vocabularies only instance two successful conflations with a rule
like this: hypothesis/hypotheses and parenthesis/parentheses. I realise
there are more in the language as a whole (oasis/oases for example), but the
point is that a rule is hardly worth adding if it only affects one word in
20,000. 

The truth is words like bases (as a plural of basis), hypnoses, ellipses are
not used very much. We tend to avoid forming plurals when the plural is
dubious, and use a different contruction. Everyone says "CVs" because they
don't know the plural of "curriculum vitae", and avoid trying to pluralise
words like chassis, chablis, cyclops, Mrs ...   

(As a general feature of English, exotic plurals are declining. Hippos for
hippopotami, cactuses for cacti, eskimos for esquimaux etc. Americans say
syllabi, but that sounds strange in England. Dice has become the singular
form of what was once die. Perhaps one day the plural will be dices.
Ignorance of foreign languages must help here. News broadcasters use
papperazzi in the singular without thinking it strange.) 

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________

Tolkin, Steve | 6 Dec 2001 15:05

RE: 99% of English words ending -sis and -xis are not plurals

Summary:
I think this new rule would easily justify itself even if the 
only words affected were: thesis osmosis analysis emphasis 
diagnosis prognosis synthesis and hypothesis.

Details:
I completely agree with your general line of argument, that the
benefits of any stemming rule must clearly exceed its "costs",
and that this principle applies even more stringly when the rule
is a change to an existing rule.  

I have been thinking about how to formalize this notion of the 
effectiveness of a stemming rule, but have not worked it out yet.  
The basic idea is to count the number of true positives, 
false positives, and false negatives, assign each category
a score, e.g. +1 for true positive, -2 for false positive, etc.
Then further assign each word (and/or stem) a weight based on 
its expected frequency in the corpus of documents (and/or queries).
Define effectiveness as some forumla that combine the scores and 
weights.  Then two rules can be compared.

In the absence of this formula we can still see the benefit
of my proposed rule by inspection.

The existing -sis -> -s rule fails on almost all of the -sis words.
My propsed new rule succeeds on almost all of the -sis words.
Similarly for the propsed -xis to -xes rule.

Here is an example of the benefit of the new rule.
The following is from the current Porter2:
analyse -> analys
analysed -> analys
analyser -> analys
analysers -> analys
analyses -> analys
analysing -> analys
analysis -> analysi

All these words conflate -- except for analysis itself!
That is quite bad.  
A similar pattern is seen for the other -sis words.

I think this new rule would easily justify itself even if the 
only words affected were: thesis osmosis analysis emphasis 
diagnosis prognosis synthesis hypothesis .

Of the 810 -sis words I think the following are reasonably common.
(Ordered by increasing length and then alphabetically.)

basis oasis crisis miosis stasis thesis chassis genesis kinesis
meiosis mitosis nemesis osmosis analysis dialysis dieresis ellipsis
emphasis hypnosis narcosis necrosis neurosis synopsis catalysis
catharsis cirrhosis coreopsis diaeresis diagnosis halitosis oogenesis
pertussis prognosis psoriasis psychosis sclerosis scoliosis silicosis
symbiosis synthesis hypothesis hysteresis thrombosis

There are few common -xis words that are singular: perhaps
axis, praxis, cathexis, prophyllaxis are the most common.
But there are two other reasons
to make that change: of these 58 only xis, maxis, and taxis are 
plurals -- all the rest are singular words whose plural is -xes.
Thus we greatly increase accuracy.
In addition almost all of these are technical terms.  While
I personally have not searched on e.g. phototaxis or
tropotaxis I expect that scientists do want to search
on one of the 30+ words that end with -taxis.

 
Hopefully helpfully yours,
Steve
-- 
Steven Tolkin          steve.tolkin <at> fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

> -----Original Message-----
> From: martin_porter <at> softhome.net [mailto:martin_porter <at> softhome.net]
> Sent: Thursday, December 06, 2001 7:07 AM
> To: the Tolkin family; snowball-discuss <at> lists.sourceforge.net
> Cc: Tolkin, Steve
> Subject: Re: [Snowball-discuss] 99% of English words ending -sis and
> -xis are not plurals
> 
> 
> 
> Steve,
> 
> Well, I am still reeling after the suggestion that -ive should not be
> removed, :-) , which indeed is not a bad suggestion.
> 
> I'm less sure about the -sis -xis idea, although it is very 
> interesting. It
> is really a rule for handling Greek plurals, and English is unusual in
> accepting many plural forms from other languages: beaux (French),
> cognoscenti (Italian), cacti (Latin), hypotheses (Greek), 
> seraphim (Hebrew).
> (It is a phenomenon quite easy to explain historically 
> however.) It has
> occurred to me that one might be able to work out the 
> language of a word by
> digram analysis - or something similar - and stem 
> accordingly. So hypnotic
> is "obviously" Greek, and stems to hypnos, chateaux is 
> "obviously" French
> and stems to chateau. Greek -sis endings are therefore part 
> of a general
> problem.
> 
> Remember in any case that an English stemmer is going to 
> regard -ses endings
> as normal plural forms, and remove -s, abuses, bookcases, 
> houses etc. The
> Porter stemmer removes -es from longer words, so the problem 
> reduces to
> removing -is from analysis etc. The Lovins stemmer (which is 
> more concerned
> with "scientific" vocabulary) does that, but also respells 
> final -yt as -ys
> so that analysis, analyses, analytic conflate.
> 
> The question is, how important is it in practice. One should 
> not be too
> influenced by something  like YAWL, which is more an aid to 
> scrabble players
> than practical list of words for contemporary English. 
> (although if someone
> put down "chaprassis" and claimed a triple word score I'd be 
> most upset!) My
> sample vocabularies only instance two successful conflations 
> with a rule
> like this: hypothesis/hypotheses and parenthesis/parentheses. 
> I realise
> there are more in the language as a whole (oasis/oases for 
> example), but the
> point is that a rule is hardly worth adding if it only 
> affects one word in
> 20,000. 
> 
> The truth is words like bases (as a plural of basis), 
> hypnoses, ellipses are
> not used very much. We tend to avoid forming plurals when the 
> plural is
> dubious, and use a different contruction. Everyone says "CVs" 
> because they
> don't know the plural of "curriculum vitae", and avoid trying 
> to pluralise
> words like chassis, chablis, cyclops, Mrs ...   
> 
> (As a general feature of English, exotic plurals are 
> declining. Hippos for
> hippopotami, cactuses for cacti, eskimos for esquimaux etc. 
> Americans say
> syllabi, but that sounds strange in England. Dice has become 
> the singular
> form of what was once die. Perhaps one day the plural will be dices.
> Ignorance of foreign languages must help here. News broadcasters use
> papperazzi in the singular without thinking it strange.) 
> 
> Martin
> 
> 

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________

Andreas Jung | 15 Dec 2001 21:40

Python bindings for Snowball

Hi everyone,

I found Snowball so cool that I have written Python bindings for Snowball
- with support for all nine Snowball stemmers !
I will release the bindings soon on Sourceforge. If anyone is interested
in a pre-release, let me know.

Andreas

    ---------------------------------------------------------------------
   -    Andreas Jung                            Zope Corporation       -
  -   EMail: andreas <at> zope.com                http://www.zope.com      -
 -  "Python Powered"                       http://www.python.org     - 
  -   "Makers of Zope"                       http://www.zope.org      - 
   -                  "Life is a fulltime occupation"                  -
    ---------------------------------------------------------------------
Richard Boulton | 15 Dec 2001 23:35
Gravatar

Re: Python bindings for Snowball

On Sat, 2001-12-15 at 20:40, Andreas Jung wrote:
> Hi everyone,
> 
> I found Snowball so cool that I have written Python bindings for Snowball
> - with support for all nine Snowball stemmers !
I count twelve stemmers now...

> I will release the bindings soon on Sourceforge. If anyone is interested
> in a pre-release, let me know.

Excellent!  I've just done a bit of work on making a simple and
consistent interface for C: consisting of three functions:

struct sb_stemmer *
sb_stemmer_create(const char * language);

void
sb_stemmer_release(struct sb_stemmer * stemmer);

const char *
sb_stemmer_stem(struct sb_stemmer * stemmer,
                const char * word, int size);

(This code is in the libstemmer directory in the CVS - the Makefile
needs tweaking so that it compiles the code as a library, but other than
that the work is done.)

I'd like to collaborate, to ensure that the bindings for all languages
are as consistent as possible.  Definitely interested in a pre-release.

I'd also like to try and keep all snowball related material easy to find
- so ideally I'd like to see the bindings being added to the website. 
I'd be very happy to help set this up.

--

-- 
Richard

Gmane