A while ago I said I might suggest more
fundamental changes to
the approach used in the Porter2 stemmer.
Here is another one, probably my last.
(You probably can also improve handling of
f -> v e.g.life, self,
etc.)
There are over 800 words that end with -sis
and of these only 11,
about 1%, are
plurals.
Almost all the rest are singular words, whose
plural ends with -ses.
The words that are plurals are generally quite
uncommon. Here they are:
brindisis chaprassis dalasis kolbasis kolbassis
lassis
pachisis parchesis parchisis reversis sannyasis
tsotsis
So instead of the current rule, which simply
removes the final -s, I propose
the following rule, which changes -sis to -ses,
with a few exceptions.
(We generally want to conflate singular and
plural. But there are too
many -ses words to go in the usual direction from
plural to singular.
So this rules goes in the other
direction.)
This must be run before the current rule 1,
so I'll call it rule 0.5a.
I express this in pseudocode.
if word ends with sis {
if word is sis then stem is sis &&
stop
if word is psis then stem is psi &&
stop
if word is thesis then stem is thesis
&& stop
if word is theses then stem is thesis
&& stop
change final sis to ses
}
I put special handling for thesis and theses
because otherwise these
would become "these". Certainly thesis is a
likely search term.
(Another possible stem for thesis and theses might
be "thes".)
(The rule above could be written so that -sis must occur in the R1
or R2 region. That would remove the special cases for sis and
psis,
but would cause the need to add several others.)
The 11 true plurals above are not longer handled correctly, but those
words
are rare and many other plurals are not handled
correctly today, so I do not bother
to fix them Perhaps could special case lassis
-> lassi to avoid clash with lass.
Another possible special case is "basis". The
rule above conflates it with bases,
which is its plural, but that causes it to also
conflate with base. One might want
to add another special case: if word is basis then
stem is basis && stop
This rules causes a few conflations that might not
be as desirable as possible,
e.g. ellipsis and ellipses, synapsis and synapses,
phasis and phases,
and whosis and whose.
These could also be worth adding to the list
of special cases.
But I have tried to have as few as
possible.
An analogous rule applies to -xis. Again,
almost all of the about 60 words
ending with -xis are
not plural.
The rule 0.5b below captures this, and the few
exceptions.
if word ends with xis {
if word is xis then stem is xi &&
stop
if word is maxis then stem is maxi
&& stop
if word is taxis then stem is taxi
&& stop
change final xis to xes
}
Here axis gets conflated with axes (its plural) but also with axe.
That seems
acceptable. (There is a singular word taxis, with plural taxes, but
both those
strings are far more common in their usual meaning. We do not want to
conflate taxis with tax.)
Misc.
I have written these as 2 separate rules but a
performance tweak might test if
the word ends
with -is first.
On a completely separate topic, the words "lens" is
another word
that should be special cased to return "lens" as
its stem , so that
it conflates with lenses (and so it does not
conflate with the
common computer science abbreviation for
length.)
References:
This analysis is based on the very large list of
words known as YAWL (Yet Another
and elsewhere.
Hopefully helpfully yours,
Steve
--
Steven Tolkin
steve.tolkin <at> fmr.com
617-563-0516
Fidelity Investments 82 Devonshire St.
V1D Boston MA 02109
There is nothing so practical as
a good theory. Comments are by me,
not Fidelity Investments, its
subsidiaries or affiliates.
_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________