Felipe Moyano | 10 Jun 2005 19:14
Picon
Favicon

Spanish word stemmer

Hello,
My name is Felipe Moyano and I am an spanish student of the University of 
Udine. We are doing an small research about the spanish word stemmer in 
Java. We have tested it and we have found some possible errors. This errors 
have been clasified in a table that i send you in sufix_research.xls. Can 
you tell me if these errors can be resolved? Are these really errors or they 
have been made to improve search efficiency? I also have seen the source 
files, What the int atributes substring and result mean in Among class? 
(-1,1) or (-1,-1) or (16,2) ....

Thank you very much

_________________________________________________________________
Acepta el reto MSN Premium: Correos más divertidos con fotos y textos 
increíbles en MSN Premium. Descárgalo y pruébalo 2 meses gratis. 
http://join.msn.com?XAPID=1697&DI=1055&HL=Footer_mailsenviados_correosmasdivertidos
Attachment (sufix_research.xls): application/octet-stream, 20 KiB
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 10 Jun 2005 21:40
Picon

Re: Spanish word stemmer


Felipe

Unfortunately I have no means of reading .xls files. Would it be possible
for you to send the same information in a plain text form?

Martin
Martin Porter | 12 Jun 2005 16:29
Picon

Re: Spanish word stemmer


Felipe

Thank you for the xls file in text form. It raises a number of interesting
points, and I will send an answer in the next few days. Possibly I will try
some extensions to the algorithm, and this may take me a bit longer, since I
would want to look at the possibility of similar adjustments in French,
Portuguese and Italian.

More soon,

Martin
Martin Porter | 14 Jun 2005 18:07
Picon

Re: Spanish word stemmer


Felipe,

Thank you for the suffix list. It has proved interesting to work through it. The
general answer to your question is that the non-inclusion of these suffixes
(apart from ante/antes -- see below) is intentional, and also is best for the
algorithm.

You must remember that if a word end with X, and X is a suffix in the language,

a) X may significantly alter the meaning of a word. In this case it should not,
in an IR context, be removed.

b) X may not be a true suffix, but merely form the end of the stem.

c) X may be rare in the language, and hardly therefore wirth removing.

d) X may be removable, but not worth removing because it leads to no further
conflations.

Most of the suffixes you instance exhibit one or more of these features. Here is
your list:

SUFFIX EXAMPLES STEMMER RIGHT

orio/a/os/as                        <--- changes meaning too much
 absolutorio   absolutori  absolut
 accesorio     accesori    acces
 consultorio   consultori  consult

(Continue reading)

Martin Porter | 16 Jun 2005 10:05
Picon

Re: Spanish word stemmer


Following a communication from Felipe Moyano I have added ending detection
for -ANT, -ANCE suffixes to the Spanish, Portuguese and Italian stemmers. To
see the changes made, look for '// Note 1' in the .sbl scripts themselves.
The stemmer definitions, vocabularies, and introductory page on the Romance
stemmers have been updated accordingly.

(The reason for this omission is that these stemmers derived from the
structure of the French stemmer, where the noun-making derivational suffix
-ant does not require separate treatment, since it gets removed as a verb
suffix. It was therefore overlooked as a separate derivational ending.)

Gmane