Peter Stahl | 13 Jun 16:05 2010

Error in the vocabulary for Italian stemmer?

Hi everyone,

my name is Peter and I'm a student of computational linguistics from Bochum, Germany. During my attempt to
port your stemmers to pure Python, I always compare the stemmed results of my code to those in the testing
vocabulary provided on your site. At last, I compared the results of the Italian stemmer. My
implementation produces the same results as yours, but with the exception of nine words. They all end with
either 'ici' or 'ice'. In my opinion, the stemmed forms of those words in the appropriate diffs.txt file
are wrong according to the mentioned steps of the algorithm. It is all about the following words:

giocatrici 	giocatr 		(giocatric)
mediatrice 	mediatr 		(mediatric)
pagatrice 	pagatr		(pagatric)
portatrice 	portatr 		(portatric)
portatrici 	portatr  		(portatric)
ricreatrice 	ricreatr 		(ricreatric)
roccatrici 	roccatr 		(roccatric)
salvatrice 	salvatr 		(salvatric)
sfruttatrici 	sfruttatr 		(sfruttatric)

The first column shows the unstemmed forms, the second shows the stemmed forms of my implementation, the
third shows the stemmed forms of your testing vocabulary.
Let us take the word 'giocatrici' as an example:

R1 = 'atrici'
R2 = 'rici'
RV = 'catrici'

According to step 1, the suffixes 'ici' or 'ice', respectively, are supposed to be deleted if one of these is
in R2. This is definitely the case here. So the word 'giocatrici' is stemmed to 'giocatr' and not to
'giocatric'. 
(Continue reading)

Peter Stahl | 13 Jun 18:34 2010

Error in the vocabulary for Portuguese stemmer?

Hi,

I'm sorry for writing to you again, but I suppose there are some errors in the vocabulary for Portuguese,
too. This time it is about the following four words:

adquirira              adquirira              (adquir)
conseguira          conseguira          (consegu)
explodira 	      explodira 	     (explod)
itabira 	              itabira 		     (itab)

Again it is about step 1. It says that one of the suffixes 'ira' or 'iras' has to be replaced by 'ir' if it is in RV
and preceded by 'e'. But the last condition cannot apply to these four words. Thus, actually, the suffixes
may not be deleted. What do you say about this?

Best regards,
Peter
Martin Porter | 14 Jun 13:02 2010
Picon

Re: Error in the vocabulary for Italian stemmer?


Peter, 

The point is that the search is for the longest suffix, and when no action
is taken for that suffix, the algorithm doesn't then move to a shorter
suffix for which action can be taken.

So giocatrici (=comedienne, I assume) ends -ici as well as -atrici. The
longer, -atrici, is taken, and there is no action (because it is not in R2).
The -ici ending is not considered, even though it is in R2.

In other words you search for the longest suffix, and see if it removable,
you don't search for the longest suffix which is removable.

I guess the same point lies behind your second email,

Martin

At 04:05 PM 6/13/2010 +0200, Peter Stahl wrote:
>
>Hi everyone,
>
>my name is Peter and I'm a student of computational linguistics from
Bochum, Germany. During my attempt to port your stemmers to pure Python, I
always compare the stemmed results of my code to those in the testing
vocabulary provided on your site. At last, I compared the results of the
Italian stemmer. My implementation produces the same results as yours, but
with the exception of nine words. They all end with either 'ici' or 'ice'.
In my opinion, the stemmed forms of those words in the appropriate diffs.txt
file are wrong according to the mentioned steps of the algorithm. It is all
(Continue reading)

Peter Stahl | 14 Jun 18:00 2010

Fwd: Error in the vocabulary for Italian stemmer?



Anfang der weitergeleiteten E-Mail:

Von: Peter Stahl <pemistahl <at> googlemail.com>
Datum: 14. Juni 2010 17:56:13 MESZ
Betreff: Re: [Snowball-discuss] Error in the vocabulary for Italian stemmer?

Hi Martin,

thanks for your reply. I understand what you mean, I guess. The funny thing is that I implemented my wrong interpretation of your algorithm for every language and I didn't get any errors according to your testing vocabulary, except for Italian and Portuguese. I wrote my own implementations according to the descriptions provided on your site and didn't take a look into your code. For people who want to do it the same way it would be good, if you could make it a bit clearer in the descriptions that one should not search for the longest suffix that can be deleted, as this might be a source for misunderstandings.

Best regards,
Peter  



Am 14.06.2010 um 13:02 schrieb Martin Porter:


Peter,

The point is that the search is for the longest suffix, and when no action
is taken for that suffix, the algorithm doesn't then move to a shorter
suffix for which action can be taken.

So giocatrici (=comedienne, I assume) ends -ici as well as -atrici. The
longer, -atrici, is taken, and there is no action (because it is not in R2).
The -ici ending is not considered, even though it is in R2.

In other words you search for the longest suffix, and see if it removable,
you don't search for the longest suffix which is removable.

I guess the same point lies behind your second email,

Martin





At 04:05 PM 6/13/2010 +0200, Peter Stahl wrote:

Hi everyone,

my name is Peter and I'm a student of computational linguistics from
Bochum, Germany. During my attempt to port your stemmers to pure Python, I
always compare the stemmed results of my code to those in the testing
vocabulary provided on your site. At last, I compared the results of the
Italian stemmer. My implementation produces the same results as yours, but
with the exception of nine words. They all end with either 'ici' or 'ice'.
In my opinion, the stemmed forms of those words in the appropriate diffs.txt
file are wrong according to the mentioned steps of the algorithm. It is all
about the following words:

giocatrici giocatr (giocatric)
mediatrice mediatr (mediatric)
pagatrice pagatr (pagatric)
portatrice portatr (portatric)
portatrici portatr   (portatric)
ricreatrice ricreatr (ricreatric)
roccatrici roccatr (roccatric)
salvatrice salvatr (salvatric)
sfruttatrici sfruttatr (sfruttatric)

The first column shows the unstemmed forms, the second shows the stemmed
forms of my implementation, the third shows the stemmed forms of your
testing vocabulary.
Let us take the word 'giocatrici' as an example:

R1 = 'atrici'
R2 = 'rici'
RV = 'catrici'

According to step 1, the suffixes 'ici' or 'ice', respectively, are
supposed to be deleted if one of these is in R2. This is definitely the case
here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'.
Is this an error in your testing vocabulary or in the description of the
Italian stemmer? Or did I miss anything?


Thanks and best regards,
Peter



_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Content-Type: text/plain; x-avg=cert; charset=us-ascii

Content-Disposition: inline
Content-Description: "AVG certification"


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.437 / Virus Database: 271.1.1/2934 - Release Date: 06/13/10
06:35:00






_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
James Aylett | 14 Jun 19:27 2010

Re: Fwd: Error in the vocabulary for Italian stemmer?

On 14 Jun 2010, at 17:00, Peter Stahl wrote:

>> if you could make it a bit clearer in the descriptions that one should not search for the longest suffix
that can be deleted, as this might be a source for misunderstandings.

Peter — is there anything beyond the behaviour of among(…) that you mean? This is made fairly clear IMHO
in 7g of the manual.

James

--

-- 
 James Aylett
 talktorex.co.uk - xapian.org - devfort.com
Martin Porter | 16 Jun 11:09 2010
Picon

Re: Error in the vocabulary for Italian stemmer?


>For people who want to do it the same way it would be good, if you could
make it a bit clearer in the descriptions that one should not search for the
longest suffix that can be deleted, as this might be a source for
misunderstandings.

I think the descriptions, if carefully read, are clear on this point, but
the important lesson here is that the snowball system does achieve what the
original Porter stemmer description did not, namely, it results in exact
definitions of the algorithms, since errors in recoding are detectable and
correctable. Incidentally, that particular error (searching for the longest
suffix that can be deleted rather than the longest suffix, and then seeing
if it is deletable) was built into an early encoding of the Porter stemmer
which was standardly used for many years, and lies behind the note in the
description of Snowball at http://snowball.tartarus.org/texts/introduction.html,

"A good test is to type in agreement. It should stem to agreement — the same
word. If it stems to agreem there is an error."
Peter Stahl | 16 Jun 17:02 2010

Re: Error in the vocabulary for Italian stemmer?

Hi again,

now I have a different problem with the Romanian stemmer. In particular, it is about step 4 in the
description. There are a lot of short words that only step 4 can apply to. But, however, they are not stemmed
at all according to your results in the testing vocabulary. I don't see why. Can you please explain this?

Let's take the word 'ambe', for example. R1 is 'be', R2 and RV are empty strings. Only step 4 can apply here.
The longest suffix among those of step 4 for this word is the vowel 'e'. It is in R1, too. But your results say
that the word should not be stemmed to 'amb'. Why?

Thank you very much. (I hope I'm not getting on your nerves. But I really don't know how to solve this.)

Best regards,
Peter 

Am 16.06.2010 um 11:09 schrieb Martin Porter:

> 
>> For people who want to do it the same way it would be good, if you could
> make it a bit clearer in the descriptions that one should not search for the
> longest suffix that can be deleted, as this might be a source for
> misunderstandings.
> 
> I think the descriptions, if carefully read, are clear on this point, but
> the important lesson here is that the snowball system does achieve what the
> original Porter stemmer description did not, namely, it results in exact
> definitions of the algorithms, since errors in recoding are detectable and
> correctable. Incidentally, that particular error (searching for the longest
> suffix that can be deleted rather than the longest suffix, and then seeing
> if it is deletable) was built into an early encoding of the Porter stemmer
> which was standardly used for many years, and lies behind the note in the
> description of Snowball at http://snowball.tartarus.org/texts/introduction.html,
> 
> "A good test is to type in agreement. It should stem to agreement — the same
> word. If it stems to agreem there is an error."
> 
> 
> 
Martin Porter | 17 Jun 09:43 2010
Picon

Re: Error in the vocabulary for Italian stemmer?


Peter,

This time you have found something ...

R1 in Step 4 of the Romanian algorithm definition should be RV. (RV is used
in the corresponding routine vowel_suffix of the snowball script.) I will
correct the page on the website soon.

Sorry about that. I guess it proves that the Romanian algorithm has not
previously been recoded following the algorithm definition.

I'm interested that you seem to be recoding all the algorithms in python. If
they are BSD licensed, might you make them available to us?

Martin

At 05:02 PM 6/16/2010 +0200, Peter Stahl wrote:
>
>Hi again,
>
>now I have a different problem with the Romanian stemmer ...
Peter Stahl | 18 Jun 16:26 2010

Re: Error in the vocabulary for Italian stemmer?

Hi Martin,

thanks, my port of the Romanian stemmer is now working properly.  Do you know the Natural Language Toolkit
(http://www.nltk.org)? It's a collection of Python modules which can be used for natural language
processing. Everyone is allowed to contribute something. As NLTK has been helping me a lot during my
studies, I wanted to give something back and port your stemmers to Python. I started a discussion about it
here: http://groups.google.com/group/nltk-dev/browse_thread/thread/6098341c1ac4b7a3

If NLTK's people don't mind, then I don't mind to make the stemmers available to you, too. I'm going to ask
them about it.

Have a nice weekend,
Peter

Am 17.06.2010 um 09:43 schrieb Martin Porter:

> 
> Peter,
> 
> This time you have found something ...
> 
> R1 in Step 4 of the Romanian algorithm definition should be RV. (RV is used
> in the corresponding routine vowel_suffix of the snowball script.) I will
> correct the page on the website soon.
> 
> Sorry about that. I guess it proves that the Romanian algorithm has not
> previously been recoded following the algorithm definition.
> 
> I'm interested that you seem to be recoding all the algorithms in python. If
> they are BSD licensed, might you make them available to us?
> 
> Martin
> 
> At 05:02 PM 6/16/2010 +0200, Peter Stahl wrote:
>> 
>> Hi again,
>> 
>> now I have a different problem with the Romanian stemmer ...
> 
> 
> 
Olly Betts | 18 Jun 17:03 2010

Re: Fwd: Error in the vocabulary for Italian stemmer?

On Mon, Jun 14, 2010 at 06:00:56PM +0200, Peter Stahl wrote:
> The funny thing is that I implemented my wrong interpretation of your
> algorithm for every language and I didn't get any errors according to your
> testing vocabulary, except for Italian and Portuguese.

It's a little unfortunate if the test vocabulary doesn't catch an
implementation error, particularly one which has been made in multiple
independent implementations.

Do you have the incorrect implementations still?  I'd be interested to
try to add coverage for this.

Cheers,
    Olly

Gmane