13 Jun 2010 16:05
Error in the vocabulary for Italian stemmer?
Peter Stahl <pemistahl <at> googlemail.com>
2010-06-13 14:05:41 GMT
2010-06-13 14:05:41 GMT
Hi everyone, my name is Peter and I'm a student of computational linguistics from Bochum, Germany. During my attempt to port your stemmers to pure Python, I always compare the stemmed results of my code to those in the testing vocabulary provided on your site. At last, I compared the results of the Italian stemmer. My implementation produces the same results as yours, but with the exception of nine words. They all end with either 'ici' or 'ice'. In my opinion, the stemmed forms of those words in the appropriate diffs.txt file are wrong according to the mentioned steps of the algorithm. It is all about the following words: giocatrici giocatr (giocatric) mediatrice mediatr (mediatric) pagatrice pagatr (pagatric) portatrice portatr (portatric) portatrici portatr (portatric) ricreatrice ricreatr (ricreatric) roccatrici roccatr (roccatric) salvatrice salvatr (salvatric) sfruttatrici sfruttatr (sfruttatric) The first column shows the unstemmed forms, the second shows the stemmed forms of my implementation, the third shows the stemmed forms of your testing vocabulary. Let us take the word 'giocatrici' as an example: R1 = 'atrici' R2 = 'rici' RV = 'catrici' According to step 1, the suffixes 'ici' or 'ice', respectively, are supposed to be deleted if one of these is in R2. This is definitely the case here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'.(Continue reading)
RSS Feed