Re: Portuguese Stemmer
Martin Porter <martin <at> porterloo.wanadoo.co.uk>
2009-10-28 16:50:55 GMT
Yes, the stemmers can be improved, among other things by building in
"exception lists" at various critical places. If you look at the snowball
site the Porter2 stemmer has been improved upon in this way. We've not done
this for the other languages, however, because it requires an intimate
language knowledge only available to native speakers.
Actually I wouldn't worry about your example to much. It is, I take it, an
irregular plural, and that is usually not what worries people in the English
stemmer. The English stemmer does not try to rectify irregular plurals
(woman/women) or irregular past tense forms (sing/sang/sung). No one
complains in practice. What upsets people is the Porter stemmer stemming
"arsenal" and "arsenic" to the same form. (The first is, among other things,
a popular football club.) So that is the sort of thing one tries to rectify
in an exception list.
It would be useful to have French2, Portuguese2 stemmers, and so on,
corresponding to the English2 stemmer -- taken a step further by native
speakers familiar with the complaints of an audience using the stemmers as
part of IR system, but we have not really managed to establish such a
community, at least, not yet.
>I was investigating Lucene and came uppon the SnowBall package. I was
>testing the PortugueseStemmer to see how applicable the SnowBallAnalyzer
>would be in our portal (using Lucene). I tested stemming the word airplane
>in portuguese, which is avião and its plural is aviões. Apparently the
>PortugueseStemmer will not stemm the two words to be the same (aviões
>stemmed to aviõ and avião stemmed to aviã).