Martin Porter | 1 Feb 2003 12:25
Picon

Re: French stemmer, Snowball project

Fred,

Your comments have been very useful, and I've made changes to the French
stemmer page accordingly. What I will do soon is add your name, and the name
of other email helpers, to the 'credits' section of the Snowball site.

> ....  Maybe it is just personnal because I 
>am involved in information retrieval but: would it be a good idea to 
>explain the consequences of this particular rule in the description of 
>the algorithm?  ....

The difficulty of course is trying to keep the algorithm descriptions down
to a manageable size. All the rules in an algorithm might be given an
explanation along these lines. But I am aware that there is a gap here. 

Martin

Martin Porter | 1 Feb 2003 12:25
Picon

Re: R: italian stemming

At 01:51 PM 9/3/02 +0200, enea wrote:
>I encountered another problem with your stemmer.
>israele is stemmed to israel
>israeliano is stemmed to israel (ano is verb suffix)
>israeliani, israeliane and israeliana is stemmed to israelian
>I wonder if I should remove ano from verb suffix and add ano, ani, ana
>and ane as standard suffixes since this problem is frequent (eg.
>italiano, italiani; partigiano, partigiani; gabbiano, gabbiani; indiano,
>indiani; isolano, isolani; romano, romani...). What do you think?
>Regards,
>
>Enea
>

Enea,

Yes -ano is problematical. It is interesting that the corresponding French
ending -ent (3rd person plural present indicative) is also problematical,
and is in fact not removed in the French stemmer. Of course -ent is slightly
broader than -ano, since it is the ending for all three classes of verb
conjugation.

One possibility is not to remove -ano at all. You might look at that.

Your idea of removing -ana, -ani, -ane often crops up in stemmer design, and
is quite sensible. As adjectival endings, you can think of them as

    -ano + -a
    -ano + -i
    -ano + -e
(Continue reading)

Oleg Bartunov | 1 Feb 2003 12:25
Picon

russian stemmer

Martin,

current russian stemmer seems doesn't treat adjective endings like:
'nogo', 'nomu', 'nyi' ...., so
veslopidnogo (bicycle) -> velosipedn~ogo
velosipednyi -> velosipedn~yi
 while better to have
velosipednogo -> velosiped~nogo
velosipednyi ->  velosiped~nyi

I'm not a linguist, so  I don't know how properly distinguish
'nogo' from 'ogo' etc. Probably there is some grammar rules.

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg <at> sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Martin Porter | 1 Feb 2003 12:25
Picon

Re: russian stemmer

Oleg,

I've had a look at -n-ogo, -n-yi etc endings through the Russian vocabulary,
and feel that I would need to take linguistic advice before I could make any
progress with -n- removal.  

As you may recall, I did the Russian stemmer with a linguist, Pat Miles, who
lives some 60 miles away, and is not really a computer user. Also, Pat
charges for his work, which is a further inconvenience to me! I'd rather try
to get free linguistic help now through the open source community. Is there
anyone you know in Russia who might experiment a bit further with the
Snowball stemmer to see if they could make improvements here?

Martin

>current russian stemmer seems doesn't treat adjective endings like:
>'nogo', 'nomu', 'nyi' ...., so
>veslopidnogo (bicycle) -> velosipedn~ogo
>velosipednyi -> velosipedn~yi
> while better to have
>velosipednogo -> velosiped~nogo
>velosipednyi ->  velosiped~nyi
>
>I'm not a linguist, so  I don't know how properly distinguish
>'nogo' from 'ogo' etc. Probably there is some grammar rules.

Svetlana Pereyaslavets | 1 Feb 2003 12:25
Picon
Picon

RE: Snowball-discuss digest, Vol 1 #5 - 1 msg

Dear Martin,
I am not a linguist, but a native Russian speaker. May I try to give some
explanation on this suffix.
Free, but hopefully helpful :-)
It is a very common in Russian adjectives and adverbs when we deal with a
construction:

*****basic*construction********
  prefix-root  - "other optional" suffix - n (Oleg's question) - <adjective
ending> ( = yi/iy/oy....through all genders and declinations)
*******************************
The rule has the following options:
1.  prefix-root  - n - <adjective ending >

	1.1. The root itself ends on  -n-
	In this case we will encounter -nn- after stripping the adjective ending,
and we SHOULD REMOVE one -n- (that is the 	suffix).
	Such words usually don't have prefixes (so can be easily compared to the
dictionary).
	Example : kon-n-yi (adjective from "kon'"=horse)

	1.2  The root ends on any other letter

	we SHOULD REMOVE the -n- (that is the suffix).
	Example: ruch-n-oy (adjective from "ruka"= hand).

2.  prefix-root  - "other optional"suffix - n - adjective ending

	2.1. other optional suffix = - an- or - yan -
 	- a- or -ya- SHOULD BE REMOVED TOGETHER with the suffix -n-.
(Continue reading)

Oleg Bartunov | 1 Feb 2003 12:25
Picon

Re: RE: Snowball-discuss digest, Vol 1 #5 - 1 msg

Svetlana - you're my hero ! I'd like to test russian stemmer
(current version is available: http://intra.astronet.ru/db/lingua/snowball/)
I'm working on development of Russian Scientific Network (www.nature.ru)
and we found stemming is very important for scientific corpora.

	Oleg
On Mon, 9 Sep 2002, Svetlana Pereyaslavets wrote:

> Dear Martin,
> I am not a linguist, but a native Russian speaker. May I try to give some
> explanation on this suffix.
> Free, but hopefully helpful :-)
> It is a very common in Russian adjectives and adverbs when we deal with a
> construction:
>
> *****basic*construction********
>   prefix-root  - "other optional" suffix - n (Oleg's question) - <adjective
> ending> ( = yi/iy/oy....through all genders and declinations)
> *******************************
> The rule has the following options:
> 1.  prefix-root  - n - <adjective ending >
>
> 	1.1. The root itself ends on  -n-
> 	In this case we will encounter -nn- after stripping the adjective ending,
> and we SHOULD REMOVE one -n- (that is the 	suffix).
> 	Such words usually don't have prefixes (so can be easily compared to the
> dictionary).
> 	Example : kon-n-yi (adjective from "kon'"=horse)
>
> 	1.2  The root ends on any other letter
(Continue reading)

Martin Porter | 1 Feb 2003 12:25
Picon

Re: RE: Snowball-discuss digest, Vol 1 #5 - 1 msg

Svetlana,

This information looks very good. Give me a little time to absorb it and I
will repond back to the group. 

Martin

Martin Porter | 1 Feb 2003 12:25
Picon

French stemmer, Snowball project

[This masseage to Fred Brault ought to be posted on snowball-discuss, as it
give the background to some recent changes to the Snowball site. It was
originally sent 1 Sept 02 - Martin]

Fred,

I've looked at the algorithm alongside your comments, and can now report back.

Things aren't as bad as I at first thought ...

A) The rule for marking u and i/y as consonants is not explained at all
well, and will have to be improved. The point is that it works from left to
right along the word. So in risquiez (to quote one of your examples), i is
vowel because it is between consonants, u is a consonant because it is after
q, and the following i is a vowel because it is after U (u as a consonant)
and before i. So the word becomes risqUiez, and not, as you have it, risquIez.

If you think about it, the i should not be treated as a consonant in this case.

This accounts for most of the differences you noted.

B) There is a clear slip in the description of the algorithm, which you have
spotted. For 'eus' before 'ement', it should read: "delete if in R2, or
replace by 'eux' if in R1". I will put that right. ('ement' is tricky
because it is a single form for two endings - see the table of endings for
the Romance languages.)

This explains the 'pieusement' example. I realise the algorithm is failing
to equate pieuse with pieux, but that is a short-stem problem (see below).

(Continue reading)

Martin Porter | 1 Feb 2003 12:25
Picon

(no subject)

Svetlana,

I have looked through your note, and tried out -n- removal by modifying the
stemmer. I can't get a satifactory result, and suspect this is one of the
things I tried and rejected when first developing it, although I can't be
too certain of this. (Although later revised, the Russian stemmer was
developed more than ten years ago, so I forget certain details.) 

The problem I find is that -n- is too frequently removed erroneously. In
your email you refer to "a dictionary" a couple of times, but it is
important to remember of course that the Snowball stemmers are purely
algorithmic and do not use dictionaries.

For example, vzyskanie (penalty) is a noun, with noun ending -ie, although
-ie is equally an adjectival ending. Removing -n- after removing -ie is
over-stemming, and the result then fails to conflate with vzyskan+X, where X
is a valid noun ending that is not also an adjectival ending. There are many
cases like this. One could try to compensate by stripping off all final n's,
but that is liable to result in many false conflations.

Perhaps with your knowledge of the language you could make it work. If so,
Snowball would be ideal for your experiments.

Martin

P.S. It occurs to me you must know Eibe Frank.

Martin Porter | 1 Feb 2003 12:25
Picon

New site

[New Snowball site provisionally at
http://www.tartarus.org/~richard/snowball/index.php]

Okay, I've put in new credits (mentioning Oleg B., Andrei A, Steve T., Fred
B. and James A.), removed as many refs to sourceforge as I can, and await
Richard's decision on the final destination of the snowball site.

Richard, there are verious refs to sourceforge on the *.php pages. There is
also the sourceforge logo to eliminate.

I find cvs update sometimes does a lot of transferring of data - I can't see
why. Otherwise it is much niftier than before.

Martin


Gmane