Santosh Pai | 6 Oct 2007 10:58

Tweak stemmer: (say the German stemmer)

Hi,

 

Using the snowball stemmer is providing me with great results.

I had a question though…

 

For some languages (say German, French), the stemmer replaces the special language specific characters with corresponding English equivalents

For eg in German:

 

ä à a

ö à o

ü à u

Ä à A

Ö à O

Ü à U

ß à ss”

 

Is there a way to preserve the special characters in the stems?

 

Thanks

 

 

   Santosh Pai                        


6th Floor, Windsor Manor,

Baner Road, PuneINDIA 

www.quagnito.com

E-Mail :
Tel. :

Mobile :

 

santosh <at> quagnito.com
+91 20 27292039
+91 98501 60015

 

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 11 Oct 2007 12:22
Picon

Re: Tweak stemmer: (say the German stemmer)


Santosh,

(I hope I understand your email correctly. Anyway, here goes with an
answer ...)

The stemmers do not replace the language specific characters with
'English' equivalents as a standard operation, but will occasionally
remove accents as part of the stemming process. For example, in German,
removal of umlaut in the last syllable will help conflate singular and
plural forms. The German eszet (double s) is similarly usefully replaced
by 'ss'. 

If you feel that the normalisation is incorrect, the solution is to
modify the snowball scripts and recompile.

Incidentally we have also had the opposite suggestion to your own, that,
in Spanish for example, it would be be better for the stemmer to remove
ALL accents, since their use is not so very consistent in the written
language.

Italian acute/grave usage is very variable, and the Snowball Italian
stemmer maps both to the same form. 

But the general approach in the Snowball stemmers is to leave accents
alone unless their is a clear reason for removing or altering them.

Martin

On Sat, 2007-10-06 at 14:28 +0530, Santosh Pai wrote:
> Hi,
> 
>  
> 
> Using the snowball stemmer is providing me with great results.
> 
> I had a question though…
> 
>  
> 
> For some languages (say German, French), the stemmer replaces the
> special language specific characters with corresponding English
> equivalents
> 
> For eg in German:
> 
>  
> 
> ä àa
> 
> ö ào
> 
> ü àu
> 
> Ä àA
> 
> Ö àO
> 
> Ü àU
> 
> ß àss”
> 
>  
> 
> Is there a way to preserve the special characters in the stems?
> 
>  
> 
> Thanks
> 
>  
> 
>  
> 
>    Santosh Pai                        
> 
> 
> 6th Floor, Windsor
> Manor,
> 
> Baner Road,
> Pune, INDIA 
> 
> www.quagnito.com
> 
> 
> E-Mail : 
> Tel. : 
> 
> Mobile : 
> 
>  
> 
> 
> santosh <at> quagnito.com
> +91 20 27292039
> +91 98501 60015
> 
> 
>  
> 
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 18 Oct 2007 13:56
Picon

Re: question about an optimization


Kai

You must understand that the C code is generated from a snowball script,
and that no attempt is made to alter the generated code to improve
performance. (The maintenance involved would be too great.)

Could you please direct comments on snowball to
snowball-discuss <at> lists.tartarus.org. 

Martin

On Thu, 2007-10-18 at 17:40 +0800, kai wang wrote:
> hi,
> in portuguese step 2 Verb suffixes, suffix end with "a e i m o r s u
> á" can be  found by an setence, if not end with these letters, then
> return
>  
> if
> 
> (text.length() < 2 || (!((2925090 >> (text[text.length() - 1] & 0x1f))
> & 1) && text[text.length() - 1] != 'á')) 
> 
> return;
> why is it not used in the code, is it because this operation won't
> imporove performance?
>  
> kai
Ignacio Perez | 22 Oct 2007 20:52
Picon

Spanish Stemmer question

My name is Ignacio Perez and I'm both a linguistics student at Buenos Aires University (Argentina) and a Java programmer. I'm working now with the spanish stemmer and I found a couple of problems with the following rule:

'idad'
'idades'
(
R2 delete
try (
[substring] among(
'abil'
'ic'
'iv' (R2 delete)
)
)
)

The first (an easiest to solve) problem I find here is that we should add 'ibil' next to 'abil' beacause the suffixes '-ibilidad' and '-abilidad' are the same one for different verb conjugation (the first for "-er" and "-ir" and the second for "-ar"). This would solve problems for words like "inteligibilidad" or "sensibilidad".

The second problem I ran to is that 'abil' (and, if added, also 'ibil') is only removed when in R2. This leaves out words like "posibilidad" (very usual) or "ambilidad". And I thought the rule might go with something like this "remove abil when in R2 or when in R1 except when preceeded by "inh".

Kind regards.

Ignacio Perez
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 24 Oct 2007 10:41
Picon

Re: Spanish Stemmer question


Ignacio,

I will look into this at some point.

The stemmers were developed by working with large sample vocabularies,
and you then often find that a rule that seems plausible has to be
withdrawn because there are too many cases where it is causing errors.
In any event, a new rule must be tested against a large vocabulary, and
the words affected, for better or worse, checked and assessed.  

I'll do this with your two suggestions when I come back to working on
snowball. My guess is that your first rule could certainly be added in;
the second may lead to errors as well as improvements.

Are you intending to include these extras in your version of the
stemmer? If so, keep us informed.

Thanks for your interest. 

Martin

On Mon, 2007-10-22 at 15:52 -0300, Ignacio Perez wrote:
> My name is Ignacio Perez and I'm both a linguistics student at Buenos
> Aires University (Argentina) and a Java programmer. I'm working now
> with the spanish stemmer and I found a couple of problems with the
> following rule: 
>             'idad'
>             'idades'
>             (
>                 R2 delete
>                 try (
>                     [substring] among(
>                         'abil'
> 
>                         'ic'
>                         'iv'   (R2 delete)
>                     )
>                 )
>             )
> 
> The first (an easiest to solve) problem I find here is that we should
> add 'ibil' next to 'abil' beacause the suffixes '-ibilidad' and
> '-abilidad' are the same one for different verb conjugation (the first
> for "-er" and "-ir" and the second for "-ar"). This would solve
> problems for words like "inteligibilidad" or "sensibilidad". 
> 
> The second problem I ran to is that 'abil' (and, if added, also
> 'ibil') is only removed when in R2. This leaves out words like
> "posibilidad" (very usual) or "ambilidad". And I thought the rule
> might go with something like this "remove abil when in R2 or when in
> R1 except when preceeded by "inh". 
> 
> Kind regards.
> 
> Ignacio Perez
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Ignacio Perez | 30 Oct 2007 00:29
Picon

Problem with spanish stemmer

I'm working with the spanish stemmer and I'm having sort of a problem with the verb suffixes. The input I'm stemming is not orthographically perfect and I can not rely on the accents for stemming. I thought, then, I could remove all accents from my input and from the stemmer (for most of verb suffixes this does not represent a problem since "iéarmos", "íamos", "ábamos", "áramos", etc. are surely a suffix even when they're expressed as "iearmos", "iamos", "abamos", "aramos"; there is no ambiguity). Surprisingly (for me) the stemmer did not behave as I expected and words like "tomaramos" were split "tomar-amos".
Evidently I'm not understanding the behaviour of the stemmer and these accents had more value for it.

So, how can I use the stemmer making it not accent-sensitive?

Thanks a lot

Ignacio

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 31 Oct 2007 10:29
Picon

Re: Problem with spanish stemmer


Ignacio,

What you have outlined should work, and I would have to look at your
approach in some detail to see where the problem lies. (Something
incidentally that I do not currently have the time the do!)

I have just done a simple test in which the line of suffixes,

'{a'}ramos' 'i{e'}ramos' 'i{e'}semos' '{a'}semos'

is additionally preceded by the line

'aramos' 'ieramos' 'iesemos' 'asemos'

and this works fine, your word "tomaramos" splitting as "tom-aramos".

So I can suggest that as an approach: supplement the algorithm with
extra endings, corresponding to the accented forms but with the accent
removed. I suggest you build it up bit by bit, and test it out as each
new ending, or set of endings, is included.

This problem has arisen before. See for example the email of Andrew
Green 19 May 2007.

Martin

On Mon, 2007-10-29 at 20:29 -0300, Ignacio Perez wrote:
> I'm working with the spanish stemmer and I'm having sort of a problem
> with the verb suffixes. The input I'm stemming is not orthographically
> perfect and I can not rely on the accents for stemming. I thought,
> then, I could remove all accents from my input and from the stemmer
> (for most of verb suffixes this does not represent a problem since
> "iéarmos", "íamos", "ábamos", "áramos", etc. are surely a suffix even
> when they're expressed as "iearmos", "iamos", "abamos", "aramos";
> there is no ambiguity). Surprisingly (for me) the stemmer did not
> behave as I expected and words like "tomaramos" were split
> "tomar-amos". 
> Evidently I'm not understanding the behaviour of the stemmer and these
> accents had more value for it. '{a'}ramos' 'i{e'}ramos' 'i{e'}semos'
> '{a'}semos'
> 
> So, how can I use the stemmer making it not accent-sensitive?
> 
> Thanks a lot
> 
> Ignacio
Ignacio Perez | 31 Oct 2007 16:46
Picon

Re: Problem with spanish stemmer

Thanks a lot to both of you. My problem was that I was attempting to modify the generated code instead of re compiling the stemmer.

Sorry, and thanks again

Ignacio

On 10/31/07, Martin Porter <martin.porter <at> grapeshot.co.uk> wrote:

Ignacio,

What you have outlined should work, and I would have to look at your
approach in some detail to see where the problem lies. (Something
incidentally that I do not currently have the time the do!)

I have just done a simple test in which the line of suffixes,

'{a'}ramos' 'i{e'}ramos' 'i{e'}semos' '{a'}semos'

is additionally preceded by the line

'aramos' 'ieramos' 'iesemos' 'asemos'

and this works fine, your word "tomaramos" splitting as "tom-aramos".

So I can suggest that as an approach: supplement the algorithm with
extra endings, corresponding to the accented forms but with the accent
removed. I suggest you build it up bit by bit, and test it out as each
new ending, or set of endings, is included.

This problem has arisen before. See for example the email of Andrew
Green 19 May 2007.

Martin

On Mon, 2007-10-29 at 20:29 -0300, Ignacio Perez wrote:
> I'm working with the spanish stemmer and I'm having sort of a problem
> with the verb suffixes. The input I'm stemming is not orthographically
> perfect and I can not rely on the accents for stemming. I thought,
> then, I could remove all accents from my input and from the stemmer
> (for most of verb suffixes this does not represent a problem since
> "iéarmos", "íamos", "ábamos", "áramos", etc. are surely a suffix even
> when they're expressed as "iearmos", "iamos", "abamos", "aramos";
> there is no ambiguity). Surprisingly (for me) the stemmer did not
> behave as I expected and words like "tomaramos" were split
> "tomar-amos".
> Evidently I'm not understanding the behaviour of the stemmer and these
> accents had more value for it. '{a'}ramos' 'i{e'}ramos' 'i{e'}semos'
> '{a'}semos'
>
> So, how can I use the stemmer making it not accent-sensitive?
>
> Thanks a lot
>
> Ignacio


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

Gmane