Picon

Snowball and Perl

Hi all,

I am trying to add snowball stemmers for Spanish, Italian and English
(Catalan and Basque
would be welcome but I see they are not available) to a Perl
application.

Is there any way to compile Snowball code to Perl instead of C?
Could you give me any advice or workaround to achieve that aim?

Thanks in advance

  Juan Fco Fernandez
Oleg Bartunov | 1 Oct 2003 12:44
Picon

Re: Snowball and Perl

On Wed, 1 Oct 2003, Juan Francisco Fernandez Carrasco wrote:

> Hi all,
>
> I am trying to add snowball stemmers for Spanish, Italian and English
> (Catalan and Basque
> would be welcome but I see they are not available) to a Perl
> application.
>
> Is there any way to compile Snowball code to Perl instead of C?
> Could you give me any advice or workaround to achieve that aim?

Latest version is available from
http://openfts.sourceforge.net/contributions.shtml
It's real wrapper, not just an implementation in perl, so it's fast
and doesnt' bring any additional errors.

	Regards,

		Oleg

>
> Thanks in advance
>
>   Juan Fco Fernandez
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
(Continue reading)

Olly Betts | 1 Oct 2003 13:53
Favicon
Gravatar

Re: Snowball and Perl

On Wed, Oct 01, 2003 at 12:19:40PM +0200, Juan Francisco Fernandez Carrasco wrote:
> I am trying to add snowball stemmers for Spanish, Italian and English
> (Catalan and Basque
> would be welcome but I see they are not available) to a Perl
> application.

If you'd like to help out with Catalan and Basque stemmers, see:

http://snowball.tartarus.org/texts/howtohelp.html

> Is there any way to compile Snowball code to Perl instead of C?

I have a partially written Snowball backend which would generate Perl
code instead of C, but I've not had enough time to work on it much.
It's probably not too far from working though if someone wants to
try and get it working.

> Could you give me any advice or workaround to achieve that aim?

The other approaches are to use XS (or similar) wrappers to call the C
version from Perl, or to use a separate Perl implementation.  Oleg
provided a link to Perl wrappers (and a couple of reasons for preferring
them).

Cheers,
    Olly
Picon

Re: Snowball and Perl

Olly Betts wrote:

On Wed, Oct 01, 2003 at 12:19:40PM +0200, Juan Francisco Fernandez Carrasco wrote:
> I am trying to add snowball stemmers for Spanish, Italian and English
> (Catalan and Basque
> would be welcome but I see they are not available) to a Perl
> application.

If you'd like to help out with Catalan and Basque stemmers, see:

http://snowball.tartarus.org/texts/howtohelp.html


I would like to, but I lack of the linguistic knowledge to do so.
I could try it for Catalan if I would know which are the important
points of a stemmer but I have no idea about it. And about Basque,
well, it lies out of my skills.

 

> Is there any way to compile Snowball code to Perl instead of C?

I have a partially written Snowball backend which would generate Perl
code instead of C, but I've not had enough time to work on it much.
It's probably not too far from working though if someone wants to
try and get it working.

> Could you give me any advice or workaround to achieve that aim?

The other approaches are to use XS (or similar) wrappers to call the C
version from Perl, or to use a separate Perl implementation.  Oleg
provided a link to Perl wrappers (and a couple of reasons for preferring
them).
 

Thanks all for your help. I've found the wrappers and I am trying to put them up and
running.

Best,

  Juan Fra
 

 
Cheers,
    Olly

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

Steve Jones | 1 Oct 2003 12:38

-IZE versus -ISE

Hi,
 
I was wondering if anyone had overcome (or tried to!) the -IZE/-ISE problem: where 'categorize' and 'categorise' are both valid spellings of the same word. I noticed that the algorithm only takes into account the -IZE version.
 
I am aware the english language is a complex beastie and you would have to differentiate 'prize' from 'prise', but I thought someone might have had some success in this area...?
 
Regards,
 
/steve
 
Teodor Sigaev | 1 Oct 2003 12:55
Picon

Re: Snowball and Perl

http://www.snowball.tartarus.org/wrappers/guide.html

Perl wrapper for C-version of snowball

Juan Francisco Fernandez Carrasco wrote:
> Hi all,
> 
> I am trying to add snowball stemmers for Spanish, Italian and English
> (Catalan and Basque
> would be welcome but I see they are not available) to a Perl
> application.
> 
> Is there any way to compile Snowball code to Perl instead of C?
> Could you give me any advice or workaround to achieve that aim?
> 
> Thanks in advance
> 
>   Juan Fco Fernandez
> 
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss

--

-- 
Teodor Sigaev                                  E-mail: teodor <at> sigaev.ru
Martin Porter | 2 Oct 2003 18:51
Picon

Re: -IZE versus -ISE


Steve,

This often comes up. Type "ise ize" into the search box on the snowball main
page, and look at the previous discussions,

Martin
Boštjan Jerko | 6 Oct 2003 08:28
Picon
Favicon

two results

Hello !

Is it possible to get two stems for one word?
In Slovene there is a possibility to stem word (e.g. "zelodec) in two ways ("zelodc, "zelod"c).

B.
Richard Boulton | 6 Oct 2003 13:44
Gravatar

Re: two results

Boštjan Jerko wrote:
> Is it possible to get two stems for one word?
> In Slovene there is a possibility to stem word (e.g. "zelodec) in two ways ("zelodc, "zelod"c).

No, all current stemming algorithms return one stemmed version of each 
input word.

I'm not sure what it would mean to have two possible stemmed forms - 
stemming is a normalising process, used to determine if differing 
versions of a word share a common root.

Why do you think having multiple stemmed forms might be neccessary?

My thought is that a possible situation where this might be useful would 
be as follows:

We have two stemmed words, "A" and "B", with quite distinct meanings.
There exists at least one word "A_" which should stem to "A".
There exists at least one word "B_" which should stem to "B".

However, there is a word "X" which can have two different meanings.  One 
of those meanings is a form of "A", the other is a form of "B".  In 
order to reflect this, without tying "A" and "B" together, "X" should 
stem to both "A" and "B".

Is this the situation you have?

I don't know of any concrete examples of this situation, (Martin may be 
able to give one), but the way I would expect it to be solved is to 
choose which stemmed form is more frequently the correct stemmed form of 
"X", and to use that always.  Alternatively, if neither form is 
significantly more frequent, "X" could be left in an unstemmed form.

It would require a good deal of work to allow most search engines to 
deal with a stemming algorithm that returned multiple possibilities.

--

-- 
Richard
Martin Porter | 6 Oct 2003 14:29
Picon

Re: two results


Jerko

Continuing Richard's discussion, an example in English would be "routing",
which might stem either to "route" (a verb meaning "to direct"), or "rout"
(a verb meaning "to defeat in battle"). This example is given elsewhere on
the Snowball website. In English, examples like this are very rare.

Richard is right in saying the stemmers on the Snowball site reduce all
words to just one form, but other approaches are possible. For example the
Willett/Schinke Latin stemmer reduces each word to two forms, in the hope
that at least one of the forms is correct, and I have implemented it in
Snowball. 

So it is certainly possible, although it doesn't fit the design scheme of
the stemmers currently distributed on the Snowball site.

Martin

P.S. Do you know about

M. Popovic and P. Willett. The Effectiveness of Stemming for Natural
Language Access to Slovene Textual Data. Journal of the American Society for
Information Science, 43(5):384-390, 1992

(I have not myself read this paper, as far as I can recall.)

Gmane