Martin Porter | 2 May 2011 11:13
Picon

Re: less than zero (enriching Italian stemmer)


Adriano,

Yes, you alter the .sbl file, compile the snowball compiler, and translate
the .sbl file into C or java. If you're a software beginner I suggest you
get local help if you can. This will be easier than trying to take you
through the steps from the snowball-discuss board.

I'm not quite sure where python comes into all this ...

Martin

At 10:19 AM 4/29/2011 -0400, adriano allora wrote:
>Hi Martin!
>
>Wow, you've just opened a new world to me! It's my first C script and, you
>see, I can be stupid, but I never be coward when I see something
>completely... °___°
>
>No. ok, seriously: I downloaded a gzip archive named snowball_web_and_code
>which contains all the source code for snowball.
>I opened it and see several things very interesting (for instance adding
>some stopwords, but it's not necessary doing all now: there is time for
>further improvements), so: thank you for this.
>But I beg your pardon: now I'm not sure about what I have to do.
>1) can I simply change the files stem_ISO_8859.sbl and stem_MS_DOS_LATIN.sbl
>and in the directory named algorithms/italian? if not: where is the source
>file I have to change in order to add morphemes to Italian algorithm?
>2) after changing the algorithm what I exactly have to do? It's reasonable
>to assume that compiling it (gcc -O -o Snowball compiler/*.c) will not
(Continue reading)

Tiago | 17 May 2011 21:07
Picon
Gravatar

Making accented letters equivalent to they unaccented letter

Hello


I'm using libstemmer with sphinx, in an attempt to have better search results, since my site is in portuguese.

But i'm having a problem, for example, if there are the words "última" and "ultima". As search terms they should return the same objects.
I was looking at the libstemmer_pt and I would like to know if its possible to make them the same symbol.
If it is, can some one give me an example please.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Tiago 
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Olly Betts | 17 May 2011 22:29
Favicon
Gravatar

Re: Making accented letters equivalent to they unaccented letter

On Tue, May 17, 2011 at 04:07:36PM -0300, Tiago wrote:
> But i'm having a problem, for example, if there are the words "última" and
> "ultima". As search terms they should return the same objects.
> I was looking at the libstemmer_pt and I would like to know if its possible
> to make them the same symbol.
> If it is, can some one give me an example please.

I think libunac is the most popular way to strip accents - you can apply
it after stemming to give what you want.

Unfortunately the homepage seems to be down:

http://www.senga.org/unac/

Best link I can offer instead is the debian source package page:

http://packages.debian.org/source/sid/unac

The unac_1.8.0.orig.tar.gz link is the original source code.

Cheers,
    Olly
Martin Porter | 17 May 2011 23:06
Picon

Re: Making accented letters equivalent to theyunaccented letter


The need to treat Spanish/Portuguese accents as discardable has come up more
than once in the past. The complication is that the endings themselves may
carry accents. To find out more, you can dig back into the gmane archives
with this starting point,

http://search.gmane.org/?query=%22andrew+green%22&group=gmane.comp.search.sn
owball

The complete archive is at

http://news.gmane.org/gmane.comp.search.snowball
Tiago | 18 May 2011 22:08
Picon
Gravatar

Snowball: ISO-8859 to UTF-8

Hello


I'm using libstemmer here, and im trying to modify the portuguese file.
I downloaded snowball and was checking the sources, there was only a ISO-8859 and MS-DOS source there.
In the libstemmer there is a iso8859 and utf-8 c versions.

is there someway to output a utf-8 c version of the iso8859.sbl file?

Im trying to remove some verbs conjugations that are messing with some words in the plural.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Tiago Scolari
 tscolari <at> gmail.com
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 19 May 2011 11:12
Picon

Re: Snowball: ISO-8859 to UTF-8


Tiago,

Richard Boulton set up the libstemmer approach, so he can best advise, but
the ISO-8859 source for Portuguese (and several of the other languages) is
used both for ISO-8859 and utf-8 Unicode compilations, since the codes --
a-acute etc -- are the same in both. You just alter the compilation options
in the snowball compiler.

Martin

>I downloaded snowball and was checking the sources, there was only a
>ISO-8859 and MS-DOS source there.
Christophe Maillard | 31 May 2011 14:49
Picon

Snowball Analyzer for Greek

Dear Sir/Madam,

I tried the Snowball Analyzer in English, and that worked perfectly fine. But now, I would like to try it for Greek, but when I compile the sources, I get the following exception :

    java.lang.ClassNotFoundException: org.tartarus.snowball.ext.GreekStemmer

What's strange is that the file org\apache\lucene\analysis\el\GreekStemmer.class exists, but the file org\tartarus\snowball\ext\GreekStemmer.class does not. I tried to copy the existing file into this directory (and re-JARed), but I had another exception.

I'm using lucene 3.1.

Do you have any idea of how I can resolve that problem?

Thank you for your answer,

Best regards,

Christophe

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

Gmane