xiao shibin | 9 May 07:44 2004
Picon

Unicode version of snowball

>May 2002 - Unicode support added
 
where can I download the unicode version?
 
thanks,
 
xiao shibin
Martin Porter | 9 May 13:06 2004
Picon

Re: Unicode version of snowball

At 13:44 09/05/2004 +0800, xiao shibin wrote:
>>May 2002 - Unicode support added 
>
>where can I download the unicode version?
>
>thanks,
>
>xiao shib

Just download the whole thing and use the -w[idechars] option when
compiling. If you put "unicode" in the snowball front-page search box you
can see the emails that were passed around when 16 bit character support was
being added, which provides useful background.

Martin
xiao shibin | 10 May 06:05 2004
Picon

Re: Unicode version of snowball

Hi Martin,

By your help, I can compile the stemmer to process UCS2-based unicode.
But my russian text is encoded in UTF8-based unicode, and I don't want to translate UTF8 data to UCS2 data,
Could you tell me how to modify the stemmer?
Or have a done version of snowball which support UTF8?

search the snowball front-page, get "cyrillic letters in utf-8", should I do other modify?

stringdef a decimal '45264' 
stringdef b decimal '45520' 
stringdef v decimal '45776' 
stringdef g decimal '46032' 
stringdef d decimal '46288' 
stringdef e decimal '46544' 
stringdef zh decimal '46800' 
stringdef z decimal '47056' 
stringdef i decimal '47312' 
stringdef i` decimal '47568' 
stringdef k decimal '47824' 
stringdef l decimal '48080' 
stringdef m decimal '48336' 
stringdef n decimal '48592' 
stringdef o decimal '48848' 
stringdef p decimal '49104' 
stringdef r decimal '32977' 
stringdef s decimal '33233' 
stringdef t decimal '33489' 
stringdef u decimal '33745' 
stringdef f decimal '34001' 
stringdef kh decimal '34257' 
stringdef ts decimal '34513' 
stringdef ch decimal '34769' 
stringdef sh decimal '35025' 
stringdef shch decimal '35281' 
stringdef " decimal '36049' 
stringdef y decimal '35793' 
stringdef ' decimal '35537' 
stringdef e` decimal '36305' 
stringdef iu decimal '36561' 
stringdef ia decimal '36817' 




thanks for your help.

xiao shibin

----- Original Message ----- 
From: "Martin Porter" <martin.porter <at> grapeshot.co.uk>
To: "xiao shibin" <xiao.shibin <at> trs.com.cn>; <snowball-discuss <at> lists.tartarus.org>
Sent: Sunday, May 09, 2004 7:06 PM
Subject: Re: [Snowball-discuss] Unicode version of snowball


> At 13:44 09/05/2004 +0800, xiao shibin wrote:
> >>May 2002 - Unicode support added 
> >
> >where can I download the unicode version?
> >
> >thanks,
> >
> >xiao shib
> 
> Just download the whole thing and use the -w[idechars] option when
> compiling. If you put "unicode" in the snowball front-page search box you
> can see the emails that were passed around when 16 bit character support was
> being added, which provides useful background.
> 
> Martin
> 
>  
> 
> 
xiao shibin | 10 May 10:25 2004
Picon

Re: Unicode version of snowball

Hi Martin,

I download the  Tar gzipped file containing the Snowball sources, and compiling the 'p' directory, then
produce 'snowball', 
Run snowball with -w[idechars] option,  stem.sbl is downloaded from
http://www.snowball.tartarus.org/russian/stem.sbl, and modify  the commented out.
Copy stem.c and stem.h to 'q' directory and modify the api.h, then compile

But the stemming result is error, attached is the UCS2-based russian input file and output file.

thanks for your help.

xiao shibin


----- Original Message ----- 
From: "Martin Porter" <martin.porter <at> grapeshot.co.uk>
To: "xiao shibin" <xiao.shibin <at> trs.com.cn>; <snowball-discuss <at> lists.tartarus.org>
Sent: Sunday, May 09, 2004 7:06 PM
Subject: Re: [Snowball-discuss] Unicode version of snowball


> At 13:44 09/05/2004 +0800, xiao shibin wrote:
> >>May 2002 - Unicode support added 
> >
> >where can I download the unicode version?
> >
> >thanks,
> >
> >xiao shib
> 
> Just download the whole thing and use the -w[idechars] option when
> compiling. If you put "unicode" in the snowball front-page search box you
> can see the emails that were passed around when 16 bit character support was
> being added, which provides useful background.
> 
> Martin
> 
>  
> 
> 
ÿþ( 
ÿþ( 
Martin Porter | 12 May 16:37 2004
Picon

Re: Unicode version of snowball


Xiao,

It looks to me as if you are using my driver.c, which forces lower case on a
string - mapping A-Z to a-z - as a tidyup up process. This is done byte by
byte rather than 16-bit item by 16-bit item. Perhaps that is the problem, if
so I apologise for the UNicode side of things not being better developed
than it is, but it should be possible to get it working with what you have.

At the moment I'm not sure what you have done. If you are still stuck, send
me a tgz file of the .c and .h sources you are using, and I will test it out
with your small input and output file samples,

Martin
Blake Madden | 12 May 19:41 2004
Picon

Finnish vowels

In the Finnish stemmer, it states this:

Finnish includes the following accented forms,
ä   ö
The following letters are vowels:
a   e   i   o   u   y   ä   ö

However, I believe that 'Å' (A with a volle) is also in their alphabet 
(albeit uncommonly used).  Please refer to 
http://en.wikipedia.org/wiki/Finnish_alphabet.

Thanks,
Blake Madden

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar – get it now! 
http://toolbar.msn.com/go/onm00200415ave/direct/01/
Blake Madden | 12 May 19:59 2004
Picon

Finnish stemmer diff file

In the Finnish stemmer's diff file (the text file that shows a list of 
Finnish words and respective stemmed equivalents), there are a few entries 
that have uppercased 'Ä's in them.  This can be somewhat confusing given 
that the stemmers are meant to only work with lowercased text.  Here is one 
example:

edelliseltÄ                   edelliseltÄ
edelliseltä                   edellis

This gives the impression that there is something special about 'Ä', like it 
is a special consonant.  It looks here like "edelliseltÄ" and "edelliseltä" 
are entirely different words.  However, this is not exactly the case.  In 
reality, "edelliseltÄ" was not stemmed correctly because it was not 
lowercased first.  Like I said, this could just be a little confusing.

Thanks,
Blake

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar – get it now! 
http://toolbar.msn.com/go/onm00200415ave/direct/01/
Martin Porter | 13 May 11:59 2004
Picon

Re: Unicode version of snowball


Xiao,

I am glad you sorted it out.

I will need to return to Snowball at some point, when I have extended some
other software to handle Unicode, so that might be a good time to address
the UTF8 encoding business. I do not think it will be until next year however.

It would be possible to put UTF-8 Unicode straight in, if Snowball had not
introduced these things called 'classes'. For example, the line

define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'

defines v as one of the bytes in the succeeding string. If instead one had

define v as among ('a' 'e' 'i' 'o' 'u' 
                   '{a'}' '{e'}' '{i'}' '{o'}' '{u'}' '{u"}')

it would not matter whether the macros {a'} etc were made up of 1, 2 or 3
bytes. The Snowball scripts could be rewritten to avoid the use of classes.

Retrospectively, one can say that the idea of character classes is a
mistake: the increased speed by which they are implemented should be made an
optimisation feature of among expressions of a certain shape. 

Martin

> Martin,
>
>Thanks for your help, I can compile the snowball for stemming UCS2-based
russian text.
>
>It is better that snowball can stem UTF8-based text, Could you have any
plan to modify the snowball to support stemming UTF8-based text?
>
>Xiao Shibin
>TRS Ltd.
>
Martin Porter | 13 May 12:09 2004
Picon

Re: Finnish vowels


Hi Blake!

The point is that Finland was under Swedish control for a century or more (I
do not recall the exact period) and there was a huge influx of Swedish words
into the language. But the Swedish a-ring is not strictly part of Finnish
morphology just as French accents are not part of English, although they
exist in many loan words. A-ring could be added as a vowel to the Finnish
stemmer, but would not significantly affect results.

I left A-ring in the vocabulary samples because my policy has not been to
doctor the vocabularies by removing spelling errors and so on.

>In the Finnish stemmer, it states this:
>
>Finnish includes the following accented forms,
>ä   ö
>The following letters are vowels:
>a   e   i   o   u   y   ä   ö
>
>However, I believe that 'Å' (A with a volle) is also in their alphabet 
>(albeit uncommonly used).  Please refer to 
>http://en.wikipedia.org/wiki/Finnish_alphabet.
>
>Thanks,
>Blake Madden
Martin Porter | 17 May 10:49 2004
Picon

Re: Possible memory leak in Snowballs Java stemmer


Wolfram,

Thank you for your carefully researched email. The Java generator is the
work of Richard Boulton and we must refer it to him. I hope we'll get an
answer back to you shortly. (I know he is rather busy at the moment)

I am posting this to snowball-discuss <at> lists.tartarus.org. PLease post
subsequent replies to that address,

Martin

At 16:43 14/05/2004 +0200, Wolfram Esser wrote:
>Hello Mr. Porter!
>
>I am really appreciating you work on the stemming field.... I'm using 
>the German stemmer extensively to find typing mistakes in large 
>electronic encyclopedias.
>
>I am using the Java stemming engine which is provided by this link:
>    http://snowball.tartarus.org/snowball_java.tgz
>Which is  - to my knowledge - the current version of the Java stemmer.
>
>_*Problem:*_
>When stemming about 500,000 words and generating a Java hashmap which 
>maps all the stemms to their corresponding words, I get OutOfMemory 
>exceptions - even with about 700MB of java heapspace and with about 1GB 
>of machine RAM. This is strange, because the raw data needed must be 
>something like 6MB+6MB+small(X) about 20-30 MB of RAM.
>
>_*Analysis:*_
>According to your Java TestApp (delevered with the above archive), after 
>calling stem() one has to use SnowBallProgram.getCurrent() to get the 
>stem of the stemmed word. This method does the following:
>   public String getCurrent()
>    {
>        return current.toString();
>    }
>
>So it converts the StringBuffer current to a String - but:  
>StringBuffer.toString() does a so called "lazy copy" - it does NOT 
>create a fresh new String wich is returned, but instead it creates a 
>"hollow" String object, where it points the data-buffer to the existing 
>StringBuffer. So the StringBuffer and the new String point to the exact 
>same memory block.
>
>So when StringBuffer has allocated 2MB (and only 10 bytes used, which is 
>OK for a StringBuffer!), then the new String points also to a 2MB memory 
>block whith only 10 bytes used.
>Java memorizes this fact and when the StringBuffer changes its value, 
>then the actual copy if the memory is done - but to late! The String 
>object occupies 2MB - and always will - even if only 10 bytes contain 
>useful characters!
>
>As people are calling SnowBall's getCurrent() method often - they almost 
>always get String objects that occupy a lot of useless memory. This is 
>O.K., if they do only use these String for example (like in your 
>TestApp), to do a System.out.println() and discard them afterwards. Then 
>moemy will be freed by Java's garbage collector. But when you keep 
>references to those Strings (e.g. as keys in a HashMap, like in my 
>case!), machines memory runs out lightning fast! Actually I could only 
>store about 300,000 stems in my Hashmap which occupied 600MB of RAM at 
>that time!
>
>So, reusing StringBuffers is actually a usage case which maybe was not 
>intented by the developers of the StringBuffer class.
>
>
>_*Solution:*_
>
>Either the user or you library can do something like this
>    String myStem = new String( germanStemmer.getCurrent());
>
>
>or (which I woul prefer): rewrite the getCurrent method like one of the 
>following (this prevents lib users of using the library in a maybe 
>dangerous was):
>
>   public String getCurrent()
>    {
>        return new String(current);
>    }
>
>or
>   public String getCurrent()
>    {
>        return current.substring(0);
>    }
>
>in both cases only the actual amount of occupied characters is stored in 
>the new String object.
>
>
>
>I dont know who is actually caring for the Java part of Snowball. But 
>I'm sure you can forward this eMail to him/her.
>I really would like to hear from your team, if you could reproduce my 
>problem and find the solution helpful.
>Or did I overlook some other (memory saving) means of getting the 
>desired stem?
>
>Anyway: Thank you for your great work
>and greetings from Germany:
>            Wolfram
>
>
>
>-- 
>
>---
>
>     o    Wolfram Esser (Dipl.-Inform.), Lehrstuhl fuer Informatik II
>    / \       Universitaet Wuerzburg, Am Hubland, D-97074 Wuerzburg
>infoII o      Phone: +49 (0)931-888-6614   Fax: +49 (0)931-888-6603
>  / \         mailto:esser <at> informatik.uni-wuerzburg.de
> o   o        http://www2.informatik.uni-wuerzburg.de/staff/wolfram/
>
>

Gmane