9 May 2004 07:44
9 May 2004 13:06
Re: Unicode version of snowball
Martin Porter <martin.porter <at> grapeshot.co.uk>
2004-05-09 11:06:29 GMT
2004-05-09 11:06:29 GMT
At 13:44 09/05/2004 +0800, xiao shibin wrote: >>May 2002 - Unicode support added > >where can I download the unicode version? > >thanks, > >xiao shib Just download the whole thing and use the -w[idechars] option when compiling. If you put "unicode" in the snowball front-page search box you can see the emails that were passed around when 16 bit character support was being added, which provides useful background. Martin
10 May 2004 06:05
Re: Unicode version of snowball
xiao shibin <xiao.shibin <at> trs.com.cn>
2004-05-10 04:05:49 GMT
2004-05-10 04:05:49 GMT
Hi Martin, By your help, I can compile the stemmer to process UCS2-based unicode. But my russian text is encoded in UTF8-based unicode, and I don't want to translate UTF8 data to UCS2 data, Could you tell me how to modify the stemmer? Or have a done version of snowball which support UTF8? search the snowball front-page, get "cyrillic letters in utf-8", should I do other modify? stringdef a decimal '45264' stringdef b decimal '45520' stringdef v decimal '45776' stringdef g decimal '46032' stringdef d decimal '46288' stringdef e decimal '46544' stringdef zh decimal '46800' stringdef z decimal '47056' stringdef i decimal '47312' stringdef i` decimal '47568' stringdef k decimal '47824' stringdef l decimal '48080' stringdef m decimal '48336' stringdef n decimal '48592' stringdef o decimal '48848' stringdef p decimal '49104' stringdef r decimal '32977' stringdef s decimal '33233' stringdef t decimal '33489' stringdef u decimal '33745' stringdef f decimal '34001' stringdef kh decimal '34257' stringdef ts decimal '34513' stringdef ch decimal '34769' stringdef sh decimal '35025' stringdef shch decimal '35281' stringdef " decimal '36049' stringdef y decimal '35793' stringdef ' decimal '35537' stringdef e` decimal '36305' stringdef iu decimal '36561' stringdef ia decimal '36817' thanks for your help. xiao shibin ----- Original Message ----- From: "Martin Porter" <martin.porter <at> grapeshot.co.uk> To: "xiao shibin" <xiao.shibin <at> trs.com.cn>; <snowball-discuss <at> lists.tartarus.org> Sent: Sunday, May 09, 2004 7:06 PM Subject: Re: [Snowball-discuss] Unicode version of snowball > At 13:44 09/05/2004 +0800, xiao shibin wrote: > >>May 2002 - Unicode support added > > > >where can I download the unicode version? > > > >thanks, > > > >xiao shib > > Just download the whole thing and use the -w[idechars] option when > compiling. If you put "unicode" in the snowball front-page search box you > can see the emails that were passed around when 16 bit character support was > being added, which provides useful background. > > Martin > > > >
10 May 2004 10:25
Re: Unicode version of snowball
xiao shibin <xiao.shibin <at> trs.com.cn>
2004-05-10 08:25:57 GMT
2004-05-10 08:25:57 GMT
Hi Martin, I download the Tar gzipped file containing the Snowball sources, and compiling the 'p' directory, then produce 'snowball', Run snowball with -w[idechars] option, stem.sbl is downloaded from http://www.snowball.tartarus.org/russian/stem.sbl, and modify the commented out. Copy stem.c and stem.h to 'q' directory and modify the api.h, then compile But the stemming result is error, attached is the UCS2-based russian input file and output file. thanks for your help. xiao shibin ----- Original Message ----- From: "Martin Porter" <martin.porter <at> grapeshot.co.uk> To: "xiao shibin" <xiao.shibin <at> trs.com.cn>; <snowball-discuss <at> lists.tartarus.org> Sent: Sunday, May 09, 2004 7:06 PM Subject: Re: [Snowball-discuss] Unicode version of snowball > At 13:44 09/05/2004 +0800, xiao shibin wrote: > >>May 2002 - Unicode support added > > > >where can I download the unicode version? > > > >thanks, > > > >xiao shib > > Just download the whole thing and use the -w[idechars] option when > compiling. If you put "unicode" in the snowball front-page search box you > can see the emails that were passed around when 16 bit character support was > being added, which provides useful background. > > Martin > > > >
ÿþ(
ÿþ(
12 May 2004 16:37
Re: Unicode version of snowball
Martin Porter <martin.porter <at> grapeshot.co.uk>
2004-05-12 14:37:34 GMT
2004-05-12 14:37:34 GMT
Xiao, It looks to me as if you are using my driver.c, which forces lower case on a string - mapping A-Z to a-z - as a tidyup up process. This is done byte by byte rather than 16-bit item by 16-bit item. Perhaps that is the problem, if so I apologise for the UNicode side of things not being better developed than it is, but it should be possible to get it working with what you have. At the moment I'm not sure what you have done. If you are still stuck, send me a tgz file of the .c and .h sources you are using, and I will test it out with your small input and output file samples, Martin
12 May 2004 19:41
Finnish vowels
Blake Madden <madden_blake <at> hotmail.com>
2004-05-12 17:41:52 GMT
2004-05-12 17:41:52 GMT
In the Finnish stemmer, it states this: Finnish includes the following accented forms, ä ö The following letters are vowels: a e i o u y ä ö However, I believe that 'Å' (A with a volle) is also in their alphabet (albeit uncommonly used). Please refer to http://en.wikipedia.org/wiki/Finnish_alphabet. Thanks, Blake Madden _________________________________________________________________ FREE pop-up blocking with the new MSN Toolbar get it now! http://toolbar.msn.com/go/onm00200415ave/direct/01/
12 May 2004 19:59
Finnish stemmer diff file
Blake Madden <madden_blake <at> hotmail.com>
2004-05-12 17:59:06 GMT
2004-05-12 17:59:06 GMT
In the Finnish stemmer's diff file (the text file that shows a list of Finnish words and respective stemmed equivalents), there are a few entries that have uppercased 'Ä's in them. This can be somewhat confusing given that the stemmers are meant to only work with lowercased text. Here is one example: edelliseltÄ edelliseltÄ edelliseltä edellis This gives the impression that there is something special about 'Ä', like it is a special consonant. It looks here like "edelliseltÄ" and "edelliseltä" are entirely different words. However, this is not exactly the case. In reality, "edelliseltÄ" was not stemmed correctly because it was not lowercased first. Like I said, this could just be a little confusing. Thanks, Blake _________________________________________________________________ FREE pop-up blocking with the new MSN Toolbar get it now! http://toolbar.msn.com/go/onm00200415ave/direct/01/
13 May 2004 11:59
Re: Unicode version of snowball
Martin Porter <martin.porter <at> grapeshot.co.uk>
2004-05-13 09:59:00 GMT
2004-05-13 09:59:00 GMT
Xiao,
I am glad you sorted it out.
I will need to return to Snowball at some point, when I have extended some
other software to handle Unicode, so that might be a good time to address
the UTF8 encoding business. I do not think it will be until next year however.
It would be possible to put UTF-8 Unicode straight in, if Snowball had not
introduced these things called 'classes'. For example, the line
define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'
defines v as one of the bytes in the succeeding string. If instead one had
define v as among ('a' 'e' 'i' 'o' 'u'
'{a'}' '{e'}' '{i'}' '{o'}' '{u'}' '{u"}')
it would not matter whether the macros {a'} etc were made up of 1, 2 or 3
bytes. The Snowball scripts could be rewritten to avoid the use of classes.
Retrospectively, one can say that the idea of character classes is a
mistake: the increased speed by which they are implemented should be made an
optimisation feature of among expressions of a certain shape.
Martin
> Martin,
>
>Thanks for your help, I can compile the snowball for stemming UCS2-based
russian text.
>
>It is better that snowball can stem UTF8-based text, Could you have any
plan to modify the snowball to support stemming UTF8-based text?
>
>Xiao Shibin
>TRS Ltd.
>
13 May 2004 12:09
Re: Finnish vowels
Martin Porter <martin.porter <at> grapeshot.co.uk>
2004-05-13 10:09:55 GMT
2004-05-13 10:09:55 GMT
Hi Blake! The point is that Finland was under Swedish control for a century or more (I do not recall the exact period) and there was a huge influx of Swedish words into the language. But the Swedish a-ring is not strictly part of Finnish morphology just as French accents are not part of English, although they exist in many loan words. A-ring could be added as a vowel to the Finnish stemmer, but would not significantly affect results. I left A-ring in the vocabulary samples because my policy has not been to doctor the vocabularies by removing spelling errors and so on. >In the Finnish stemmer, it states this: > >Finnish includes the following accented forms, >ä ö >The following letters are vowels: >a e i o u y ä ö > >However, I believe that 'Å' (A with a volle) is also in their alphabet >(albeit uncommonly used). Please refer to >http://en.wikipedia.org/wiki/Finnish_alphabet. > >Thanks, >Blake Madden
17 May 2004 10:49
Re: Possible memory leak in Snowballs Java stemmer
Martin Porter <martin.porter <at> grapeshot.co.uk>
2004-05-17 08:49:35 GMT
2004-05-17 08:49:35 GMT
Wolfram, Thank you for your carefully researched email. The Java generator is the work of Richard Boulton and we must refer it to him. I hope we'll get an answer back to you shortly. (I know he is rather busy at the moment) I am posting this to snowball-discuss <at> lists.tartarus.org. PLease post subsequent replies to that address, Martin At 16:43 14/05/2004 +0200, Wolfram Esser wrote: >Hello Mr. Porter! > >I am really appreciating you work on the stemming field.... I'm using >the German stemmer extensively to find typing mistakes in large >electronic encyclopedias. > >I am using the Java stemming engine which is provided by this link: > http://snowball.tartarus.org/snowball_java.tgz >Which is - to my knowledge - the current version of the Java stemmer. > >_*Problem:*_ >When stemming about 500,000 words and generating a Java hashmap which >maps all the stemms to their corresponding words, I get OutOfMemory >exceptions - even with about 700MB of java heapspace and with about 1GB >of machine RAM. This is strange, because the raw data needed must be >something like 6MB+6MB+small(X) about 20-30 MB of RAM. > >_*Analysis:*_ >According to your Java TestApp (delevered with the above archive), after >calling stem() one has to use SnowBallProgram.getCurrent() to get the >stem of the stemmed word. This method does the following: > public String getCurrent() > { > return current.toString(); > } > >So it converts the StringBuffer current to a String - but: >StringBuffer.toString() does a so called "lazy copy" - it does NOT >create a fresh new String wich is returned, but instead it creates a >"hollow" String object, where it points the data-buffer to the existing >StringBuffer. So the StringBuffer and the new String point to the exact >same memory block. > >So when StringBuffer has allocated 2MB (and only 10 bytes used, which is >OK for a StringBuffer!), then the new String points also to a 2MB memory >block whith only 10 bytes used. >Java memorizes this fact and when the StringBuffer changes its value, >then the actual copy if the memory is done - but to late! The String >object occupies 2MB - and always will - even if only 10 bytes contain >useful characters! > >As people are calling SnowBall's getCurrent() method often - they almost >always get String objects that occupy a lot of useless memory. This is >O.K., if they do only use these String for example (like in your >TestApp), to do a System.out.println() and discard them afterwards. Then >moemy will be freed by Java's garbage collector. But when you keep >references to those Strings (e.g. as keys in a HashMap, like in my >case!), machines memory runs out lightning fast! Actually I could only >store about 300,000 stems in my Hashmap which occupied 600MB of RAM at >that time! > >So, reusing StringBuffers is actually a usage case which maybe was not >intented by the developers of the StringBuffer class. > > >_*Solution:*_ > >Either the user or you library can do something like this > String myStem = new String( germanStemmer.getCurrent()); > > >or (which I woul prefer): rewrite the getCurrent method like one of the >following (this prevents lib users of using the library in a maybe >dangerous was): > > public String getCurrent() > { > return new String(current); > } > >or > public String getCurrent() > { > return current.substring(0); > } > >in both cases only the actual amount of occupied characters is stored in >the new String object. > > > >I dont know who is actually caring for the Java part of Snowball. But >I'm sure you can forward this eMail to him/her. >I really would like to hear from your team, if you could reproduce my >problem and find the solution helpful. >Or did I overlook some other (memory saving) means of getting the >desired stem? > >Anyway: Thank you for your great work >and greetings from Germany: > Wolfram > > > >-- > >--- > > o Wolfram Esser (Dipl.-Inform.), Lehrstuhl fuer Informatik II > / \ Universitaet Wuerzburg, Am Hubland, D-97074 Wuerzburg >infoII o Phone: +49 (0)931-888-6614 Fax: +49 (0)931-888-6603 > / \ mailto:esser <at> informatik.uni-wuerzburg.de > o o http://www2.informatik.uni-wuerzburg.de/staff/wolfram/ > >
RSS Feed