btw, cyrillic letters in utf-8:
stringdef a decimal '45264'
stringdef b decimal '45520'
stringdef v decimal '45776'
stringdef g decimal '46032'
stringdef d decimal '46288'
stringdef e decimal '46544'
stringdef zh decimal '46800'
stringdef z decimal '47056'
stringdef i decimal '47312'
stringdef i` decimal '47568'
stringdef k decimal '47824'
stringdef l decimal '48080'
stringdef m decimal '48336'
stringdef n decimal '48592'
stringdef o decimal '48848'
stringdef p decimal '49104'
stringdef r decimal '32977'
stringdef s decimal '33233'
stringdef t decimal '33489'
stringdef u decimal '33745'
stringdef f decimal '34001'
stringdef kh decimal '34257'
stringdef ts decimal '34513'
stringdef ch decimal '34769'
stringdef sh decimal '35025'
stringdef shch decimal '35281'
stringdef " decimal '36049'
stringdef y decimal '35793'
stringdef ' decimal '35537'
stringdef e` decimal '36305'
stringdef iu decimal '36561'
stringdef ia decimal '36817'
Martin Porter wrote:
Return-Path: <bu <at> lucky.net>
Delivered-To: martin_porter <at> SoftHome.net
Date: Wed, 04 Jun 2003 10:08:30 +0300
From: Eugen Bushuev <bu <at> lucky.net>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Gecko/20021120 Netscape/7.01
X-Accept-Language: ru, en-us
To: Martin Porter <martin_porter <at> SoftHome.net>
CC: Oleg Bartunov <oleg <at> sai.msu.su>
Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support
References: <courier.3EDD8B20.00003D70 <at> softhome.net>
X-Verify-Sender: verified
Hi.
This question was risen by me. You can get bit of russian utf-8 text at
http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html.
About you advice - i can't find neither in russian or english .sbl
something similar "goto v[owel]" directives. I tried to add current
character size to the SN_env structure and replace sizeof(symbol) with
z->sizeOfChar in all memory allocation procedures. Also i tried to play
with incrementing z->c, but it gave me nothing since i alsmost don't
understood how does it work.
I'm trying to make it out because i need tsearch to work with UTF-8.
UTF-8 is used because postgres uses it as "Unicode", and besides this i
need to process data in several languagies, at least English, Russian
and Ukrainian.
And, btw, why 2 and 3 characters? I thought that english text uses 1
byte and russian - 2 bytes...
Martin Porter wrote:
Oleg,
No, Snowball is either set up for 1 byte character use, or 2 byte character
use, but it has occurred to me that implementing the stemmers on utf-8 data
may not be so difficult, even with no changes to the Snowball compiler.
If you treat utf-8 data as a pure byte stream of characters (so one utf-8
character corresponds to 2 or 3 bytes) the stemmers almost work, but the
thing that goes wrong is the single character tests for characters in a
certain class. So one would have to replace
goto vowel // vowel defined by 'define vowel '...'
by
goto among ('a' 'e' 'i' 'o' 'u')
or more precisely
goto among ('[a]' '[e]' ... )
where [a] etc are macros defining the vowels as utf encoded byte sequences.
Perhaps that is how all the stemmers should have been written.
Can you point me to some plain text somewhere in the web that gives a bit of
russian in utf-8 encoded Unicode ? I might play around with this idea.
Martin
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
--
? ?????????, ?.??????.
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
--
С уважением, Е.Бушуев.