Oleg Bartunov | 3 Jun 2003 17:31
Picon

8-bit and 16-bit characters support

Martin,

someone asked me about simultaneous support of 1-byte and 2-byte
characters. Seems, snowball doesnt works correctle with
 mixed utf-8 text (russian+english).

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg <at> sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Martin Porter | 4 Jun 2003 08:01
Picon

Re: 8-bit and 16-bit characters support


Oleg,

No, Snowball is either set up for 1 byte character use, or 2 byte character
use, but it has occurred to me that implementing the stemmers on utf-8 data
may not be so difficult, even with no changes to the Snowball compiler.

If you treat utf-8 data as a pure byte stream of characters (so one utf-8
character corresponds to 2 or 3 bytes) the stemmers almost work, but the
thing that goes wrong is the single character tests for characters in a
certain class. So one would have to replace

    goto vowel  // vowel defined by 'define vowel '...'

by

    goto among ('a' 'e' 'i' 'o' 'u')

or more precisely

    goto among ('[a]' '[e]' ... )

where [a] etc are macros defining the vowels as utf encoded byte sequences.

Perhaps that is how all the stemmers should have been written.

Can you point me to some plain text somewhere in the web that gives a bit of
russian in utf-8 encoded Unicode ? I might play around with this idea.

Martin
(Continue reading)

Martin Porter | 4 Jun 2003 10:19
Picon

Re: 8-bit and 16-bit characters support

>Return-Path: <bu <at> lucky.net>
>Delivered-To: martin_porter <at> SoftHome.net
>Date: Wed, 04 Jun 2003 10:08:30 +0300
>From: Eugen Bushuev <bu <at> lucky.net>
>User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Gecko/20021120 Netscape/7.01
>X-Accept-Language: ru, en-us
>To: Martin Porter <martin_porter <at> SoftHome.net>
>CC: Oleg Bartunov <oleg <at> sai.msu.su>
>Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support
>References: <courier.3EDD8B20.00003D70 <at> softhome.net>
>X-Verify-Sender: verified
>
>Hi.
>This question was risen by me. You can get bit of russian utf-8 text at 
>http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html.
>
>About you advice - i can't find neither in russian or english .sbl 
>something similar "goto v[owel]" directives. I tried to add current 
>character size to the SN_env structure and replace sizeof(symbol) with 
>z->sizeOfChar in all memory allocation procedures. Also i tried to play 
>with incrementing z->c, but it gave me nothing since i alsmost don't 
>understood how does it work.
>
>I'm trying to make it out because i need tsearch to work with UTF-8. 
>UTF-8 is used because postgres uses it as "Unicode", and besides this i 
>need to process data in several languagies, at least English, Russian 
>and Ukrainian.
>
>And, btw, why 2 and 3 characters? I thought that english text uses 1 
(Continue reading)

Martin Porter | 4 Jun 2003 10:20
Picon

Re: 8-bit and 16-bit characters support


Eugen,

I was really thinking aloud. I would need to rewrite the snowball scripts to
use 'among's rather than character groups. 'goto vowel' was just a way of
illustrating the problem.

The way to make it work with utf-8 encoded data is to put the unicode
Russian characters into 2 byte form before calling Snowball, and then repack
as utf-8 afterwards. Tedious, I know.

I said 2 or 3 byte characters because in utf-8, a character value above 127
packs into either 2 or 3 bytes. Is that not so?

I will look at http address you sent.

Martin
James Aylett | 4 Jun 2003 11:04
Gravatar

Re: 8-bit and 16-bit characters support

On Wed, Jun 04, 2003 at 02:20:04AM -0600, Martin Porter wrote:

> I said 2 or 3 byte characters because in utf-8, a character value above 127
> packs into either 2 or 3 bytes. Is that not so?

It actually packs into up to six octets. Unicode originally had a
16-bit code point space (UTF-8 for that always fits into at most three
octets). Eventually they ran out, and extended the codepoint space. In
UTF-16 this is accessed using 'surrogate pairs' (which are fairly
vile, IMHO :-), but UTF-8 will just encode them directly using more
octets.

Actually, Unicode 4.0 has codepoints U+0000 to U+10 FFFF [Unicode
website], but UTF-8 will encode up to U+7FFF FFFF [UTF-8 RFC 2279];
four octets of UTF-8 will suffice for the current codepoint range of
Unicode.

James

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
Martin Porter | 4 Jun 2003 12:57
Picon

Re: 8-bit and 16-bit characters support


>It actually packs into up to six octets.

Yes, I was vaguely aware that it 'went on forever'. But I think for all the
characters that might be useful in a stemming context, i.e. in Snowball, the
original 64K holds everything (Polish L, Hungarian double acute etc etc).
And the large numbers are for Chinese ideograms and other exotica.

The reason the ranges matter is that Snowball uses bitmaps for character
sets, of size N-n bits where n and N are the smallest and largest code
values in the charater set. N for Cyrillic is not too high, and in any case
N-n for Cyrillic is only about 33. But there is a general assumption than N
will not be so large that the bitmaps explode in size.
Eugen Bushuev | 4 Jun 2003 14:39

Re: 8-bit and 16-bit characters support

btw, cyrillic letters in utf-8:

stringdef a    decimal '45264'
stringdef b    decimal '45520'
stringdef v    decimal '45776'
stringdef g    decimal '46032'
stringdef d    decimal '46288'
stringdef e    decimal '46544'
stringdef zh   decimal '46800'
stringdef z    decimal '47056'
stringdef i    decimal '47312'
stringdef i`   decimal '47568'
stringdef k    decimal '47824'
stringdef l    decimal '48080'
stringdef m    decimal '48336'
stringdef n    decimal '48592'
stringdef o    decimal '48848'
stringdef p    decimal '49104'
stringdef r    decimal '32977'
stringdef s    decimal '33233'
stringdef t    decimal '33489'
stringdef u    decimal '33745'
stringdef f    decimal '34001'
stringdef kh   decimal '34257'
stringdef ts   decimal '34513'
stringdef ch   decimal '34769'
stringdef sh   decimal '35025'
stringdef shch decimal '35281'
stringdef "    decimal '36049'
stringdef y    decimal '35793'
stringdef '    decimal '35537'
stringdef e`   decimal '36305'
stringdef iu   decimal '36561'
stringdef ia   decimal '36817'



Martin Porter wrote:
Return-Path: <bu <at> lucky.net> Delivered-To: martin_porter <at> SoftHome.net Date: Wed, 04 Jun 2003 10:08:30 +0300 From: Eugen Bushuev <bu <at> lucky.net> User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Gecko/20021120 Netscape/7.01
X-Accept-Language: ru, en-us To: Martin Porter <martin_porter <at> SoftHome.net> CC: Oleg Bartunov <oleg <at> sai.msu.su> Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support References: <courier.3EDD8B20.00003D70 <at> softhome.net> X-Verify-Sender: verified Hi. This question was risen by me. You can get bit of russian utf-8 text at http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html. About you advice - i can't find neither in russian or english .sbl something similar "goto v[owel]" directives. I tried to add current character size to the SN_env structure and replace sizeof(symbol) with z->sizeOfChar in all memory allocation procedures. Also i tried to play with incrementing z->c, but it gave me nothing since i alsmost don't understood how does it work. I'm trying to make it out because i need tsearch to work with UTF-8. UTF-8 is used because postgres uses it as "Unicode", and besides this i need to process data in several languagies, at least English, Russian and Ukrainian. And, btw, why 2 and 3 characters? I thought that english text uses 1 byte and russian - 2 bytes... Martin Porter wrote:
Oleg, No, Snowball is either set up for 1 byte character use, or 2 byte character use, but it has occurred to me that implementing the stemmers on utf-8 data may not be so difficult, even with no changes to the Snowball compiler. If you treat utf-8 data as a pure byte stream of characters (so one utf-8 character corresponds to 2 or 3 bytes) the stemmers almost work, but the thing that goes wrong is the single character tests for characters in a certain class. So one would have to replace goto vowel // vowel defined by 'define vowel '...' by goto among ('a' 'e' 'i' 'o' 'u') or more precisely goto among ('[a]' '[e]' ... ) where [a] etc are macros defining the vowels as utf encoded byte sequences. Perhaps that is how all the stemmers should have been written. Can you point me to some plain text somewhere in the web that gives a bit of russian in utf-8 encoded Unicode ? I might play around with this idea. Martin _______________________________________________ Snowball-discuss mailing list Snowball-discuss <at> lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss
-- ? ?????????, ?.??????.
_______________________________________________ Snowball-discuss mailing list Snowball-discuss <at> lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss

--
С уважением, Е.Бушуев.

Sven Neumann | 5 Jun 2003 11:26

Can regular expressions be used to implement Porter Stemmer

Something less deep I have been wondering about is if snowball supports
regular expressions. I went through a lot of trouble of porting Perl5
compatible regexes for my own programming language, and it seemed the
easiest way to implement Porter. But snowball doesn't do that either,
and it's designed for stemmers as such. Is there a specific reason for
this, and if not, are there samples of a regex porter stemmer?

Thank you,

Sven Neumann

PS: Please do not use html formatted mail, as it mixes  badly in the
replies.
Dmitry Serebrennikov | 5 Jun 2003 03:16
Picon
Favicon

So which source tree is the right one?

Greetings Snowballers,

I just checked out the CVS tree but then realized that there are 
actually two source trees to choose from...

Judging by the fact that latest changes were made to the snowball module 
in CVS rather then the website one, I guess this is where the best stuff 
is found. But since it seems that the tree is being actively moved and 
established, maybe it's not yet ready? It certainly looks much nicer! :)

Also, from the looks of it, I may be able to help here and there with 
the new tree. I already added a language flag to the sample program so 
it does not require recompilation and now I am finding that I'd like to 
make a few changes to the make file to make it build the Java stemmers...

I guess I am looking to find out what's hapenning with the tree and what 
the plans are, so that I can get production-ready code for my use and 
also steer my own efforts in a way the would be helpful to you'all.

Thanks.
Dmitry.
Richard Boulton | 5 Jun 2003 16:26
Gravatar

Re: So which source tree is the right one?

On Thu, 2003-06-05 at 02:16, Dmitry Serebrennikov wrote:
> I just checked out the CVS tree but then realized that there are 
> actually two source trees to choose from...
>
> Judging by the fact that latest changes were made to the snowball module 
> in CVS rather then the website one, I guess this is where the best stuff 
> is found. But since it seems that the tree is being actively moved and 
> established, maybe it's not yet ready? It certainly looks much nicer! :)

Absolutely correct.  For now, you should probably use the code from
"website".

I'm setting up the "snowball" module to contain the source files for the
compiler and algorithms, a straightforward build system, and a standard
interface for the stemming algorithms.  However, I havn't yet finished
the work (and have been putting in about 1 afternoon per month, so it's
not going too quickly).

Martin is still developing in the "website" module, so the very latest
versions of the algorithms are likely to be found there - though
currently I believe that those in "snowball" are exactly consistent.

Once I'm happy with the structure in "snowball", the plan is to check it
with Martin, and assuming he approves to make it the standard.

> Also, from the looks of it, I may be able to help here and there with 
> the new tree. I already added a language flag to the sample program so 
> it does not require recompilation and now I am finding that I'd like to 
> make a few changes to the make file to make it build the Java stemmers...

I'm interested in patches - feel free to send them directly to me.
(Note - this only applies to patches for the "snowball" module.  Any
patches to the algorithms or the website should be sent to this list, as
always.)

> I guess I am looking to find out what's hapenning with the tree and what 
> the plans are, so that I can get production-ready code for my use and 
> also steer my own efforts in a way the would be helpful to you'all.

Hope the above helps.

--

-- 
Richard

Gmane