4 Jan 2005 11:35
Re: a simple algorithm problem
Martin Porter <martin.porter <at> grapeshot.co.uk>
2005-01-04 10:35:13 GMT
2005-01-04 10:35:13 GMT
Ayhan, Thank you for the sample Snowball script. (I would have defined the utf-8 characters through stringdefs, following section 3 of the Snowball manual). You ask if there is any way around 'size' being wrong when utf-8 characters are included. This is all part of the problem of character representation. The best solution would be an addition compilation mode for utf-8, but of course that would mean yet another elaboration of Snowball itself ... Right now there is 8-bit working, and 16-bit working. 16-bit working fits in well with the Java codegenerator scheme, and can be used with the ANSI C codegenerator, although it has given rise to confusion on at least one occasion. In Snowball the concept of 'character' only turn up in a few contexts: a)hop N - to hop forward N characters b)next = hop 1 c)goto C d)gopast C - where you keep doing a 'next' until C is successful e)size - counts the number of characters and in 'groupings'. In retrospect, I occasionally wish groupings were not in the language. Instead of A) define vowel 'aeiou' one could have(Continue reading)
> >What's the character encoding of snowball scripts at the moment?
> The scripts themselves are in ASCII, and ASCII assumptions are made in the
> Snowball compiler.
So when you were talking about strings being in UTF-8, were you
talking about input and output only? I wasn't awarethe concept of
'string' applied to anything other than things in the Snowball
language itself ... or do you mean that strings would be stored
internally to a running snowball stemmer in UTF-8?
Cheers,
James
RSS Feed