Martin Porter | 4 Jan 2005 11:35
Picon

Re: a simple algorithm problem


Ayhan,

Thank you for the sample Snowball script. (I would have defined the utf-8
characters through stringdefs, following section 3 of the Snowball manual).

You ask if there is any way around 'size' being wrong when utf-8 characters
are included. This is all part of the problem of character representation.
The best solution would be an addition compilation mode for utf-8, but of
course that would mean yet another elaboration of Snowball itself ... 

Right now there is 8-bit working, and 16-bit working. 16-bit working fits in
well with the Java codegenerator scheme, and can be used with the ANSI C
codegenerator, although it has given rise to confusion on at least one
occasion.    

In Snowball the concept of 'character' only turn up in a few contexts:

a)hop N      - to hop forward N characters
b)next       = hop 1
c)goto C
d)gopast C   - where you keep doing a 'next' until C is successful
e)size       - counts the number of characters

and in 'groupings'. In retrospect, I occasionally wish groupings were not in
the language. Instead of

A) define vowel 'aeiou'

one could have
(Continue reading)

Olly Betts | 5 Jan 2005 01:55
Favicon
Gravatar

Re: a simple algorithm problem

On Tue, Jan 04, 2005 at 10:35:13AM +0000, Martin Porter wrote:
> In retrospect, I occasionally wish groupings were not in
> the language. Instead of
> 
> A) define vowel 'aeiou'
> 
> one could have
> 
> B) define vowel as among('a' 'e' 'i' 'o' 'u')
> 
> (A) is implemented as a bitmap, and (B) as a fast table lookup, and (A) is
> faster than (B), but optimisation in the codegenerator could turn (B) into a
> bitmap as well.
>
> There are other differences however: (B) needs to be defined
> in a 'forward' or 'backward' context; non-vowel is a neat test that works
> with style (A) but not (B).

Could (A) simply be handled internally as a shorthand for defining form (B) in
both forward and backward context, without changing the meaning of existing
code?  As you say, the code generator can optimise this to give the same
generated code as at present (although it may not be worth the complications of
using a bitmap for multi-byte utf-8 characters).

The manual defines "non-vowel" as the same as "(not vowel next)".  Wouldn't
that work for the (B) version too?  In which case just extending where "non"
can be used solves that issue.

> If groupings were NOT in the language, you could reduce the difference
> between utf-8 and single character working to the definition of a couple of
(Continue reading)

Martin Porter | 6 Jan 2005 10:12
Picon

Re: a simple algorithm problem


Olly,

I would answer 'yes' to all the points made in your last email. Thanks for
reporting the error in Snowball manual. It will be fixed in the next cvs commit.

There are various ways to proceed ... 

In the 2-byte character version of Snowball (standard for the Java
codegenerator) you can define characters as decimal or hex numbers in the
range 257 to 64K. These characters can go into character tables, which are
implemented as bitmaps. Of course, working with 256 characters, the bitmap
never exceeds 32 bytes -- and will frequently be less, since the bitmap is
truncated at both ends by removing runs of zeros.

Working with 64K characters, a bitmap might go up to 8K in size, which is
not an intolerable overhead. In practice they are much smaller, since the
codes we need in the stemmers do not have high Unicode values.

So one idea is to declare 'utf8' in the Snowball script, allowing character
defs in the range 0-64K, as in the 2-byte character version. Characters
could be written with their Unicode values, and encoded in utf-8 form in
strings.

Looking again at the cursor movement issues:

If I've got my thinking right: in goto and gopast, cursor movement can be
done one byte at a time, (cursor++; or cursor--;). If expression C is made
up of well-formed utf-8 strings. 'goto C' must either fail, or end on a
valid character boundary.
(Continue reading)

James Aylett | 6 Jan 2005 10:29
Gravatar

Re: a simple algorithm problem

On Thu, Jan 06, 2005 at 09:12:34AM +0000, Martin Porter wrote:

> So one idea is to declare 'utf8' in the Snowball script, allowing character
> defs in the range 0-64K, as in the 2-byte character version. Characters
> could be written with their Unicode values.

Presumably this still restricts Snowball to code points in the BMP? Or
does it just restrict it to recognising and doing things with
characters at code points in the BMP, passing through any others?
There's not a huge amount outside it yet, so this may not matter at
all.

> and encoded in utf-8 form in strings.

What's the character encoding of snowball scripts at the moment? It
isn't touched upon in the manual, so I'm guessing at present it's
expected to be ASCII or similar.

Cheers,
James

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
Martin Porter | 6 Jan 2005 11:20
Picon

Re: a simple algorithm problem


>Presumably this still restricts Snowball to code points in the BMP? Or
>does it just restrict it to recognising and doing things with
>characters at code points in the BMP, passing through any others?

It would be the latter. Since stemming is applicable to a system of
languages, all  of whose characters are, I would assert, in the BMP, I do
think that is a problem.

>What's the character encoding of snowball scripts at the moment?

The scripts themselves are in ASCII, and ASCII assumptions are made in the
Snowball compiler. 
James Aylett | 6 Jan 2005 12:18
Gravatar

Re: a simple algorithm problem

On Thu, Jan 06, 2005 at 10:20:43AM +0000, Martin Porter wrote:

> >Presumably this still restricts Snowball to code points in the BMP? Or
> >does it just restrict it to recognising and doing things with
> >characters at code points in the BMP, passing through any others?
> 
> It would be the latter. Since stemming is applicable to a system of
> languages, all  of whose characters are, I would assert, in the BMP, I do
> think that is a problem.

I'd agree that, right now, acting on code points outside the BMP
shouldn't be needed. Providing other characters are passed through, it
will cope happily with anything I can reasonably think of throwing at
it :-)

> >What's the character encoding of snowball scripts at the moment?
> The scripts themselves are in ASCII, and ASCII assumptions are made in the
> Snowball compiler. 

So when you were talking about strings being in UTF-8, were you
talking about input and output only? I wasn't awarethe concept of
'string' applied to anything other than things in the Snowball
language itself ... or do you mean that strings would be stored
internally to a running snowball stemmer in UTF-8?

Cheers,
James

--

-- 
/--------------------------------------------------------------------------\
(Continue reading)

Olly Betts | 6 Jan 2005 17:12
Favicon
Gravatar

Re: a simple algorithm problem

On Thu, Jan 06, 2005 at 09:12:34AM +0000, Martin Porter wrote:
> Working with 64K characters, a bitmap might go up to 8K in size, which is
> not an intolerable overhead. In practice they are much smaller, since the
> codes we need in the stemmers do not have high Unicode values.

I don't really see this as a big problem.  You can always just dump out
a C switch statement and let the C compiler's code generator decide on
the best way to represent it.

> So one idea is to declare 'utf8' in the Snowball script, allowing character
> defs in the range 0-64K, as in the 2-byte character version. Characters
> could be written with their Unicode values, and encoded in utf-8 form in
> strings.

I doubt most of the stemmers care about characters beyond a 2 byte utf-8
representation.  So the answer is likely to be the same for any sequence
with the same first 2 bytes, and you don't even need to decode the character
fully.  Look at the first byte to see if this is a one byte, two byte,
or longer sequence, and have tables for the first two and a fixed answer
for the third.

> Looking again at the cursor movement issues:
> 
> If I've got my thinking right: in goto and gopast, cursor movement can be
> done one byte at a time, (cursor++; or cursor--;). If expression C is made
> up of well-formed utf-8 strings. 'goto C' must either fail, or end on a
> valid character boundary.

That's true.

(Continue reading)

A. Tordai | 11 Jan 2005 13:47
Picon
Picon
Favicon

Hungarian stemmer

Hello,

I was wondering if someone is already working on a Hungarian stemmer in 
Snowball. I'm an AI student and my thesis would be on the subject of 
Hungarian stemmers. It would be nice to know if I can continue 
developing one in Snowball.

Anna
Martin Porter | 11 Jan 2005 15:21
Picon

Re: Hungarian stemmer


Anna,

We know of no work being done in Snowball for Hungarian. A stemmer would be
of great interest. Tell us how you get on!

Martin Porter
Blake Madden | 11 Jan 2005 16:49
Picon
Favicon

Re: Hungarian stemmer

Anna,

I know that Hungarian is quite different, but most
other work that I have seen with other Eastern
European language stemmers (such as Polish and Czech)
are usually dictionary based.

Blake

--- Martin Porter <martin.porter <at> grapeshot.co.uk>
wrote:

> 
> Anna,
> 
> We know of no work being done in Snowball for
> Hungarian. A stemmer would be
> of great interest. Tell us how you get on!
> 
> Martin Porter
> 
> 
> 
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss <at> lists.tartarus.org
>
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 
(Continue reading)


Gmane