Ralf Juengling | 3 Mar 01:57 2007
Picon

string overhaul

Hi,

The representation of string objects by the interpreter
and by the compiler is different at the moment. For
the interpreter it's an object of builtin class STRING
and it maintains a pointer to a C-style (null-terminated)
string, the compiler stores strings in ubyte storages
(with a redundant \0 terminator to be compatible with C
functions).

It's ugly because the string data is being copied back
and forth each time the line of execution crosses the
boundary between interpreter and compiled code (it seems).
I would like to unify the representation and use storages
on both sides. I would introduce a new storage class
"character", say, which internally is equivalent to
ubyte (a "wide-character" class could be added later),
so that strings are still distinguishable from ubyte
storages. These character storages would be read-only,
but it's conceivable to allow some string functions to
work with ubyte storages as well, so that we effectively
would have mutable and immutable strings.

I haven't really thought that much about it and was
wondering if you would go about this differently.
Any thoughts?

Ralf

-------------------------------------------------------------------------
(Continue reading)

Raymond Martin | 4 Mar 15:43 2007
Picon

Re: string overhaul

Hi,

> The representation of string objects by the interpreter
> and by the compiler is different at the moment. For
> the interpreter it's an object of builtin class STRING
> and it maintains a pointer to a C-style (null-terminated)
> string, the compiler stores strings in ubyte storages
> (with a redundant \0 terminator to be compatible with C
> functions).

Last year Leon made a start at adding Unicode support, so I would say that
if anything is to be changed at the lower level an eye should be kept on 
internationalization support as a main concern. That is to say, if anything
is changed in terms of strings then internationalization is a concern to be 
dealt with at the same time. It should not be left until later if possible.

> 
> It's ugly because the string data is being copied back
> and forth each time the line of execution crosses the
> boundary between interpreter and compiled code (it seems).
> I would like to unify the representation and use storages
> on both sides. I would introduce a new storage class
> "character", say, which internally is equivalent to
> ubyte (a "wide-character" class could be added later),
> so that strings are still distinguishable from ubyte
> storages. These character storages would be read-only,
> but it's conceivable to allow some string functions to
> work with ubyte storages as well, so that we effectively
> would have mutable and immutable strings.

(Continue reading)

Ralf Juengling | 12 Mar 02:08 2007
Picon

Re: string overhaul


On Sun, 4 Mar 2007, Raymond Martin wrote:

> Last year Leon made a start at adding Unicode support, so I would say that
> if anything is to be changed at the lower level an eye should be kept on
> internationalization support as a main concern. That is to say, if anything
> is changed in terms of strings then internationalization is a concern to be
> dealt with at the same time. It should not be left until later if possible.
>
> A suggestion I have made previously is to use the m17n library for strings
> (http://www.m17n.org). It is included in major Linux distros. There is also
> UIMA from IBM, but it is very large.

Thanks, I took a cursory look at it (just to say that I don't really know 
it), looks certainly interesting.

I'm still trying to get a grasp of this whole issue, please correct me
where I'm wrong.

When we read a text file in lush, with 'read-lines' say, then it is 
assumed that the text is in ASCII. ASCII encodes only 128 characters,
and in a way that the "code points" (integers) fit in a C char. That's
why we can go with the conventional C representation of character
strings (arrays of chars).

Both, the interpreter and the compiler use this C string representation,
both also assume/make sure that the character strings are null-terminated.
They only differ in what objects they use as "handles" for C strings (the
interpreter uses objects of builtin type 'string_class' (in header.h),
the compiler also uses a dedicated type but in the code it produces,
(Continue reading)

Ralf Juengling | 12 Mar 17:15 2007
Picon

Re: mailing list archives

On Sun, 11 Mar 2007, Raymond Martin wrote:

> Unfortunately, the mailing list archives seem to not
> be working properly at this moment.

Scary. Both archives (lush-devel and lush-users) seem inaccessible
now.

ralf

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
Raymond Martin | 15 Mar 20:18 2007
Picon

Re: string overhaul

Ralf,

On 11 March 2007 21:08:17 you wrote:

> I'm still trying to get a grasp of this whole issue, please correct me
> where I'm wrong.
> 
> When we read a text file in lush, with 'read-lines' say, then it is 
> assumed that the text is in ASCII. ASCII encodes only 128 characters,
> and in a way that the "code points" (integers) fit in a C char. That's
> why we can go with the conventional C representation of character
> strings (arrays of chars).

Depends on locale set in shell. If it is set to something like en_US then
it will be using ASCII. If it is set to UTF-8 then wide characters will be used
internally after conversion.

> 
> Both, the interpreter and the compiler use this C string representation,
> both also assume/make sure that the character strings are null-terminated.
> They only differ in what objects they use as "handles" for C strings (the
> interpreter uses objects of builtin type 'string_class' (in header.h),
> the compiler also uses a dedicated type but in the code it produces,
> storage objects are used as "handles". I'm not quite sure why it's
> necessary but unlike array data, string data is being copied when the
> interpreter passes a string to a compiled (dh) function, or when a
> a compiled function returns a character string to the interpreter. In
> my last email I said I would like to get rid of that copying and thought,
> in order to do that we'd need to unify the way interpreter and compiler
> "handle" character strings. But I'm not even sure anymore that this
(Continue reading)


Gmane