Tiago Tresoldi | 1 Dec 04:17 2002
Picon

unicode in lua

One of the first thing I did in Lua was writing a small and not very well coded utf-8 library for an almost-dead
machine 
translation progrm (that was written in Python and maybe some day will become a C/Lua combination).
I have thinking about releasing it or not because I am not only sure that is far from perfection but I also know
that I could 
myself do something better... Anyway, I finally decided. You can download it at
http://traduki.sf.net/unicode.lua and ue for 
anything you want, but be advised that it is far from perfection.
Private comments on this (but please don't blame too much :) ) are welcome.

Tiago

Ignacio Castaño | 1 Dec 05:32 2002
Picon
Picon

Re: tolua survey


Some people mentioned that they would like that tolua accepted any valid
header file as an input, while many other said that they were not using
tolua, because they preferred their handtuned bindings.

I'm not really interested in full C/C++ compliance. Lua is very different
from C++, and you don't usually want to translate class interfaces directly,
but to change them in order to fit better in the lua programming style. For
example, while in C++ you have a protected variable with two methods to
access it (accessor/mutator) in lua you would like to access the name of the
var directly and let the binding code map the reads to the accessor and the
writes to the mutator.

Another example is the use of prefixes, many libraries have a prefix for all
of its functions and classes (SDL_, wx, etc). In lua you may probably put
them in a module, but doing this in tolua would result in redundant function
names:

SDL.SDL_Open
wx.wxApp

While you can rename the functions, you cannot do so with classes (why?),
and doing it manually is sometimes a long task.

Those little things is what keeps me away from lua. toLua should be more
flexible, and instead of just translating C++ to lua, it should allow you to
say: "this is what I want in lua, and this is what I have in C++, now
generate the binding code". But maybe that's just not possible and I'm just
dreaming awake.

(Continue reading)

Ignacio Castaño | 1 Dec 06:29 2002
Picon
Picon

Re: tolua survey

Ignacio Castaño wrote:
> Those little things is what keeps me away from *lua*. toLua should be more

I meant 'toLua', lua is awesome of course ;-)

Ignacio Castaño
castanyo <at> yahoo.es

_______________________________________________________________
Copa del Mundo de la FIFA 2002
El único lugar de Internet con vídeos de los 64 partidos. 
¡Apúntante ya! en http://fifaworldcup.yahoo.com/fc/es/

Björn De Meyer | 1 Dec 14:27 2002
Picon

Re: lua for unicode

John Belmonte wrote:
> 
> For Japanese, I think in the 50's there was a movement to reduce the
> number "essential characters", likely with the goal of improving the
> literacy rate.  A set of 1,850 characters was adopted by law and
> publications now limit themselves to that set except for proper names,
> as you say.

Yes, it's true that there are 2000 Jouyou Kanji  ,
which means "everyday use chinese characters". 
If you know these characters, you are not illiterate.
However that doesn't mean you have a high degree of
literacy! Those 2000 characters are a legally enforced 
minimum knowledge. Official publications and newspapers 
will limit themselves  to these everyday use Kanji. 
However, Japanese literature does not limit itself 
in this way.  

 
> If there is a group of Japanese not happy with the unicode situation,
> aren't they free to press for additions to the code set, and to lobby
> their government to produce a complete and freely licensed font to
> preserve their heritage?

The problem is that due to chauvinism, and the way Microsoft has handled
things, those groups are weary of unicode. They see it as a western 
attempt to butcher their language, and deeply mistrust the unicode 
consortium. Like I said before, they even developed their own 
Japanese- centric OS, TRON, that can display the whole range of 
80000 Kanji and then some. Some of the Japanese free font uyou
(Continue reading)

Enrico Colombini | 1 Dec 10:45 2002
Picon

Re: tolua survey

At 05.32 01/12/02 +0100, Ignacio Castaño <castanyo <at> yahoo.es> wrote:
>Those little things is what keeps me away from lua. toLua should be 
>more flexible, and instead of just translating C++ to lua, it should 
>allow you to say: "this is what I want in lua, and this is what I 
>have in C++, now generate the binding code".

That would be interesting. In any case, I would like tolua to leave the
original C/C++ header unchanged: having to fine-tune a copy of the header
by hand at every version change can be a nightmare, especially if the
header has been written by somebody else.

A possible solution would be for tolua to parse the original header and a
separate "tolua_interface_directives" file, describing how to
interpret/convert the information from the header. This approach would
allow both simple conversions and more elaborate Lua-C(++) interfaces, such
as the one Ignacio wishes for.

  Enrico

Changhan Lee | 2 Dec 09:45 2002

Re: lua for unicode

oh, sorry i'm confused in understanding microsoft's wide character and utf-8 format, 
i just thought it looks impossible in lua to read 2-byte wide character formatted file. 
and i have misunderstanding that all unicode is 2-byte array,

i found it's possible to save lua file to UTF-8 format, because all lua keywords and tokens are ascii characters,
and just read string from lua table without any other string operation was ok. 
i used MultiByteToWideChar/WideCharToMultiByte win32 functions to read string from lua table, and
worked fine. 
( am i doing right with this functions? first i converted UTF-8 multibyte to widechar, and then converted
widechar version string to ANSI and i know this solution can't display all different language characters
at same time )

but i'm still confusing, UTF-8 format is not wide character version, and it's unicode, 
and microsoft's wide character isn't unicode? 
UTF-8 is multi-byte and microsoft's wide character is 2-byte array..

i want to work together to support UTF-8 in lua, but i'm still newbie in this field yet. 
and checking if char is greater than 127 in all string loops looks not good at performance..
(wide character version looks more easy in string operation)

and finally,  i'm korean, currently doing to support korean and japanese in my game, and sorry for poor
english :)

Roberto Ierusalimschy | 2 Dec 11:11 2002
Picon
Picon

Re: lua for unicode

> As for the core Lua lib, it's mostly string width independent
> I believe, although a quick search across the source produces a 
> number of calls to things like strlen in routines like 
> lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
> luaV_strcomp.

lua_pushstring is auxiliar (as you pointed out); the same for
lua_dostring (the "official" luaL_loadbuffer gets a size). luaS_new
and luaO_chunkid may need some rework.

> What is this call luaV_strcomp used for?

It does string-order comparison:  "hi" <= "hello". Yes, this one breaks
an external Unicode system. Suggestions?

> I also see the use of strcpy in a number of places. I was 
> under the impression Lua was string width independent but maybe
> I was wrong.

We try to make it string width independent, but it is difficult to
ensure that. We will try to improve that (maybe after 5.0 beta).

-- Roberto

lua+Steven.Murdoch | 2 Dec 13:12 2002
Picon
Picon

Re: lua for unicode

> but i'm still confusing, UTF-8 format is not wide character version, and it's unicode, 
> and microsoft's wide character isn't unicode? 
> UTF-8 is multi-byte and microsoft's wide character is 2-byte array..

Unicode is a character set which maps integer numbers to characters. In theory 
there is not limit to the size of Unicode since it does not define any 
representation. However it is extremely unlikely any characters will be mapped 
numbers greater than 2^31 (2,147,483,648), and there are proposals to limit 
this to 2^21 (2,097,152).

Initially Unicode was limited to 2^16 positions (65,536), but this was found 
to be inadequate. The first 2^16 characters of Unicode are known as the Basic 
Multilingual Plane (BMP) and is intended be enough to represent all living 
languages, however as other messages have suggested it does not contain 
historical characters. This space is not yet full so there may be further 
characters added in the future.

Unicode doesn't specify a way for the integers representing characters to be 
encoded so there are a number of options. Windows was designed when Unicode 
characters were only 16bit long so are encoded as two bytes, therefore can 
only represent the BMP. This is what Microsoft's wide character representation 
is (sometimes called UCS-2).

UTF-8 is a variable length encoding which can represent the whole Unicode 
codespace (1-3 bytes for the BMP, 1-4 bytes for 21 bit Unicode, 1-6 bytes for 
31 bit Unicode). It has several features that are good for backwards 
compatibility. For details see: http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-16 is another encoding which can represent the whole Unicode codespace as 
2 or 4 bytes per character.
(Continue reading)

Roberto Ierusalimschy | 2 Dec 13:16 2002
Picon
Picon

Re: lua for unicode

> As for the core Lua lib, it's mostly string width independent
> I believe, although a quick search across the source produces a 
> number of calls to things like strlen in routines like 
> lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
> luaV_strcomp.

Just a remark: for UTF-8, the way Lua uses strlen, strcat, strcpy, etc.
is OK, as UTF-8 strings cannot contain zeros. The only dependency in the
core is `strcoll' (used in luaV_strcmp to order strings).

-- Roberto

lua+Steven.Murdoch | 2 Dec 15:23 2002
Picon
Picon

Re: lua for unicode

> Just a remark: for UTF-8, the way Lua uses strlen, strcat, strcpy, etc.
> is OK, as UTF-8 strings cannot contain zeros.

The null character ('\0' in C) is represented in Unicode as a single, zero 
byte. Will Lua handle this case correctly, since I think it permits strings 
with '\0' in them? However as you state, UTF-8 does not permit a zero byte in 
any other circumstance.

Steven Murdoch.


Gmane