Markus Kuhn | 1 Nov 2000 10:32
Picon
Picon
Favicon

Re: Transliteration in wcrtomb()

Marcin 'Qrczak' Kowalczyk wrote on 2000-10-31 22:15 UTC:
> Mon, 30 Oct 2000 21:45:20 +0000, Markus Kuhn <Markus.Kuhn <at> cl.cam.ac.uk> pisze:
> 
> > In my eyes, "ü" -> "ue" is just as much a valid and useful multibyte
> > encoding as UTF-8.
> 
> It is possible to apply transliteration when iconv did not do it,
> but it's impossible to undo transliteration made by iconv. So please
> don't force users of iconv to have transliteration.

You probably lost context of what I was talking about. I was never
talking about iconv() or any other function not defined in ISO C99. I
was only talking about locale-dependent multibyte encoding as done by
wcrtomb() and all the other functions that are built on top of it
(printf("%ls"), wprintf(), etc.). Whether these functions should do
transliteration or not should in my opinion be the user's choice, via
selecting a locale that has or has not transliteration, as desired.
iconv() is *NOT* locale dependent and its semantics is not defined in
relation to wcrtomb() in any way and therefore it is completely
irrelevant here.

> When I want transliteration, I can easily do it myself (I've done it
> in Haskell); but when I need to know whether text can be unambiguously
> converted, I want to be able to get an error in other cases.

Yes, of course, iconv() will do all this and more for you and I never
ever said that this was a bad idea.

> > My proposal gives the programmer far more control and at the same
> > time far less special code that has to be added to applications.
(Continue reading)

Marcin 'Qrczak' Kowalczyk | 1 Nov 2000 16:40
Picon

Re: Transliteration in wcrtomb()

Wed, 01 Nov 2000 09:32:55 +0000, Markus Kuhn <Markus.Kuhn <at> cl.cam.ac.uk> pisze:

> The user (not programmer!) does this by picking the right locale:

Does it work with iconv and nl_langinfo(CODESET), i.e. does
nl_langinfo(CODESET) include this information? I hope yes.
I like transliteration if it's selectable by the user.

> I think, the external representation selectable by the user should
> include transliteration and it should be done in wcrtomb() if the
> user wants it.

As it has been pointed out, wcrtomb is not appropriate for conversion
from Unicode (or other known charset), because wchar_t can be any
encoding. Please don't tell me that iconv is no longer appropriate
for the conversion between Unicode and the local charset :-)

--

-- 
 __("<  Marcin Kowalczyk * qrczak <at> knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Markus Kuhn | 1 Nov 2000 18:33
Picon
Picon
Favicon

Re: wchart_t is Unicode

Marcin 'Qrczak' Kowalczyk wrote on 2000-11-01 15:40 UTC:
> As it has been pointed out, wcrtomb is not appropriate for conversion
> from Unicode (or other known charset), because wchar_t can be any
> encoding.

Just insert into your source code the three magic lines

  #ifndef __STDC_ISO_10646__
  #error "Error: wchar_t is not Unicode!"
  #endif

and after that, your environment will be miraculously guaranteed to have
only a Unicode wchar_t encoding. Portable programming can be that easy!

Markus

--

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Michael Holzt | 2 Nov 2000 14:45

Article about Unicode on Linux/Unix

I'm about to write an article about unicode on linux/unix for a german
magazine. The article is targeted toward the end user, so it will include
things like 'what is unicode?', 'why should i take care?' with little
programming details (maybe a few perl snippets). 

I wonder if you have any suggestions about topics which i shouldn't miss
in the article.

--

-- 
Greetings
Michael
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

etrapani | 2 Nov 2000 14:12
Picon

iconv output utf-8 -> utf-16, which one is wrong?


With the very same file, the iconv output is different under FreeBSD and
Linux (both on Intel PIII).  I don't want to find the cause of the
difference (version numbers would helpful if that where the case),
rather I want to know which output is the right one according to specs
or common sense or even both :).

Linux:

$ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
0000010 1e20 1c20 7e01 0a00                    
0000018

FreeBSD:

$ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
0000010 6b01 1e20 1c20 7e01 0a00               
000001a

Wprint (a postscript filter for Netscape/Mozilla printing output) is
now, under FreeBSD sending the "fffe" as a valid character because it
does not expect it.  Although it is easy to just skip it if it is
present I would like to know if it should be present at all.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

(Continue reading)

Edmund GRIMLEY EVANS | 2 Nov 2000 14:28

Re: iconv output utf-8 -> utf-16, which one is wrong?

etrapani <at> unesco.org.uy <etrapani <at> unesco.org.uy>:

> Wprint (a postscript filter for Netscape/Mozilla printing output) is
> now, under FreeBSD sending the "fffe" as a valid character because it
> does not expect it.  Although it is easy to just skip it if it is
> present I would like to know if it should be present at all.

U+FEFF is the BOM (Byte Order Mark) or ZERO WIDTH NO-BREAK SPACE. It
can in some circumstances be useful to have this at the beginning of a
file or datastream to distinguish big-endian UTF-16 from little-endian
UTF-16 (and from UTF-8, etc), however, it can also be harmful, so I
don't think iconv should be generating or interpreting BOMs by
default.

Should iconv perhaps have command-line arguments --bom-in and
--bom-out or something similar?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Karlsson Kent - keka | 2 Nov 2000 14:54
Picon

RE: iconv output utf-8 -> utf-16, which one is wrong?


> -----Original Message-----
> From: etrapani <at> unesco.org.uy [mailto:etrapani <at> unesco.org.uy]
> Linux:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
> 0000010 1e20 1c20 7e01 0a00                    
> 0000018

This is UTF-16LE (little-endian serialisation of UTF-16).
It does *not* conform to 10646 (which only allows for
big-endian serialisations) but does conform to Unicode.

An initial U+FEFF in UTF-16LE (or UTF-16BE) is interpreted
as a character (ZWNBSP) and must be kept.

> FreeBSD:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
> 0000010 6b01 1e20 1c20 7e01 0a00               
> 000001a

This is UTF-16[with-byte-order-mark; little-endian],
assuming that there was no U+FEFF in the beginning of
the source file (if there was, this would be UTF-16LE).

The (optional if big-endian) byte-order-mark is to be
removed after detecting the byte order.
(Continue reading)

Mark Leisher | 2 Nov 2000 15:38

Re: iconv output utf-8 -> utf-16, which one is wrong?


    etrapani> With the very same file, the iconv output is different under
    etrapani> FreeBSD and Linux (both on Intel PIII).  I don't want to find
    etrapani> the cause of the difference (version numbers would helpful if
    etrapani> that where the case), rather I want to know which output is the
    etrapani> right one according to specs or common sense or even both :).

Assuming the characters are being shown byte-swapped, the FreeBSD output is
correct. Byte Order Mark's (BOM's) should be produced for all UTF-16 text.

This short segment of code shows how to determine if the text needs to be byte
swapped.

  FILE *in;
  int byte_swap;
  unsigned short bom;

  fread((char *) &bom, sizeof(unsigned short), 1, in);

  byte_swap = (bom == 0xfffe) ? 1 : 0;
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            Cinema, radio, television, magazines are a
New Mexico State University       school of inattention: people look without
Box 30001, Dept. 3CRL             seeing, listen without hearing.
Las Cruces, NM  88003                            -- Robert Bresson
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

(Continue reading)

Mark Leisher | 2 Nov 2000 15:47

Re: iconv output utf-8 -> utf-16, which one is wrong?


    Edmund> U+FEFF is the BOM (Byte Order Mark) or ZERO WIDTH NO-BREAK
    Edmund> SPACE. It can in some circumstances be useful to have this at the
    Edmund> beginning of a file or datastream to distinguish big-endian UTF-16
    Edmund> from little-endian UTF-16 (and from UTF-8, etc), however, it can
    Edmund> also be harmful, so I don't think iconv should be generating or
    Edmund> interpreting BOMs by default.

Without a BOM, there really is no way to tell if the text has been
byte-swapped.  We run into this all the time with text generated on Solaris
and Linux (on a little endian machine), and depend heavily on the existence of
the BOM to make the text readable.

There are times when it just gets in the way, like applications that don't
know about the BOM.

    Edmund> Should iconv perhaps have command-line arguments --bom-in and
    Edmund> --bom-out or something similar?

Maybe a single command line parameter to explicitly avoid generating a BOM,
but I think one should be generated by default.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            Cinema, radio, television, magazines are a
New Mexico State University       school of inattention: people look without
Box 30001, Dept. 3CRL             seeing, listen without hearing.
Las Cruces, NM  88003                            -- Robert Bresson
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
(Continue reading)

Bruno Haible | 2 Nov 2000 18:08
Picon

Re: iconv output utf-8 -> utf-16, which one is wrong?

etrapani <at> unesco.org.uy writes:
> 
> With the very same file, the iconv output is different under FreeBSD and
> Linux (both on Intel PIII).

> Linux:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
> 0000010 1e20 1c20 7e01 0a00                    
> 0000018

This is big-endian UTF-16, without byte order mark. (Yes, Kent.
The "hexdump" utility on x86 systems displays 16-bit little-endian
words. Next time, please use
    hexdump -e '"%06.6_ax  " 16/1 "%02X "' -e '"  " 16/1 "%_p" "\n"'
instead of "hexdump".)

> FreeBSD:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
> 0000010 6b01 1e20 1c20 7e01 0a00               
> 000001a

This is big-endian UTF-16, with byte order mark.

And with glibc 2.1.96 you get:

$ cat t.txt | /glibc22/bin/iconv -f utf-8 -t utf-16 | hexdump
(Continue reading)


Gmane