Florian Weimer | 1 May 2000 19:25

Re: wcwidth() implementation

  Markus Kuhn <Markus.Kuhn <at> cl.cam.ac.uk> writes:

> Attached is my public domain implementation of the wcwidth() and
> wcswidth() functions. I hope you will find it useful for inclusion into
> glibc, xterm, etc. The function wcwidth() distinguishes between normal,
> wide, and combining characters, and wcswidth() can be used to predict
> how many columns a string sent to a terminal emulator such as xterm will
> occupy on the screen.

Hmm.  Your implementation restricts wide characters mainly to the
East-Asia regions.  But there are many characters which you can hardly
display using normal glyphs, for example:

        ∰   U+2230   VOLUME INTEGRAL
        ⒛   U+249B   NUMBER TWENTY FULL STOP
        ⒨   U+24A8   PARENTHESIZED LATIN SMALL LETTER M
        ffl   U+FB04   LATIN SMALL LIGATURE FFL

I think wcwidth() could even return 3 for these characters and many
more. ;)
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Edmund GRIMLEY EVANS | 2 May 2000 15:32

Re: what shall we do about iconv?

About a month ago I wrote here that "there seems to be no sensible way
of implementing a function that converts data while reading it from a
stream and knows at the end how may non-reversible conversions
occurred". I tried contacting the Open Group about this, and I
received some replies from Andrew Josey.

The Group thinks that the specification is clear enough: iconv()
should return -1 whenever one of the conditions EILSEQ, E2BIG, EINVAL
or EBADF occurs. Application code is already reliant on this
behaviour, so it cannot be changed. Apparently the problem I pointed
out is real, but it would have to be solved using an alternate API
instead of iconv. I don't know whether anyone is likely to take any
concrete steps towards defining such an API, but I feel happier now
that I know what the situation is.

Just thought I'd register this for the archives ...

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Bruno Haible | 2 May 2000 16:01
Picon

Re: what shall we do about iconv?

Edmund GRIMLEY EVANS writes:
> About a month ago I wrote here that "there seems to be no sensible way
> of implementing a function that converts data while reading it from a
> stream and knows at the end how may non-reversible conversions
> occurred". I tried contacting the Open Group about this, and I
> received some replies from Andrew Josey.

Thanks for telling us; my clarification request about the same issue
was not answered up to now.

> The Group thinks that the specification is clear enough: iconv()
> should return -1 whenever one of the conditions EILSEQ, E2BIG, EINVAL
> or EBADF occurs. Application code is already reliant on this
> behaviour

Now this is clear. Fine.

What an application can do:

a. It can convert one character at a time. If it does this, it can decide
   itself about possible default behaviour, transliteration, special colouring
   of each misconverted character, or whatever.

b. Often, if EILSEQ or EINVAL occurs, the entire conversion is aborted, and
   it does not matter how many non-reversible character conversions were
   already made. So all the application has to protect against is E2BIG,
   and it can do so by doing the conversion into a temporary buffer first.

   So it has to do the conversion twice. But that's life in C. When you
   call strdup, it also has to scan the source string twice: once to
(Continue reading)

Bruno Haible | 2 May 2000 16:57
Picon

Updated HOWTO


An updated version of the Linux Unicode-HOWTO is at
   ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html

- Don't recommend to set LANGUAGE=xx.UTF-8. Back to LANGUAGE=xx. Instead,
  recommend to use the new gettext library which converts the message on
  the fly. No more need to convert all the .mo files to UTF-8.
- Mentions Otfried Cheong's oc-unicode package instead of Mule-UCS.
- Mentions Edmund Grimley Evans' framebuffer console terminal emulator.
- Mentions Pango.
- Mentions Python's, Javascript's, Tcl's, Perl's Unicode support.
- libutf8 is in version 0.7.1.
- libiconv is in version 1.3.
- Remove dead "Multilingual Emacs and Unicode" URL.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

etrapani | 3 May 2000 22:55
Picon

wprint 1.01

>From freshmeat
(http://www.freshmeat.net/appindex/2000/04/26/956785619.html):

WorldPrint is a filter for Netscape's postscript output that uses
TrueType fonts to allow the printing of pages written in Unicode, Big5,
SJIS, the ISO-8859* charsets (and maybe others). This does not require
Netscape to be able to render the full text on screen.

I thought you might be interested.

Bye, Eduardo.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

PILCH Hartmut | 4 May 2000 08:05
Picon

Re: Printing and non-latin1 pages

> Proposals?  It would be neat to have a comment in the postscript file
> stating what was the original encoding of the page.  That way I could
> automatically convert the file without having the user specify the
> encoding.  That is, if we get the original strings, if we get UTF8 or
> UTF16 in the poscript file there would be no need for that.

Sounds great.

But I must be missing something here, because I just installed the Wadalab
Japanese Postscript fonts in the /usr/lib/ghostscript/ path as usually
done in Japanese distributions, and everything works wihthout any
postprocessing filter.

We have the t1utils and ttf2pfb for perfectly converting TTF to PFB (type
1 postscript, compressed), and it should be not too difficult to create
some Unicode PFB files, and even CID files, which work with Ghostscript
>=5.5.  Ken Lunde has made CID versions of the Wadalab fonts available.

Already now, EUC-JP, SJIS and Latin-1 all print correctly under my
Netscape 4.6 without postprocessing, and with everything unified to UCS,
everything should be even more straightforward.

-phm

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Bruno Haible | 4 May 2000 20:27
Picon

how to test the new glibc-2.2's UTF-8 locales


The new multibyte locales in the upcoming glibc-2.2 are now pretty much
working. For the adventurous among you who would like to discover how
neat it is, I'm appending a recipe how to install a new glibc snapshot
without shooting yourself in the foot.

What works: (I hope I missed nothing, Ulrich!)
  - Locales with multibyte encodings can be created.
  - iconv is now much more reliable.
  - The wc* and mb* family of functions, including fwprintf and fwscanf.
  - FILE streams: fopen("filename","r/w,ccs=ENCODING"), fpos_t now includes
    an mbstate_t.
  - strcoll, wcscoll have been completely rewritten.
  - nl_langinfo(CODESET) works.
  - gettext automatically converts the translations to the current locale's
    encoding.

What is still missing:
  - Transliteration of accented and special punctuation characters during
    a conversion UTF-8 -> ISO-8859-* or UTF-8 -> ASCII.
  - The wide character properties (wcwidth, iswupper, etc.) are very
    different from the tables of the Unicode consortium.
  - regexp is multibyte aware, but still only handles ISO-8859-1 characters
    correctly.
  - There is no UTF-7 support in iconv.

Bruno

Installation instructions
=========================
(Continue reading)

Robert Brady | 4 May 2000 20:30
Picon
Favicon

Bengali

Are there any readers of the Bengali script on this mailing list? Does
anyone know anyone who does?

--

-- 
Robert

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

M.K.Laha | 5 May 2000 05:59
Picon

Need info on Linux & Bengali

> Date:  Thu, 04 May 00 13:49PM CDT  
> From:  Robert Brady <rwb197 <at> ecs.soton.ac.uk>  
> To:  linux-utf8 <at> nl.linux.org  
> Subject:  Bengali  
>  
> Are there any readers of the Bengali script on this mailing list? Does
> anyone know anyone who does?
> 
> -- 
> Robert

Hi!

I read and write in the Bengali script. I would love to be able to
do that using Linux. My text processor is groff. I'd appreciate any
directions as to how I could augment groff to do Bengali. I am
prepared to help to in the implementation, too.

- Manas Laha
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Ulrich Drepper | 4 May 2000 20:57
Picon
Favicon

Re: how to test the new glibc-2.2's UTF-8 locales

Bruno Haible <haible <at> ilog.fr> writes:

>   - The wide character properties (wcwidth, iswupper, etc.) are very
>     different from the tables of the Unicode consortium.

This is an issue of the locale data files since the information is
coming from these files.

>   - There is no UTF-7 support in iconv.

I currently don't think that promoting this ill-designed encoding is
useful.  Let it die.  Don't tell anybody that it ever existed.

> - Sources:
>   - glibc CVS sources, instructions are at http://sourceware.cygnus.com/glibc/
>     (remember to use "cvs -z 9" to save network bandwidth)

Please use -z3.  The actual differences in the amount of data are
minimal but you help the server.

--

-- 
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/


Gmane