Doug Ewell | 2 Jul 00:58

Solving the UTF-8 problem

This is intentionally cross-posted to LTRU and ietf-languages, since it 
deals with both implementation policy and proposed changes to RFC 4646bis 
and 4645bis.

CE Whitehead <cewcathar at hotmail dot com> wrote on ietf-languages:

> I want to update the 1694acad comments field to include a transliteration 
> into Basic Latin (also--perhaps???--to fix the inconsistency as 4eme is 
> missing the accent grave on the e!! :
>
> Comments: 17th century French, as catalogued in the "Dictionnaire de 
> l'acad&#xE9;mie fran&#xE7;oise" ("l'academie francoise"), 4&#xE8;me (4eme) 
> ed. 1694; frequently includes elements of Middle French, as this is a 
> transitional period.

I really, really don't like the direction this is headed.  Ultimately we 
will find ourselves having to provide duplicate Description and Comments 
content for every non-ASCII character in the Language Subtag Registry, 
removing most of the advantage of being able to represent non-ASCII in the 
first place.

What are we going to do when the ISO 639-3 code list is finalized and we 
have to deal with adding the following pairs of languages, whose names 
differ only by diacritical marks?

aru  Arua
arx  Aruá

bfa  Bari
mot  Barí
(Continue reading)

John Cowan | 2 Jul 01:28

Re: [Ltru] Solving the UTF-8 problem

Doug Ewell scripsit:

> 1.  UTF-8 doesn't play well with e-mail, which is invaluable for 
> discussing changes on the ietf-languages list and sending the changes to 
> IANA (stated by several).

I note for the record that your message arrived encoded in Latin-1
but tagged "utf-8".

--

-- 
We pledge allegiance to the penguin             John Cowan
and to the intellectual property regime         cowan <at> ccil.org
for which he stands, one world under            http://www.ccil.org/~cowan
Linux, with free music and open source
software for all.               --Julian Dibbell on Brazil, edited
Doug Ewell | 2 Jul 01:50

Re: [Ltru] Solving the UTF-8 problem

John Cowan <cowan at ccil dot org> wrote:

>> 1.  UTF-8 doesn't play well with e-mail, which is invaluable for 
>> discussing changes on the ietf-languages list and sending the changes to 
>> IANA (stated by several).
>
> I note for the record that your message arrived encoded in Latin-1 but 
> tagged "utf-8".

(For reference: Microsoft Outlook Express 6.00.2900.2180.)

When I saved the sent message as a text file and then opened it in a hex 
editor, I saw the UTF-8 sequences.  So it may have been changed somewhere 
along the way.  Maybe I should have included a non-1252 character somewhere.

In any case, I don't dispute that e-mail support for UTF-8, whether in 
clients or gateways, is unreliable.  I suggested an approach to get around 
that problem.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages
Picon

Re: Solving the UTF-8 problem

On Sun, Jul 01, 2007 at 03:58:48PM -0700,
 Doug Ewell <dewell <at> roadrunner.com> wrote 
 a message of 161 lines which said:

> Another possibility is to have IANA post an official version of the
> Registry in one encoding, such as UTF-8, and additional, unofficial
> versions in other encodings, such as Latin-1 or hex NCRs.

Why not? Currently, we do exactly the opposite: IANA publishes the
official registry in hex NCR
(http://www.iana.org/assignments/language-subtag-registry) and
langtag.net publishes an unofficial version in UTF-8
(http://www.langtag.net/registries/language-subtag-registry.utf8).

> Potential problems with this approach are unintentional mismatches
> between the versions (I caught one of these problems for the ISO
> 639-3 people recently)

I do not get it. If the unofficial version is produced by a program,
how can a mismatch exist (unless there is a bug in the program)? 

And if the unofficial version is done by hand, should we tell ISO
639-3 that computers are better than people for boring and repetitive
tasks? 
CE Whitehead | 2 Jul 17:03
Picon

Solving the UTF-8 problem; was Language Tag Modification 1694acad;

Hi, I'm confused as to whether or not persons with Thai Windows only (do 
they exist?? I thought so) can see all Latin-1 characters properly.

Also, I think adding a comment to transliterations into ascii that overlap 
would not be too much trouble.

(But this is needed only if there are persons who have operating systems 
anywhere on earth who cannot see these characters).

Finally, I do not think it is too much work to provide a ferw 
transliterations of non-ascii characters, given all the other stuff that 
goes into language subtag entries.

And I do not mind whether it is the officialregistry in ascii and the 
unofficial in utf-8 or the other way around.

Thanks.

(I've put all this again below for people who like the context.)

--C. E. Whitehead

Doug Ewell dewell at roadrunner.com
Mon Jul 2 00:58:48 CEST 2007

>What are we going to do when the ISO 639-3 code list is finalized and we 
>have to deal with adding the following pairs of languages, whose names 
>differ only by diacritical marks?

>aru  Arua
(Continue reading)

CE Whitehead | 2 Jul 17:09
Picon

Archive Entry for baku1926

Hi, I tried to do an archive entry for baku1926, but as I never heard back 
from Reshat as to what references he wanted included in it, I did a rough 
entry and then a collection of extra resources; I leave it up to Doug and 
the list what--if anything--to do to complete the archive.

--C. E. Whitehead

(NOTE:  Sorry, notepad puts an extra character sequence in utf-8 files and I 
was unable to edit the Russian characters in the Yahoo file manager for some 
reason,
so had to leave the notepad character sequence there!!!  I used an escape / 
in the closing tag; this may or may not hide the unecessary characters from 
view)

Archived entry form at:
http://www.geocities.com/quaiouestenglish/ietftemp/archiveentrybaku1926.html

* * *
Additional resources listed at:

http://www.geocities.com/quaiouestenglish/ietftemp/additionalresourcesbaku1926.html

& also below:

ADDITIONAL RESOURCES which Reshat has not confirmed
(from discussion):

&#1043;&#1056;&#1040;&#1053;&#1044;&#1045;, &#1041;.  (1934). 
«&#1059;&#1085;&#1080;&#1092;&#1080;&#1082;&#1072;&#1094;&#1080;&#1103; 
&#1072;&#1083;&#1092;&#1072;&#1074;&#1080;&#1090;&#1086;&#1074;».  In 
(Continue reading)

Picon

Re: Solving the UTF-8 problem

On Mon, Jul 02, 2007 at 08:19:17AM -0700,
 Doug Ewell <dewell <at> roadrunner.com> wrote 
 a message of 26 lines which said:

> 1. The human who runs the conversion program forgets to run it at
> the appropriate time.
> 
> 2. The human runs it against an outdated source file.
> 
> 3. The program runs correctly, but the new file is not copied or
> FTP'd to the appropriate location.
> 
> etc.  Notice that these scenarios tend to involve human error.

OK, I do not have a silver bullet to eliminate *all* errors but,
certainly, there is a long experience in our field in automatic
processes. For instance, issue 1) should be addressed by using a
program, not a human, to run the conversion (humans forget, software
schedulers do not).

Yes, cron can crash, the disk can be full, programs have bugs, etc,
but more 21st century techniques, involving software, could certainly
help to address some of these issues.
Doug Ewell | 2 Jul 17:19

Re: Solving the UTF-8 problem

(LTRU removed from recipient list)

Stephane Bortzmeyer <bortzmeyer at nic dot fr> wrote:

>> Potential problems with this approach are unintentional mismatches 
>> between the versions (I caught one of these problems for the ISO 639-3 
>> people recently)
>
> I do not get it. If the unofficial version is produced by a program, how 
> can a mismatch exist (unless there is a bug in the program)?

1. The human who runs the conversion program forgets to run it at the 
appropriate time.

2. The human runs it against an outdated source file.

3. The program runs correctly, but the new file is not copied or FTP'd to 
the appropriate location.

etc.  Notice that these scenarios tend to involve human error.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages
Picon

Re: Solving the UTF-8 problem

On Sun, Jul 01, 2007 at 03:58:48PM -0700,
 Doug Ewell <dewell <at> roadrunner.com> wrote 
 a message of 161 lines which said:

> 3.  UTF-8 can't be read on some, espcially older, computer systems
> (Frank Ellermann, months ago, and CE Whitehead).
> 
> With the continuing adoption of Unicode by OS and software vendors,
> I really can't get behind this argument.

Sorry but UTF-8 adoption is far from ubiquitous. Many tools still have
problems with UTF-8. I discovered today that ht://Dig, one of the two
most common free search engines has no UTF-8 support at all (see
http://www.htdig.org/FAQ.html#q4.27 and
http://www.htdig.org/FAQ.html#q4.10) which is quite sad for a Web
search engine (and, yes, the explanations they give are wrong, too).

Another common example is the Postscript tool a2ps.

> It simply isn't appropriate to "dumb down" all computerized text to
> match the least capable systems that might be running somewhere.

I understand the reasoning and, yes, switching the registry to UTF-8
might be one more signal sent to software developers, to tell them
they really should upgrade but do not claim that everything is done
yet.

So, I basically agree that UTF-8 for the registry is better but I do
not want to see bold sentences like "Anyone but Frank Ellermann can
run a full UTF-8 environment by now". This is not true.
(Continue reading)

Michael Everson | 2 Jul 22:41
Favicon
Gravatar

Re: Solving the UTF-8 problem

At 22:15 +0200 2007-07-02, Stephane Bortzmeyer wrote:

>So, I basically agree that UTF-8 for the registry is better but I do
>not want to see bold sentences like "Anyone but Frank Ellermann can
>run a full UTF-8 environment by now". This is not true.

It seems to me, frankly, that anyone who wants to use THIS registry 
can manage UTF-8.
--

-- 
Michael Everson * http://www.evertype.com

Gmane