Tim Bray | 1 Feb 07:18
Favicon
Gravatar

Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt

On 1/31/07, John C Klensin <john-ietf <at> jck.com> wrote:

> While I think I agree with you about your second proposed
> paragraph above ("New protocols..."), I think my instructions
> with this document is to keep it narrow and to focus on escapes,
> not on general advice to protocol designers about Unicode or
> internationalization more broadly.   So I don't want to go so
> far as to make specific (or even specific-sounding) suggestions.

I hadn't read 2277 in years; having done so, I think that it says what
I was trying to say quite effectively.  De facto, this spec is really
only usable for text (in the 2277 sense) when internationalizing
existing protocols.

> This effort, and some others, have convinced me that we are
> getting closer to the time at which RFC 2277/ BCP 18 needs to be
> reopened, reviewed, and updated, but this document isn't the
> right place to do it, at least IMO.

Really? Having just re-read that, I found little to disagree with or
want to change.  My pain point is 2223, but everyone knows that. -Tim

John C Klensin | 1 Feb 17:08

Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt


--On Wednesday, 31 January, 2007 22:18 -0800 Tim Bray
<tbray <at> textuality.com> wrote:

> On 1/31/07, John C Klensin <john-ietf <at> jck.com> wrote:
> 
>> While I think I agree with you about your second proposed
>> paragraph above ("New protocols..."), I think my instructions
>> with this document is to keep it narrow and to focus on
>> escapes, not on general advice to protocol designers about
>> Unicode or internationalization more broadly.   So I don't
>> want to go so far as to make specific (or even
>> specific-sounding) suggestions.
> 
> I hadn't read 2277 in years; having done so, I think that it
> says what
> I was trying to say quite effectively.  De facto, this spec is
> really
> only usable for text (in the 2277 sense) when
> internationalizing existing protocols.

That is more or less what the text says now... watch for -02
probably sometime next week.

>> This effort, and some others, have convinced me that we are
>> getting closer to the time at which RFC 2277/ BCP 18 needs to
>> be reopened, reviewed, and updated, but this document isn't
>> the right place to do it, at least IMO.
> 
> Really? Having just re-read that, I found little to disagree
(Continue reading)

Clive D.W. Feather | 2 Feb 12:46

Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt

>> In section 5.2 it is said HTML uses the &#xNNNN; form and that
>> this form has a clear terminator. This is not really false but
>> HTML allows to omit the terminator if it is not needed, for
>> example <p>Bj&#xf6rn</p> is also valid. I would suggest to
>> mention only XML or note that HTML's mechanism is similar to
>> that of XML.

If you check the HTML specification (section 5.3), it says that SGML
allows the semicolon to be omitted in certain contexts, but "strongly
suggest" not to do that.

In particular, the example
    <p>Bj&#xf6rn</p>
is not valid because it lies in the middle of a word. A permitted case
would be:
    <p>Bj&#xf6</p>
where the tag begin symbol < ends the entity.

--

-- 
Clive D.W. Feather  | Work:  <clive <at> demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive <at> davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |

Clive D.W. Feather | 2 Feb 12:38

Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt

John C Klensin said:
> I've just submitted draft-klensin-unicode-escapes-01.txt and
> assume it will show up in the posting directory today or
> tomorrow.  

Some comments for you.

* In 1.1, rather than saying that Unicode occupies "two or more octets",
wouldn't it be better to say "21 bits - rather than the 7 bits of ASCII -"?

* Somewhere in the last two paragraphs of 1.1 you should be talking about
mini-languages (e.g. Cosmogol) as well as protocols and UIs.

* In 3, you're inconsistent between "U+NNN[N[N]]" and "NNN...". Indeed,
shouldn't the former actually be "U+[[N]N]NNNN"? (Note both the order and
the number of Ns.) I would suggest that better wording might be:

    ... U+NN syntax for code point references specified in the Unicode
    Standard, where NN is between four and six hexadecimal digits.

* In 4, second bullet, "string terminators" should be "string delimiters".

* In 5.2, you've said "generally considered ugly and awkward" but I'm not
aware of anyone else who's made that complaint.

* In 6 you need to copy in all the security stuff from Unicode; the stuff
that says that you must use shortest-form UTF-8 (so not using %xC1.A1 for
'A') because of the problems of filters and firewalls not spotting longer
forms.

(Continue reading)

Picon

Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt

On Fri, Feb 02, 2007 at 11:38:53AM +0000,
 Clive D.W. Feather <clive <at> demon.net> wrote 
 a message of 35 lines which said:

> * Somewhere in the last two paragraphs of 1.1 you should be talking
> about mini-languages (e.g. Cosmogol) as well as protocols and UIs.

This is clearly a difficult issue since RFC 2277 is clear on:

* text carried by a protocol (i18n is necessary)
* protocol elements (i18n is optional)

but does not mention formats, mini-languages and so on. RFC 4234 is a
good example of a format whose i18n rules are unclear.

Frank Ellermann | 2 Feb 14:30
Picon
Picon

I-D.klensin-unicode-escapes (was: New Draft)

Clive D.W. Feather wrote:

> In 3, you're inconsistent between "U+NNN[N[N]]" and "NNN...". Indeed,
> shouldn't the former actually be "U+[[N]N]NNNN"? (Note both the order
> and the number of Ns.)

+1

>     ... U+NN syntax for code point references specified in the Unicode
>     Standard, where NN is between four and six hexadecimal digits.

No, folks could misinterpret U+NN as "anything up to 6 digits".

> In 5.2, you've said "generally considered ugly and awkward" but I'm
> not aware of anyone else who's made that complaint.

+1  Obviously John hates it, that would justify "often".  Others don't
like backslash-U for various reasons, not only ugly and awkward, also
confusing (due to various conventions), unclear (lack of delimiter),
and a royal PITA in conjunction with <quoted-string>, when it results
in multiple backslashes.  "Harmful" is worse than "ugly and awkward".

> In 6 you need to copy in all the security stuff from Unicode

IMO not "all", folks are supposed to know RFC 3629, it's a STD.  So far
all attacks on the "three steps" model fortunately failed, STD is STD.

Frank

(Continue reading)

Frank Ellermann | 2 Feb 14:05
Picon
Picon

I-D.klensin-unicode-escapes (was: New Draft)

Clive D.W. Feather wrote:

> If you check the HTML specification (section 5.3), it says that SGML
> allows the semicolon to be omitted in certain contexts, but "strongly
> suggest" not to do that.

Yes, never ever mention that HTML exists, it's horrible.  The [Charmod]
bible requires no (SGML) nonsense in http://www.w3.org/TR/charmod/#C044

The I-D should IMO adopt and cite [Charmod] C042 up to C048 verbatim.

A few other conformance criteria in [Charmod] might be also interesting:
http://www.w3.org/TR/charmod/#C070  Don't exclude arbitrary code points
http://www.w3.org/TR/charmod/#C077  Don't allow anything above U+10FFFF
http://www.w3.org/TR/charmod/#C078  Don't (ab)use surrogates
http://www.w3.org/TR/charmod/#C079  Don't (ab)use non-characters

http://www.w3.org/TR/charmod/#C015  n/a (covered by the better RFC 2277)
http://www.w3.org/TR/charmod/#C016  n/a (covered by the better RFC 2277)
http://www.w3.org/TR/charmod/#C017  Stick to working encoding rules
http://www.w3.org/TR/charmod/#C018  n/a (covered by the better RFC 2277)

http://www.w3.org/TR/charmod/#C049  n/a (for the I-D US-ASCII is given)
http://www.w3.org/TR/charmod/#C026  n/a (covered by the better RFC 2277)

Etc.  The "better RFC 2277" idea is a single default UTF-8, instead of a
choice between UTF-8, UTF-16, UTF-16LE, UTF16-BE, UTF-32, UTF-32LE, and
UTF-32BE in [Charmod], let alone hypothetical UTF-32 "2143" or "3412".

Probably the I-D should mention that one famous exception from its rule
(Continue reading)

Frank Ellermann | 2 Feb 15:07
Picon
Picon

ABNF (was: New draft)

Stephane Bortzmeyer wrote:

> RFC 4234 is a good example of a format whose i18n rules are unclear.

| NOTE:
|
|     ABNF strings are case-insensitive and the character set for these
|     strings is us-ascii.

That's clear.  A tricky part could be <name> in chapter 2.2, because...

|     rulename       =  ALPHA *(ALPHA / DIGIT / "-")

...in chapter 4 could be interpreted as different from <name>.  Now I've
found a typo in 4234 chapter 2.4:

   although Appendix A (Core) provides definitions for a 7-bit US-ASCII
   environment as has been common to much of the Internet.

It's Appendix B, not Appendix A (Acknowledgements).  Chapter 4 is based
on Appendix B, a <rulename> is clearly ASCII.  IMO RFC 4234 is fine, its
LWSP is an exception, FWS as in RFC 2822 (excl. obs-FWS) would be better.

Unlike an utter dubious variant in RFCs 2068, 2069, 2616, 2617, and 2831,
where nobody sees the potential damage caused by <LWS> hidden in a #rule.

Frank

Clive D.W. Feather | 2 Feb 19:25

Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt

John C Klensin said:
> 	* I have not touched the ABNF associated with the \u /
> 	\U case.  I have inserted an explicit placeholder but,
> 	as discussed on this list, I think we need to figure out
> 	what we want to do and then go back and adjust the
> 	metalanguage productions.   In particular, there has
> 	been one strong suggestion, with which I agree, that we
> 	not take the obvious approach of substituting %x5C.75
> 	for "\u", since the intent is a character string
> 	abstraction (independent of the implementation character
> 	set) rather than specific octets.

I asked Paul Overell about this and got the following answer (in part):

>        "By separating external encoding from the syntax, it is intended
>        that alternate encoding environments can be used for the same
>        syntax."
>
> So although "\" means us-ascii %x5C, the same ABNF may still be used to
> specify the syntax of strings expressed in a different character set by
> specifying the mapping between %x5C and the encoding used, but that is
> outside the scope of ABNF.

In other words, you write the ABNF as if the target encoding was ASCII, but
then state somewhere in the document that other encodings may be used and
the ABNF is meant to represent the abstract characters, not specific octet
values.

Or else we either drop ABNF (wrong, I think) or state explicitly that the
notation is "ABNF except for case-sensitivity".
(Continue reading)

Clive D.W. Feather | 2 Feb 19:47

Re: I-D.klensin-unicode-escapes (was: New Draft)

Frank Ellermann said:
> The I-D should IMO adopt and cite [Charmod] C042 up to C048 verbatim.

C042 would require &#x1234; rather than allowing us to invent \u'1234'.

> A few other conformance criteria in [Charmod] might be also interesting:
> http://www.w3.org/TR/charmod/#C070  Don't exclude arbitrary code points
> http://www.w3.org/TR/charmod/#C077  Don't allow anything above U+10FFFF
> http://www.w3.org/TR/charmod/#C078  Don't (ab)use surrogates
> http://www.w3.org/TR/charmod/#C079  Don't (ab)use non-characters

Those are worth including, I think.

--

-- 
Clive D.W. Feather  | Work:  <clive <at> demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive <at> davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |


Gmane