Simon Josefsson | 8 Oct 15:58

BOM in draft-hoffman-stringprep-07

(If there is a more appropriate mailing list for stringprep, let me know.)

I see that draft-hoffman-stringprep-07 maps U+FEFF (ZWNBSP/BOM) to
nothing (older versions did too) and also prohibits the character in
the output.  Why?

U+FEFF used as a byte order mark have its uses in some Unicode
transformation formats (UTF-16 and UTF-32), and I don't see stringprep
requiring the use of UTF-8, where I would agree that it makes sense to
prohibit it.

My interpretation is that any protocol that uses UTF-16 and UTF-32 in
byte order independent mode will thus have to modify the stringprep
tables, to have BOM signatures work, before they can use stringprep.
Only UTF-8 and byte order tagged UTF-16 and UTF-32 can use stringprep
tables as is.  Is this a correct interpretation?  In that case, the
text in section 5 should make it more clear that it is allowed to use
partial tables (right now it says all or some of the tables may be
used), and also note that byte order independent UTF-16 and UTF-32
applications need to remove U+FEFF from the tables.  An alternate
solution would be to only define stringprep for UTF-8.

Thanks.

Kent Karlsson | 8 Oct 18:08
Picon
Picon

RE: BOM in draft-hoffman-stringprep-07


The double use of U+FEFF as BOM and ZWNBSP has been disunified.
U+FEFF is now (from Unicode 3.2) used only as BOM, though it
retains the name ZWNBSP (spelled out in full, of course...).  The
function of "true" ZWNBSP has been taken over by 2060;WORD JOINER.

A BOM is not part of any text, just some byte serialisation of that text, and
stringprep can be applied to some text, but not to the (byte) serialisation
of the text, that is a lower level, regardless of whether UTF-8, UTF-16,
UTF-32 or something else is used as processing code.  After stringprep,
the text can again be serialised, so (byte oriented) protocols can use BOM
with (byte serialised) UTF-16 or UTF-32 (or even UTF-8) if desired. 
(Though I would prefer fixating the byte order, rather than ever use BOM
where possible.)

So as long as it is ok to remove WORD JOINER (which is what non-BOM
uses of U+FEFF should be turned into), ZWNBSP should be removed to.

	/Kent Karlsson

> -----Original Message-----
> From: owner-idn <at> ops.ietf.org 
> [mailto:owner-idn <at> ops.ietf.org]On Behalf Of
> Simon Josefsson
> Sent: den 8 oktober 2002 15:59
> To: idn <at> ops.ietf.org
> Subject: [idn] BOM in draft-hoffman-stringprep-07
> 
> 
> (If there is a more appropriate mailing list for stringprep, 
(Continue reading)

Soobok Lee | 13 Oct 08:56
Picon

length restrictions on IDN label

[ When i read IDNA draft today, I still can't find
  the answer from it for the following question about IDN label length.
 If the following issue is already addressed in the draft, please correct me. ]

 I have a punycode label of length 63 octets:
  L1: zq--o39AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

 L2=ToUnicode(L1) produces: U+AC00 x 56 times ( Hangul "KA" repeated 56 times)

  L2:
U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 
U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 
U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 
U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 
U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 
U+AC00 U+AC00 U+AC00 U+AC00 U+AC00 U+AC00

 But this L2 can be encoded in various unicode/legacy encodings into
  various lengths of octets:

  UTF8 : 3 x 56 = 168 octets
  UCS2 : 2 x 56 = 112 octets
  UCS4 : 4 x 56 = 224 octets
  KSX1001/EUC-KR : 2 x 56 = 112 octets 

 These encodings produce labels longer than  63 octets 

 Moreover, each ACE label of valid (<256 octets) ACE-form FQDN IDN  may be
  converted into below-63-octets valid UTF8 labels, while the cumulative sum
  of the length of each UTF8 label of the FQDN IDN may exceed 256 octets 
(Continue reading)

JFC (Jefsey) Morfin | 13 Oct 11:18

ITU position.

This will probably lead to the major controversy I tried to avoid in coming 
in here.
Are you sure you do not want to discuss babelization before we must fight it?
jfc

cf. http://www.itu.int/newsroom/pp02/Highlights/1010.html

ITU stakes claim on Internet names

The Working Group of the Plenary approved a resolution on the management
of internationalized multilingual domain names. This resolution aims at
promoting the role of Member States in the internationalization of domain
names and addresses of their respective languages. This will be increasingly
important in the coming years as a majority of Internet users is expected
to prefer to conduct online activities in their own language. It is also
important because the current Domain Names System (DNS) mapping does not
reflect the growing language needs of all. The resolution stresses that
the registration and allocation of Internet domain names and addresses
should reflect the geographical and functional nature of the Internet with
an equitable balance of interests of all stakeholders. It also emphasizes
that the non-discriminatory access to Internet domain names and addresses,
and more generally to the Internet, should be available to all citizens
and that the management of Internet Domain Names and addresses is of
concern to both governments and the private sector. It also reiterates
the need to fully maintain country code numbering plans and addresses as
in ITU-T recommendation E.164 which defines the international public
telecommunication numbering plan. Under this resolution, ITU will be
providing assistance to its Member States to promote the use of their
languages for domain names and addresses and will cooperate with the
World Intellectual Property Organization whose role includes protection
(Continue reading)

Soobok Lee | 13 Oct 17:30
Picon

another question about IDN label length


 L1 is a utf8-encoded IDN label.

 L2=nameprep(L1), L2 is still in utf8.

 A1=ToASCII(L2)

 Function length(X) returns the octet length of the string X
 in unknown encoding and charset.

 My question in previous questions was:
   in the case of  length(A1)<=63 and length(L2)>63,
    is  L2  a valid iDN label conforming to DNS label length
     restrictions ?

 My second question is:

   if length(L2) <= 63 < length(L1),
     is L1 is valid iDN label ?

   if length(L1) <= 63 < length(L2),
     is L2 is valid iDN label ?

 I can repeat similar questions  by changing utf8 to
   other charset encodings, which will further complicate
   the problem.

 Some applications may exchange non-ASCII-form IDN instead 
 of ACE-form IDN  after negotiations. Non-ASCII-form IDN
 may be used and stored and transfered and processed as 
(Continue reading)

Paul Hoffman / IMC | 13 Oct 18:39
Picon
Gravatar

Re: length restrictions on IDN label

At 3:56 PM +0900 10/13/02, Soobok Lee wrote:
>[ When i read IDNA draft today, I still can't find
>   the answer from it for the following question about IDN label length.
>  If the following issue is already addressed in the draft, please 
>correct me. ]

It is indeed covered in the draft. The input to IDNA is code points, 
not encoded characters. As you point out, different encodings give 
different lengths for the same string. The only lengths that matter 
are those that are already in STD 13.

>  Many internet applications impose/assumes  the 63-octets-limit of 
>label lengths.
>  IF this assumption is violated, the label will be regarded as invalid
>  labels, and produce unpredictable errors by some implementations.

Which Internet applications are you speaking of? Which encodings are 
they using? As you pointed out, different encodings give different 
lengths. Thus, no sensible application could assume a 63-octet length 
if it deals with different encodings.

>  From implementators' point of view, more precise specificiation is needed
>  about whether IDN label/FQDN has *NEW* length restrictions in 
>various char encodings,
>  if IDNA tries to extend the character repertoires of allowable characters.

It seems likely that most implementers can understand that they must 
continue to follow the same rules that they always have for the 
length of domain names and labels.

(Continue reading)

Soobok Lee | 14 Oct 03:54
Picon

Re: length restrictions on IDN label

On Sun, Oct 13, 2002 at 09:39:42AM -0700, Paul Hoffman / IMC wrote:
> At 3:56 PM +0900 10/13/02, Soobok Lee wrote:
> >[ When i read IDNA draft today, I still can't find
> >  the answer from it for the following question about IDN label length.
> > If the following issue is already addressed in the draft, please 
> >correct me. ]
> 
> It is indeed covered in the draft. The input to IDNA is code points, 
> not encoded characters. As you point out, different encodings give 
> different lengths for the same string. The only lengths that matter 
> are those that are already in STD 13.
> 
> > Many internet applications impose/assumes  the 63-octets-limit of 
> >label lengths.
> > IF this assumption is violated, the label will be regarded as invalid
> > labels, and produce unpredictable errors by some implementations.
> 
> Which Internet applications are you speaking of? Which encodings are 
> they using? As you pointed out, different encodings give different 
> lengths. Thus, no sensible application could assume a 63-octet length 
> if it deals with different encodings.

UTF8,EUC-KR etc are all ASCII compatible encoding/charset.
Applications don't need to give up/modify old 63-octets restrictions for
 LDH labels even in utf8 or euc-kr, because those encodings produce
the same octets string  as  pure  ASCII encoding does. That is,
in those ASCII compatible encoding of LDH chars, the number of codepoints and
the number of octets are equal, while they are not equal in encoding of
non-LDH chars like Hangul, CJK letters (the octet length is doubled or tripled).

(Continue reading)

Paul Hoffman / IMC | 14 Oct 04:16
Picon
Gravatar

Re: length restrictions on IDN label

At 10:54 AM +0900 10/14/02, Soobok Lee wrote:
>On Sun, Oct 13, 2002 at 09:39:42AM -0700, Paul Hoffman / IMC wrote:
>  > It seems likely that most implementers can understand that they must
>>  continue to follow the same rules that they always have for the
>  > length of domain names and labels.
>
>The unit of length restriction matters: # of code points or # of octets ?
>That should be made clearer. RFC1035 uses "octets", not a 
>character/code point.

Exactly right. It seems like you have answered your own question!

In case I'm missing something, where in any of the IDN documents does 
it indicate that the length restrictions are in number of code points 
instead of number of octets?

--Paul Hoffman, Director
--Internet Mail Consortium

Soobok Lee | 14 Oct 04:43
Picon

Re: length restrictions on IDN label


 Then,
  U+AC00 x 56 times (in my previous posting)  is a valid label  
conforming to RFC1035 ?
   and its equivalent ACE label (of  63 octets ) is a valid label ?

   UTF8-encoded IDN labels are not governed by RFC1035 length restrictions ?
   IDNA contains  brand new length restrictions for 8bit labels   which 
obsoletes RFC1035 ?

   Soobok Lee

Paul Hoffman / IMC wrote:

> At 10:54 AM +0900 10/14/02, Soobok Lee wrote:
>
>> On Sun, Oct 13, 2002 at 09:39:42AM -0700, Paul Hoffman / IMC wrote:
>>  > It seems likely that most implementers can understand that they must
>>
>>>  continue to follow the same rules that they always have for the
>>
>>  > length of domain names and labels.
>>
>> The unit of length restriction matters: # of code points or # of 
>> octets ?
>> That should be made clearer. RFC1035 uses "octets", not a 
>> character/code point.
>
>
> Exactly right. It seems like you have answered your own question!
(Continue reading)

Paul Hoffman / IMC | 14 Oct 04:47
Picon
Gravatar

Re: length restrictions on IDN label

At 11:43 AM +0900 10/14/02, Soobok Lee wrote:
>Then,
>  U+AC00 x 56 times (in my previous posting)  is a valid label 
>conforming to RFC1035 ?
>   and its equivalent ACE label (of  63 octets ) is a valid label ?

If it follows the rules for ToASCII, yes.

>   UTF8-encoded IDN labels are not governed by RFC1035 length restrictions ?

There is no such thing. IDN labels are always encoded in ASCII 
following the rules of STD 13, just as it says in the draft.

>   IDNA contains  brand new length restrictions for 8bit labels 
>which obsoletes RFC1035 ?

No. Where in the draft do you see such "brand new length restrictions"?

The current goal of the WG is to fix any unclear statements in the 
document. We can't do that if you don't say exactly what text you 
find unclear.

--Paul Hoffman, Director
--Internet Mail Consortium


Gmane