Mark Davis | 1 Apr 18:08

Error in IDN eamples for testing

One of our engineers took a look at the IDN samples for testing, and found a
problem in the data. Apparently some of the UTF-8 is illegal (using 2
three-byte sequences for a supplementary character).

Mark

========

Hi Mark,
The following are the offending members of structure in Appendix A of
draft-josefsson-idn-test-vectors. The data claiming to be UTF-8 is actually
CESU-8.

struct stringprep
{
char *comment;
char *in;
char *out;
char *profile;
int flags;
int rc;
}
strprep[] =
{
..........
{
"Surrogate code U+DF42",
"\xED\xBD\x82", NULL, "Nameprep", 0,
STRINGPREP_CONTAINS_PROHIBITED
},
(Continue reading)

Martin Duerst | 7 Apr 21:07
Picon
Favicon

Challenge: longest UTF-8 with valid domain name

Here is a little challenge that might end up in Simon's test suite
and be otherwise valuable for testing:

What is the longest (in terms of bytes) internationalized
domain name (when encoded as UTF-8)? Obviously because there
are characters that are ignored by nameprep, we would have to
ask for the longest one after ToUnicode.

The problem applies both to single labels as well as to
FQDNs.

Regards,    Martin.

Adam M. Costello | 7 Apr 23:42

Re: Challenge: longest UTF-8 with valid domain name

Martin Duerst <duerst <at> w3.org> wrote:

> What is the longest (in terms of bytes) internationalized domain name
> (when encoded as UTF-8)?
>
> Obviously because there are characters that are ignored by nameprep,
> we would have to ask for the longest one after ToUnicode.

You mean the longest after nameprep.  The output of ToUnicode is not
necessarily nameprepped (because ToUnicode leaves its input untouched if
the input is not an ACE).

> The problem applies both to single labels as well as to FQDNs.

I think the longest nameprepped label is 224 bytes in UTF-8.  Taken any
code point in the range 10000..55931 (hex) and repeat it 56 times.  The
ACE form will be 63 characters.  The nameprepped non-ACE form in UTF-8
will be 56*4 = 224 bytes.

As for the longest IDN, I think that would have four labels, three
of which are maximal, and one of which is just shy of maximal (62
characters in the ACE).  With the four length bytes, that hits the DNS
limit of 255 bytes.

The longest UTF-8 representation of that name (with nameprepped labels)
would use ideographic full stops for dots, and would include the
trailing dot.  So it would be (3*56+55)*4 + 4*3 = 904 bytes.

AMC

(Continue reading)

jim | 8 Apr 00:38

Question about a ToUnicode step


Hello,

I'm trying to understand the purpose behind some steps in 
ToUnicode. From the IDNA spec:

1. If all code points in the sequence are in the ASCII range (0..7F)
   then skip to step 3.

2. Perform the steps specified in [NAMEPREP] and fail if there is an
   error. (If step 3 of ToAscii is also performed here, it will not
   affect the overall behavior of ToUnicode, but it is not
   necessary.) The AllowUnassigned flag is used in [NAMEPREP].

3. Verify that the sequence begins with the ACE prefix, and save a
   copy of the sequence.

I'm curious about steps 1 & 2. I don't understand why nameprep
is being applied to ASCII domain labels. This seems to be pointless 
in situations where the input string is 8-bit ASCII. Wouldn't
a simple dns character compatibility check in place of steps 1 
and 2 suffice?

Regards,
Jim Mathies

Adam M. Costello | 8 Apr 01:06

Re: Question about a ToUnicode step

jim <at> mathies.com wrote:

> I'm trying to understand the purpose behind some steps in 
> ToUnicode.  From the IDNA spec:
> 
> 
> 1. If all code points in the sequence are in the ASCII range (0..7F)
>    then skip to step 3.
> 
> 2. Perform the steps specified in [NAMEPREP] and fail if there is an
>    error. (If step 3 of ToAscii is also performed here, it will not
>    affect the overall behavior of ToUnicode, but it is not
>    necessary.) The AllowUnassigned flag is used in [NAMEPREP].
> 
> 3. Verify that the sequence begins with the ACE prefix, and save a
>    copy of the sequence.
> 
> 
> I'm curious about steps 1 & 2.  I don't understand why nameprep
> is being applied to ASCII domain labels.

Nameprep is *not* being applied to ASCII labels.  That's what step 1
does, it prevents nameprep from being applied to ASCII labels.

> Wouldn't a simple dns character compatibility check in place of steps
> 1 and 2 suffice?

ToUnicode is not only intended to be applied to DNS-compatible labels.
It is intended to be applied to any internationalized label.

(Continue reading)

Adam M. Costello | 24 Apr 22:45

ToUnicode output can be longer than input

The IDNA spec contains an incidental statement that was intended to be
helpful, in section 4.2:

    The ToUnicode output never contains more code points than its input.

Oops, that's not true, because Nameprep can cause strings to expand.
For example, consider the input:

x n - - fi fi - a ffl u e n t - s o u ffl - viii - u i c

The spaces are not really there, they just indicate the clusters, which
represent single code points (ligatures and roman numerals: U+FB01,
U+FB04, U+2177).  That's 24 code points.

ToUnicode would apply Nameprep (which expands the ligatures and roman
numerals to their ASCII equivalents), then apply the Punycode decoder,
yielding:

fifi-affluent-soufflé-viii

(For the Latin-1 impaired, the non-ASCII character is e with an acute
accent.)  That's 26 code points.  26 > 24.

So the statement needs to be removed or altered if/when the RFC is
revised.  It would be correct to say that the Punycode decoder cannot
output more code points than it inputs, but Nameprep can, and therefore
ToUnicode can.

AMC

(Continue reading)

Edmon Chung | 25 Apr 15:20

Re: ToUnicode output can be longer than input

Hi Adam,

----- Original Message -----
From: "Adam M. Costello" <idn.amc+0 <at> nicemice.net.RemoveThisWord>
> For example, consider the input:
>
> x n - - fi fi - a ffl u e n t - s o u ffl - viii - u i c
>
> The spaces are not really there, they just indicate the clusters, which
> represent single code points (ligatures and roman numerals: U+FB01,
> U+FB04, U+2177).  That's 24 code points.

If I counted it correctly, there are 33 "codepoints" in the above ACE
string. (I agree with your assessment however, please see below, but the
example doesnt seem to illustrate your point...)

> The IDNA spec contains an incidental statement that was intended to be
> helpful, in section 4.2:
>
>     The ToUnicode output never contains more code points than its input.
>
> Oops, that's not true, because Nameprep can cause strings to expand.

I can understand this possibility.
Basically, if the length of the Unicode composition for one or more
characters in the string is longer than the ACE composition and the total
excess for all the characters within the string is more than 4 (compensating
the "xn--"), then the ToUnicode output will be longer than the input.

> So the statement needs to be removed or altered if/when the RFC is
(Continue reading)

Adam M. Costello | 25 Apr 21:30

Re: ToUnicode output can be longer than input

Edmon Chung <edmon <at> neteka.com> wrote:

> > x n - - fi fi - a ffl u e n t - s o u ffl - viii - u i c
> >
> > The spaces are not really there, they just indicate the clusters, which
> > represent single code points (ligatures and roman numerals: U+FB01,
> > U+FB04, U+2177).  That's 24 code points.
> 
> If I counted it correctly, there are 33 "codepoints" in the above ACE
> string.

fi represents one code point (U+FB01), ffl represents one code point
(U+FB04), and viii represents one code point (U+2177).  Now if you count
again, you should count 24.  I'm trying to describe a non-ASCII ACE
string containing 24 code points, some of which are ASCII and some of
which are compatibility characters.

AMC

Erik Nordmark | 27 Apr 01:21
Picon

Re: ToUnicode output can be longer than input

> So the statement needs to be removed or altered if/when the RFC is
> revised.  It would be correct to say that the Punycode decoder cannot
> output more code points than it inputs, but Nameprep can, and therefore
> ToUnicode can.

The RFC editor maintains an errata page. It would make sense to send them
email asking them to add this to their page.

  Erik

Martin v. Löwis | 27 Apr 10:10
Picon
Gravatar

Re: implementations list

Marc Blanchet <Marc.Blanchet <at> viagenie.qc.ca> writes:

> I would like to rework the http://www.i-d-n.net (which is very outdated, my
> apologies) to reflect the status of the idn work and to start collecting
> information on implementations.

Python 2.3 implements IDNA. Here is the record in the format you are
requesting.

Name: Python 2.3b1
Purpose: library
Programming language: Python
Url: http://www.python.org/2.3/
     http://www.python.org/dev/doc/devel/lib/module-encodings.idna.html
Description: Unicode strings are transparently accepted as host names
 in the socket, ftplib, httplib, and urllib libraries. For conversions
 from ACE, the "idna" codec is provided. The implementation assumes
 query strings, and UseSTD3ASCIIRules is false. Along with the IDNA
 implementation comes a "punycode" codec and a stringprep module.

Regards,
Martin


Gmane