Simon Josefsson | 4 Jan 2011 21:23
Favicon
Gravatar

IDNA2008 test vectors

Has anyone produced test vectors for IDNA2008 that they could share?

/Simon
Mark Davis ☕ | 5 Jan 2011 00:43

Re: IDNA2008 test vectors

For the next version of UTS46, the UTC is introducing additional fields in its test files that indicate whether the strings are valid in IDNA2008, so you could take a look at those (currently in draft state).

http://www.unicode.org/review/#pri177


You would skip the lines starting with T, since those are only relevant for transitional implementations. If you use the files and find any issues, let me know and I can funnel the feedback back to the UTC.

Mark

— Il meglio è l’inimico del bene —


On Tue, Jan 4, 2011 at 12:23, Simon Josefsson <simon <at> josefsson.org> wrote:
Has anyone produced test vectors for IDNA2008 that they could share?

/Simon
_______________________________________________
Idna-update mailing list
Idna-update <at> alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update

_______________________________________________
Idna-update mailing list
Idna-update <at> alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update
Yoshiro YONEYA | 5 Jan 2011 02:35
Picon
Favicon

Re: IDNA2008 test vectors

Hi, Simon,

idnkit-2, which is available from <http://jprs.co.jp/idn/index-e.html>,
include many of test vectors in its test suite.  Please refer to 
idnkit-2.0/test/*/*.def files.

If you are going to produce IDNA2008 implementation, I'd like to have 
interoperability test together.

Regards,

--

-- 
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>

On Tue, 04 Jan 2011 21:23:21 +0100 Simon Josefsson <simon <at> josefsson.org> wrote:

> Has anyone produced test vectors for IDNA2008 that they could share?
> 
> /Simon
> _______________________________________________
> Idna-update mailing list
> Idna-update <at> alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
> 
> 
Yoshiro YONEYA | 5 Jan 2011 07:18
Picon
Favicon

Hyphen Restrictions

Hi, all,

I need clarification of RFC5891 section 4.2.3.1, which says:

4.2.3.1.  Hyphen Restrictions

   The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
   the third and fourth character positions and MUST NOT start or end
   with a "-" (hyphen).

My question is that what "the third and fourth character positions" means.
Does it mean third and fourth octet from the beginning of the string?
For example:
  beginning of the string
    |
    v 1   2   3   4   5 <-- position of octet
    +---+---+---+---+---+
    | a | b | - | - | c |
    +---+---+---+---+---+
              ^   ^
              |   |
      two consecutive hyphens

Or does it mean third and fourth character from the beginning of the string?
For example:
  beginning of the string
    |
    v 1   2   3   4   5 <-- position of character
    +---+---+---+---+---+
    |<A>|<B>| - | - |<C>| here <A>, <B> and <C> stands for non-ASCII (multi- 
    +---+---+---+---+---+ octets) character
              ^   ^
              |   |
      two consecutive hyphens

My understanding for this restrictions is to preserve future ACE prefix, 
so I expect the answer for my question is former one.  Is that right?

Regards,

--

-- 
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>
Andrew Sullivan | 5 Jan 2011 07:28

Re: Hyphen Restrictions

On Wed, Jan 05, 2011 at 03:18:20PM +0900, Yoshiro YONEYA wrote:
> Or does it mean third and fourth character from the beginning of the string?
> For example:
>   beginning of the string
>     |
>     v 1   2   3   4   5 <-- position of character
>     +---+---+---+---+---+
>     |<A>|<B>| - | - |<C>| here <A>, <B> and <C> stands for non-ASCII (multi- 
>     +---+---+---+---+---+ octets) character
>               ^   ^
>               |   |
>       two consecutive hyphens

I believe the intention is this one.  The target is "the Unicode
string".  At one point in the development of IDNA2008, I think this
was called a "putative U-label", if I recall correctly.  The idea was
that you had an inbound Unicode string that was supposed to be a
U-label, but you didn't know yet.

That this is the correct interpretation is suggested by section 4.4,
which talks about converting the whole thing to an A-label by doing
the Punycode conversion.  That suggests that previous "labels" in 4.x
were only ever putative U-labels or else they were A-labels.

The above is merely my interpretation; I hold no special authority.

A
--

-- 
Andrew Sullivan
ajs <at> shinkuro.com
Shinkuro, Inc.
John C Klensin | 5 Jan 2011 07:46

Re: Hyphen Restrictions


--On Wednesday, January 05, 2011 15:18 +0900 Yoshiro YONEYA
<yoshiro.yoneya <at> jprs.co.jp> wrote:

> Hi, all,
> 
> I need clarification of RFC5891 section 4.2.3.1, which says:
> 
> 4.2.3.1.  Hyphen Restrictions
> 
>    The Unicode string MUST NOT contain "--" (two consecutive
> hyphens) in    the third and fourth character positions and
> MUST NOT start or end    with a "-" (hyphen).
> 
> My question is that what "the third and fourth character
> positions" means. Does it mean third and fourth octet from the
> beginning of the string? For example:
>   beginning of the string
>     |
>     v 1   2   3   4   5 <-- position of octet
>     +---+---+---+---+---+
>     | a | b | - | - | c |
>     +---+---+---+---+---+
>               ^   ^
>               |   |
>       two consecutive hyphens
>...
> 
> My understanding for this restrictions is to preserve future
> ACE prefix,  so I expect the answer for my question is former
> one.  Is that right?

Yes

   john
Yoshiro YONEYA | 5 Jan 2011 08:05
Picon
Favicon

Re: Hyphen Restrictions

Dear Andrew and John,

Thank you for your quick response.  I'm clear now.  The reason why I 
raised this question was that two possible interpretation will cause 
interoperability problem between implementations.  And if the answer 
was later, it will cause big impact for IDNA-aware registries.

Regards,

-- 
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>

On Wed, 05 Jan 2011 01:46:36 -0500 John C Klensin <klensin <at> jck.com> wrote:

> 
> 
> --On Wednesday, January 05, 2011 15:18 +0900 Yoshiro YONEYA
> <yoshiro.yoneya <at> jprs.co.jp> wrote:
> 
> > Hi, all,
> > 
> > I need clarification of RFC5891 section 4.2.3.1, which says:
> > 
> > 4.2.3.1.  Hyphen Restrictions
> > 
> >    The Unicode string MUST NOT contain "--" (two consecutive
> > hyphens) in    the third and fourth character positions and
> > MUST NOT start or end    with a "-" (hyphen).
> > 
> > My question is that what "the third and fourth character
> > positions" means. Does it mean third and fourth octet from the
> > beginning of the string? For example:
> >   beginning of the string
> >     |
> >     v 1   2   3   4   5 <-- position of octet
> >     +---+---+---+---+---+
> >     | a | b | - | - | c |
> >     +---+---+---+---+---+
> >               ^   ^
> >               |   |
> >       two consecutive hyphens
> >...
> > 
> > My understanding for this restrictions is to preserve future
> > ACE prefix,  so I expect the answer for my question is former
> > one.  Is that right?
> 
> Yes
> 
>    john
> 
> 
> 
> 
> _______________________________________________
> Idna-update mailing list
> Idna-update <at> alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
> 
> 
Adam M. Costello | 5 Jan 2011 09:03

Re: Hyphen Restrictions

Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> wrote:

> Dear Andrew and John,
> 
> Thank you for your quick response.  I'm clear now.

You are?  But didn't Andrew and John disagree?  Andrew said it means 3rd
& 4th characters, while John said it means 3rd & 4th octets.

AMC
Nicolas Williams | 5 Jan 2011 10:25
Picon
Favicon

Re: Hyphen Restrictions

On Wed, Jan 05, 2011 at 08:03:43AM +0000, Adam M. Costello wrote:
> Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> wrote:
> 
> > Dear Andrew and John,
> > 
> > Thank you for your quick response.  I'm clear now.
> 
> You are?  But didn't Andrew and John disagree?  Andrew said it means 3rd
> & 4th characters, while John said it means 3rd & 4th octets.

The RFC says "characters", and I think it probably says that for a
reason.  It'd be nice if the RFC stated that reason.

My guess: the way Punycode works there's no way for the a Punycoded
string to start with an ACE prefix if it doesn't have hyphens in the 3rd
and fourth characters.  However, a quick test seems to indicate that
either that requirement is off by one or my guess is wrong:

% idn --quiet foó--bar
xn--fo--bar-m0a
% idn --quiet xnó--bar
xn--xn--bar-m0a
% idn --quiet xñ--bar 
xn--x--bar-wwa
% 

Nico
-- 
Simon Josefsson | 5 Jan 2011 11:06
Favicon
Gravatar

Combining mark vs combining character?

Hi,

I need a clarification regarding this paragraph in section 4.2.3.2 of
RFC 5891:

   The Unicode string MUST NOT begin with a combining mark or combining
   character (see The Unicode Standard, Section 2.11 [Unicode] for an
   exact definition).

And this in section 5.4:

   Putative U-labels with any of the following characteristics MUST be
   rejected prior to DNS lookup:
...
   o  Labels whose first character is a combining mark (see The Unicode
      Standard, Section 2.11 [Unicode]).

The reference to [Unicode] is not normative, which would be a problem
for any implementer.

Reading section 2.11 of Unicode 5.0 discuss "combining character" but
not "combining mark".

There is a section 7.9 in Unicode 5.0 called "Combining Marks".

A section that discuss both Combining Marks and Combining Characters in
the same section is section 3.11 on "Canonical Ordering Behaviour".

There is one section 3.6 on "Combination" that gives the precice
definition of a "Combining character":

   Combining character: A character with the General Category of
   Combining Mark (M).

Is this the intended definition of Combining character by RFC 5891?

Questions:

1) Does RFC 5891 refer to "combining mark" and "combining character" as
the same thing?

2) Is there a significant difference between the requirement in 4.2.3.2
and 5.4?  The latter section only mentions "combining mark" and not
"combining character".

3) What is the precice definition of a "combining mark"?

/Simon

Gmane