4 Jan 2011 21:23
5 Jan 2011 00:43
Re: IDNA2008 test vectors
Mark Davis ☕ <mark <at> macchiato.com>
2011-01-04 23:43:20 GMT
2011-01-04 23:43:20 GMT
For the next version of UTS46, the UTC is introducing additional fields in its test files that indicate whether the strings are valid in IDNA2008, so you could take a look at those (currently in draft state).
http://www.unicode.org/review/#pri177
The draft test file is at http://www.unicode.org/Public/idna/6.0.1/IdnaTest.txt
You would skip the lines starting with T, since those are only relevant for transitional implementations. If you use the files and find any issues, let me know and I can funnel the feedback back to the UTC.
Mark
— Il meglio è l’inimico del bene —
— Il meglio è l’inimico del bene —
On Tue, Jan 4, 2011 at 12:23, Simon Josefsson <simon <at> josefsson.org> wrote:
Has anyone produced test vectors for IDNA2008 that they could share?
/Simon
_______________________________________________
Idna-update mailing list
Idna-update <at> alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update
_______________________________________________ Idna-update mailing list Idna-update <at> alvestrand.no http://www.alvestrand.no/mailman/listinfo/idna-update
5 Jan 2011 02:35
Re: IDNA2008 test vectors
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>
2011-01-05 01:35:19 GMT
2011-01-05 01:35:19 GMT
Hi, Simon, idnkit-2, which is available from <http://jprs.co.jp/idn/index-e.html>, include many of test vectors in its test suite. Please refer to idnkit-2.0/test/*/*.def files. If you are going to produce IDNA2008 implementation, I'd like to have interoperability test together. Regards, -- -- Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> On Tue, 04 Jan 2011 21:23:21 +0100 Simon Josefsson <simon <at> josefsson.org> wrote: > Has anyone produced test vectors for IDNA2008 that they could share? > > /Simon > _______________________________________________ > Idna-update mailing list > Idna-update <at> alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update > >
5 Jan 2011 07:18
Hyphen Restrictions
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>
2011-01-05 06:18:20 GMT
2011-01-05 06:18:20 GMT
Hi, all,
I need clarification of RFC5891 section 4.2.3.1, which says:
4.2.3.1. Hyphen Restrictions
The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
the third and fourth character positions and MUST NOT start or end
with a "-" (hyphen).
My question is that what "the third and fourth character positions" means.
Does it mean third and fourth octet from the beginning of the string?
For example:
beginning of the string
|
v 1 2 3 4 5 <-- position of octet
+---+---+---+---+---+
| a | b | - | - | c |
+---+---+---+---+---+
^ ^
| |
two consecutive hyphens
Or does it mean third and fourth character from the beginning of the string?
For example:
beginning of the string
|
v 1 2 3 4 5 <-- position of character
+---+---+---+---+---+
|<A>|<B>| - | - |<C>| here <A>, <B> and <C> stands for non-ASCII (multi-
+---+---+---+---+---+ octets) character
^ ^
| |
two consecutive hyphens
My understanding for this restrictions is to preserve future ACE prefix,
so I expect the answer for my question is former one. Is that right?
Regards,
--
--
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>
5 Jan 2011 07:28
Re: Hyphen Restrictions
Andrew Sullivan <ajs <at> shinkuro.com>
2011-01-05 06:28:34 GMT
2011-01-05 06:28:34 GMT
On Wed, Jan 05, 2011 at 03:18:20PM +0900, Yoshiro YONEYA wrote: > Or does it mean third and fourth character from the beginning of the string? > For example: > beginning of the string > | > v 1 2 3 4 5 <-- position of character > +---+---+---+---+---+ > |<A>|<B>| - | - |<C>| here <A>, <B> and <C> stands for non-ASCII (multi- > +---+---+---+---+---+ octets) character > ^ ^ > | | > two consecutive hyphens I believe the intention is this one. The target is "the Unicode string". At one point in the development of IDNA2008, I think this was called a "putative U-label", if I recall correctly. The idea was that you had an inbound Unicode string that was supposed to be a U-label, but you didn't know yet. That this is the correct interpretation is suggested by section 4.4, which talks about converting the whole thing to an A-label by doing the Punycode conversion. That suggests that previous "labels" in 4.x were only ever putative U-labels or else they were A-labels. The above is merely my interpretation; I hold no special authority. A -- -- Andrew Sullivan ajs <at> shinkuro.com Shinkuro, Inc.
5 Jan 2011 07:46
Re: Hyphen Restrictions
John C Klensin <klensin <at> jck.com>
2011-01-05 06:46:36 GMT
2011-01-05 06:46:36 GMT
--On Wednesday, January 05, 2011 15:18 +0900 Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> wrote: > Hi, all, > > I need clarification of RFC5891 section 4.2.3.1, which says: > > 4.2.3.1. Hyphen Restrictions > > The Unicode string MUST NOT contain "--" (two consecutive > hyphens) in the third and fourth character positions and > MUST NOT start or end with a "-" (hyphen). > > My question is that what "the third and fourth character > positions" means. Does it mean third and fourth octet from the > beginning of the string? For example: > beginning of the string > | > v 1 2 3 4 5 <-- position of octet > +---+---+---+---+---+ > | a | b | - | - | c | > +---+---+---+---+---+ > ^ ^ > | | > two consecutive hyphens >... > > My understanding for this restrictions is to preserve future > ACE prefix, so I expect the answer for my question is former > one. Is that right? Yes john
5 Jan 2011 08:05
Re: Hyphen Restrictions
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp>
2011-01-05 07:05:12 GMT
2011-01-05 07:05:12 GMT
Dear Andrew and John, Thank you for your quick response. I'm clear now. The reason why I raised this question was that two possible interpretation will cause interoperability problem between implementations. And if the answer was later, it will cause big impact for IDNA-aware registries. Regards, -- Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> On Wed, 05 Jan 2011 01:46:36 -0500 John C Klensin <klensin <at> jck.com> wrote: > > > --On Wednesday, January 05, 2011 15:18 +0900 Yoshiro YONEYA > <yoshiro.yoneya <at> jprs.co.jp> wrote: > > > Hi, all, > > > > I need clarification of RFC5891 section 4.2.3.1, which says: > > > > 4.2.3.1. Hyphen Restrictions > > > > The Unicode string MUST NOT contain "--" (two consecutive > > hyphens) in the third and fourth character positions and > > MUST NOT start or end with a "-" (hyphen). > > > > My question is that what "the third and fourth character > > positions" means. Does it mean third and fourth octet from the > > beginning of the string? For example: > > beginning of the string > > | > > v 1 2 3 4 5 <-- position of octet > > +---+---+---+---+---+ > > | a | b | - | - | c | > > +---+---+---+---+---+ > > ^ ^ > > | | > > two consecutive hyphens > >... > > > > My understanding for this restrictions is to preserve future > > ACE prefix, so I expect the answer for my question is former > > one. Is that right? > > Yes > > john > > > > > _______________________________________________ > Idna-update mailing list > Idna-update <at> alvestrand.no > http://www.alvestrand.no/mailman/listinfo/idna-update > >
5 Jan 2011 09:03
Re: Hyphen Restrictions
Adam M. Costello <idna-update.amc+0+ <at> nicemice.net.RemoveThisWord>
2011-01-05 08:03:43 GMT
2011-01-05 08:03:43 GMT
Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> wrote: > Dear Andrew and John, > > Thank you for your quick response. I'm clear now. You are? But didn't Andrew and John disagree? Andrew said it means 3rd & 4th characters, while John said it means 3rd & 4th octets. AMC
5 Jan 2011 10:25
Re: Hyphen Restrictions
Nicolas Williams <Nicolas.Williams <at> oracle.com>
2011-01-05 09:25:39 GMT
2011-01-05 09:25:39 GMT
On Wed, Jan 05, 2011 at 08:03:43AM +0000, Adam M. Costello wrote: > Yoshiro YONEYA <yoshiro.yoneya <at> jprs.co.jp> wrote: > > > Dear Andrew and John, > > > > Thank you for your quick response. I'm clear now. > > You are? But didn't Andrew and John disagree? Andrew said it means 3rd > & 4th characters, while John said it means 3rd & 4th octets. The RFC says "characters", and I think it probably says that for a reason. It'd be nice if the RFC stated that reason. My guess: the way Punycode works there's no way for the a Punycoded string to start with an ACE prefix if it doesn't have hyphens in the 3rd and fourth characters. However, a quick test seems to indicate that either that requirement is off by one or my guess is wrong: % idn --quiet foó--bar xn--fo--bar-m0a % idn --quiet xnó--bar xn--xn--bar-m0a % idn --quiet xñ--bar xn--x--bar-wwa % Nico --
5 Jan 2011 11:06
Combining mark vs combining character?
Simon Josefsson <simon <at> josefsson.org>
2011-01-05 10:06:40 GMT
2011-01-05 10:06:40 GMT
Hi,
I need a clarification regarding this paragraph in section 4.2.3.2 of
RFC 5891:
The Unicode string MUST NOT begin with a combining mark or combining
character (see The Unicode Standard, Section 2.11 [Unicode] for an
exact definition).
And this in section 5.4:
Putative U-labels with any of the following characteristics MUST be
rejected prior to DNS lookup:
...
o Labels whose first character is a combining mark (see The Unicode
Standard, Section 2.11 [Unicode]).
The reference to [Unicode] is not normative, which would be a problem
for any implementer.
Reading section 2.11 of Unicode 5.0 discuss "combining character" but
not "combining mark".
There is a section 7.9 in Unicode 5.0 called "Combining Marks".
A section that discuss both Combining Marks and Combining Characters in
the same section is section 3.11 on "Canonical Ordering Behaviour".
There is one section 3.6 on "Combination" that gives the precice
definition of a "Combining character":
Combining character: A character with the General Category of
Combining Mark (M).
Is this the intended definition of Combining character by RFC 5891?
Questions:
1) Does RFC 5891 refer to "combining mark" and "combining character" as
the same thing?
2) Is there a significant difference between the requirement in 4.2.3.2
and 5.4? The latter section only mentions "combining mark" and not
"combining character".
3) What is the precice definition of a "combining mark"?
/Simon
RSS Feed