Erik van der Poel | 8 Apr 03:55

Re: space-like unicode char

Soobok Lee wrote:
> U+1160 is a space-like char and even stringprep/nameprep does not
> filter it out because the char is not for punctuational purpose.

U+1160 is HANGUL JUNGSEONG FILLER and it is used to transform 
nonstandard syllables into standard ones (Unicode 3.0 section 3.11 (RFC 
3454 refers to Unicode 3.2.0)). However, this transformation is one of 
the additional transformations not considered part of Unicode 
normalization (3.2.0's UAX #15 Annex 10). So this character is not 
generated by Stringprep/Nameprep.

However, it is not prohibited either, so it may occur in the input to 
(and output from) Stringprep/Nameprep. I read some of the sections on 
Hangul in the Unicode book and Web site, but I did not see any rules 
regarding repeated occurrences of U+1160 (as you had in your example, 
not quoted above). I also did not see any rules about what to do when a 
filler is not followed by a Hangul jamo. It would be nice to have these 
rules in Unicode or in Stringprep.

I tried U+1160 followed by a Latin character in MSIE with i-Nav and in 
Firefox with IDN turned on, and it was displayed as a wide space. It is 
unfortunate that both implementations chose to display it as a space 
instead of deleting it.

Erik

Soobok Lee | 8 Apr 09:05

Re: space-like unicode char

Erik van der Poel wrote:

> Soobok Lee wrote:
>
>> U+1160 is a space-like char and even stringprep/nameprep does not
>> filter it out because the char is not for punctuational purpose.
>
>
> U+1160 is HANGUL JUNGSEONG FILLER and it is used to transform
> nonstandard syllables into standard ones (Unicode 3.0 section 3.11
> (RFC 3454 refers to Unicode 3.2.0)). However, this transformation is
> one of the additional transformations not considered part of Unicode
> normalization (3.2.0's UAX #15 Annex 10). 

Exactly. U+1160 is not "touched" by Unicode normalization (NFC).

> So this character is not generated by Stringprep/Nameprep.However, it
> is not prohibited either, so it may occur in the input to (and output
> from) Stringprep/Nameprep.

Yes, it may occur.

> I read some of the sections on Hangul in the Unicode book and Web
> site, but I did not see any rules regarding repeated occurrences of
> U+1160 (as you had in your example, not quoted above). I also did not
> see any rules about what to do when a filler is not followed by a
> Hangul jamo. It would be nice to have these rules in Unicode or in
> Stringprep.

U+1160 problem has been raised 3.5 years ago (you can look into this
(Continue reading)

Soobok Lee | 8 Apr 09:21

combining marks and space-like unicode char


>>I tried U+1160 followed by a Latin character in MSIE with i-Nav and in
>>Firefox with IDN turned on, and it was displayed as a wide space. It
>>is unfortunate that both implementations chose to display it as a
>>space instead of deleting it.
>>    
>>
>
>Yes. Plugins M U S T filter out U+1160 from validated ToUnicode()ed
>labels, whether or not IDNA requires that.
>
>Soobok
>
I will add this: In standard hangul writing system,
U+1160 is meaningful only in some context (surrounded by at least one
jamo char).
But, is standalone U+1160 is illegal ? No, it is NOT illegal.

So, blind filtering of U+1160 is fault. Plugins' filtering should be
context-sensitive.
That is why it would complicate stringprep if it were included into
stringprep. :-)

We can find similar problems in "combining diacritical marks" (U+3xx).
What if
a label with single char 'combining accent or above-dot ' without any
preceding
alphabet? It will combine with its preceding dot delimiter. and that
will produce
confusing looks ( looks like a colon which is a protocol delimiter).
(Continue reading)

Erik van der Poel | 8 Apr 21:36

Re: space-like unicode char

Soobok Lee wrote:
> U+1160 problem has been raised 3.5 years ago (you can look into this
> huge idn-list archive by keyword search for 1160 or filler)
> with some additional hangul jamo problem. One draft has been submitted
> by me (you may find that in www.i-d-n.net)
> to filter out these invalid char sequences. But the draft had been
> discarded . Someone argued that such filtering * complicates *
> stringprep algorithms with context-sensitive filtering/prohibiting and
> the problem is up to UTC/NFC not to IETF. of course, i couldn't accept that.

The i-d-n.net name no longer takes you to a real site, but I believe I 
found your draft here:

http://www.watersprings.org/pub/id/draft-ietf-idn-hangeulchar-00.txt

I agree that the U+1160 issues would complicate a spec, and I can see 
why the IETF decided not to include them in the RFCs, but now that we 
have seen that a number of implementations display this character in a 
potentially dangerous way, we should reconsider the specs.

Unicode may not be able to address these issues in the normalization 
spec since they have promised not to make any incompatible changes. 
Unicode might be able to address the issues in other normative or 
informative parts of their book or documents, and the IETF might just 
want to refer to those parts of Unicode.

Alternatively, the IETF can write up its own specifications or 
recommendations. It's not immediately clear to me whether U+1160 ought 
to be addressed in Stringprep or Nameprep. As we have seen, Stringprep 
is used in various protocols, including SASLprep, which is for user 
(Continue reading)


Gmane