tedd | 15 Feb 17:08
Favicon
Gravatar

homograph attacks

Hi people:

You all knew this was going to happen.

    http://www.p&1072;ypal.com

You might find --

    http://www.shmoo.com/idn/homograph.txt

-- an interesting read.

tedd
--

-- 
--------------------------------------------------------------------------------
http://sperling.com/

Martin v. Löwis | 15 Feb 20:02
Picon
Gravatar

Re: homograph attacks

tedd wrote:
> You all knew this was going to happen.
> 
>    http://www.p&1072;ypal.com

Indeed. However, I am somewhat disheartened that this could
happen. IMO, Verisign should have never have registered that
domain - the registrar should have provided a language for
the label, that language should have been "Russian" (or
else &1072; should not have been allowed), and this combination
of Cyrillic and Latin letters should not be allowed for the
Russian language.

Regards,
Martin

Krall, Gary | 15 Feb 20:07
Picon
Favicon

RE: homograph attacks

All:

And an interesting follow-up by Paul Hoffman:

http://LookIt.proper.com/archives/000302.html#000302

Gary.

-----Original Message-----
From: owner-idn <at> ops.ietf.org [mailto:owner-idn <at> ops.ietf.org]On Behalf Of
tedd
Sent: Tuesday, February 15, 2005 8:08 AM
To: idn <at> ops.ietf.org
Cc: ericj <at> shmoo.com
Subject: [idn] homograph attacks

Hi people:

You all knew this was going to happen.

    http://www.p&1072;ypal.com

You might find --

    http://www.shmoo.com/idn/homograph.txt

-- an interesting read.

tedd
--

-- 
(Continue reading)

Edmon Chung | 15 Feb 20:23

Re: homograph attacks

I think the measure to disable IDN is, like a Cantonese saying: "cutting off one's toes to avoid sand worms"...

You are right that we knew all along, in some ways, it may be good that it finally did happen and raised some
concern.  Now, the right attitude is to address it rather than shy away from it.

Here is a list I put together way back of a list of the characters in the Latin, Greek, Cyrilic, Armenian
characters (based on Unicode) that creates potential "homographs"

a 0061 03B1 0430    
b 0062 03B2 0432    
c 0063 0441 0441    
Đ 0111 0256     
e 0065 03B5 0435    
ə 0259 04D9     
h 0068 03B7 043D 0570 029C  
i 0069 03B9 0456 04C0 0131 0456 0130
j 006A 0458 03F3 0458   
k 006B 03BA     
ĸ 0138 043A     
m 006D 03BC 043C    
ɯ 026F 0561     
ɰ 0270 057A     
n 006E 03BD 0578    
o 006F 03BF 043E 
p 0070 03C1 0440 
s 0073 0455 057F 
t 0074 03C4 0442 
u 0075 057D  
x 0078 03C7 0445 
y 0079 03C5 04AF 0443
(Continue reading)

Gravatar

Re: homograph attacks


On Tue, 15 Feb 2005, "Martin v. Löwis" wrote:

> >    http://www.p&1072;ypal.com
>
> Indeed. However, I am somewhat disheartened that this could
> happen. IMO, Verisign should have never have registered that
> domain - the registrar should have provided a language for
> the label, that language should have been "Russian" (or
> else &1072; should not have been allowed), and this combination
> of Cyrillic and Latin letters should not be allowed for the
> Russian language.

Which is easier than it sounds - there are thousands of homonyms in
unicode (depending on the font sometimes even significantly more) and even
in the easy western european languages you may have a accentgrave dropping
of in some fonts/cases with lowercase chars. And wether it is a true
homonym may differ from language to language and even depends on wether
this is uppercase or not.

Even in an easy language like dutch - would you see the difference between
wwww.ijselmeer.nl and www.?selmeer.nl ('i'+'j' or just unicode 0133 (or
0133 for uppercase)) ?

And even that - to a lot of readers the URLs www.langorse.co.uk and
www.1angorse.co.uk will appear identical. (Try Courier (new) or Gill Sans).

Plus it is not uncommon in some asian company/logo's to see essentially
two or even three "scripts" combined.

(Continue reading)

Kane, Pat | 15 Feb 20:30
Picon
Favicon

RE: homograph attacks

VeriSign does prevent domains with the Russian language tag from commingling
A-Z with the Cyrillic characters.  It does permit 0-9 and the dash to be
used.  This filter also applies to other Cyrillic based languages such as
Belarusian, Ukrainian, Serbian, Macedonian and Bulgarian.  

There are other languages that are listed within ISO 639-2 that today use a
combination of Latin and Cyrillic as they were originally Latin based (Tajik
was Arabic prior to being Latin based), migrated to Cyrillic during the
Soviet era and today are migrating back to Latin.  It is common to use Latin
and Cyrillic characters in Tajik, from what I understand not being a native
speaker.  Granted there are not a lot of registrations in com net that are
Tajik, but this is just the point of an IDN.

Pat Kane

-----Original Message-----
From: owner-idn <at> ops.ietf.org [mailto:owner-idn <at> ops.ietf.org] On Behalf Of
"Martin v. Löwis"
Sent: Tuesday, February 15, 2005 2:02 PM
To: tedd
Cc: idn <at> ops.ietf.org; ericj <at> shmoo.com
Subject: Re: [idn] homograph attacks

tedd wrote:
> You all knew this was going to happen.
> 
>    http://www.p&1072;ypal.com

Indeed. However, I am somewhat disheartened that this could
happen. IMO, Verisign should have never have registered that
(Continue reading)

Martin v. Löwis | 15 Feb 21:14
Picon
Gravatar

Re: homograph attacks

Dirk-Willem van Gulik wrote:
>>Indeed. However, I am somewhat disheartened that this could
>>happen. IMO, Verisign should have never have registered that
>>domain - the registrar should have provided a language for
>>the label, that language should have been "Russian" (or
>>else &1072; should not have been allowed), and this combination
>>of Cyrillic and Latin letters should not be allowed for the
>>Russian language.
> 
> 
> Which is easier than it sounds - there are thousands of homonyms in
> unicode (depending on the font sometimes even significantly more) and even
> in the easy western european languages you may have a accentgrave dropping
> of in some fonts/cases with lowercase chars.

Please read carefully. I'm not at all suggesting that homo*graphs*
should be used in considering whether registration is allowed.

I'm suggesting that Verisign does what they say they do: define
"language packs" for each language, see

http://verisign.com/products-services/naming-and-directory-services/naming-services/internationalized-domain-names/page_001382.html

> Even in an easy language like dutch - would you see the difference between
> wwww.ijselmeer.nl and www.?selmeer.nl ('i'+'j' or just unicode 0133 (or
> 0133 for uppercase)) ?

By design, IDN normalizes U+0133 as ij, so whether you have "&#x0133;"
or "ij" in the URL - IDNA will resolve that as the same machine in
all cases.
(Continue reading)

Martin v. Löwis | 15 Feb 21:29
Picon
Gravatar

Re: homograph attacks

Kane, Pat wrote:
> VeriSign does prevent domains with the Russian language tag from commingling
> A-Z with the Cyrillic characters.  It does permit 0-9 and the dash to be
> used.  This filter also applies to other Cyrillic based languages such as
> Belarusian, Ukrainian, Serbian, Macedonian and Bulgarian.  
> 
> There are other languages that are listed within ISO 639-2 that today use a
> combination of Latin and Cyrillic as they were originally Latin based (Tajik
> was Arabic prior to being Latin based), migrated to Cyrillic during the
> Soviet era and today are migrating back to Latin.

Thanks for the clarification. Is this information publically available
somehow? On

http://www.verisign.com/static/002533.pdf

I can find the language code list (which shows that indeed TGK and RUS
might be treated differently); I wonder whether you somehow list the
constraints implemented for each tag. How did the applicant know that
he would have to use Tajik in order to get a cyrillic letter into an
otherwise latin label?

As for the Tajik writing system: why is it then necessary to allow
mixed scripts? Wouldn't the Tajik users be satisfied if you could
either register all-Latin or all-Cyrillic labels (perhaps allowing
all-Arabic as well)?

Regards,
Martin

(Continue reading)

JFC (Jefsey) Morfin | 16 Feb 01:43

RE: homograph attacks

Dear Pat,
I have several questions here.

1. where do you maintain an ASCII list of your language tags? Should it not 
be supported on the IANA server and common to all the gTLDs?
2. is there a list of the permitted UNICODEs codes per languages? For 
example I am interested in the French and Ukrainian sets.
3. did you decide them by yourself, or did you gather a group of lingual 
authorities to assist you. This would be very interesting.
4. would there not be a way to register IDN in using their  "xn--" version? 
It would simplify international management by resellers?

Thank you for your assistance.

At 20:30 15/02/2005, Kane, Pat wrote:
>VeriSign does prevent domains with the Russian language tag from commingling
>A-Z with the Cyrillic characters.  It does permit 0-9 and the dash to be
>used.  This filter also applies to other Cyrillic based languages such as
>Belarusian, Ukrainian, Serbian, Macedonian and Bulgarian.
>
>There are other languages that are listed within ISO 639-2 that today use a
>combination of Latin and Cyrillic as they were originally Latin based (Tajik
>was Arabic prior to being Latin based), migrated to Cyrillic during the
>Soviet era and today are migrating back to Latin.  It is common to use Latin
>and Cyrillic characters in Tajik, from what I understand not being a native
>speaker.  Granted there are not a lot of registrations in com net that are
>Tajik, but this is just the point of an IDN.
>
>Pat Kane
>
(Continue reading)

Michel Suignard | 16 Feb 07:12
Picon
Favicon

RE: homograph attacks

No languages used in the former soviet union should require a mix of latin and cyrillic in a single dns label.
Unicode contains many latin homographs in the Cyrillic block exactly for that reason, to avoid mixing the
two scripts in a single word. It is unfortunate that the exact visual match is now haunting us. However it
should not be used as a rationale to accept registration of mixed Cyrillic/Latin labels by tld registries.

To answer another message in this thread, there is no definitive answer about which Unicode characters are
allowed for a given languages. But in all languages that have a reasonable concept of 'words', you should
never need to allow mixed script in a word, at least in the context of IDN label. There are exceptions to
these rules, like in South and East Asia (Japanese comes to mind), but these languages can be detected
reasonably using the Unicode script property.

Michel 

-----Original Message-----
From: owner-idn <at> ops.ietf.org [mailto:owner-idn <at> ops.ietf.org] On Behalf Of Kane, Pat

VeriSign does prevent domains with the Russian language tag from commingling A-Z with the Cyrillic
characters.  It does permit 0-9 and the dash to be used.  This filter also applies to other Cyrillic based
languages such as Belarusian, Ukrainian, Serbian, Macedonian and Bulgarian.  

There are other languages that are listed within ISO 639-2 that today use a combination of Latin and
Cyrillic as they were originally Latin based (Tajik was Arabic prior to being Latin based), migrated to
Cyrillic during the Soviet era and today are migrating back to Latin.  It is common to use Latin and Cyrillic
characters in Tajik, from what I understand not being a native speaker.  Granted there are not a lot of
registrations in com net that are Tajik, but this is just the point of an IDN.

Pat Kane


Gmane