"Martin J. Dürst" | 1 Oct 12:47 2012
Picon

New charset registry entry for iso-8859-11, anybody?

Dear Charset Experts,

Behind the scenes, there have been some discussions about adding an 
entry for ISO-8859-11, Latin/Thai.

However, Wikipedia (http://en.wikipedia.org/wiki/ISO/IEC_8859-11) says 
the following:

 >>>>
ISO-8859-11 is not a registered IANA charset name despite following the 
normal pattern for IANA charsets based on the ISO 8859 series. However, 
the close equivalent TIS-620 (which lacks the non-breaking space) is 
registered with IANA, and can without problems be used for ISO/IEC 
8859-11, since the no-break space has a code which was unallocated in 
TIS-620.
 >>>>

I would like to get your feedback on the following alternative proposals:

1) Leave everything as is.

2) Add an alias "ISO-8859-11" to the TIS-620 entry (acknowledging 
current practice and ignoring the official difference at 0xA0 (*)).

3) Add a new entry of the form:

Name: ISO-8859-11 (preferred MIME name)
MIBenum: [TBD]
Source: ISO/IEC 8859-11:2001
Alias: csISOLatinThai
(Continue reading)

Anne van Kesteren | 17 Apr 11:38 2012
Picon

Encoding Standard (mostly complete)

Hi,

Apart from big5, all encoders and decoders are now defined.

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

Feedback is much appreciated.

(big5 is somewhat complicated unfortunately. See  
http://annevankesteren.nl/2012/04/big5 for more details.)

Kind regards,

--

-- 
Anne van Kesteren
http://annevankesteren.nl/

Anne van Kesteren | 20 Dec 11:59 2011
Picon

Encodings and the web

Hi,

When doing research into encodings as implemented by popular user agents I
have found the current standards lacking. In particular:

    * More encodings in the registry than needed for the web
    * Error handling for encodings is undefined (can lead to XSS exploits,
      also gives interoperability problems)
    * Often encodings are implemented differently from the standard

A year ago I did some research into encodings[1] and more detailed for
single-octet encodings[2] and I have now taken that further into starting
to define a standard[3] for encodings as they are to be implemented by
user agents. The current scope is roughly defining the encodings, their
labels and name, and how you match a label.

The goal is to unify encoding handling across user agents for the web so
legacy pages can be interpreted "correctly" (i.e. as expected by users).

If you are interested in helping out testing (and reverse engineering)
multi-octet encodings please let me know. Any other input is much
appreciated as well.

Kind regards,

[1]<http://wiki.whatwg.org/wiki/Web_Encodings>
[2]<http://annevankesteren.nl/2010/12/encodings-labels-tested>
[3]<http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>

--

-- 
(Continue reading)

Leif Halvard Silli | 20 Dec 00:53 2011
Picon

Registration of new charset 'unicode'

Charset name:
     'unicode'

Charset aliases:
      The 'unicode' spec defines 'utf-16' as its alias, but this of
      course contradicts with 'utf-16' as defined in the IANA registry.

Suitability for use in MIME text:
      The 'unicode' charset has same MIME text media issue as utf-16.
  [1] http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
      Microsoft's 'Character Set Recognition' document, [1]
      together with the 'Code Page Identifiers' document.[2]
  [2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
  [3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

ISO 10646 equivalency table:
      The 'unicode' charset represents codepage 1200, whose definition
      is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO 
      10646);'

Additional information:
    * Byte order mark (BOM): The 'unicode' charset specifications do 
not explain whether the BOM is required or recommended. However, 
without it, products may fail to determine the encoding. And also: The 
BOM allows products that do not support 'unicode' to perceive the 
encoding as 'utf-16'. Hence it is not very surprising that products 
that label documents with the 'unicode' label tend to include the BOM. 
Hence, BOM in 'unicode'-encoded documents should be seen as strongly 
(Continue reading)

Shawn Steele | 15 Dec 19:50 2011
Picon

Are charset names supposed to be case sensitive?

Are charset names supposed to be case sensitive?

 

-Shawn

 

 

http://blogs.msdn.com/shawnste

 

Leif Halvard Silli | 15 Dec 12:19 2011
Picon

Registration of new charset 'unicode'

Charset name:
      unicode

Charset aliases:
      No aliases. (This is a willful violation of the spec upon
      which the registration of 'unicode' is based, see the NB!)

Suitability for use in MIME text:
      The 'unicode' charset labels the little-endian 'subset' of
      'UTF-16' and thus shares the same issue: It does 'not encode
      line endings in the way required for MIME "text" media'.
  [1] http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
      The 'unicode' charset label covers 'codepage 1200':
  [2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
      Codepage 1200 covers a little-endian representation of UTF-16,
      including the BOM: 'Unicode UTF-16, little endian byte order
      BMP of ISO 10646);'.
  [3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
      The reference to 'Unicode UTF-16' is taken to mean that the
      BOM MUST be present.

ISO 10646 equivalency table:
      The 'unicode' charset is equivalent of the BMP.[1][2] 

Additional information:
      The 'unicode' charset can be understood as the little-endian
  'subset' of 'UTF-16'. Thus, like 'UTF-16', it includes the BOM: If
  the resource doesn't contain a BOM, then it isn't 'unicode'-encoded.
  Applications generating resources with the 'unicode' label on
  (example: <META content="text/html; charset=unicode" 
  http-equiv=Content-Type>), are known to insert the BOM. When parsing
  e.g. media of MIME type 'text/html', then Internet Explorer is known
  to NOT pick 'unicode' (or any other of the 16-bit UTF variants) 
  as the encoding unless there is a BOM. (Minor exception for 
  'text/html': If the HTTP Content-Type: header contains 'unicode'
  in the charset parameter, then IE renders the 'text/html' resource 
  fine even without a BOM - but only as long as the resource isn't 
  loaded from cache.)
     NB! Alias: At the time of this registration, the spec upon which 
  the registration of the 'unicode' and the 'unicodeFFFE' charset is
  based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2]
  This is incompatible with the registered semantics of (uppercase) 
  'UTF-16' (RFC2781) as it causes implementations - such as Internet
  Explorer (IE) - to interpret 'utf-16' (irrespective of case) to mean
  'little-endian'. Usually, because a BOM takes precedence (the BOM is
  a MUST for both 'unicode', 'unicodeFFFE' and 'UTF-16'), the problem is
  solved by the BOM. But otherwise, unless implementations adheres to 
  the 'unicode'-registration and thus rejects 'utf-16' as alias for
  'unicode', then big-endian MIME text resources that are labelled as 
  'UTF-16' risk being mis-rendered (causing 'mojibake').

Intended usage:
      LIMITED USE. It is used by a large community of Microsoft product 
users, but is also supported, across different platforms, by products 
that want to be compatible. By 'compatible' is meant e.g. tools, such 
as editors, in need of determining the encoding or advice about the 
best charset label. In that regard: Any resource that can be validly 
labeled as 'unicode' could also validly (and probably ought to) be 
labelled as 'UTF-16'. Another example is the encoding sniffing 
algorithm of HTML5, which in certain circumstances require charset 
labels that contain 'a UTF-16 encoding' (such as 'unicode') as its 
value, to be interpreted as if its value instead was 'UTF-8'.

Person & email address to contact for further information:
      Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no

Leif Halvard Silli | 15 Dec 12:19 2011
Picon

Registration of new charset 'unicodeFFFE'

Charset name: 
      unicodeFFFE

Charset aliases:
      No aliases.

Suitability for use in MIME text:
      The 'unicodeFFFE' charset labels the big-endian 'subset' of
      'UTF-16' and thus shares the same issue: It does 'not encode
      line endings in the way required for MIME "text" media'. 
  [1] http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
      The 'unicodeFFFE' charset label covers 'codepage 1201':
  [2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
      Codepage 1201 covers a big-endian representation of
      'UTF-16', including the BOM: 'Unicode UTF-16, big endian byte
      order; available only to managed applications'.
  [3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
      The reference to 'Unicode UTF-16' is taken to mean that the
      BOM MUST be present.

ISO 10646 equivalency table:
      The 'unicodeFFFE' charset (codepage 1201) is the big-endian 
  equivalent to 'unicode' (codepage 1200), which in turn represents
  'BMP of ISO 10646'.[2] Thus 'unicodeFFFE' is equivalent of the BMP.

Additional information: 
      The 'unicodeFFFE' charset can be understood as the big-endian 
  'subset' of 'UTF-16'. Thus, like 'UTF-16'-encoded resources, 
  'unicodeFFFE'-encoded resources include the BOM: If the resource 
  doesn't contain a BOM, then it isn't 'unicodeFFFE'-encoded. 
  Applications generating resources with the 'unicodeFFFE' label on
  (example: <META content="text/html; charset=unicodeFFFE" 
  http-equiv=Content-Type>), are known to insert the BOM. When parsing
  e.g. media of MIME type 'text/html', then Internet Explorer is known
  to NOT pick 'unicodeFFFE' (or any other of the 16-bit UTF variants) 
  as the encoding unless there is a BOM. (Minor exception for 
  'text/html': If the HTTP Content-Type: header contains 'unicodeFFFE'
  in the charset parameter, then IE renders the 'text/html' resource 
  fine even without a BOM - but only as long as the resource isn't 
  loaded from cache.)
     NB! Alias: At the time of this registration, the spec upon which
  the registration of the 'unicodeFFFE' and the 'unicode' charset is
  based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2] 
  This is incompatible with the registered semantics of (uppercase) 
  'UTF-16' (RFC2781) as it causes implementations - such as Internet
  Explorer (IE) - to interpret 'utf-16' (irrespective of case) to mean
  'little-endian'. Usually, because a BOM takes precedence (the BOM is
  a MUST for both 'unicode', 'unicodeFFFE' and 'UTF-16'), the problem is
  solved by the BOM. But otherwise, unless implementations adheres to 
  the 'unicode'-registration and thus rejects 'utf-16' as alias for
  'unicode', then big-endian MIME text resources that are labelled as 
  'UTF-16' risk being mis-rendered (causing 'mojibake').

Intended usage:
      LIMITED USE. It is used by a large community of Microsoft product 
users, but is also supported, across different platforms, by products 
that want to be compatible. By 'compatible' is meant e.g. tools, such 
as editors, in need of determining the encoding or advice about the 
best charset label. In that regard: Any resource that can be validly 
labeled as 'unicodeFFFE' could also validly (and probably ought to) be 
labelled as 'UTF-16'. Another example is the encoding sniffing 
algorithm of HTML5, which in certain circumstances require charset 
labels that contain 'a UTF-16 encoding' (such as 'unicodeFFFE') as its 
value, to be interpreted as if its value instead was 'UTF-8'.

   Person & email address to contact for further information: 
      Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no

Leif Halvard Silli | 15 Dec 07:53 2011
Picon

How to register 'unicode'/'unicodeFFFE' ?

Hi! I am ready to submit . and have prepared - two registrations for 
the 'unicode' and the 'unicodeFFFE' charset. The two charsets are 
variants of 'UTF-16', and they only differ from each others with regard 
to their endianness. Each charset includes the BOM. The registrations 
are based on Microsoft's specifications:

http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

The purpose of the registrations would be to 'documents existing 
practice in a large community' and should thus "be explicitly marked as 
being of limited or specialized use and should only be used in Internet 
messages with prior bilateral agreement".

http://tools.ietf.org/html/rfc2978#section-2.5

In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to 
register: We have 'UTF-16', for which the endianness can be signalled 
via the BOM. Thus one can switch the endianness freely, without having 
to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast, 
if one changes the endianness, then one must also switch to the other 
(or: another) charset label. This is a mayor reason to not use this 
charset.

So far so good: Because both charsets include the BOM, the BOM takes 
precedence - in particular if the name of the label is not supported by 
the implementation. Opera and Firefox are in that category, for 
example. And even Microsoft seems to be in that league, as e.g. IE has 
no problems handling a little-endian file which includes the 
'unicodeFFFE' label, as long as the document *also* contains the BOM.

However, Microsoft's spec includes one additional detail which is not 
only impractical but also dangerous: 'utf-16' (formally in lowercase) 
is seen by the Microsoft spec as an alias for 'unicode' - the 
little-endian charset variant. This is of course incompatible with the 
'UTF-16' charset and so the registrations I have prepared, reject this 
detail. However, for applications that implements the current Microsoft 
specification (such as IE), this still nevertheless mean that if your 
'text/html' document is big-endian, but without the BOM, and if you 
then send 'UTF-16' via the HTTP Content-Type: charset parameter, then 
you can be certain that IE treats the document as little-endian, with 
'mojibake' as result.

  (You probably would not like to send 'UTF-16' via HTTP Content-Type, 
though - except as a 'back-up' solution in addition to the BOM, because 
IE does not seem to cache encoding information sent this way. And so 
your document would be misinterpreted if you used the back button. For 
XML, by contrast, then the situation seems better than for 'text/html' 
- perhaps because XML defaults to either UTF-8 or UTF-16.)

As I pondered over this, I first considered that 'unicode' and 
'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However, 
the fact that each charset supports only one 'subset' of UTF-16 (which 
is a single charset/encoding), meant that it had to be two charsets, if 
the Microsoft reality is taken as basis.

That said: We should support reality, and not Microsoft reality. And in 
that regard: Because both charsets include the BOM, it is simple to 
treat them as aliases for 'UTF-16' - it is only when you create an 
invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE 
risks acting up. (IE always consider BOM before anything else - even 
before HTTP Content-Type, it seems.)

So there are actually two possibilities here: EITHER to update the 
'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then 
we would also pretty much automatically discourage their use as there 
would be a clear recommendation in place to use the preferred name -  
'UTF-16' - instead. OR, the other option: To register them as two 
separate charsets.

Making the two labels into aliases of 'UTF-16' would - formally -  give 
them a more prominent status than registering them independently for 
'limited use'. To register them as aliases, would be to *not* base them 
on 'Microsoft reality'. Such a thing could perhaps make Microsoft align 
itself more with 'UTF-16' as she is registered? Another problem with 
registering them as independent charsets, would be that it would be 
more unclear how non-Microsoft products should handle them. Does anyone 
know if IE10 is behaving any differently w.r.t. UTF-16? Is there a 
direction towards the standard?

To update the UTF-16 registration seems simple - only a matter of 
adding the aliases: 
<http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started 
to, again, consider that the best option.

I had planned to send the registrations now, but I would like to gather 
some responses first. However, if the expert reviewers would like, I 
could post the registrations that I have prepared ASAP - often it is 
better to have something concrete to look at. (That being said, I have 
covered very many of the issues in this message ...)

With regards,
Leif H Silli

Shawn Steele | 22 Sep 17:27 2011
Picon

RE: Common/Limited Use

That's why a definition would've been good :)  

I'm not sure there's much point to add the information "now".  Either your app's stuck with some legacy
restriction requiring code page support or you should be using Unicode.  So either some code page(s) you
already know about on this list are important to you, or they aren't.  Whether we call them "common,
limited, or obsolete" doesn't change that.

It might be good if http://www.iana.org/assignments/character-sets and
http://www.iana.org/assignments/charset-reg/index.html were updated to include a note to the
effect that UTF-8 or UTF-16 may be preferred in many cases, something like:

"Many character sets are limited in their scope and many others have inconsistent implementations
between vendors.  To enable the broadest set of characters in a manner that is most consistent between
vendors, implementers should consider using UTF-8 or UTF-16."

-Shawn

 
http://blogs.msdn.com/shawnste



________________________________________
From: "Martin J. Dürst" [duerst <at> it.aoyama.ac.jp]
Sent: Wednesday, September 21, 2011 9:09 PM
To: Shawn Steele
Cc: Ira McDonald; ietf-charsets <at> mail.apps.ietf.org; Makoto Murata (eb2m-mrt <at> asahi-net.or.jp)
Subject: Re: Big5 / CP950

On 2011/09/22 3:05, Shawn Steele wrote:
> I saw the “one of…”, but they aren’t defined in the RFC?  Your spirit of Limited Use sounds about
right for big5 though.

I agree.

For some more comments, please see below.

> Thanks,
> Shawn
>
> From: Ira McDonald [mailto:blueroofmusic <at> gmail.com]
> Sent: Wednesday, September 21, 2011 9:41 AM
> To: Shawn Steele; Ira McDonald
> Cc: "Martin J. Dürst"; ietf-charsets <at> mail.apps.ietf.org; Makoto Murata (eb2m-mrt <at> asahi-net.or.jp)
> Subject: Re: Big5 / CP950
>
> Hi Shawn,
>
> RFC 2978 section 5 'Charset Registration Template'
>
>       "Intended usage:
>
>       (One of COMMON, LIMITED USE or OBSOLETE)"
>
> The spirit of LIMITED USE has been to discourage the use
> of legacy charsets that are particularly problematic - Big5.
>
> Not sure if OBSOLETE has ever been used.

I haven't checked, but I guess these were not introduced when the
charset registry was created, but with a later update.

I assume the distinction between COMMON and LIMITED USE was originally
intended as some kind of advice to implementers: If it's COMMON, then
make sure it's supported, if it's LIMITED USE, you may not need it. But
I don't think that has ever really worked.


> Martin - searching for this made me realize that the
> plaintext IANA Charset Registry at
>
>    ftp://ftp.iana.org/assignments/character-sets
>
> contains 257 entries - they don't include the Intended
> Usage field.
>
> I suggest we work w/ IANA to change the plaintext
> registry.

Assuming somebody has lots of spare time, that would indeed be a good
idea. Assuming that everybody's time is rather limited, it may have to
wait. There are quite a few other things in the registry that might
benefit from clearing up, but the critical mass may not be reached yet.

> In most cases this data is long lost (if ever submitted)
> because the directory
>
>    ftp://ftp.iana.org/assignments/charset-reg
>
> contains only 55 entries.

Lots of stuff was taken from http://tools.ietf.org/html/rfc1345 (and
some other places). There's no need to keep that kind of information in
separate templates.

Regards,    Martin.

> Cheers,
> - Ira
>
> Ira McDonald (Musician / Software Architect)
> Chair - Linux Foundation Open Printing WG
> Co-Chair - IEEE-ISTO PWG IPP WG
> Chair - TCG Embedded Systems Hardcopy SWG
> IETF Designated Expert - IPP&  Printer MIB
> Blue Roof Music/High North Inc
> http://sites.google.com/site/blueroofmusic

> http://sites.google.com/site/highnorthinc

> mailto:blueroofmusic <at> gmail.com<mailto:blueroofmusic <at> gmail.com>
> Christmas through April:
>    579 Park Place  Saline, MI  48176
>    734-944-0094
> May to Christmas:
>    PO Box 221  Grand Marais, MI 49839
>    906-494-2434
>
Shawn Steele | 19 Sep 19:23 2011
Picon

Big5 / CP950

Murata-san has asked us to update the big5 entry similarly to what we did for shift_jis, pointing out that Big5 has vendor-specific variations as well.  Eg: add something like this:

 

Several vendor specific charsets that derive from Big5 often use

the Big5 name instead of a more specific vendor charset name.

Windows Code Page 950 is one example, Big-5 HKSCS, Big5+ and

several font specific variations are others.

 

However, I don’t see a big5 entry in the charset registry, only the entry in the Character Sets table.  There is an entry for Big5-HKSCS (which probably fits the definition of one of the Big5 variants above), but HKSCS is a variant.

 

Am I missing something?  Should a new entry for big5 be created (pointing to something like http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT)?  Other suggestions?

 

Thanks,

 

-Shawn

 

 

http://blogs.msdn.com/shawnste

 

Shawn Steele | 10 Nov 23:52 2010
Picon

shift_jis / windows-31J

A dozen years ago windows-31J was created because people noticed that there were lots of different flavors of shift_jis floating around.  Uniquely identifying them may have made sense, however the windows-31J term has never really been widely adopted for the windows code page 932 behavior.

 

So I’d like to propose the following updates, loosly based on discussion about variants some time ago.  I’d be happy to accept other suggestions that help users discover that some test is tagged with the less-specific shift_jis name rather than the more specific vendor charset name.

 

Name: Windows-31J

MIBenum: 2024

Source: Windows Japanese.  A variant of Shift_JIS to include

        NEC special characters (Row 13), NEC selection of IBM

        extensions (Rows 89 to 92), and IBM extensions (Rows

        115 to 119).  The CCS's are JIS X0201:1997,

        JIS X0208:1997, and these extensions.  This charset

        can be used for the top-level media type "text", but

        it is of limited or specialized use (see RFC2278).

        PCL Symbol Set id: 19K.  Windows-31J text is commonly

        declared with the shift_jis name of the parent charset.

Alias: csWindows31J

Alias: shift_jis+cp932

 

Name: Shift_JIS  (preferred MIME name)

MIBenum: 17

Source: This charset is an extension of csHalfWidthKatakana by

        adding graphic characters in JIS X 0208.  The CCS's are

        JIS X0201:1997 and JIS X0208:1997.  The

        complete definition is shown in Appendix 1 of JIS

        X0208:1997.

        This charset can be used for the top-level media type "text".

        Several vendor specific charsets that derive from shift_jis

        often use the shift_jis name instead of a more specific

        vendor charset name.

Alias: MS_Kanji

Alias: csShiftJIS

 

 

 

 

- Shawn

 

 

http://blogs.msdn.com/shawnste

(Selfhost 7872)

 


Gmane