Etan Wexler | 3 Dec 00:47 2003

Re: UTF-8 signature / BOM in CSS


Richard Ishida wrote to <mailto:www-international <at> w3.org>, 
<mailto:w3c-css-wg <at> w3.org>, and <mailto:w3c-i18n-ig <at> w3.org> on 2 
December 2003 in "RE: UTF-8 signature / BOM in CSS" 
(<mid:005301c3b8e4$1d862250$6501a8c0 <at> w3c40upc3ma3j2>):

> I wonder whether CSS can introduce a change to CSS2.1 at this stage to
> clarify that the BOM - particularly any UTF-8 signature - should not be
> considered part of the following text.

I'd like to see such a revision made.

CSS specifications should mandate a preparation phase for CSS 
consumption. In this phase, a CSS engine would strip an initial BOM, if 
present, and strip all noncharacters. After this phase, a clean stream 
of Unicode characters gets passed to the tokenizer; parsing proceeds as 
specified in the grammar.

By the way, what UTF-8 signatures exist besides U+FEFF?

--

-- 
Etan Wexler.
(Sorry about the character munging in my original message. And sorry 
about using my unsubscribed address, thus splitting the thread. I'm 
reconnecting with www-style.)

Tex Texin | 3 Dec 05:26 2003

Re: UTF-8 signature / BOM in CSS


Etan,

I am not sure I would agree with stripping non-characters. I would
rather reject documents with junk in them than silently clean them up.

In the case of the UTF-8 BOM, I would not object to simply stripping it,
but it does seem odd to not make use of the information about the
document's encoding and odder still to not use the information about
endian-ness in a UTF-16 encoded document. Also stripping it in the case
of UTF-16 would eliminate useful information from a CSS document.

To answer your question about other BOMs, they are all based on U+FEFF,
but they exist for
UTF-16, UTF-32, and SCSU (Unicode compression).

I have a list with more detail here:
http://www.i18nguy.com/unicode/c-unicode.html#BOM

and the Unicode Consortium has a FAQ on UTF-8 and the BOM at:
http://www.unicode.org/faq/utf_bom.html

tex

Etan Wexler wrote:
> 
> Richard Ishida wrote to <mailto:www-international <at> w3.org>,
> <mailto:w3c-css-wg <at> w3.org>, and <mailto:w3c-i18n-ig <at> w3.org> on 2
> December 2003 in "RE: UTF-8 signature / BOM in CSS"
> (<mid:005301c3b8e4$1d862250$6501a8c0 <at> w3c40upc3ma3j2>):
(Continue reading)

Etan Wexler | 5 Dec 22:30 2003

Re: UTF-8 signature / BOM in CSS


Tex Texin wrote to>, <mailto:www-international <at> w3.org>, 
<mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and 
<mailto:www-style <at> w3.org> on 2 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:3FCD6609.7C5A8F4F <at> i18nguy.com>):

> I am not sure I would agree with stripping non-characters. I would
> rather reject documents with junk in them than silently clean them up.

I used to be of the junk-rejection mentality. Ian Hickson, time, and 
probably some brain-altering medication have convinced me of the case 
for parsing at all costs.

> In the case of the UTF-8 BOM, I would not object to simply stripping 
> it,
> but it does seem odd to not make use of the information about the
> document's encoding and odder still to not use the information about
> endian-ness in a UTF-16 encoded document.

I assumed that the CSS engine would make use of out-of-band information 
to indicate the detected encoding scheme. Or the CSS engine would 
internally convert style sheets' encodings to a single chosen encoding 
(say, UTF-16BE). Or the CSS engine would parse bytes into 
encoding-independent character objects. The CSS engine would then pass 
these character objects to the tokenizer, with the original encoding 
scheme becoming irrelevant to understanding the CSS.

> Also stripping it in the case
> of UTF-16 would eliminate useful information from a CSS document.

(Continue reading)

Tex Texin | 6 Dec 03:55 2003

Re: UTF-8 signature / BOM in CSS


Etan,

I would be happy for either the brain-altering meds, or some
justification.

I went the other way. I used to be for being tolerant on reading and
strict on writing and now I would prefer strict everywhere. Being
tolerant disguises and perpetuates problems, introduces security risks,
and leads to unpredictable behavior. It also causes users to think the
technology is mysterious and unpredictable rather than being
decipherable and manageable. And the benefit? I can't think of one.
I'd be happy to understand how being accepting is beneficial.

Regards,

Tex

Etan Wexler wrote:
> 
> Tex Texin wrote to>, <mailto:www-international <at> w3.org>,
> <mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and
> <mailto:www-style <at> w3.org> on 2 December 2003 in "Re: UTF-8 signature /
> BOM in CSS" (<mid:3FCD6609.7C5A8F4F <at> i18nguy.com>):
> 
> > I am not sure I would agree with stripping non-characters. I would
> > rather reject documents with junk in them than silently clean them up.
> 
> I used to be of the junk-rejection mentality. Ian Hickson, time, and
> probably some brain-altering medication have convinced me of the case
(Continue reading)

Chris Lilley | 6 Dec 16:48 2003
Picon

Re: UTF-8 signature / BOM in CSS


On Friday, December 5, 2003, 10:30:37 PM, Etan wrote:

Tex Texin wrote to>>, <mailto:www-international <at> w3.org>, 
EW> <mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and 
EW> <mailto:www-style <at> w3.org> on 2 December 2003 in "Re: UTF-8 signature /
EW> BOM in CSS" (<mid:3FCD6609.7C5A8F4F <at> i18nguy.com>):

>> I am not sure I would agree with stripping non-characters. I would
>> rather reject documents with junk in them than silently clean them up.

EW> I used to be of the junk-rejection mentality. Ian Hickson, time, and
EW> probably some brain-altering medication have convinced me of the case
EW> for parsing at all costs.

Probably the influence of too much HTML.

I refer you to the TAG Architecture document
http://www.w3.org/TR/webarch/#error-handling

Principle: Error recovery

  Silent recovery from error is harmful.

>> In the case of the UTF-8 BOM, I would not object to simply stripping
>> it,

The BOM is not an error. Nor is it a character, invalid or otherwise.

Invalid characters are errors
(Continue reading)

François Yergeau | 6 Dec 20:56 2003

Re: UTF-8 signature / BOM in CSS


Chris Lilley a écrit  :
> Almost correct. There are various byte sequences, all of which encode
> U+FEFF, whichis a byte order mark and not a character.

That's one way to see it, but another way is to consider it a character 
and to bring it squarely in the grammar of a language, like I proposed 
recently for CSS:

  EncodingDecl = [BOM][ <at> charset=<foobar>]

with the additional constraint that EncodingDecl must occur at the start 
of the stylesheet.

The BOM is a pretty mysterious beast for many, with a somewhat fuzzy 
status, and the above has the advantage of making it and its role 
explicit, instead of living in a some strange layer somewhere between 
byte sequences and character sequences.

--

-- 
François

Etan Wexler | 7 Dec 04:07 2003

Parsing everything


Tex Texin wrote to <mailto:www-international <at> w3.org>, 
<mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and 
<mailto:www-style <at> w3.org> on 5 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:3FD1453D.3252B143 <at> i18nguy.com>):

> I would be happy for either the brain-altering meds

   http://www.fluoxetine.com/
   http://www.abilify.com/

Cheers. (And goodbye, refractory depression!)

> or some justification.

I won't repeat Ian Hickson's arguments or ask him to do so. The 
discussion is archived. What I consider Ian's salient points are in the 
following messages.

   http://lists.w3.org/Archives/Public/www-style/2003Mar/0028.html
   http://lists.w3.org/Archives/Public/www-style/2003Feb/0199.html
   http://lists.w3.org/Archives/Public/www-style/2003Feb/0254.html
   http://lists.w3.org/Archives/Public/www-style/2003Feb/0278.html

In summary:

Junk happens. Declaring it unparsable does not eliminate it and 
probably won't significantly reduce it.

If we don't specify the error handling for every case, agents will vary 
(Continue reading)

Etan Wexler | 7 Dec 04:08 2003

Parsing everything


Chris Lilley wrote to <mailto:www-international <at> w3.org>, 
<mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and 
<mailto:www-style <at> w3.org> on 6 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:862788409.20031206164822 <at> w3.org>):

> Etan wrote:
>
> EW> [...] convinced me of the case
> EW> for parsing at all costs.
>
> Probably the influence of too much HTML.

Maybe so, but not in the way that I think that you're thinking. I 
didn't mean that agents should parse in any way they choose, but that 
the specification should prescribe rules for parsing everything. My 
experience with HTML teaches me that underspecifying error handling is 
a mistake.

> I refer you to the TAG Architecture document
> http://www.w3.org/TR/webarch/#error-handling
>
> Principle: Error recovery
>
>   Silent recovery from error is harmful.

And immediately following this statement in the same document:

     Good practice: Specify error handling

(Continue reading)

Etan Wexler | 7 Dec 04:08 2003

Re: UTF-8 signature / BOM in CSS


Chris Lilley wrote to <mailto:www-international <at> w3.org>, 
<mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and 
<mailto:www-style <at> w3.org> on 6 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:862788409.20031206164822 <at> w3.org>):

> EW> I assumed that the CSS engine would make use of out-of-band 
> information
> EW> to indicate the detected encoding scheme.
>
> Please check the definition of that out of band information [ in]
> particular what it says about when a BOM must be present.

Perhaps I was unclear. I did not mean that the CSS engine would 
propagate the "charset" parameter's value unmodified. What I had in 
mind is as follows.

The CSS engine retrieves a style sheet. It could be from HTTP, the 
local file system, FTP, SMTP + MIME, a database, or any source, really. 
The CSS engine detects an encoding scheme according to the prescribed 
or accepted best practice. Factors that could determine the detection 
include a "charset" parameter, a byte-order mark (U+FEFF), a database 
schema, a file name extension, and the native byte order of the local 
machine. Once the encoding scheme is detected, it is noted for further 
use. The encoding scheme will never be noted as UTF-16 or UTF-32.

There is no encoding scheme UTF-16. There is a "charset" value in the 
IANA registry called "UTF-16", but UTF-16 is an encoding form. Any 
serialized UTF-16 document is either big-endian or little-endian. 
Nevertheless, the "UTF-16" label is allowed and in use, so we resort to 
(Continue reading)

Etan Wexler | 7 Dec 04:08 2003

Re: UTF-8 signature / BOM in CSS


François Yergeau wrote to <mailto:www-international <at> w3.org>, 
<mailto:w3c-css-wg <at> w3.org>, <mailto:w3c-i18n-ig <at> w3.org>, and 
<mailto:www-style <at> w3.org> on 6 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:3FD23453.6000009 <at> yergeau.com>):

> [...] another way is to consider [the BOM] a character and to bring it 
> squarely in the grammar of a language, like I proposed recently for 
> CSS:
>
>  EncodingDecl = [BOM][ <at> charset=<foobar>]
>
> with the additional constraint that EncodingDecl must occur at the 
> start of the stylesheet.

Is the BOM to be considered an identifier character? That's possible. 
Then an identifier consisting solely of one U+FEFF would be allowed at 
the beginning of a style sheet. But the codepoint U+FEFF could just as 
well be tokenized as its own type and grouped with "S" (space tokens) 
and comments as a separator of other tokens. This latter approach is 
not backwards compatible in a formal sense, but how many existing 
Cascading Style Sheets make use of U+FEFF in identifiers? About zero, 
I'd guess.

--

-- 
Etan Wexler.


Gmane