Jacob Palme | 12 Mar 11:31
Picon
Picon

Charset mandatory in unix/linux


The charset parameter is mandatory in the MIME content-type
attribute. However, such a parameter is not mandatory in
Unix or Linux. This is causing more and more problems, when
people have a mixture of files with different charsets,
which you easily get when you download files from the
Internet or receive them via e-mail.

Would it be possible to get the people responsible for the
file systems in Unix and Linux to add a mandatory charset
attribute to all text files? Best is probably to add a
generalized property list to files, so that also other
properties than charset can be added in the future.

The advantage would be that programs which transport files
across the Internet, such as e-mail, ftp and http, would
more often use the correct charset and not munge the files
by giving then an incorrect charset. The commonly occuring
problem with incorrect charset would be reduced. Also local
problems such as text editors would benefit from knowing
the charset of a file.

(Mac OS earlier had a very good feature, you could add to
every file a property list called the "resource fork").
This still works in Mac OS X, but is less and less often
used, since Unix, on which Mac OS X is based, does not have
this facility. In Mac OS X the resource fork is stored in a
separate file whose file name starts with ".", in the same
directory as the file described).
--

-- 
(Continue reading)

Perry E. Metzger | 12 Mar 19:15

Re: Charset mandatory in unix/linux


Jacob Palme <jpalme <at> dsv.su.se> writes:
> Would it be possible to get the people responsible for the
> file systems in Unix and Linux to add a mandatory charset
> attribute to all text files?

Not very likely at all I would say. The idea of "file attributes" is
pretty alien to Unix in general, and there is no central control over
all the various scattered forms of Unix, which makes it even harder.

Perry

ned+ietf-822 | 12 Mar 16:10

Re: Charset mandatory in unix/linux


(cc'ing the ietf-types list since this doesn't seem like an appropriate topic
for ietf-822)

> The charset parameter is mandatory in the MIME content-type
> attribute.

Actually, I don't know of a single case where this is true. All media type
pararameters are either type or subtype specific, so there is no general rule
that applies to all charset parameters. Nevertheless, the charset parameters
that attach to the text top-level type are optional, as is the charset
parameter on application/xml. And making the parameter optional doesn't even
imply that there's a default. For exmaple, In the case of XML the allowed
charsets for unlabelled material are intentionally limited so they can be
determined by inspection.

> However, such a parameter is not mandatory in
> Unix or Linux.

I could say the same thing about media types. File extensions or type codes are
commonly used to determine the media type. This is a huge problem that has led
to serious security glitches as well as poor user experiences.

> This is causing more and more problems, when
> people have a mixture of files with different charsets,
> which you easily get when you download files from the
> Internet or receive them via e-mail.

The reality is it is causing less and less problems as things gradually shift
towards Unicode-based charsets and away from the vast array of less capable
(Continue reading)

Martin Duerst | 13 Mar 07:31
Picon
Gravatar

Re: Charset mandatory in unix/linux

At 00:10 06/03/13, Ned Freed wrote:
 >(cc'ing the ietf-types list since this doesn't seem like an appropriate topic
 >for ietf-822)
 >
 >> The charset parameter is mandatory in the MIME content-type
 >> attribute.
 >
 >Actually, I don't know of a single case where this is true.

I don't, either.

 >All media type
 >pararameters are either type or subtype specific, so there is no general rule
 >that applies to all charset parameters. Nevertheless, the charset parameters
 >that attach to the text top-level type are optional, as is the charset
 >parameter on application/xml. And making the parameter optional doesn't even
 >imply that there's a default. For exmaple, In the case of XML the allowed
 >charsets for unlabelled material are intentionally limited so they can be
 >determined by inspection.

Almost correct, but wrong: In the case of application/xml, if there
is no 'charset' parameter on the mime type, information inside the
XML document is used to determine the character encoding according
to a clearly defined bootstrap algorithm. If you start a document
with
     <?xml version='1.0' encoding='foobar'?>
then it's in the "foobar" encoding. That doesn't mean that your
parser will be able to understand the "foobar" encoding, XML
parsers are only required to understand UTF-8 and UTF-16.

(Continue reading)

Arnt Gulbrandsen | 13 Mar 14:04
Picon
Favicon
Gravatar

Re: Charset mandatory in unix/linux


Martin Duerst writes:
> At 00:10 06/03/13, Ned Freed wrote:
> >(cc'ing the ietf-types list since this doesn't seem like an 
> appropriate topic
> >for ietf-822)
> >
> >> The charset parameter is mandatory in the MIME content-type attribute.
> >
> >Actually, I don't know of a single case where this is true.
>
> I don't, either.

Huh.

So I looked closely, and I find that you're right. How about stressing 
this a little more in the successor to RFC 2045? Ned? Section 5.2 could 
state clearly that «Content-Type: text/plain» does NOT mean the same as 
a missing Content-Type header.

Arnt

Ned Freed | 13 Mar 15:17

Re: Charset mandatory in unix/linux


> Martin Duerst writes:
> > At 00:10 06/03/13, Ned Freed wrote:
> > >(cc'ing the ietf-types list since this doesn't seem like an
> > appropriate topic
> > >for ietf-822)
> > >
> > >> The charset parameter is mandatory in the MIME content-type attribute.
> > >
> > >Actually, I don't know of a single case where this is true.
> >
> > I don't, either.

> Huh.

> So I looked closely, and I find that you're right. How about stressing
> this a little more in the successor to RFC 2045?

RFC 2045 is about message structure and only deals with with the aspects
of media types that interact with message structure. RFC 2046 is where
media type behavior is defined, and section 4.1.2 of that document is
quite clear about the behavior of the charset parameter.

> Ned? Section 5.2 could
> state clearly that «Content-Type: text/plain» does NOT mean the same as
> a missing Content-Type header.

Since it does mean that in email, I fail to see why such a change would be (a)
Correct or (b) A good idea.

(Continue reading)

Arnt Gulbrandsen | 13 Mar 15:45
Picon
Favicon
Gravatar

Re: Charset mandatory in unix/linux


Ned Freed writes:
> Since it does mean that in email, I fail to see why such a change 
> would be (a) Correct or (b) A good idea.

If «content-type: text/plain; charset=us-ascii", «content-type; 
text/plain» and a missing content-type field are equivalent, I would 
say that encoding the character set is mandatory in email, just as 
Jakob Palme wrote in the first message. A message sender can't avoid 
specifying the character set, even by leaving out the charset 
parameter.

Arnt

Tony Finch | 13 Mar 16:03
Picon
Favicon

Re: Charset mandatory in unix/linux


On Sun, 12 Mar 2006, Jacob Palme wrote:
>
> The charset parameter is mandatory in the MIME content-type
> attribute. However, such a parameter is not mandatory in
> Unix or Linux. This is causing more and more problems, when
> people have a mixture of files with different charsets,
> which you easily get when you download files from the
> Internet or receive them via e-mail.

Unix doesn't really support fine-grained switching between locales, and
therefore it doesn't have good support for switching between charsets
either. Files are untyped, and the way they are treated depends on
context. The context is principally defined by the program the file is fed
to, but the interpretation can also be changed by the locale settings in
the environment. Switching locales within a program is not well supported.
The problem is bigger than charsets: for example, the Unix locale API also
doesn't have good support for manipulating dates in multiple timezones.

So if you are going to solve your problem, you'll have to re-do the locale
API as well as defining how to use extended attributes to store charset
information. This is a problem for POSIX not the IETF.

Tony.
--

-- 
f.a.n.finch  <dot <at> dotat.at>  http://dotat.at/
MULL OF KINTYRE TO ARDNAMURCHAN POINT: SOUTH OR SOUTHEAST 6 TO GALE 8,
OCCASIONALLY SEVERE GALE 9 NEAR EXPOSED HEADLANDS, DECREASING 5 OR 6
OVERNIGHT.

(Continue reading)

Ned Freed | 13 Mar 17:26

Re: Charset mandatory in unix/linux


> Ned Freed writes:
> > Since it does mean that in email, I fail to see why such a change
> > would be (a) Correct or (b) A good idea.

> If «content-type: text/plain; charset=us-ascii", «content-type;
> text/plain» and a missing content-type field are equivalent, I would
> say that encoding the character set is mandatory in email, just as
> Jakob Palme wrote in the first message. A message sender can't avoid
> specifying the character set, even by leaving out the charset
> parameter.

It is quite true that a user of text/plain in email has no choice but to
specify the charset, either implicitly by omitting the parameter, in which case
it defaults to US-ASCII, or explicitly by including it. But that's not what
Jacob said. He said:

> The charset parameter is mandatory in the MIME content-type
> attribute.

This is quite clearly incorrect. The _parameter_ is not mandatory,
_specification of the charset_ by one means or another is. (More precisely, it
is unavoidable due to the way the defaults are set up. Referring to mechanisms
involving defaults as mandatory is not good use of terminology IMO.)

Now, perhaps Jacob meant to say that the charset of plain text has to be
specified in some way. I'm not a mind reader and cannot divine intent. All I
can do is respond to what he did say, which was incorrect.

				Ned
(Continue reading)

Bruce Lilly | 26 Mar 22:15
Picon

Re: Charset mandatory in unix/linux


On Sun March 12 2006 10:10, ned+ietf-822 <at> mrochek.com wrote:
> 
> (cc'ing the ietf-types list since this doesn't seem like an appropriate topic
> for ietf-822)
[this response to types, cc to 822, Reply-To set to types]

[Jacob Palme wrote] 
> > The charset parameter is mandatory in the MIME content-type
> > attribute.
> 
> Actually, I don't know of a single case where this is true. All media type
> pararameters are either type or subtype specific, so there is no general rule
> that applies to all charset parameters. Nevertheless, the charset parameters
> that attach to the text top-level type are optional, as is the charset
> parameter on application/xml.

What about text/directory and application/shf+xml (RFCs 2425 and
4194 respectively)?

RFC 2425:

5.3.  Required parameters

   Required parameters: charset

RFC 4194:

9.  IANA Considerations

(Continue reading)


Gmane