Bruce Lilly | 13 Dec 18:25
Picon

Re: Interpretation of RFC 2047


Charles Lindsey wrote:

> Would it not a sensible rule be to say that you should decode any occurrence
> of =?<charset>?[BQ]?...?= (subject to the 76 character limit) in any
> header provided:
>     (a) it was immediately preceded by '(' or by CFWS
>     (b) it was immediately followed by ')' or by CFWS
>     (c) it was not contained within a quoted-string

(d) it was not part of a MIME parameter (RFC 2047 expressly forbids 2047
     encoding in MIME parameters; RFC 2231 provides a mechanism for parameters
     and also extends 2047 to include language tags)

... and more (see below)

> Actually, there is a parsing required, because an encoded word in an
> unstructured header must have LWS (i.e. CFWS) on either side of it, whereas
> it can also have '(' and ')' immediately next to it in a strutured header.

That's not accurate: first, LWS and CFWS are different "(a) =?se2?q?x?="
(quotes for legibility only) is legal whereas " (a)=?se2?q?x?=" is not;
both have CFWS immediately before what looks like an encoded-word, but
only the former has LWS immediately before an encoded-word. And there
are many issues with parentheses; ")=?se2?q?x?=(" in a structured
header which contains no other parentheses does not contain an
encoded-word.

Other areas that immediately come to mind are:
1. RFC 2557 Content-Location, which permits URIs, which in turn (RFC 2396)
    permit parentheses.  That's in a structured field, but a URI, not a
    comment. [there are issues with 2557 and CFWS vs. the URIs, and these
    have been discussed on the MHTML list]
2. RFC 2533 "filters" have more nested parentheses than a technical paper
    at a LISP convention.  They're not comments and they appear in
    structured MIME extension headers (RFC 3297 Content-Alternative,
    RFC 2912 Content-Features).
3. URIs can also appear in other MIME extension headers; IIRC one of the
    RFCs provides for a URI in a parameter.
4. URIs also appear in headers which are not MIME extension headers, e.g.
    many of the List- headers.

I'm not certain, but I don't believe that the filter syntax permits anything
resembling 2047 encoding.  URIs probably do, but again, I haven't checked
thoroughly.

Misinterpreting something as encoded when it is in fact not an encoded-word
can have consequences.  Even if not changed in the protocol, but only for
display, there could be problems e.g. with cut-and-paste of URIs.

Strictly speaking, one can only decode if one knows the relevant header
syntax.  Display is a relatively minor issue, subject to the above
caveat.  But transformations by gateways may result in fouling up content
beyond all recognition unless the header syntax is known.  Ideally,
gateways shouldn't decode encoded-words -- if they're left in encoded
form there is no chance that they'll be garbled, which is the likely
outcome unless strict syntax of headers is known and applied rigorously.

Bruce Lilly | 13 Dec 19:00
Picon

The sad state of MIME non-compliance


Recently I had occasion to send a message consisting of
introductory text plus two pages of content in two formats
(text/plain and application/pdf). The obvious MIME structure
would be

multipart/mixed
    text/plain (introductory text)
    multipart/alternative
       multipart/related
          text/plain (page 1 text)
          text/plain (page 2 text)
       multipart/related
          application/pdf (page 1)
          application/pdf (page 2)

Sadly, many of the MUAs currently on the market fail to display
such a message properly.  Of course, non-MIME text-only UAs will
display the message body.  Kmail 1.4.3 seems to do a reasonable
job of displaying the message correctly.  But Netscape/Mozilla,
MS Outlook [Express], Eudora, and probably others do not. Most
of these display only the introductory text.

A sample message has been submitted with Mozilla bug report
#184869, viewable at http://bugzilla.mozilla.org/show_bug.cgi?id=184869

It appears in most cases that it is the nested MIME multipart
structure that is not being correctly parsed; changing individual
media types seems to have little or no effect.

I suspect that part of the problem is what RFC 2822 has remedied
w.r.t. non-MIME messages, viz. the standards are spread out over
multiple documents, which makes it difficult for developers to
refer to *the* standard.  Worse, there are errors in some of the
documents, and the errata page at www.rfc-editor.org is not
exactly well-publicized (IMO, it should be mentioned in
every RFC).  Still worse, some of the RFCs contradict others (e.g.
while RFC 2046 states that the domain of an entity in the absence
of a Content-Transfer-Encoding header defaults to 7bit, RFC
2425 [5.5] claims that it is 8bit.  2822 also clarified the syntax
relative to 822.

It would be nice if the MIME documents were similarly brought up to
date with current ABNF, retrieval of the text which has been lost
in troff-comment-land these many years, and consolidation into a
small set of consistent documents.

Bruce Lilly | 13 Dec 19:19
Picon

Re: RFC2231 encoding in parameters.


Valdis.Kletnieks <at> vt.edu wrote:


> 1) Are any MUAs "in the wild" currently actually using the 2047-style encoding > in parameters rather than the 2231 syntax? If so, who are they, and who wants > to send the authors a note? ;)
As of a couple of months ago, Netscape/Mozilla did so by default. There is an obscure, undocumented configuration parameter that can be placed in the configuration file to change that to use 2231 encoding, but (again, as of a couple of months ago) it did not work: 1. 8-bit file names were first 2047-encoded, then that encoded via the 2231 mechanism 2. instead of a single language tag, a language list was used under some circumstances. I think a bug report was submitted, but a quick search at bugzilla today didn't show anything relevant. With the obscure configuration parameter in place, Mozilla 1.2a produced the following (which, as the filename is us-ascii, does not show problem #1): User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/20020910 Content-Type: application/excel; name*=ISO-8859-1'en-us, en, fr, ru, ja'membership.xls
Picon
Favicon

Re: The sad state of MIME non-compliance


On Fri, 2002-12-13 at 19:00, Bruce Lilly wrote: > multipart/mixed > text/plain (introductory text) > multipart/alternative > multipart/related > text/plain (page 1 text) > text/plain (page 2 text) > multipart/related > application/pdf (page 1) > application/pdf (page 2) > > Sadly, many of the MUAs currently on the market fail to display > such a message properly. Of course, non-MIME text-only UAs will > display the message body. Kmail 1.4.3 seems to do a reasonable > job of displaying the message correctly. But Netscape/Mozilla, > MS Outlook [Express], Eudora, and probably others do not. Most > of these display only the introductory text.
Just to add a data point: evolution (1.2) does fine, displaying the two pdf files (and not displaying the two text parts). cheers -- vbi -- -- this email is protected by a digital signature: http://fortytwo.ch/gpg NOTE: keyserver bugs! get my key here: https://fortytwo.ch/gpg/92082481
Charles Lindsey | 17 Dec 11:46
Picon
Picon

Re: Interpretation of RFC 2047


In <3DFA180A.7030705 <at> alex.blilly.com> Bruce Lilly <blilly <at> erols.com> writes:

>Charles Lindsey wrote:

>> Would it not a sensible rule be to say that you should decode any occurrence
>> of =?<charset>?[BQ]?...?= (subject to the 76 character limit) in any
>> header provided:
>>     (a) it was immediately preceded by '(' or by CFWS
>>     (b) it was immediately followed by ')' or by CFWS
>>     (c) it was not contained within a quoted-string

>(d) it was not part of a MIME parameter (RFC 2047 expressly forbids 2047
>     encoding in MIME parameters; RFC 2231 provides a mechanism for parameters
>     and also extends 2047 to include language tags)

>... and more (see below)

>> Actually, there is a parsing required, because an encoded word in an
>> unstructured header must have LWS (i.e. CFWS) on either side of it, whereas
>> it can also have '(' and ')' immediately next to it in a strutured header.

I think a reasonable heuristic, which would nearly always do the "right
thing" would be:

NOT to decode anything within properly matched "...", <...> or [...] or
which follows a ';' which looks like the start of some MIME parameters.
And otherwise decode anything enclosed by WS or within properly matched
and nested (...).

>Other areas that immediately come to mind are:
>1. RFC 2557 Content-Location, which permits URIs, which in turn (RFC 2396)
>    permit parentheses.  That's in a structured field, but a URI, not a
>    comment. [there are issues with 2557 and CFWS vs. the URIs, and these
>    have been discussed on the MHTML list]

But URIs are not supposed to contain 8bit stuff, so the question should
not arise. And if IRIs should ever get into the standard, then there is a
special downgrading to URI built in (yes, that is Yet Another Encoding for
us to have to worry about :-( ). Mind you, if the URIs are enclosed within
<...>, then my rule above would cover them.

>I'm not certain, but I don't believe that the filter syntax permits anything
>resembling 2047 encoding.  URIs probably do, but again, I haven't checked
>thoroughly.

We are talking only of headers which the agent does not already know how
to parse, so presumably it is not going to do more than display them. But
displaying is probably more useful than leaving them alone as far as being
helpful to the reader is concerned.

>Strictly speaking, one can only decode if one knows the relevant header
>syntax.  Display is a relatively minor issue, subject to the above
>caveat.  But transformations by gateways may result in fouling up content
>beyond all recognition unless the header syntax is known.  Ideally,
>gateways shouldn't decode encoded-words -- if they're left in encoded
>form there is no chance that they'll be garbled, which is the likely
>outcome unless strict syntax of headers is known and applied rigorously.

Agreed.

But there is a more interesting question, which is what agents that create
unrecognized headers with 8bit stuff in them could usefully do. I.e. a
user tries to create a Foobar: header with such stuff in it. This could be
a problem in news to mail gatewaying. Treating all such headers as
unstructured is possible, but might not do the right thing. Trying to
recognise comments might be better (not within "...", <...> or [...]
though).

--

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl <at> clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

Bruce Lilly | 17 Dec 19:27
Picon

Re: Interpretation of RFC 2047


Charles Lindsey wrote:

> I think a reasonable heuristic, which would nearly always do the "right
> thing" would be:
> 
> NOT to decode anything within properly matched "...", <...> or [...] or
> which follows a ';' which looks like the start of some MIME parameters.
> And otherwise decode anything enclosed by WS or within properly matched
> and nested (...).

As header field contents are defined by a grammar, attempts to
decode using only regular expressions (as opposed to a parser
which accepts the defined grammar) are doomed to failure. Failures
include both false positives and false negatives, as illustrated
below.

> But there is a more interesting question, which is what agents that create
> unrecognized headers with 8bit stuff in them could usefully do. I.e. a
> user tries to create a Foobar: header with such stuff in it. This could be
> a problem in news to mail gatewaying. Treating all such headers as
> unstructured is possible, but might not do the right thing. Trying to
> recognise comments might be better (not within "...", <...> or [...]
> though).

One cannot recognise a comment unless the header field syntax is known.

    Content-Features: (& (Type="text/plain") (charset=US-ASCII) )

contains no comments.

    Foobar: (& (Type="text/plain") (charset=US-ASCII) )

might or might not contain comments depending on the definition of
the Foobar header field.  I submit that

    Foobar: file:(=?us-ascii?q?=3D?=)

does not contain a comment. It does have matched parentheses.  It does
not contain an RFC 2047 encoded-word and does not encode any 8-bit
characters  It does contain a syntactically valid absolute URI.

    Foobar: http://users.erols.com/blilly/mailparse/(=?us-ascii?q?=3D?=)

does not contain a comment. It does have matched parentheses.  It does
not contain an RFC 2047 encoded-word and does not encode any 8-bit
characters  It contains a valid absolute URI with a query.  You are
welcome to try the URI; it does work (though the query is ignored).

Either could just as well be a Content-Location header. Both would be
attempted to be decoded using a simple regular expression matching
heuristic.  If thus inappropriately decoded, they would yield

    Foobar: file:=
    Foobar: http://users.erols.com/blilly/mailparse/=

which are clearly not what was intended.  You are welcome to try the
last one as a URI, you will get a 404 not found error.  N.B. you might
(in general) instead have stumbled upon a valid URI which was different
from the intended one.

Treating an unrecognized header field as unstructured in the above
examples would not mangle the URIs, for display or otherwise.  If the
example were instead

    Foobar: =?us-ascii?q?-3D?=

treating the header as unstructured may result in decoding for display.
Gateways should not attempt to transform unrecognized header fields; it
is unknown whether or not the above example really contains an encoded-word.
If an unrecognized header field has content which is forbidden in the
destination network, the header could be elided. If the content is not
forbidden, the unrecognized header field should be passed unaltered.  A
network which would forbid RFC 2047 encoded-word content would be rather
unusual, to say the least.  A gateway should never decode RFC 2047
encoded-words in header fields, as the decoded word may have octets or
combinations of octets which are illegal in header fields (e.g. NUL, DEL,
8-bit-set, lone CR).  Such decoding might be acceptable if both of the
following conditions apply:
1. the destination network uses content in some format other than RFC
    2822 header fields (otherwise, there's no need for transformation).
2. it is guaranteed that a reverse transformation from the destination
    network to Internet mail is possible and produces content equivalent
    to the original (i.e. equivalent to leaving the header field unaltered),
    or that no reverse gateway will attempt to regenerate the header field
    (i.e. equivalent to eliding the header in the forward gateway
    transformation).

Excluding content after a semicolon would fail to decode the RFC 2047
encoded-word in the header

    To: empty-list:;, =?iso-8859-1?Q?J=FCrgen?= j <at> foo.com

Excluding content bracketed in <> would also be an error. Consider RFCs
2368 and 2369 (not to be confused with 2396, which is also applicable) and:

    List-Owner: <mailto:%3D%3fiso-8859-1%3FQ%3fJ%3dFCrgen%3F%3d%20j <at> foo.com?Subject=list>

That does contain an RFC2047 encoded-word within the <>. Decoding the
content within <> first requires decoding URI encoding to obtain the
same mailbox as specified in the To header example above.

Further note that List-Owner provides for specifying additional header
fields (as with Subject in the example above), and of course
Content-Location and the hypothetical Foobar are not excluded.

Other examples could be given, but the above show that it is necessary
to fully parse header field content in order to determine whether or
not there is an encoded-word; use of regular expressions (or the
equivalent) is inadequate.

Alan Barrett | 18 Dec 10:57

Re: Interpretation of RFC 2047


On Tue, 17 Dec 2002, Bruce Lilly wrote:
> One cannot recognise a comment unless the header field syntax is known.

One can recognise a comment from lexical analysis alone.  This was true
in RFC 822, and should still be true in RFC 2822 unless something went
wrong.

>    Content-Features: (& (Type="text/plain") (charset=US-ASCII) )
>
> contains no comments.

RFC 822 was absolutely clear that it contains a comment.  By my reading
of RFC 2822 section 3.2.3, it still contains a comment.

RFC 2912 suggests that the above Content-Features header field contains
no comments.  But RFC 2912 was published before RFC 2822, so cannot use
any sophistry about RFC 2822 perhaps having unintentionally changed the
definition of a comment.  Instead, RFC 2912 claims to depend on RFC 822,
where the definition of a comment is absolutely clear, so RFC 2912 would
have had no excuse at all for trying to modify it.

RFC 822 and 2822 did not deliberately leave open the possibility for
future header fields to redefine the comment syntax.  RFC 2912 does not
even discuss the fact that it attempts to redefine the comment syntax.
This is a fatal flaw in RFC 2912, and it's somewhat surprising that it
was not noticed before.

>    Foobar: (& (Type="text/plain") (charset=US-ASCII) )
> 
> might or might not contain comments depending on the definition of
> the Foobar header field.  I submit that
> 
>    Foobar: file:(=?us-ascii?q?=3D?=)
> 
> does not contain a comment.  It does have matched parentheses.  It
> does not contain an RFC 2047 encoded-word and does not encode any
> 8-bit characters It does contain a syntactically valid absolute URI.

I submit that the RFC 2822 section 3.2.3 definition of a comment was
intended to apply to all header fields, including those defined in RFC
2822's future; that it was a mistake for RFC 2912's "Content-Features:"
header field to contain stuff that looks like a comment but is not
intended as a comment; and that it would be a mistake for the definition
of the Foobar: header field to try to say that the above example does
not contain a comment.

> Other examples could be given, but the above show that it is necessary
> to fully parse header field content in order to determine whether
> or not there is an encoded-word; use of regular expressions (or the
> equivalent) is inadequate.

I agree on this point.  However, lexical analysis plus some guessing
will often be good enough.

--apb (Alan Barrett)

Keith Moore | 18 Dec 15:35
Picon

Re: Interpretation of RFC 2047


> One can recognise a comment from lexical analysis alone.  

comments are only valid in structured fields.  so in order to
recognize a comment you have to know the set of structured fields.
and it is (perhaps unfortunately) the cases that some fields
have a syntax that uses parenthesis as other than comment delimiters.
if I'm not mistaken this has been the case ever since rfc 987,
which used constructs like (a) to order to encode things like @
in PrintableString fields.

> I submit that the RFC 2822 section 3.2.3 definition of a comment was
> intended to apply to all header fields

I don't think so - that would break too many things already in existence.

Keith

Alan Barrett | 18 Dec 16:37

Re: Interpretation of RFC 2047


On Wed, 18 Dec 2002, Keith Moore wrote:
> > One can recognise a comment from lexical analysis alone.  
> 
> comments are only valid in structured fields.  so in order to
> recognize a comment you have to know the set of structured fields.

Yes, that's true (both in RFC 822 section 3.1.3 and RFC 2822 section
2.2.1).

Does anybody claim that that RFC 2912 "Content-Encoding" is an
unstructured field?

> and it is (perhaps unfortunately) the cases that some fields have
> a syntax that uses parenthesis as other than comment delimiters.
> if I'm not mistaken this has been the case ever since rfc 987,
> which used constructs like (a) to order to encode things like @ in
> PrintableString fields.

By my reading of RFC 987, if a PrintableString is used in a context
where something like unquoted "(a)" could be misinterpreted as a
comment, then the entire PrintableString must be further encoded in an
RFC 822 quoted-string.  See the second paragraph on page 58 of RFC 987,
where it says "word may be encoded as 822.atom (which has a restricted
character set) or as 822.quoted-string, which can handle all ASCII
characters."

> > I submit that the RFC 2822 section 3.2.3 definition of a comment was
> > intended to apply to all header fields
>
> I don't think so - that would break too many things already in
> existence.

OK, all structured header fields.  The entire lexical analyser described
in RFC 822 section 3.1.4, and RFC 2822 section 3.2 (read in conjunction
with section 2.2.2), seems to be intended to apply to all structured
header fields.  I have always assumed that this included any structured
fields that might be defined in the future.

Apart from RFC 2912 Content-Encoding, what else violates this assumption?

--apb (Alan Barrett)

Alan Barrett | 18 Dec 17:08

Re: Interpretation of RFC 2047


On Wed, 18 Dec 2002, Keith Moore wrote:

> > By my reading of RFC 987, if a PrintableString is used in a context > > where something like unquoted "(a)" could be misinterpreted as a > > comment, then the entire PrintableString must be further encoded in an > > RFC 822 quoted-string. > > perhaps. but this was not done in practice.
OK. I managed to avoid encountering such cases. --apb (Alan Barrett)

Gmane