interpretation of whitespace inside obs-phrase
Jay Freeman (saurik <saurik <at> saurik.com>
2012-08-25 05:25:16 GMT
Hello. I am working on an implementation of an e-mail parsing library, and thereby am getting intimate with RFC5322.
I am consequently trying to understand how to interpret some of the semantics of whitespace while parsing
addresses, and I have come across a specific situation where I have not understood the RFC. I was thereby
hoping that someone may be able to offer their expertise.
(I will now apologize profusely if this is a misuse of this mailing list. I found a couple previous questions
by someone working on a Ruby e-mail library while going through the archives, and thereby figured that it
was at least not entirely frowned upon to ask such questions here.)
In this case, the specific examples I am working with are as follows. The part I am concerned with is the
display name (although depending on the answer to this issue I may be forced to reevalulate other things I
currently believe I understand).
1: |Jay "Freeman"|
2: |Jay R. Freeman|
For references, here are the higher-level (non-character class) rules from RFC5322 that are important
for these two parses.
display-name = phrase
phrase = 1*word / obs-phrase
obs-phrase = word *(word / "." / CFWS)
word = atom / quoted-string
atom = [CFWS] 1*atext [CFWS]
quoted-string = [CFWS]
DQUOTE *([FWS] qcontent) [FWS] DQUOTE
[CFWS]
Now, one problem I run into is that the grammar is somewhat ambiguous with regards to the placement of CFWS,
however, I will use a greedy expansion for the purposes of interpreting these examples.
In the first case, I end up with a phrase, and in the second an obs-phrase. (I have included two different
versions of the second parse, as technically both are greedy but using a different order for the arguments
of the alternation in the obs-phrase rule.)
1: atom:|Jay | quoted-string:|"Freeman"|
2: atom:|Jay | atom:|R| "." atom:| Freeman|
2: atom:|Jay | atom:|R| "." CFWS atom:|Freeman|
At this point, the standard is clear that the whitespace surrounding the atoms and the quotation marks
surrounding the quoted string (in addition to any whitespace outside of those, although this example has
none) are semantically not part of the values.
Both atom and dot-atom are interpreted as a single unit, comprising
the string of characters that make it up. Semantically, the optional
comments and FWS surrounding the rest of the characters are not part
of the atom; the atom is only the run of atext characters in an atom,
or the atext and "." characters in a dot-atom.
Semantically, neither the optional CFWS outside of the quote
characters nor the quote characters themselves are part of the
quoted-string; the quoted-string is what is contained between the two
quote characters. As stated earlier, the "\" in any quoted-pair and
the CRLF in any FWS/CFWS that appears within the quoted-string are
semantically "invisible" and therefore not part of the quoted-string
either.
In these cases, I would then presume, I would end up with the display names |JayFreeman| and
|JayR.Freeman|. In the first case I find this perfectly reasonable. In the second case, however, I'm
somewhat confused by the lack of resulting whitespace.
Note: The "period" (or "full stop") character (".") in obs-phrase
is not a form that was allowed in earlier versions of this or any
other specification. Period (nor any other character from
specials) was not allowed in phrase because it introduced a
parsing difficulty distinguishing between phrases and portions of
an addr-spec (see section 4.4). It appears here because the
period character is currently used in many messages in the
display-name portion of addresses, especially for initials in
names, and therefore must be interpreted properly.
Given this description, I would have assumed that the purpose of this expansion is to support clients that
don't feel they need to provide quotation marks around names. I feel somewhat vindicated in this
understanding due to RFC5536 2.1.
o Articles are conformant if they use the <obs-phrase> construct
(use of a phrase like "John Q. Public" without the use of quotes,
see Section 4.1 of [RFC5322]), but agents MUST NOT generate
productions of such syntax.
However, without the whitespace--which I am required to ignore due to the rules on how atoms are
parsed--this seems to not be a useful obs- exception. Am I fundamentally misunderstanding this
situation? Is whitespace in these contexts actually preserved?
(If whitespace is preserved, how does one handle the whitespace between the display name and the
addr-spec, or whitespace between other random atoms in the specification, or whitespace preceeding the
display name after the "To:"?)
For completeness, I can also come up with a second way to interpret the second example, which is |JayR.
Freeman|, as neither of the above semantics rules indicate that the CFWS in the obs-phrase is to be
semantically ignored, and should thereby become a space.
Runs of FWS, comment, or CFWS that occur between lexical tokens in a
structured header field are semantically interpreted as a single
space character.
That said, I am not certain if these are technically even "lexical tokens in a structured header", as my read
of other sections of the specification indicate that the structured header itself only has a single
token, an addr-list, and inside of that token there must be explicit rules regarding how whitespace is parsed.
Finally, for comparison, I have attempted to parse this using a few implementations to see what they do.
(BTW, if this is interesting, I'm happy to do more work and test actual clients.) I have added an additional
test, 3:|"Jay" R Freeman|, to stress multiple spaces.
Java JavaMail:
1:|Jay "Freeman"|
2:|Jay R. Freeman|
3:|"Jay" R Freeman|
Java MIME4J:
1:|Jay Freeman|
2:|Jay R. Freeman|
3:|Jay R Freeman|
3:|Jay R Freeman| (Lenient)
Python email.utils:
1:|Jay Freeman|
2:|Jay R. Freeman|
3:|Jay R Freeman|
Ruby TMail:
1:|Jay Freeman|
2:|Jay R.Freeman|
3:|Jay R Freeman|
PHP mailparse_rfc822_parse_addresses:
1:|Jay Freeman|
2:|Jay R.Freeman|
3:|Jay R Freeman|
PHP imap_rfc822_parse_adrlist:
1:|Jay Freeman|
2:|Jay R. Freeman|
3:|Jay R Freeman|
Of these results, there is actually very little similarity :(. MIME4J, when using its "lenient" parser
returns the same results as Python's e-mail.utils, and Ruby's TMail returns the same results as PHP's
mailparse extension.
(Incidentally, I believe that the reason the PHP imap extension returning different results from the PHP
mailparse extension is that the imap extension is calling out to c-client, whereas the mailparse
extension was coded in-house.)
So, again, if anyone here is willing to help me understand what the correct behavior here is, I would be most
appreciative. ;P
Sincerely,
Jay Freeman (saurik)
saurik <at> saurik.com
_______________________________________________
ietf-822 mailing list
ietf-822 <at> ietf.org
https://www.ietf.org/mailman/listinfo/ietf-822