Re: [Need advice] When to decode '+' to ' '?
Mike Brown <mike <at> skew.org>
2007-07-21 19:40:00 GMT
I think you're confusing general percent-encoding in URIs with the rules for
producing application/x-www-form-urlencoded data. They're related, but
distinct.
In any case, sender and receiver must agree; if you (the receiver) know the
data is of the application/x-www-form-urlencoded media type, then you should
not be blindly applying the modern, general RFC 3986 percent-encoding rules to
it to interpret it. You must decode it using the reverse of the encoding
process.
As described in the HTML specs, such data is divided into "&"-separated
"name=value" pairs, it uses "+" instead of "%20" for space, has had newlines
normalized to "%0D%0A", and has had "non-alphanumeric"/"reserved" characters
percent-encoded. This section of the specs predates HTML becoming
Unicode-friendly, so there is a great deal of ambiguity in exactly which
characters are percent-encoded and how, but in practice, implementations
generally align with RFC 3986 when deciding which characters to encode.
So, to encode a set of name-value pairs (character data from an HTML form):
1. In each name and value, encode each CR, LF, or CR+LF to "%0D%0A".
2. In each name and value, encode each space as "+", and percent-encode any
other character that won't be unambiguous in a URI, especially "+", "&", and
"=".
3. Insert "=" between each name and value, and "&" between each pair.
To decode:
(Continue reading)