And Clover | 1 Dec 2009 01:44
Favicon

Re: Move to bless Graham's WSGI 1.1 as official spec

Graham Dumpleton wrote:

> Answering my own question, it is actually obvious that it has to be
> called (1, 0). This is because wsgiref in Python 3.X already calls it
> (1, 0) and don't have much choice to be in agreement with that.

wsgiref.simple_server in Python 3 to date is not something that anyone 
should worry about being compatible with. It is a 2to3 hack that cannot 
meaningfully claim to represent wsgi version anything.

Careless use of urllib.parse.unquote causes 3.0's simple_server not to 
work at all, and 3.1's to mangle the path by treating it as UTF-8 
instead of ISO-8859-1, as 'WSGI 1.1' proposed and mod_wsgi (and even 
mod_cgi via wsgiref.CGIHandler) delivered.

Yes, I'm always going on about Unicode paths. I'm fed up of shipping 
apps with a page-long deployment note about fixing them. It pains me 
that in so many years both this and "What do we do about Python 3?" 
still haven't been addressed.

mod_wsgi 3.0 already has more traction than wsgiref 3.1 and I would 
prefer not to see more farcical reverse-progress at this point.

For what it's worth my responses on the issues of this thread. But at 
this point I really just want a BDFL to just come and do it, whatever it 
is. A new WSGI, whatever the version number, is massively overdue.

 >> 1. The 'readline()' function of 'wsgi.input' may optionally take a 
size hint.

(Continue reading)

James Y Knight | 1 Dec 2009 02:41

Re: Move to bless Graham's WSGI 1.1 as official spec

On Nov 29, 2009, at 12:40 AM, James Y Knight wrote:
> The next step here is clearly for someone to redraft the changes as a diff against PEP 333. If you do not have
any interest in being that person, please make that clear, so someone else can step up to do so.

Okay, not sensing any other volunteers here...I guess it's all me.

The intention of this spec update is to be compatible with existing middleware/applications when running
on Python 2.X. Apps/middleware running on python 3.X require changes in any case, and this specification
will tell them exactly what to expect. That Python 3.X middleware and WSGI adapters will have to deal with
both bytestrings and unicode strings in many parts of the API (output status code, output headers, output
response iterable/write callback) will add some complexity, but that's life.

Any WSGI implementations on Python 3.X claiming compliance to WSGI 1.0 are most likely broken, and its
behavior cannot be relied upon. Too bad about wsgiref.

As self-appointed author, I am going to take a stand and say that both the python3-related string-type
specifications, and the additional requirements except #3 (read() with no-args) and #4 (file_wrapper
looking at Content-Length), will be included.

And it will be called WSGI 1.1.

Back to the list of "extra requirements":

#1: (readline with an arg) must be included, despite the potential for breakage. That ship has already
sailed, the breakage has already occurred, it's already required. Disagreement here really is of no consequence.

#2: (wsgi.input() must return EOF at EOF): I do not believe will break any middleware. It will require some
changes in some WSGI adapter implementations, but that's acceptable. If you have a real-life example of
middleware that would break here, show it. So this will be included.

(Continue reading)

Manlio Perillo | 3 Dec 2009 11:55
Picon
Favicon

Re: Move to bless Graham's WSGI 1.1 as official spec

James Y Knight ha scritto:
> I move to bless mod_wsgi's definition of WSGI 1.1 [1]
> [...]
> 
> [1] http://code.google.com/p/modwsgi/wiki/SupportForPython3X

Hi.

Just a few questions.

It is true that HTTP headers can be encoded assuming latin-1; and they
can be encoded using PEP 383.

However what about URI (that is, for PATH_INFO and the like)?
For URI (if I remember correctly) the suggested encoding is UTF-8, so
URLS should be decoded using

  url.decode('utf-8', 'surrogateescape')

Is this correct?

Now another question.
Let's consider the `wsgiref.util.application_uri` function

def application_uri(environ):
    url = environ['wsgi.url_scheme']+'://'
    from urllib.parse import quote

    if environ.get('HTTP_HOST'):
        url += environ['HTTP_HOST']
(Continue reading)

Manlio Perillo | 3 Dec 2009 15:49
Picon
Favicon

HTTP headers encoding

Hi.

I'm doing some tests to try to understand how HTTP headers are encoded
by browsers.

I have written a simple WSGI application that asks authentication
credentials and then print them on the terminal and return the data as
response, as raw bytes
http://paste.pocoo.org/show/154633/

Then I used some browsers to try to send an username with non ascii
characters.

When I try with simple characters in the iso-8859-1 charset, things
works well; the data is encoded using this charset.

However when I try to use some extraneus character, like Euro, there are
problems.

Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a
'\xac'

I don't know where \xac come from, but it is the last byte in the utf-8
encoded Euro: '\xe2\x82\xac'

Internet Explorer 6.0 sends me a
'\x80'
and this this the Euro characted encoded using cp1252 (and I suspect
that it always use this encoding, instead of iso-8859-1).

(Continue reading)

Manlio Perillo | 3 Dec 2009 17:09
Picon
Favicon

Re: HTTP headers encoding

Manlio Perillo ha scritto:
> Hi.
> 
> I'm doing some tests to try to understand how HTTP headers are encoded
> by browsers.
> 
> I have written a simple WSGI application that asks authentication
> credentials and then print them on the terminal and return the data as
> response, as raw bytes
> http://paste.pocoo.org/show/154633/
> 

I'm now testing using HTTP Digest Authentication.
The application is here:
http://paste.pocoo.org/show/154667/

It uses my wsgix framework
http://hg.mperillo.ath.cx/wsgix/
since I don't want to rewrite the entire Digest Authentication handling.

As user name I use the the string "àè€".
The results are:

- Firefox does not send any request, and instead it show me the returned
  response body "Authentication required".

  This is quite strange.

- Internet Explorer 6 encode the username using cp1252, as always.

(Continue reading)

And Clover | 3 Dec 2009 19:35
Favicon

Re: Move to bless Graham's WSGI 1.1 as official spec

Manlio Perillo wrote:

> However what about URI (that is, for PATH_INFO and the like)?
> For URI (if I remember correctly) the suggested encoding is UTF-8, so
> URLS should be decoded using

>   url.decode('utf-8', 'surrogateescape')

> Is this correct?

The currently-discussed proposal is ISO-8859-1, allowing the real bytes 
to be trivially extracted. This is consistent with the other headers and 
would be my preferred approach.

Python 3.1's wsgiref.simple_server, on the other hand, blindly uses 
urllib.unquote, which defaults to UTF-8 without surrogateescape, 
mangling any non-UTF-8 input.

I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding 
is blessed. But *something* needs to be blessed. An encoding, an 
alternative undecoded path_info, both, something else... just *something*.

> Let's consider the `wsgiref.util.application_uri` function
> There is a potential problem, here, with the quote function.

Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in 
3.0, but still broken. Until we can come to a Pronouncement on what WSGI 
*is* in Python 3, it is meaningless anyway.

> Cookie data SHOULD be transparent to the server/gateway; however WSGI is
(Continue reading)

Manlio Perillo | 3 Dec 2009 19:52
Picon
Favicon

Re: Move to bless Graham's WSGI 1.1 as official spec

And Clover ha scritto:
> [...]
>> Cookie data SHOULD be transparent to the server/gateway; however WSGI is
>> going to assume that data is encoded in latin-1.
> 
> Yeah. This is no big deal because non-ASCII characters in cookies are
> already broken everywhere(*). Given this and other limitations on what
> characters can go in cookies, they are habitually encoded using ad-hoc
> mechanisms handled by the application (typically a round of URL-encoding).
> 
> *: in particular:
> 
> - Opera and Chrome send non-ASCII cookie characters in UTF-8.
> - IE encodes using the system codepage (which can never be UTF-8),
>   mangling any characters that don't fit in the codepage through the
>   traditional Windows 'similar replacement character' scheme.
> - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
>   gets through but everything else is mangled)
> - Safari refuses to send any cookie containing non-ASCII characters.
> 

Thanks for this summary.
I think it should go in a wiki or in a separate document (like
rationale) to the WSGI spec.

However this should never happen with cookie, since cookie data is
opaque to browser, and it MUST send it "as is".

What you describe happen with other headers containing TEXT.
And now I understand that strange behaviour of Firefox with non latin-1
(Continue reading)

James Y Knight | 3 Dec 2009 20:00

Re: Move to bless Graham's WSGI 1.1 as official spec

On Dec 3, 2009, at 1:35 PM, And Clover wrote:
> Manlio Perillo wrote:
> 
>> However what about URI (that is, for PATH_INFO and the like)?
>> For URI (if I remember correctly) the suggested encoding is UTF-8, so
>> URLS should be decoded using
> 
>>  url.decode('utf-8', 'surrogateescape')
> 
>> Is this correct?
> 
> The currently-discussed proposal is ISO-8859-1, allowing the real bytes to be trivially extracted.
This is consistent with the other headers and would be my preferred approach.

Right, for WSGI 1.1 on Python 3.x, 8859-1 strings is the plan. Other, more ideologically pure options can be
discussed for an incompatible revision of WSGI (e.g. the hypothetical 2.0).

BTW: I hope to have a first draft of the changes by Monday. (But don't beat up on me if it's delayed; I am working
on it.)

James
_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org

And Clover | 3 Dec 2009 20:11
Favicon

Re: HTTP headers encoding

Manlio Perillo wrote:

> I have written a simple WSGI application that asks authentication
> credentials

Ho ho! This is another area that is Completely Broken Everywhere. It's 
actually a similar situation to the cookies:

- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
   mangling any characters that don't fit in the codepage through the
   traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
   gets through but everything else is mangled)
- Safari uses ISO-8859-1, and refuses to send any cookie containing
   characters outside the 8859-1 repertoire.
- Konqueror uses ISO-8859-1, and replaces any non-8859-1 character
   with a question mark.

The HTTP standard has nothing to say about the encoding in use *inside* 
the base64-encoded Authorization byte-string token. It's anyone's guess, 
and every browser has guessed differently. (Safari here is at least 
slightly better than its behaviour with the cookies.)

 > (and I suspect that [IE] always use this encoding, instead of
 > iso-8859-1).

It will certainly never send ISO-8859-1, but what it does send is locale 
dependent. Type an e-acute in your username on a Western machine and 
it'll send one byte sequence; type the same thing on an Eastern European 
(Continue reading)

Henry Precheur | 3 Dec 2009 20:25

Re: Move to bless Graham's WSGI 1.1 as official spec

On Thu, Dec 03, 2009 at 07:35:14PM +0100, And Clover wrote:
> >I don't know what the HTTP/Cookie spec says about this.
> 
> The traditional interpretation of RFC2616 is that headers are ISO-8859-1.
> 
> You will notice that no browser correctly follows this.

The RFC 2109 & 2965 say that a cookie's value can be anything:

> The VALUE is opaque to the user agent and may be anything the origin
> server chooses to send, possibly in a server-selected printable ASCII
> encoding.

Theoricaly you could put something like: 'foo\n\0bar' in a cookie.

Also a cookie can include comments which have to be encoded using ...
UTF-8:

> Comment=value
>   OPTIONAL.  Because cookies can be used to derive or store
>   private information about a user, the value of the Comment
>   attribute allows an origin server to document how it intends to
>   use the cookie.  The user can inspect the information to decide
>   whether to initiate or continue a session with this cookie.
>   Characters in value MUST be in UTF-8 encoding.

--

-- 
  Henry Prêcheur
_______________________________________________
Web-SIG mailing list
(Continue reading)


Gmane