Armin Ronacher | 4 May 17:02 2009

Re: Python 3.0 and WSGI 1.0.

Hello everybody,

I just recently started looking at supporting Python 3 with one of my libraries
(Werkzeug), mainly because the MoinMoin projects considers using it which uses
the library in question.  Right now what Werkzeug does is consider HTTP being
Unicode aware in the sense that everything that carries text data is encoded and
decoded into a known encoding.

This is partially against the specification and not entirely correct, but it
works the best on modern browsers and is also what Django and Paste are doing.

It's basically that the incoming request data is .decode(encoding)d (usually
utf-8) before passed to the user code and unicode data is encoded back into the
same encoding before it's sent to the server.

Now why is the current behavior of Python 3 a problem here?  The encode, decode
hack from above is obviously a solution for these kinds of applications, albeit
not a good one.  Interfaces like mod_wsgi already have the data as bytestring,
would decode it from latin1 just that the application can encode it back and
decode as utf-8.  Not only is this slow but also does this mean that the code
does not survive a run through 2to3.

Now you could argue that the libraries where wrong in the first place and should
support unicode strings that were encoded from latin1 and decoded, but seems
like very few libraries support that.

Now which strings carry data that could contain non-ascii characters from a
source with an unknown encoding?  Right now these are the following:

  * PATH_INFO
(Continue reading)

Graham Dumpleton | 5 May 02:21 2009
Picon

Re: Python 3.0 and WSGI 1.0.

2009/5/5 Armin Ronacher <armin.ronacher@...>:
> Hello everybody,
>
> I just recently started looking at supporting Python 3 with one of my libraries
> (Werkzeug), mainly because the MoinMoin projects considers using it which uses
> the library in question.  Right now what Werkzeug does is consider HTTP being
> Unicode aware in the sense that everything that carries text data is encoded and
> decoded into a known encoding.
>
> This is partially against the specification and not entirely correct, but it
> works the best on modern browsers and is also what Django and Paste are doing.
>
> It's basically that the incoming request data is .decode(encoding)d (usually
> utf-8) before passed to the user code and unicode data is encoded back into the
> same encoding before it's sent to the server.
>
> Now why is the current behavior of Python 3 a problem here?  The encode, decode
> hack from above is obviously a solution for these kinds of applications, albeit
> not a good one.  Interfaces like mod_wsgi already have the data as bytestring,
> would decode it from latin1 just that the application can encode it back and
> decode as utf-8.  Not only is this slow but also does this mean that the code
> does not survive a run through 2to3.
>
> Now you could argue that the libraries where wrong in the first place and should
> support unicode strings that were encoded from latin1 and decoded, but seems
> like very few libraries support that.
>
> Now which strings carry data that could contain non-ascii characters from a
> source with an unknown encoding?  Right now these are the following:
>
(Continue reading)

Graham Dumpleton | 5 May 12:04 2009
Picon

Re: Python 3.0 and WSGI 1.0.

2009/5/5 Armin Ronacher <armin.ronacher@...>:
> Hi,
>
> Graham Dumpleton wrote:
>> I can't see but have choice but to pass such settings through as
>> strings, else more than likely would cause problems for applications.
>> Problem is it isn't clear what encoding stuff can be in Apache
>> configuration. At the moment latin-1 is assumed.
>
> Because those information does not have a specified encoding I can see
> nothing wrong with it passing that information as bytestrings.  I would
> have no problem passing *all* values as bytestrings.

At what point does that become an inconvenience though? I guess that
is my concern, because if one has to do too many manual conversions in
an application, people will start to complain it becomes unwieldy to
use. In other words, you make it easier or more logical for
frameworks, but do you end up putting more burden on applications for
stuff outside those core values.

So, for those core CGI values which the framework is going to modify
even before an application sees them, then fine. Is the framework also
going to set the rules as to what encoding is used for other values in
the WSGI environment and convert them per that encoding when an
application requests them, or is the application always going to have
to deal with them as bytes?

As I keep saying, you guys who write the frameworks and applications
are going to know better than I, I am just challenging the notions as
a way of making people think about it so the end result is what is the
(Continue reading)

Robert Brewer | 5 May 16:55 2009

Re: Python 3.0 and WSGI 1.0.

Graham Dumpleton wrote:
> 2009/5/5 Armin Ronacher <armin.ronacher-GGlT2RywCWtWk0Htik3J/w@public.gmane.org>:
>> Graham Dumpleton wrote:
>>> I can't see but have choice but to pass such settings through as
>>> strings, else more than likely would cause problems for applications.
>>> Problem is it isn't clear what encoding stuff can be in Apache
>>> configuration. At the moment latin-1 is assumed.
>> Because those information does not have a specified encoding I can see
>> nothing wrong with it passing that information as bytestrings.  I would
>> have no problem passing *all* values as bytestrings.
>
> At what point does that become an inconvenience though? I guess that
> is my concern, because if one has to do too many manual conversions in
> an application, people will start to complain it becomes unwieldy to
> use. In other words, you make it easier or more logical for
> frameworks, but do you end up putting more burden on applications for
> stuff outside those core values.
>
> So, for those core CGI values which the framework is going to modify
> even before an application sees them, then fine. Is the framework also
> going to set the rules as to what encoding is used for other values in
> the WSGI environment and convert them per that encoding when an
> application requests them, or is the application always going to have
> to deal with them as bytes?
>
> As I keep saying, you guys who write the frameworks and applications
> are going to know better than I, I am just challenging the notions as
> a way of making people think about it so the end result is what is the
> most logical thing to do. ;-)

In short: it's pretty easy for a framework to default to utf-8 for
everything, yet give application developers ways to override that. See,
for example, the cherrypy.tools.encoding Tool in our python3
branch--it's moved from running "sometime" after the page handler, to
wrapping the page handler so all page handlers emit bytes. That makes it
possible for everyone to use unicode strings everywhere, yet still allow
some to specify exact bytes as necessary. In shorter: don't worry about
that part, we've got it covered. ;)


Robert Brewer
fumanchu-Q+9y+cpEbCIdnm+yROfE0A@public.gmane.org


_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org
Ian Bicking | 5 May 19:01 2009

Re: Python 3.0 and WSGI 1.0.

Philip Jenvey brought this to my attention:

  http://www.python.org/dev/peps/pep-0383/

It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such a way that you can decode to get the original bytes object, and thus transcode to another encoding.  It's intended for cases exactly like WSGI.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org
Graham Dumpleton | 6 May 05:14 2009
Picon

Re: Python 3.0 and WSGI 1.0.

2009/5/6 Ian Bicking <ianb@...>:
> Philip Jenvey brought this to my attention:
>
>   http://www.python.org/dev/peps/pep-0383/
>
> It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such
> a way that you can decode to get the original bytes object, and thus
> transcode to another encoding.  It's intended for cases exactly like WSGI.

Care to explain then how that would in practice be used while I try
and reread it a few times to try and understand it myself? :-)

Graham
_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org

Ian Bicking | 6 May 05:27 2009

Re: Python 3.0 and WSGI 1.0.

On Tue, May 5, 2009 at 10:14 PM, Graham Dumpleton <graham.dumpleton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

2009/5/6 Ian Bicking <ianb <at> colorstudy.com>:
> Philip Jenvey brought this to my attention:
>
>   http://www.python.org/dev/peps/pep-0383/
>
> It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such
> a way that you can decode to get the original bytes object, and thus
> transcode to another encoding.  It's intended for cases exactly like WSGI.

Care to explain then how that would in practice be used while I try
and reread it a few times to try and understand it myself? :-)

I don't particularly know, except I think you'd do things like:

environ['PATH_INFO'] = urllib.unquote(http_byte_path).decode('utf8', 'python-escape')

Then if the encoding was wrong, you could transcode like:

environ['PATH_INFO'] = environ['PATH_INFO'].encode('utf8', 'python-escape').decode('latin1', 'python-escape')

Note that you need to know the encoding that was used (utf8 in this case) and that python-escape was used.  It has been suggested that the server should put the encoding it used into the environment.  When transcoding this should also be updated.

It's not clear what python-escape is going to do, I don't think that's been determined.  Probably it'll put \x00 or something in the unicode string to mark raw bytes.

--
Ian Bicking  |  http://blog.ianbicking.org
_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org
Graham Dumpleton | 8 May 13:34 2009
Picon

Re: Python 3.0 and WSGI 1.0.

2009/5/5 Graham Dumpleton <graham.dumpleton@...>
>>> Now, if we are going to start using bytes for request headers, there
>>> is the other question of response data.
>>>
>>> The original proposal in amendments was that application should
>>> provide bytes, but that WSGI adapter must accept either bytes or
>>> strings, with strings interpreted as latin-1.
>>>
>>> Is there sense in being more strict in this case?
>>>
>>> In Python 2.X some WSGI adapters only allow Python 2.X strings (ie.,
>>> bytes) and reject unicode strings. Others will convert unicode
>>> strings, but rather than use latin-1, apply the default Python
>>> encoding. Thus, there is no consistency.
>>
>> I think most will assert-reject unicode types and in -O just ignore them
>> and fail in some way.  I haven't seen any of those doing a
>> unicode->string conversion by encoding which btw is disallowed by the
>> PEP anyways.
>
> A CGI/WSGI bridge, if no explicit checks are made to disallow stuff
> other than strings, will usually attempt to write to sys.stdout
> whatever you give it. Thus unicode strings can be written and
> presumably default encoding is applied.
>
> >>> sys.stdout.write(u"abcd\n")
> abcd
>
> One can even write buffers.
>
> >>> sys.stdout.write(buffer("abcd\n"))
> abcd

Robert, do you have any comments on the restricting of response
content to bytes and not allow fallback to conversion per latin-1?

I heard that in CherryPy WSGI server you are only allowing bytes. What
is your rational for that at the moment?

Graham
_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org

Robert Brewer | 8 May 17:07 2009

Re: Python 3.0 and WSGI 1.0.

Graham Dumpleton wrote:
> Robert, do you have any comments on the restricting of response
> content to bytes and not allow fallback to conversion per latin-1?
>
> I heard that in CherryPy WSGI server you are only allowing bytes. What
> is your rational for that at the moment?


In Python 2.x, one could easily mix unicode strings and byte strings in
the same interface, because they mostly supported the same operations.
Not so in Python 3.x--byte strings are missing everything from
capitalize() to zfill() [1]. I feel that choosing one type or the other
is required in order to avoid mountains of if-statements in middleware
(and lots of 'pass' statements if bytes are found).

I decided that that single type should be byte strings because I want
WSGI middleware and applications to be able to choose what encoding
their output is. Passing unicode to the server would require some
out-of-band method of telling the server which encoding to use per
response, which seemed unacceptable.

The down side, already alluded to, is that middleware cannot then call
e.g. response.capitalize() or any of a number of other methods without
first decoding the response. And it cannot do that reliably unless
(again) the encoding which was used to produce bytes is communicated
down the stack out of band.

The python3 branch of CherryPy is by no means complete. I'd be happy to
explore emitting unicode if we could decide on a method whereby apps
could inform the server which encoding they want. Middleware which
transcoded the response would need a means of overriding that. But of
course, that opens a whole new can of worms if something goes wrong,
because application authors want control over the error response; if the
server is encoding the response, and an error occurs, there would have
to be a way to pass control back up the stack to...what? whichever
component last set the encoding? That road starts to get complicated
very quickly.

If some middleware needs to treat the response as unicode, I'd rather
emit bytes and somehow return the encoding as part of the response.
Perhaps WSGI 2's mythical "return (status, headers, body-iterable,
encoding)". Middleware could then decode/transcode as desired. I can't
think of a downside to that, other than some lost cycles spent
de/encoding, but perhaps there are some I don't yet foresee.


Robert Brewer
fumanchu-Q+9y+cpEbCIdnm+yROfE0A@public.gmane.org

[1] See http://docs.python.org/dev/py3k/library/stdtypes.html#string-methods

_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org
P.J. Eby | 8 May 17:58 2009

Re: Python 3.0 and WSGI 1.0.

At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote:
>I decided that that single type should be byte strings because I want
>WSGI middleware and applications to be able to choose what encoding
>their output is. Passing unicode to the server would require some
>out-of-band method of telling the server which encoding to use per
>response, which seemed unacceptable.

I find the above baffling, since PEP 333 explicitly states that when 
using unicode types, they're not actually supposed to *be* unicode 
--  they're just bytes decoded with latin-1.

So, the server doesn't need to know "what encoding to use" -- it's 
latin-1, plain and simple.  (And it's an error for an application to 
produce a unicode string that can't be encoded as latin-1.)

To be even more specific: an application that produces strings can 
"choose what encoding to use" by encoding in it, then decoding those 
bytes via latin-1.  (This is more or less what Jython and IronPython 
users are doing already, I believe.)

_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org


Gmane