Re: Passing UTF-8 bytestrings to lxml
Stefan Behnel <stefan_ml <at> behnel.de>
2008-08-04 14:07:33 GMT
Hi,
John J Lee wrote:
> Apologies in advance if this is the wrong list -- I'm suggesting a change
> to lxml, so I guess this is the right place...
We only have one mailing list, so this is definitely the right place.
> Looking at the code, it seems that changing function _utf8 in
> apihelpers.pxi to accept UTF-8 encoded bytestrings (see patch below) would
> be sufficient to make lxml accept UTF-8 encoded bytestrings. Indeed, that
> seems to work.
The internal encoding used by libxml2 is UTF-8, so I don't expect any
problems when you pass in UTF-8 directly - as long as you can make sure
that it's really a valid UTF-8 byte sequence.
> 2. Should lxml be changed in this way? If it's considered important to
> avoid accidentally passing non-ASCII bytestrings to lxml
I consider that important, yes. The support for ASCII byte strings is a
pure convenience as ASCII names are extremely common in XML *and* they are
compatible with unicode strings in Python 2.x. Allowing anything other
than ASCII here would open the door for all sorts of hard to track down
encoding problems, as you would no longer get an exception when you
accidentally pass ISO encoded non-ASCII strings, for example.
Note that when lxml runs under Python 3, it will not allow you to pass
byte strings into the API at all (except for parsing, obviously).
(Continue reading)