Ashish Kulkarni | 12 Sep 09:15 2006

Static win32 builds for lxml 1.0.4

Hello,

I finally got the VC++ Toolkit 2003 compiler, and have made static builds for lxml 1.0.4. They are available at:

http://puggy.symonds.net/~ashish/downloads/

Regards,
ashish
Nicola Larosa | 12 Sep 12:46 2006
X-Face
Face
Picon

Unicode munging in element tag and text

Hi all, thanks for a great library. :-)

I found a rather peculiar behavior in Unicode object handling for element
tag and text. It looks like they get converted to a plain string if they
only contains ASCII chars, but not always. ElementTree instead always keeps
them as Unicode objects.

>>> from lxml.etree import Element as lxElem
>>> from elementtree.ElementTree import Element as etElem

1) Let's first build an element from a Unicode object with ASCII chars;
only ElementTree keeps it as Unicode:

>>> lx = lxElem(u'ascii')
>>> et = etElem(u'ascii')
>>> lx.tag
'ascii'
>>> et.tag
u'ascii'

while when the Unicode object contains non-ASCII chars, both libraries
correctly keep it as Unicode:

>>> lx = lxElem(u'mòrèthànàscìì')
>>> et = etElem(u'mòrèthànàscìì')
>>> lx.tag
u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec'
>>> et.tag
u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec'

(Continue reading)

Fredrik Lundh | 12 Sep 13:27 2006

Re: Unicode munging in element tag and text

Nicola Larosa wrote:

> This inconsistent behavior does not seem intentional. In my opinion, in the
> cases 1) and 2) lxml should work as it already does in the case 3), and as
> ElementTree always does.

in Python 2.X, Unicode strings are compatible with 8-bit ASCII-only 
strings, so the lxml.etree behaviour is perfectly acceptable.  I see no 
reason to force an implementation that doesn't use Python objects for 
its internal storage to be forced to keep track of the original type.

(especially not since the Unicode string type will disappear in Python 
3.0; all strings will be able to hold Unicode data).

</F>
Stefan Behnel | 12 Sep 18:23 2006
Picon

Re: Unicode munging in element tag and text

Hi Nicola,

Nicola Larosa wrote:
> This inconsistent behavior does not seem intentional. In my opinion, in the
> cases 1) and 2) lxml should work as it already does in the case 3), and as
> ElementTree always does.

At least under Python 2.x, lxml.etree will continue to return unicode or plain
strings depending on their content. Internally, everything is stored as UTF-8,
so this is for performance reasons as we can avoid unicode conversion for
plain ASCII strings (which are very common, just think of numeric data, dates,
etc.).

This may change in Python 3.x, but then, there may be more to change, so
that's not in our scope for now.

Stefan
Nicola Larosa | 12 Sep 19:00 2006
X-Face
Face
Picon

Re: Unicode munging in element tag and text

Stefan Behnel wrote:
> At least under Python 2.x, lxml.etree will continue to return unicode or plain
> strings depending on their content. Internally, everything is stored as UTF-8,
> so this is for performance reasons as we can avoid unicode conversion for
> plain ASCII strings (which are very common, just think of numeric data, dates,
> etc.).

Any benchmarks supporting this decision?

--

-- 
Nicola Larosa - http://www.tekNico.net/

There is more money being spent on breast implants and Viagra today than
on Alzheimer's research. This means that by 2040, there should be a large
elderly population with perky boobs and huge erections and absolutely no
recollection of what to do with them. -- David Icke, April 2006
Fredrik Lundh | 12 Sep 19:19 2006

Re: Unicode munging in element tag and text

Nicola Larosa wrote:

> Any benchmarks supporting this decision?

Are you trying to use the "premature optimization is evil" argument 
against people who's spent more time than anyone else on optimizing 
Python's string subsystem? ;-)

</F>
Nicola Larosa | 13 Sep 00:28 2006
X-Face
Face
Picon

Re: Unicode munging in element tag and text

Fredrik Lundh wrote:
> Are you trying to use the "premature optimization is evil" argument 
> against people who's spent more time than anyone else on optimizing 
> Python's string subsystem? ;-)

I, for one, welcome our new Iceland Sprint overlords. ;-P

--

-- 
Nicola Larosa - http://www.tekNico.net/

Many software developers have become hostage to the development
frameworks that they utilise. In turn, many frameworks have made
session state a fundamental building block of web development
because it permits sloppy design. -- Alan Dean, April 2006
Stefan Behnel | 12 Sep 21:59 2006
Picon

Re: Static win32 builds for lxml 1.0.4

Hi Ashish,

Ashish Kulkarni wrote:
> I finally got the VC++ Toolkit 2003 compiler, and have made static builds
> for lxml 1.0.4.

Great, I uploaded them. Thanks for contributing!

Stefan

Gmane