John Krukoff | 1 Aug 2006 05:13
Favicon

Copying an ElementTree doesn't work.

Can someone explain to me why when an ElementTree is copied, it's root  
element isn't copied?

>>> import lxml.etree as etree
>>> import copy
>>> root = etree.XML( '<a/>' )
>>> tree = copy.copy( etree.ElementTree( root ) )
>>> tree.getroot( ) is None
True

I get the same behaviour with deepcopy as well. Am I just supposed to  
always be using Element s and not ElementTree s? I'm running lxml  
1.0.2 on Python 2.4.3, if that matters.
John Krukoff | 1 Aug 2006 05:33
Favicon

Segfault in lxml during element copy

I've been working on an XML based middleware system written in python  
and lxml, and I've started experiencing a segfault problem with lxml  
just as it's being rolled out to the rest of the team. Embarrassing,  
you know?

It looks like a double free problem, as the crash is always acompanied  
by a glibc message that looks like this:
*** glibc detected *** free(): invalid pointer: 0x0813e1a4 ***

I've tried to come up with a stripped down test case to repeat the  
problem, but have been unable to reproduce it except in the full  
application. It's not absolutely consistent, I'll have to run the same  
request 3 or 4 times before it crashes, but it always does, even while  
generating identical output from identical input for those 3 or 4 calls.

I've tracked down the line it crashes at, and it's a simple copy  
called on an XML element:
copied = copy.copy( element )

If I remove it, and operate on the source xml directly instead of  
copying it (it's really just a safety mechanism), it still crashes,  
just in more random locations.

I'm running lxml 1.0.2, on Python 2.4.3, with libxml2 2.6.26 and  
libxslt 1.1.17 if it matters. The problem is reproducible on a  
coworkers machine, also running lxml 1.0.2 with slightly different  
minor revisions of the xml libraries.
Stefan Behnel | 1 Aug 2006 07:25
Picon

Re: Copying an ElementTree doesn't work.

Hi John,

John Krukoff wrote:
> Can someone explain to me why when an ElementTree is copied, it's root  
> element isn't copied?
> 
>>>> import lxml.etree as etree
>>>> import copy
>>>> root = etree.XML( '<a/>' )
>>>> tree = copy.copy( etree.ElementTree( root ) )
>>>> tree.getroot( ) is None
> True
> 
> I get the same behaviour with deepcopy as well. Am I just supposed to  
> always be using Element s and not ElementTree s? I'm running lxml  
> 1.0.2 on Python 2.4.3, if that matters.

Copying ElementTrees is not currently implemented. The only reason to do it
would be to avoid problems when people use it, there is no real gain. I do not
even see why you would want to copy an ElementTree.

As ElementTrees are immutable, the above is not different from this:

tree = etree.ElementTree(root)

I'll add __copy__ and __deepcopy__, though, so that the above problem will
disappear. So, thanks for reporting this.

Stefan
(Continue reading)

Stefan Behnel | 1 Aug 2006 07:56
Picon

Re: Segfault in lxml during element copy

Hi John,

John Krukoff wrote:
> I've been working on an XML based middleware system written in python  
> and lxml, and I've started experiencing a segfault problem with lxml  
> just as it's being rolled out to the rest of the team. Embarrassing,  
> you know?

Sorry for that.

> It looks like a double free problem, as the crash is always acompanied  
> by a glibc message that looks like this:
> *** glibc detected *** free(): invalid pointer: 0x0813e1a4 ***

*May* be a double free problem, yes.

> I've tried to come up with a stripped down test case to repeat the  
> problem, but have been unable to reproduce it except in the full  
> application. It's not absolutely consistent, I'll have to run the same  
> request 3 or 4 times before it crashes, but it always does, even while  
> generating identical output from identical input for those 3 or 4 calls.
> 
> I've tracked down the line it crashes at, and it's a simple copy  
> called on an XML element:
> copied = copy.copy( element )

?? You mean, you get the above error ('free(): invalid pointer') when you call
this? Then I have no idea where that bug could come from. At least, it can't
really be copy() that triggers it...

(Continue reading)

Stefan Behnel | 1 Aug 2006 09:36
Picon

Re: An intriguing behaviour of xpath in lxml

Hi Agustin,

Agustín Villena wrote:
> I already know that xpath(".") in the document node works, but is 
> beyond my understanding why xpath("/") is not implemented.

Well, what would you expect it to return? The XPath spec says:

"""
/ selects the document root (which is always the parent of the document element)
"""

The document element is returned by "/*", so it's the root element of the
document in ElementTree. The "document root" itself is not available in the
tree model provided by lxml.

It /could/ be a possibility to deliberately diverge from the spec here and
return the root element instead.

So, maybe you can enlighten us with your use case, so that we can decide what
implementation would fit here.

Stefan
Stefan Behnel | 1 Aug 2006 10:01
Picon

Re: Copying an ElementTree doesn't work.

Hi John,

John Krukoff wrote:
> Quoting Stefan Behnel <behnel_ml <at> gkec.informatik.tu-darmstadt.de>:
>> John Krukoff wrote:
>>> Can someone explain to me why when an ElementTree is copied, it's root
>>> element isn't copied?
>>>
>>>>>> import lxml.etree as etree
>>>>>> import copy
>>>>>> root = etree.XML( '<a/>' )
>>>>>> tree = copy.copy( etree.ElementTree( root ) )
>>>>>> tree.getroot( ) is None
>>> True
>>
>> As ElementTrees are immutable, the above is not different from this:
>>
>> tree = etree.ElementTree(root)
>>
>> I'll add __copy__ and __deepcopy__, though, so that the above problem
>> will disappear. So, thanks for reporting this.
> 
> For what it's worth, the use case is that I have an element tree that I
> want to copy multiple times, before performing destructive changes to
> the copies. Currently, copying the contents of an element tree to
> another element tree is kind of clunky:
> 
>>>> original = etree.ElementTree( etree.XML( '<a/>' ) )
>>>> copied = etree.ElementTree( copy.copy( original.getroot( ) ) )
> 
(Continue reading)

Stefan Behnel | 1 Aug 2006 10:17
Picon

Re: Segfault in lxml during element copy

Hi John,

John Krukoff wrote:
> Thanks for the response. Yeah, I know just how vague an error report
> this is. I was really hoping I was hitting something that someone else
> had already encountered. I've already wasted a day trying to strip the
> program down to just the lxml operations, and haven't been able to come
> up with a reduced set of the program that still causes the crash.

Try to think about the main treatments you apply to trees. Do you move
elements between trees? What happens to the source tree? Does the crash go
away if you keep a reference to it? (maybe in a set or list)

Do you keep cyclic references between objects that reference elements, i.e. is
the Python cyclic garbage collector involved in cleaning up XML trees?

If you use XSLT, can you reproduce the crash if you build the result tree (or
a simpler one) by hand? Do you use XPath calls or extension functions? Are
they required to trigger the crash?

These kinds of bugs are mostly related to garbage collection and Python
reference counting, so try to concentrate on code that results in freeing
references to elements and trees.

There is also a tool we commonly use to debug memory handling in lxml.etree.
It's called "valgrind". doc/valgrind.txt contains a command line that allows
you to run lxml with it. This gives you a stack trace when problems occur or
when the program crashes that *might* give us a hint on what happened. In case
you want to try, you can send me the output in private e-mail (preferably
bzip2-ed or gzipped) so that I can take a look at it.
(Continue reading)

Martijn Faassen | 1 Aug 2006 11:13
Favicon

Re: lxml - exslt - regexp:match()

Stefan Behnel wrote:
[snip]
> For comparison, I now implemented the examples from the page as unit tests,
> which sadly showed that Python's regexps are incompatible with what EXSLT
> requires. The Python RE "([a-z])+ " does not match "test " as in EXSLT, only
> the last "t" is returned for the group by re.findall(). So we can't claim
> compatibility with EXSLT at this point. -- Note, though, that I never really
> said it was compatible, it just builds on Python's re module. I still think
> that's enough for a Python XML library.

If it's not compatible, I think it should be invoked differently than in 
the EXSLT way. This way someone dropping in an EXSLT stylesheet with 
regexes doesn't have a half-working stylesheet but a completely and 
clearly failing stylesheet: lxml doesn't support the regexes. In 
addition, the path forward to getting the stylesheet working is clear: 
use the Python-based and deliberately incompatible regex facility 
instead, and rewrite the regexes.

Regards,

Martijn
Martijn Faassen | 1 Aug 2006 11:15
Favicon

Re: An intriguing behaviour of xpath in lxml

Stefan Behnel wrote:
> Hi Agustin,
> 
> Agustín Villena wrote:
>> I already know that xpath(".") in the document node works, but is 
>> beyond my understanding why xpath("/") is not implemented.
> 
> Well, what would you expect it to return? The XPath spec says:
> 
> """ / selects the document root (which is always the parent of the
> document element) """
> 
> The document element is returned by "/*", so it's the root element of
> the document in ElementTree. The "document root" itself is not
> available in the tree model provided by lxml.
> 
> It /could/ be a possibility to deliberately diverge from the spec
> here and return the root element instead.

What about returning a root ElementTree? Then again, that is not the
parent of the document element at present in our tree model, right? Or
is it? Changing the getparent() behavior will have consequences we need 
to consider carefully.

> So, maybe you can enlighten us with your use case, so that we can
> decide what implementation would fit here.

Yes, that would indeed be helpful.

Regards,
(Continue reading)

Stefan Behnel | 1 Aug 2006 11:47
Picon

Re: lxml - exslt - regexp:match()

Hi Martijn,

Martijn Faassen wrote:
> Stefan Behnel wrote:
> [snip]
>> For comparison, I now implemented the examples from the page as unit
>> tests,
>> which sadly showed that Python's regexps are incompatible with what EXSLT
>> requires. The Python RE "([a-z])+ " does not match "test " as in
>> EXSLT, only
>> the last "t" is returned for the group by re.findall(). So we can't claim
>> compatibility with EXSLT at this point. -- Note, though, that I never
>> really
>> said it was compatible, it just builds on Python's re module. I still
>> think
>> that's enough for a Python XML library.
> 
> If it's not compatible, I think it should be invoked differently than in
> the EXSLT way. This way someone dropping in an EXSLT stylesheet with
> regexes doesn't have a half-working stylesheet but a completely and
> clearly failing stylesheet: lxml doesn't support the regexes. In
> addition, the path forward to getting the stylesheet working is clear:
> use the Python-based and deliberately incompatible regex facility
> instead, and rewrite the regexes.

Hmmm, I feel invited to disagree here. I reread the EXSLT spec on this topic
and it does not contain any RE syntax specification and is rather unclear
about what is required for compliance. It says this in the introduction of the
RE module:

(Continue reading)


Gmane