Stefan Behnel | 1 May 12:15 2008
Picon

lxml 2.0.5 released

Hi all,

lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series.

Have fun,
Stefan

2.0.5 (2008-05-01)
Bugs fixed

    * Resolving to a filename in custom resolvers didn't work.
    * lxml did not honour libxslt's second error state "STOPPED", which let
      some XSLT errors pass silently.
    * Memory leak in Schematron with libxml2 >= 2.6.31.
Alex Klizhentas | 1 May 20:14 2008
Picon

Custom Elements question

Hi All,
Got a question:

I've extended the ElementBase object using the approach described in the tutorial, but SubElement does not work as desired:

class NodeBase(etree.ElementBase):
     def append(self,child):

 print "aaa"
 return etree.ElementBase.append(self,child)

etree.SubElement(root,"child") #no "aaa" printed

OK, but when taking your code to the module:

def SubElement(parent, tag, attrib={}, **extra):
    attrib = attrib.copy()
    attrib.update(extra)
    element = parent.makeelement(tag, attrib)
    parent.append(element)
    return element

SubElement(root,"child") # "aaa" is here!

and overriding
    def makeelement(self, tag, attrib):
        return Node(tag, attrib)

in the NodeBase just does not help,

Any advice will be appreciated,
Alex
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 1 May 20:28 2008
Picon

Re: Custom Elements question

Hi,

Alex Klizhentas wrote:
> I've extended the ElementBase object using the approach described in the
> tutorial, but SubElement does not work as desired:
> 
> class NodeBase(etree.ElementBase):
>      def append(self,child):
>  print "aaa"
>  return etree.ElementBase.append(self,child)
> 
> etree.SubElement(root,"child") #no "aaa" printed

That's because SubElement() does not call .append().

> OK, but when taking your code to the module:
> 
> def SubElement(parent, tag, attrib={}, **extra):
>     attrib = attrib.copy()
>     attrib.update(extra)
>     element = parent.makeelement(tag, attrib)
>     parent.append(element)
>     return element
> 
> SubElement(root,"child") # "aaa" is here!

As expected, as you call .append() explicitly here.

> and overriding
>     def makeelement(self, tag, attrib):
>         return Node(tag, attrib)
> 
> in the NodeBase just does not help,

SubElement() does not call .makeelement() either. It's implemented in plain C.
Could you explain a bit why you want to do this and how your .append() differs
from the normal append code?

Stefan
Alex Klizhentas | 1 May 21:11 2008
Picon

Re: Custom Elements question

Thanks for the comments,

The idea behind this is to allow the XML tree to notify observers when it's contents are changed: the node is added, removed or moved.

That's why I'm going to override the ElementBase members so that they will notify observers on the certain actions performed.

Everything works fine, except this usefult SubElement function that did not work as expected, now you've clarified the things,

Thanks
Alex

2008/5/1 Stefan Behnel <stefan_ml <at> behnel.de>:
Hi,

Alex Klizhentas wrote:
> I've extended the ElementBase object using the approach described in the
> tutorial, but SubElement does not work as desired:
>
> class NodeBase(etree.ElementBase):
>      def append(self,child):
>  print "aaa"
>  return etree.ElementBase.append(self,child)
>
> etree.SubElement(root,"child") #no "aaa" printed

That's because SubElement() does not call .append().


> OK, but when taking your code to the module:
>
> def SubElement(parent, tag, attrib={}, **extra):
>     attrib = attrib.copy()
>     attrib.update(extra)
>     element = parent.makeelement(tag, attrib)
>     parent.append(element)
>     return element
>
> SubElement(root,"child") # "aaa" is here!

As expected, as you call .append() explicitly here.


> and overriding
>     def makeelement(self, tag, attrib):
>         return Node(tag, attrib)
>
> in the NodeBase just does not help,

SubElement() does not call .makeelement() either. It's implemented in plain C.
Could you explain a bit why you want to do this and how your .append() differs
from the normal append code?

Stefan



--
Regards,
Alex
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 2 May 08:49 2008
Picon

Re: Custom Elements question


Alex Klizhentas wrote:
>> Alex Klizhentas wrote:
>>> I've extended the ElementBase object using the approach described in the
>>> tutorial, but SubElement does not work as desired:
>>>
>>> class NodeBase(etree.ElementBase):
>>>      def append(self,child):
>>>  print "aaa"
>>>  return etree.ElementBase.append(self,child)
>>>
>>> etree.SubElement(root,"child") #no "aaa" printed
>> That's because SubElement() does not call .append().
>>
>>
>>> OK, but when taking your code to the module:
>>>
>>> def SubElement(parent, tag, attrib={}, **extra):
>>>     attrib = attrib.copy()
>>>     attrib.update(extra)
>>>     element = parent.makeelement(tag, attrib)
>>>     parent.append(element)
>>>     return element
>
> The idea behind this is to allow the XML tree to notify observers when it's
> contents are changed: the node is added, removed or moved.
>
> That's why I'm going to override the ElementBase members so that they will
> notify observers on the certain actions performed.
>
> Everything works fine, except this usefult SubElement function that did not
> work as expected, now you've clarified the things,

Ah, sure. Then it's best to use a pure Python implementation of SubElement
instead, as the one above.

Stefan
Stefan Behnel | 2 May 16:30 2008
Picon

Re: Custom Elements question

Hi,

another bit of reasoning here.

Stefan Behnel wrote:
> Alex Klizhentas wrote:
>> I've extended the ElementBase object using the approach described in the
>> tutorial, but SubElement does not work as desired:
>>
>> class NodeBase(etree.ElementBase):
>>      def append(self,child):
>>  print "aaa"
>>  return etree.ElementBase.append(self,child)
>>
>> etree.SubElement(root,"child") #no "aaa" printed
> 
> That's because SubElement() does not call .append().
[...]
> SubElement() does not call .makeelement() either. It's implemented in plain C.

One important reason is that this allows lxml.etree to append the new libxml2
node at the C level *before* the decision is taken which Python class should
be used to represent it. This might have an impact on the class lookup if it
considers the parental relation when taking its decision (lxml.objectify does
that, for example).

But that's the only difference I can see between etree.SubElement() and your
Python implementation. And you could even work around it by doing something
like this:

def SubElement(parent, tag, attrib={}, **extra):
     attrib = attrib.copy()
     attrib.update(extra)
     element = parent.makeelement(tag, attrib)
     parent.append(element)
     del element
     return parent[-1]

However, you might want to avoid that if you know you won't need it, e.g. when
using the "namespace" or "default" lookup scheme.

Stefan
Stefan Behnel | 2 May 19:16 2008
Picon

threading fixed :)

Hi,

there has been a long-standing issue in the threading support in lxml,
combined with the per-thread string hash table we use for libxml2.

Here is a simple example of a sure crasher:

-------------------------------
import threading
import lxml.etree as et

xml = "<root><threadtag/></root>"

main_root = et.XML("<root/>")

def run_thread():
    thread_root = et.XML(xml)
    main_root.append(thread_root[0])
    del thread_root # deletes the document

thread = threading.Thread(target=run_thread)

thread.start()
thread.join()

print et.tostring(main_root)
-------------------------------

This crashes, because the thread parses the XML fragment into its own
dictionary and stores the tag name "threadtag" there. Then it appends the
"threadtag" element to a tree in the main program, which uses a different
dict. When it deletes the "thread_root", the document will be deleted as well,
and the (ref-counted) thread dictionary that contains the string "threadtag"
will be freed when the thread terminates. The main program then crashes when
it accesses the no longer available tag name in the corrupted document.

The solution I came up with today is actually quite simple. We have to
traverse the subtree anyway to update the document references and to fix the
namespace declarations. So it's only one step more to also fix the name
pointers by looking them up in the target dictionary and re-assigning the
names. This is only required when we really have two different dicts, which is
easy to decide. So there isn't even a performance impact if you only use a
single thread or if you do not move subtrees between threads. And the added
overhead when you need this is really small.

I will release a new beta of 2.1 soon that will have this change, and it would
be very helpful if people who currently use threaded code that exchanges (i.e.
deep copies) tree fragments between threads could check if this works for them
(i.e. if code that crashes under 2.0 if you remove the deep copying works
under 2.1). If it proves to fix the problem, I will backport it to 2.0 also.
Read: the more feedback I get, the faster this will be fixed in 2.0. :)

Stefan
Alex Klizhentas | 2 May 19:21 2008
Picon

Re: Custom Elements question

Thanks Stefan,

All the nodes in that tree should have the same type, that's why the default class lookup scheme for parser works fine.

BTW, I have one more question, to set the xml:id i use the following construct:

def xml_id(v):
    # helper function to create name space attributes
    return {'{http://www.w3.org/XML/1998/namespace}id': v}

and the following construct:

N.child1("text",xml_id("some_id"))

following the examples from the site.

to get the id I use:

class NodeBase(etree.ElementBase):
    ...   
    def get_node_id(self,id):
        searched = self.find(".//*[ <at> {http://www.w3.org/XML/1998/namespace}id='%s']"%(id,))
        if searched is None:
            raise NodeNotFoundError(id)
        return searched


I have two questions:

1. what way is faster to get the element by Id? should I use find or xpath to achieve the better performance?
2. is there a way to set xml:id using xml - prefix?

Thanks,
Alex

2008/5/2 Stefan Behnel <stefan_ml <at> behnel.de>:
Hi,

another bit of reasoning here.

Stefan Behnel wrote:
> Alex Klizhentas wrote:
>> I've extended the ElementBase object using the approach described in the
>> tutorial, but SubElement does not work as desired:
>>
>> class NodeBase(etree.ElementBase):
>>      def append(self,child):
>>  print "aaa"
>>  return etree.ElementBase.append(self,child)
>>
>> etree.SubElement(root,"child") #no "aaa" printed
>
> That's because SubElement() does not call .append().
[...]
> SubElement() does not call .makeelement() either. It's implemented in plain C.

One important reason is that this allows lxml.etree to append the new libxml2
node at the C level *before* the decision is taken which Python class should
be used to represent it. This might have an impact on the class lookup if it
considers the parental relation when taking its decision (lxml.objectify does
that, for example).

But that's the only difference I can see between etree.SubElement() and your
Python implementation. And you could even work around it by doing something
like this:

def SubElement(parent, tag, attrib={}, **extra):
    attrib = attrib.copy()
    attrib.update(extra)
    element = parent.makeelement(tag, attrib)
    parent.append(element)
    del element
    return parent[-1]

However, you might want to avoid that if you know you won't need it, e.g. when
using the "namespace" or "default" lookup scheme.

Stefan




--
Regards,
Alex
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 2 May 19:42 2008
Picon

Re: Custom Elements question

Hi,

Alex Klizhentas wrote:
> I have one more question, to set the xml:id i use the following construct:
> 
> def xml_id(v):
>     # helper function to create name space attributes
>     return {'{http://www.w3.org/XML/1998/namespace}id': v}
> 
> and the following construct:
> 
> N.child1("text",xml_id("some_id"))
> 
> following the examples from the site.
> 
> to get the id I use:
> 
> class NodeBase(etree.ElementBase):
>     ...
>     def get_node_id(self,id):
>         searched = self.find(".//*[ <at> {
> http://www.w3.org/XML/1998/namespace}id='%s']"%(id,))
>         if searched is None:
>             raise NodeNotFoundError(id)
>         return searched
> 
> I have two questions:
> 
> 1. what way is faster to get the element by Id? should I use find or xpath
> to achieve the better performance?

timeit will tell you that. But it really depends on the data. element.find()
stops short after the first hit, so that's probably faster on average if the
document is large. OTOH, XPath() is implemented in C and could easily beat the
Python code behind find(".. <at> attr...") for smaller documents...

Try this:

      find_id = etree.ETXPath(
            ".//*[ <at> {http://www.w3.org/XML/1998/namespace}id=$id]")
      ...
      def get_node_id(self,id):
          el = find_id(self, id=id)

> 2. is there a way to set xml:id using xml - prefix?

No, but if you know you run single-threaded, you can reuse the attrib dict and
just change the value. That's faster than recreating it each time.

Stefan
Stefan Behnel | 2 May 20:48 2008
Picon

lxml 2.1beta2 released

Hi all,

I'm happy to announce the release of lxml 2.1 beta2. It features a couple of
enhancements and fixes over the first beta. The main improvement is the much
more robust threading support, which makes it a lot easier to move subtrees
back and forth between threads. It is described in more detail here:

http://permalink.gmane.org/gmane.comp.python.lxml.devel/3571

Please report back on the list (preferably in reply to the above thread) if
you notice a difference to lxml 2.0 with your code.

Have fun,
Stefan

2.1beta2 (2008-05-02)

Features added

    * All parse functions in lxml.html take a parser keyword argument.
    * lxml.html has a new parser class XHTMLParser and a module attribute
      xhtml_parser that provide XML parsers that are pre-configured for the
      lxml.html package.

Bugs fixed

    * Moving a subtree from a document created in one thread into a document
      of another thread could crash when the rest of the source document is
      deleted while the subtree is still in use.
    * Passing an nsmap when creating an Element will no longer strip
      redundantly defined namespace URIs. This prevented the definition of
      more than one prefix for a namespace on the same Element.

Other changes

    * If the default namespace is redundantly defined with a prefix on the
      same Element, the prefix will now be preferred for subelements and
      attributes. This allows users to work around a problem in libxml2 where
      attributes from the default namespace could serialise without a prefix
      even when they appear on an Element with a different namespace (i.e.
      they would end up in the wrong namespace).

Gmane