Picon

Re: Memory leak when parsing XML files in sequence?

Hi,

> Christian Heimes, 30.10.2011 14:58:
>> I did some additional debugging with valgrind and found the code segment
>> that causes the memory leak. Well, it's not a real memory leak but a
>> feature. ;)

>> It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():>>
>>            * when validating, the ID registration is done at the attribute
>>            * validation level. Otherwise we have to do specific handling here.
>>            */
>>           if (xmlStrEqual(fullname, BAD_CAST "xml:id")) {
>>               /*
>>                * Add the xml:id value
>>                *
>>                * Open issue: normalization of the value.
>>                */
>>               if (xmlValidateNCName(value, 1) != 0) {
>>                   xmlErrValid(ctxt, XML_DTD_XMLID_VALUE,
>>                         "xml:id : attribute value %s is not an NCName\n",
>>                               (const char *) value, NULL);
>>               }
>>               xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret);
>>           } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret))
>>               xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret);
>>           else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret))
>>               xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret);
>>       }
>>
>>
(Continue reading)

Simon Sapin | 3 Nov 17:48 2011
Picon

lxml.cssselect much slower on 2.3.1

Hi,

Thanks to everyone working on lxml. It helped a lot in developing 
WeasyPrint (http://weasyprint.org)

lxml 2.3.1 changes how the descendant selector is implemented in cssselect:

https://github.com/lxml/lxml/commit/503d276c79bbe74db7d210814cfd28132fbe2c92#src/lxml/cssselect.py

This changes makes matching affected selectors much slower on big 
documents. On my machine, the following test gives 0.15 seconds with 2.3 
but 45 seconds with 2.3.1:

import time
from lxml import html, cssselect

document = html.parse('http://www.w3.org/TR/html5/Overview.html')
match = cssselect.CSSSelector('dl dt')
t1 = time.time()
match(document)
t2 = time.time()
print t2 - t1

(You may want to download the file and parse that if you want to run the 
test repeatedly. It’s 4.7 MB of HTML.)

Matching all selectors founds in the document’s stylesheets (as 
WeasyPrint does to render it in PDF) takes about one minute in 2.3. I 
don’t know about 2.3.1 because I gave up and killed it after one hour.

(Continue reading)

Stefan Behnel | 3 Nov 20:26 2011
Picon

Re: lxml.cssselect much slower on 2.3.1

Simon Sapin, 03.11.2011 17:48:
> Thanks to everyone working on lxml. It helped a lot in developing
> WeasyPrint (http://weasyprint.org)
>
> lxml 2.3.1 changes how the descendant selector is implemented in cssselect:
>
> https://github.com/lxml/lxml/commit/503d276c79bbe74db7d210814cfd28132fbe2c92#src/lxml/cssselect.py
>
> This changes makes matching affected selectors much slower on big
> documents. On my machine, the following test gives 0.15 seconds with 2.3
> but 45 seconds with 2.3.1:
>
> import time
> from lxml import html, cssselect
>
> document = html.parse('http://www.w3.org/TR/html5/Overview.html')
> match = cssselect.CSSSelector('dl dt')
> t1 = time.time()
> match(document)
> t2 = time.time()
> print t2 - t1
>
> (You may want to download the file and parse that if you want to run the
> test repeatedly. It’s 4.7 MB of HTML.)

Thanks for the report, I can reproduce this. I get 7 seconds for the latest 
version and 0.1 seconds for the version preceding the above commit. That is 
a rather large regression.

This seems to be related:
(Continue reading)

Stefan Behnel | 3 Nov 21:14 2011
Picon

Re: Memory leak when parsing XML files in sequence?

Maarten van Gompel (proycon), 03.11.2011 11:36:
>> Christian Heimes, 30.10.2011 14:58:
>>> I did some additional debugging with valgrind and found the code segment
>>> that causes the memory leak. Well, it's not a real memory leak but a
>>> feature. ;)
>
>>> It's in libxml2's SAX2.c in the function xmlSAX2StartElementNs():>>
>>>             * when validating, the ID registration is done at the attribute
>>>             * validation level. Otherwise we have to do specific handling here.
>>>             */
>>>            if (xmlStrEqual(fullname, BAD_CAST "xml:id")) {
>>>                /*
>>>                 * Add the xml:id value
>>>                 *
>>>                 * Open issue: normalization of the value.
>>>                 */
>>>                if (xmlValidateNCName(value, 1) != 0) {
>>>                    xmlErrValid(ctxt, XML_DTD_XMLID_VALUE,
>>>                          "xml:id : attribute value %s is not an NCName\n",
>>>                                (const char *) value, NULL);
>>>                }
>>>                xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret);
>>>            } else if (xmlIsID(ctxt->myDoc, ctxt->node, ret))
>>>                xmlAddID(&ctxt->vctxt, ctxt->myDoc, value, ret);
>>>            else if (xmlIsRef(ctxt->myDoc, ctxt->node, ret))
>>>                xmlAddRef(&ctxt->vctxt, ctxt->myDoc, value, ret);
>>>        }
>>>
>>>
>>> libxml2 keeps a reference when it finds a xml:id attribute. I don't see
(Continue reading)

Stefan Behnel | 3 Nov 22:06 2011
Picon

Re: lxml.cssselect much slower on 2.3.1

Stefan Behnel, 03.11.2011 20:26:
> Simon Sapin, 03.11.2011 17:48:
>> By reading the changelog I don’t understand why the change was made or
>> what effects it has (other than performance). When is A//B different or
>> preffered to A/descendant::B ?
>
> Here's the corresponding ticket:
>
> https://bugs.launchpad.net/lxml/+bug/657968

One addition here: the original code converted "dl dt" into

     descendant-or-self::dl/descendant::dt

This is correct as long as "dt" is a concrete tag name. However, the 
selector expression "dl *:last-child" is in the same way incorrectly 
converted into

     descendant-or-self::dl/descendant::*[position() = last()]

Here, position() refers to the position in the list of descendants, not in 
the list of children. The correct translation would be

     descendant-or-self::dl/descendant-or-self::*/*[position() = last()]

It appears that the translation of "dl dt" into

     descendant-or-self::dl/descendant-or-self::*/dt

runs equally fast as the original version, so I think this is the way to go.
(Continue reading)

Stefan Behnel | 3 Nov 22:14 2011
Picon

Re: lxml.cssselect much slower on 2.3.1

Stefan Behnel, 03.11.2011 22:06:
> Stefan Behnel, 03.11.2011 20:26:
>> Simon Sapin, 03.11.2011 17:48:
>>> By reading the changelog I don’t understand why the change was made or
>>> what effects it has (other than performance). When is A//B different or
>>> preffered to A/descendant::B ?
>>
>> Here's the corresponding ticket:
>>
>> https://bugs.launchpad.net/lxml/+bug/657968
>
> One addition here: the original code converted "dl dt" into
>
>       descendant-or-self::dl/descendant::dt
>
> This is correct as long as "dt" is a concrete tag name. However, the
> selector expression "dl *:last-child" is in the same way incorrectly
> converted into
>
>       descendant-or-self::dl/descendant::*[position() = last()]
>
> Here, position() refers to the position in the list of descendants, not in
> the list of children. The correct translation would be
>
>       descendant-or-self::dl/descendant-or-self::*/*[position() = last()]
>
> It appears that the translation of "dl dt" into
>
>       descendant-or-self::dl/descendant-or-self::*/dt
>
(Continue reading)

Simon Sapin | 3 Nov 22:40 2011
Picon

Re: lxml.cssselect much slower on 2.3.1

Le 03/11/2011 22:14, Stefan Behnel a écrit :
> Here's the obvious patch against 2.3.1, please give it a try.

Hi Stefan,

Thanks for your quick and detailed response.

Here are the results of the same test as in my first message, but on a 
different machine:

lxml 2.3: 0.089 seconds
lxml 2.3.1: 11 seconds
lxml 2.3.1 with this patch: 0.69 seconds

So while not as fast as 2.3, this is much more acceptable than 2.3.1.

Do these three XPath expressions have the same meaning or are there 
cases where they do not give the same results? Which are correct for CSS?

Regards,
--

-- 
Simon
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 3 Nov 23:02 2011
Picon

Re: lxml.cssselect much slower on 2.3.1

Simon Sapin, 03.11.2011 22:40:
> Le 03/11/2011 22:14, Stefan Behnel a écrit :
>> Here's the obvious patch against 2.3.1, please give it a try.
>
> Thanks for your quick and detailed response.
>
> Here are the results of the same test as in my first message, but on a
> different machine:
>
> lxml 2.3: 0.089 seconds
> lxml 2.3.1: 11 seconds
> lxml 2.3.1 with this patch: 0.69 seconds
>
> So while not as fast as 2.3, this is much more acceptable than 2.3.1.
>
> Do these three XPath expressions have the same meaning or are there
> cases where they do not give the same results? Which are correct for CSS?

Only the latter two give the correct result in all cases. The regression 
test suite also is a bit better in 2.3.1, so you'll notice when they break. 
See the doctests in test_css.txt and test_css_select.txt.

It may still be possible to optimise the "simple" cases (those without 
conditions, or more specifically, those that do not require the position() 
XPath function in a condition) into using the faster "descendant::" axis, 
so that "dt dl" would be as fast as before, whereas "dt *:last-child" and 
maybe also things like "dt .classname" would get a performance hit (the 
latter could be optimised because, although using a condition, it does not 
depend on the position() XPath function).

(Continue reading)

Simon Sapin | 4 Nov 00:22 2011
Picon

Re: lxml.cssselect much slower on 2.3.1

Le 03/11/2011 23:02, Stefan Behnel a écrit :
 > Only the latter two give the correct result in all cases. The regression
 > test suite also is a bit better in 2.3.1, so you'll notice when they 
break.
 > See the doctests in test_css.txt and test_css_select.txt.
 >
 > It may still be possible to optimise the "simple" cases (those without
 > conditions, or more specifically, those that do not require the 
position()
 > XPath function in a condition) into using the faster "descendant::" axis,
 > so that "dt dl" would be as fast as before, whereas "dt *:last-child" and
 > maybe also things like "dt .classname" would get a performance hit (the
 > latter could be optimised because, although using a condition, it 
does not
 > depend on the position() XPath function).
 >
 > However, this requires more time than I can currently invest.
 >
 > I take patches.:)

Results vary a lot depending on the selector, but translation 1 (in 2.3) 
can be orders of magnitude faster than translation 3 (you latest patch), 
while 3 can be orders of magnitude faster than 2 (2.3.1)

However if I understand right (pleas correct me if I’m wrong), 1 is 
wrong with some selectors. So your patch can be safely committed?

Additionally, 1 could be use in some cases when it is safe, be deciding 
so is more difficult. I’m investigate.

(Continue reading)

Simon Sapin | 4 Nov 00:33 2011
Picon

Incorrect XPath translation of some CSS selectors

Hi,

Please consider the following session:

 >>> import lxml.cssselect
 >>> lxml.cssselect.css_to_xpath('a >#b')
u"descendant-or-self::a/*[ <at> id = 'b']"
 >>> lxml.cssselect.css_to_xpath('a >* #b')
u"descendant-or-self::a/*//*[ <at> id = 'b']"
 >>> lxml.cssselect.css_to_xpath('a > #b')
u"descendant-or-self::a/*//*[ <at> id = 'b']"

The first and second translations are correct, but the third is not. The 
third CSS selector is equivalent to the first, not the second as the 
cssselect parser seems to think.

The same bug happens with a class instead of an ID, and maybe in other 
cases.

I can adapt the existing test suite so that is fails where it should 
not, if it helps.

PS: should bug discussion go here on the mailing list or on Github’s or 
Launchpad’s bug-tracker?

Regards,
--

-- 
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
(Continue reading)


Gmane