Sidnei da Silva | 1 Jun 05:14 2007

Crash on Win32 under heavy stress

Hi there,

We're doing some stress testing before rolling some code based on lxml
into production and we've been able to reproduce a crash when reusing
the same XSLT object repeatedly. I will dump some information here in
the hope that anyone can shed a light before I go trying to compile
debug versions of everything and the kitchen sink.

The test is composed of a rather small XML file, and a rather big and
complex XSLT with several <xsl:import /> <xsl:include /> etc.

We fire up 10 threads. Each one has it's own parsed XML file and XSLT,
they are not being shared across threads.

Each thread then goes on a loop, applying the XSLT to the XML file and
serializing the result to a string.

With a less than 1000 iterations the crash almost never happens. At
about 50000 iterations, the crash is pretty much guaranteed to happen.

There doesn't seem to be any memory leak or anything, memory usage is
quite stable.

This is using the 1.2.1 release.

This is a static build on Win32, against:

libxml2-2.6.26.win32
libxslt-1.1.17.win32
zlib-1.2.3.win32
(Continue reading)

Stefan Behnel | 1 Jun 06:36 2007
Picon

Re: Crash on Win32 under heavy stress

Hi Sidnei,

I'll be away on vacation starting today, so I regret I can't provide too much
critical help at the moment.

So, just a quick note here.

Sidnei da Silva wrote:
> This is using the 1.2.1 release.

For a quick test, try with 1.3beta or the current trunk. No guarantee, though,
especially the trunk is not necessarily in a perfect production state.

> This is a static build on Win32, against:
> 
> libxml2-2.6.26.win32
> libxslt-1.1.17.win32
> zlib-1.2.3.win32
> iconv-1.9.2.win32

Ok, first thing I'd personally try is the latest libxml2 and libxslt. The guys
over there keep fixing bugs (even really old ones in recent versions), so
things tend to get better over time. For example, I get a reproduceable XPath
crasher in the HTML module Ian is working on with libxml2 2.6.27. It's gone
with 2.6.28. libxslt is a good bet here, too.

Sorry if this doesn't help, but trying other versions is as much as I can
propose at the moment, especially if it's urgent.

Ah, one last thing: I know, you're not testing threads for fun but for
(Continue reading)

Ian Bicking | 1 Jun 06:43 2007

Re: Crash on Win32 under heavy stress

Stefan Behnel wrote:
> Ok, first thing I'd personally try is the latest libxml2 and libxslt. The guys
> over there keep fixing bugs (even really old ones in recent versions), so
> things tend to get better over time. For example, I get a reproduceable XPath
> crasher in the HTML module Ian is working on with libxml2 2.6.27. It's gone
> with 2.6.28. libxslt is a good bet here, too.

Which XPath is that?  I'd rather avoid it if I can, for those that might 
have 2.6.27.

--

-- 
Ian Bicking | ianb <at> colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers
Ian Bicking | 1 Jun 06:46 2007

Re: html branch

Stefan Behnel wrote:
>>>>>> lxml.[html.]clean: clean Javascript and other problem code from HTML
>>>>> That rather looks like an HtmlElement method to me: "cleanup(...)",
>>>>> and the
>>>>> clean_html() function would fit right into the top-level of the
>>>>> lxml.html module.
>>>> The long signature of the function made me reluctant to do this.  Any
>>>> function with that many parameters feels non-authoritative to me.  And I
>>>> would encourage people to actually write their own clean function with
>>>> the parameter defaults that are appropriate for their domain (e.g.,
>>>> clean_untrusted_comment, clean_wysiwyg_submission, etc).  I just guessed
>>>> reasonable defaults for those keyword arguments.
>>> Ah, ok, good point. Still, I would like to keep the number of modules
>>> low.
>>> lxml.html should be as close to "one point for solving your HTML
>>> needs" as
>>> possible.
>> OK.  *Actually* putting them all in one module would make the module
>> feel too big to me.  I could import them all into __init__.py.  That
>> might make the import unnecessarily slow, I'm not sure.
> 
> Avoiding imports tends to be not worth the effort. It already takes a while to
> import etree, so importing some more Python modules doesn't add much.
> 
> 
>> For some reason I've never used lazy-loading functions, though the
>> implementation seems obvious enough; just something like:
>>
>> def clean(*args, **kw):
>>     from lxml.html import clean
(Continue reading)

Sidnei da Silva | 1 Jun 14:31 2007

Re: Crash on Win32 under heavy stress

Hi Stephan,

On 6/1/07, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
> Sidnei da Silva wrote:
> > This is using the 1.2.1 release.
>
> For a quick test, try with 1.3beta or the current trunk. No guarantee, though,
> especially the trunk is not necessarily in a perfect production state.

I wish I could compile trunk, but I haven't figured out a resolution
to that issue with MSVC. :(

> > This is a static build on Win32, against:
> >
> > libxml2-2.6.26.win32
> > libxslt-1.1.17.win32
> > zlib-1.2.3.win32
> > iconv-1.9.2.win32
>
> Ok, first thing I'd personally try is the latest libxml2 and libxslt. The guys
> over there keep fixing bugs (even really old ones in recent versions), so
> things tend to get better over time. For example, I get a reproduceable XPath
> crasher in the HTML module Ian is working on with libxml2 2.6.27. It's gone
> with 2.6.28. libxslt is a good bet here, too.

I've updated to libxml2-2.6.27 and libxslt-1.1.19 and it hasn't
crashed so far. I've mailed Igor Zlatkovic to see if he will be
building binaries for the latest libxml2/libxslt anytime soon.

> Sorry if this doesn't help, but trying other versions is as much as I can
(Continue reading)

Itamar Shtull-Trauring | 1 Jun 19:57 2007

Network downloading of schemas should be off by default?

Right now, AFAICT, is is on by default in lxml.etree.XMLParser. Network
queries by library code are a bad idea: it's an unexpected behavior,
causing potential security risk and guaranteed performance problems.

--

-- 
Itamar Shtull-Trauring
http://itamarst.org
Fred Drake | 1 Jun 20:13 2007
Picon

Re: Network downloading of schemas should be off by default?

On 6/1/07, Itamar Shtull-Trauring <itamar <at> itamarst.org> wrote:
> Right now, AFAICT, is is on by default in lxml.etree.XMLParser. Network
> queries by library code are a bad idea: it's an unexpected behavior,
> causing potential security risk and guaranteed performance problems.

I actually like the way the SAX interface handles this; you provide
something that resolves references however you want, and it uses that.

  -Fred

--

-- 
Fred L. Drake, Jr.    <fdrake at gmail.com>
"Chaos is the score upon which reality is written." --Henry Miller
David Pratt | 2 Jun 18:08 2007
Picon

XMLSchema validation with XMLSchema.xsd failing

Can someone advise whether XMLSchema is currently able validate xml 
schemas. I am on mac 10.4.9 using lxml 1.3 beta using macs default 
libxml2. In my initial attempt, I am getting the following errors (after 
catching the exception in the log). I am using XMLSchema.xsd to validate 
against as you can see. The error comes before my next step - which 
would validate the schema I am interested in.

Here is my session:

 >>> import lxml.etree
 >>> from am.xmlschema import xmlschema_path
 >>> xmlschema_doc = lxml.etree.parse(xmlschema_path)
 >>> try:
...     xmlschema = lxml.etree.XMLSchema(xmlschema_doc)
... except Exception, e:
...     print
e.error_log
...
/Users/davidpratt/Desktop/xmlschemademo/dev/am.xmlschema/src/am/xmlschema/XMLSchema.xsd:655:ERROR:SCHEMASP:SCHEMAP_REDEFINED_ELEMENT: 
Element 'element': A global element declaration with the name 'element' 
does already
exist.
/Users/davidpratt/Desktop/xmlschemademo/dev/am.xmlschema/src/am/xmlschema/XMLSchema.xsd:864:ERROR:SCHEMASP:SCHEMAP_REDEFINED_ELEMENT: 
Element 'element': A global element declaration with the name 'group' 
does already exist.
 >>>

Many thanks

Regards
(Continue reading)

Stefan Behnel | 2 Jun 22:28 2007
Picon

Re: Crash on Win32 under heavy stress

Hi Sidnei,

Sidnei da Silva wrote:
> On 6/1/07, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
>> Sidnei da Silva wrote:
>> > This is using the 1.2.1 release.
>>
>> For a quick test, try with 1.3beta or the current trunk. No guarantee,
>> though,
>> especially the trunk is not necessarily in a perfect production state.
> 
> I wish I could compile trunk, but I haven't figured out a resolution
> to that issue with MSVC. :(

Ever tried MinGW? I read in a couple of posts that it works pretty well with
Python modules - not sure about Pyrex and setuptools, though.

> I've updated to libxml2-2.6.27 and libxslt-1.1.19 and it hasn't
> crashed so far. I've mailed Igor Zlatkovic to see if he will be
> building binaries for the latest libxml2/libxslt anytime soon.

Well, if 2.6.27 works for your code, then that's fine. I just said there seem
to be certain XPath expressions that make it crash. If you don't hit that
problem, you should be on the safe side.

>> Sorry if this doesn't help, but trying other versions is as much as I can
>> propose at the moment, especially if it's urgent.
> 
> That was good advice, and it seems to have solved the immediate issue.
> Thank you a lot!
(Continue reading)

Stefan Behnel | 2 Jun 22:40 2007
Picon

Re: Crash on Win32 under heavy stress

Hi Ian,

Ian Bicking wrote:
> Stefan Behnel wrote:
>> I get a reproduceable XPath
>> crasher in the HTML module Ian is working on with libxml2 2.6.27. It's
>> gone with 2.6.28. libxslt is a good bet here, too.
> 
> Which XPath is that?  I'd rather avoid it if I can, for those that might
> have 2.6.27.

It happens in the "clean" doctests, in one of the ".xpath()" method calls. One
thing to try might be using XPath() instead. However, looking at the way
clean() is implemented, I would rather rewrite a few places to use the already
existing loop over getiterator() rather than XPath. Most likely, users will
end up requiring the loop to run anyway, so it would be best to use it for
more rather than additionally parsing and running a C loop on different XPath
expressions.

Stefan

Gmane