Aymeric Augustin | 2 Jan 20:57

Should lxml.etree.iterparse support gzipped files?

Hello,

lxml.etree.parse is able to load gzipped XML files directly, but  
lxml.etree.iterparse is not. See below for an interactive session  
demonstrating the problem on debian stable. Is it the expected  
behavior, or is it a bug?

The documentation does address this point, it says only:
> lxml can parse from a local file, an HTTP URL or an FTP URL. It  
> also auto-detects and reads gzip-compressed XML files (.gz).

Context: I'm handling hundreds of GB-sized files. It would be nice to  
store them gzipped and have lxml decompress them on the fly, without  
any specific Python code.

Thanks!

% python
Python 2.5.2 (r252:60911, Jan  4 2009, 21:59:32)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import gzip, sys
 >>> from lxml import etree
 >>> print etree.__version__
2.1.1

Let's create a gzipped XML file:

 >>> gzip.open('test.xml.gz', 'wb').write('<a><b /></a>')

(Continue reading)

Stefan Behnel | 3 Jan 08:25
Picon
Favicon
Gravatar

Re: Should lxml.etree.iterparse support gzipped files?


Aymeric Augustin, 02.01.2010 20:57:
> lxml.etree.parse is able to load gzipped XML files directly, but  
> lxml.etree.iterparse is not.
 > [...]
 > The documentation does address this point, it says only:
 >> lxml can parse from a local file, an HTTP URL or an FTP URL. It
 >> also auto-detects and reads gzip-compressed XML files (.gz).

Right, there should be a note in the iterparse docs also. The input support 
in iterparse() is a lot simpler than that. It doesn't support URLs either.

Due to the inner workings of iterparse, all of this isn't trivial to add, 
as lxml would have to detect and apply the correct reading mechanism itself 
(e.g. by building up a decompression step for libxml2 manually). Even 
detecting the compression would require opening the file and reading from 
it first. Now imagine named pipes and system streams, which you cannot just 
reopen afterwards...

It might be possible to detect GzipFile objects and bypass them, but that 
would already be a difference to the normal parse() behaviour.

> Context: I'm handling hundreds of GB-sized files. It would be nice to  
> store them gzipped and have lxml decompress them on the fly, without  
> any specific Python code.

The way to do that is currently by passing through the gzip module. You can 
also try using a pipe to an externally started gzip process. I frequently 
use this on 64-bit multicore Sun machines where the system provided gzip is 
increadibly fast, much faster than Python's gzip module.
(Continue reading)

Aymeric Augustin | 3 Jan 11:00

Re: Should lxml.etree.iterparse support gzipped files?

On 3 janv. 10, at 08:25, Stefan Behnel wrote:
> Right, there should be a note in the iterparse docs also. The input  
> support in iterparse() is a lot simpler than that. It doesn't  
> support URLs either.
>
> Due to the inner workings of iterparse, all of this isn't trivial  
> to add, as lxml would have to detect and apply the correct reading  
> mechanism itself (e.g. by building up a decompression step for  
> libxml2 manually). Even detecting the compression would require  
> opening the file and reading from it first. Now imagine named pipes  
> and system streams, which you cannot just reopen afterwards...

OK, thanks for the explanation.

>> Context: I'm handling hundreds of GB-sized files. It would be nice  
>> to  store them gzipped and have lxml decompress them on the fly,  
>> without  any specific Python code.
>
> The way to do that is currently by passing through the gzip module.  
> You can also try using a pipe to an externally started gzip  
> process. I frequently use this on 64-bit multicore Sun machines  
> where the system provided gzip is increadibly fast, much faster  
> than Python's gzip module.

I tried "zcat myfile.xml.gz" to a pipe, and etree.iterparse from the  
pipe. On a Debian with a Core 2 Duo, it's faster by 10% than using  
the gzip module. The performance gain comes from gunzipping and  
parsing in parallel; the overall resource consumption (user + system)  
is nearly identical.

(Continue reading)

Stefan Behnel | 3 Jan 15:14
Picon
Favicon
Gravatar

Re: Should lxml.etree.iterparse support gzipped files?


Aymeric Augustin, 03.01.2010 11:00:
> I tried "zcat myfile.xml.gz" to a pipe, and etree.iterparse from the 
> pipe. On a Debian with a Core 2 Duo, it's faster by 10% than using the 
> gzip module. The performance gain comes from gunzipping and parsing in 
> parallel; the overall resource consumption (user + system) is nearly 
> identical.
> 
> So for now I'll stick with the gzip module.

Sounds reasonable. You can also try to adjust the gzip buffer size and see 
if that reduces the overhead.

Stefan
Andreas Jung | 6 Jan 14:33

'text', 'tail' handling with nested markup


Hi there,

given a structure like

<p>
  foo
  <q>blather</q>
  bar
  <q>blather</q>
  fox
</p>

How can I parse the content of <p> into a flat list like

['foo', <q_node>, 'bar', <q_node>, 'fox']

?

Andreas

--

-- 
ZOPYX Limited              \ zopyx group
Charlottenstr. 37/1         \ The full-service network for your
D-72070 Tübingen             \ Python, Zope and Plone projects
www.zopyx.com, info <at> zopyx.com \ www.zopyxgroup.com
------------------------------------------------------------------------
E-Publishing, Python, Zope & Plone development, Consulting

(Continue reading)

Stefan Behnel | 7 Jan 14:15
Picon
Favicon
Gravatar

Re: 'text', 'tail' handling with nested markup


Andreas Jung, 06.01.2010 14:33:
> given a structure like
> 
> <p>
>   foo
>   <q>blather</q>
>   bar
>   <q>blather</q>
>   fox
> </p>
> 
> How can I parse the content of <p> into a flat list like
> 
> ['foo', <q_node>, 'bar', <q_node>, 'fox']

  for p in root.iter(tag='p'):
      flat_list = []
      if p.text:
          flat_list.append(p.text)
      for el in p:
          flat_list.append(el)
          if el.tail:
              flat_list.append(el.tail)
      print(flat_list)

Stefan
Dirk Rothe | 7 Jan 22:38
Picon
Favicon
Gravatar

Re: 'text', 'tail' handling with nested markup

On Thu, 07 Jan 2010 14:15:41 +0100, Stefan Behnel <stefan_ml <at> behnel.de>  
wrote:

>
> Andreas Jung, 06.01.2010 14:33:
>> given a structure like
>>
>> <p>
>>   foo
>>   <q>blather</q>
>>   bar
>>   <q>blather</q>
>>   fox
>> </p>
>>
>> How can I parse the content of <p> into a flat list like
>>
>> ['foo', <q_node>, 'bar', <q_node>, 'fox']
>
>   for p in root.iter(tag='p'):
>       flat_list = []
>       if p.text:
>           flat_list.append(p.text)
>       for el in p:
>           flat_list.append(el)
>           if el.tail:
>               flat_list.append(el.tail)
>       print(flat_list)

Another variant with xpath:
(Continue reading)

Yannick Gingras | 10 Jan 21:16
Favicon
Gravatar

Re: Looking for performance tips for soupparser

On December 31, 2009, Stefan Behnel wrote:
> > Would any of you have some tips to share on speeding things up with
> > soupparser?  How hard would it be to make elements conform to the
> > pickling protocol?
> 
> I'd use the normal HTML parser instead, and only fall back to using the 
> soupparser when things go really wrong (whatever that means in your case).
> 
> Another thing you can do (assuming that caching is helpful in your case),
> is to parse the documents using soupparser and serialise them into the 
> cache. Then parse them from the cache using the normal HTML parser 
> (preferably with "recover=False") when you need them. A serialise-parse 
> cycle is several times faster than a new parser run of BeautifulSoup, so if 
> you need the documents multiple times, this will speed things up.

I implemented both ideas and it resulted in a least a 10 fold speedup.
Thanks a lot!

--

-- 
Yannick Gingras
http://ygingras.net
http://confoo.ca -- track coordinator
http://montrealpython.org -- lead organizer
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
(Continue reading)

Peter Baker | 11 Jan 22:04
Picon

Output of xsl:message when terminate is not yes

I've just recently been getting acquainted with lxml. Thanks to the
developers: it's great!

There's just one thing (so far) that I haven't been able to do. In an
XSLT transformation, I can't figure out what's going on with
xsl:message when there is no terminate="yes" attribute. Command-line
processors like xsltproc print the message to stderr. With libxslt you
can capture it with a callback function. But I can't figure out how to
display such messages with lxml.

For just a little background, my app often needs to do very long
transforms: up to a half hour, though a minute is more typical. It
seems important to be able to spit out warnings and messages to mark
the progress of the transformation. So reading an error log after the
transform is over is not what I want.

Sorry if the answer is an obvious one. It's hard to search a forum's
archive for "message," and the other keywords I've been able to think
of haven't given me the answser in either the documentation or the
forum.

Thanks in advance,
Peter Baker
Stefan Behnel | 12 Jan 09:05
Picon
Favicon
Gravatar

Re: Output of xsl:message when terminate is not yes

Hi,

Peter Baker, 11.01.2010 22:04:
> I've just recently been getting acquainted with lxml. Thanks to the
> developers: it's great!

Thanks :)

> There's just one thing (so far) that I haven't been able to do. In an
> XSLT transformation, I can't figure out what's going on with
> xsl:message when there is no terminate="yes" attribute. Command-line
> processors like xsltproc print the message to stderr. With libxslt you
> can capture it with a callback function. But I can't figure out how to
> display such messages with lxml.
> 
> For just a little background, my app often needs to do very long
> transforms: up to a half hour, though a minute is more typical. It
> seems important to be able to spit out warnings and messages to mark
> the progress of the transformation. So reading an error log after the
> transform is over is not what I want.

I never tried that, but you should be able to read the error_log also 
during the transformation (you obviously need a reference to the running 
XSLT object to do that).

You shouldn't read the log from a separate thread, though. I'm not sure if 
that works, but if it works, I should consider it a bug (I'll have to check 
that).

A different way is to use a dedicated extension element to export the 
(Continue reading)


Gmane