Chris Wj | 1 Aug 13:48
Picon
Gravatar

Splitting an xml file.

I'm looking for the best way to split an xml file with many children into multiple files with the same parent tags but individual children.

Example:

Turn this...

file1.xml
<parent>
a whole bunch of stuff...
    <child>
        child 1 stuff...
    </child>
    <child>
        child 2 stuff...
    </child>
</parent>

Into 2 files...

file1a.xml
<parent>
a whole bunch of stuff...
    <child>
        child 1 stuff...
    </child>
</parent>

file1b.xml
<parent>
a whole bunch of stuff...
    <child>
        child 2 stuff...
    </child>
</parent>

Should I use lxml.etree to find the line numbers and then just use file operations? You guys think that is most efficient?

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Alexander Shigin | 4 Aug 21:22
Picon

Re: Splitting an xml file.

В Сбт, 01/08/2009 в 07:48 -0400, Chris Wj пишет:
> I'm looking for the best way to split an xml file with many children
> into multiple files with the same parent tags but individual children.
...
> Should I use lxml.etree to find the line numbers and then just use
> file operations? You guys think that is most efficient?

I think that the simplest way to split the file is to remove all 'child'
element from parsed document and serialize document with different
childs.

In [3]: parsed = etree.parse('q.xml')
In [4]: root = parsed.getroot()
In [5]: childs = parsed.findall('child')
In [6]: for child in childs: root.remove(child)
In [7]: for num, child in enumerate(childs):
    ...     root.append(child)
    ...     f = codecs.open('file1%s.xml' % num, 'w', encoding='utf-8')
    ...     f.write(etree.tounicode(parsed))
    ...     f.close()
    ...     root.remove(child)

This solution works incorrect with tail text or nodes after 'child'
nodes. I don't know if it's critical for you, but the next XML will be
split in wrong way.
  <parent>
      <child>....</child>
      1234
      <child>....</child>
      <some-tag>....</some-tag>
  </parent>

If you need to split big XML files, it's much better to use SAX
interface. But SAX reader/writer is a way harder to implement.

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Chris Wj | 4 Aug 22:16
Picon
Gravatar

Re: Splitting an xml file.

What about xslt, can I use that to accomplish the task?

On Tue, Aug 4, 2009 at 3:22 PM, Alexander Shigin <shigin <at> rambler-co.ru> wrote:
В Сбт, 01/08/2009 в 07:48 -0400, Chris Wj пишет:
> I'm looking for the best way to split an xml file with many children
> into multiple files with the same parent tags but individual children.
...
> Should I use lxml.etree to find the line numbers and then just use
> file operations? You guys think that is most efficient?

I think that the simplest way to split the file is to remove all 'child'
element from parsed document and serialize document with different
childs.

In [3]: parsed = etree.parse('q.xml')
In [4]: root = parsed.getroot()
In [5]: childs = parsed.findall('child')
In [6]: for child in childs: root.remove(child)
In [7]: for num, child in enumerate(childs):
   ...     root.append(child)
   ...     f = codecs.open('file1%s.xml' % num, 'w', encoding='utf-8')
   ...     f.write(etree.tounicode(parsed))
   ...     f.close()
   ...     root.remove(child)

This solution works incorrect with tail text or nodes after 'child'
nodes. I don't know if it's critical for you, but the next XML will be
split in wrong way.
 <parent>
     <child>....</child>
     1234
     <child>....</child>
     <some-tag>....</some-tag>
 </parent>

If you need to split big XML files, it's much better to use SAX
interface. But SAX reader/writer is a way harder to implement.


_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
kristian kvilekval | 5 Aug 00:44
Picon
Gravatar

Key error on del attribute?


We need to delete an attribute on an Element node,
however we are receiving a strange exception. 

> a=etree.Element('a', z='1', x='2')
> a.attrib['x']
  '2'
> del a.attrib['x']
> del a.attrib['x']
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (3059, 0))

We could add a call to has_key, however we expect a simple KeyError
exception to be raised.
John Lovell | 5 Aug 01:23
Favicon

Re: Key error on del attribute?

On Ubuntu 9.04 I get a KeyError thrown.  Can you provide a list of versions like the below?

python:            2.6.2
lxml.etree:        (2, 1, 5, 0)
libxml used:       (2, 6, 32)
libxml compiled:   (2, 6, 32)
libxslt used:      (1, 1, 24)
libxslt compiled:  (1, 1, 24)

This should help...
http://codespeak.net/lxml/2.0/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do

Good luck,

John W. Lovell
Web Applications Engineer
Northwest Educational Service District
1601 R Avenue
Anacortes, WA 98221
(360) 299-4086
jlovell <at> nwesd.org

www.nwesd.org
Together We Can ...

-----Original Message-----
From: lxml-dev-bounces <at> codespeak.net [mailto:lxml-dev-bounces <at> codespeak.net] On Behalf Of
kristian kvilekval
Sent: Tuesday, August 04, 2009 3:44 PM
To: lxml-dev <at> codespeak.net
Subject: [lxml-dev] Key error on del attribute?

We need to delete an attribute on an Element node, however we are receiving a strange exception. 

> a=etree.Element('a', z='1', x='2')
> a.attrib['x']
  '2'
> del a.attrib['x']
> del a.attrib['x']
ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or
invalid The error message is: ('EOF in multi-line statement', (3059, 0))

We could add a call to has_key, however we expect a simple KeyError exception to be raised.

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
kristian kvilekval | 5 Aug 01:34
Picon
Gravatar

Re: Key error on del attribute?

On Tue, 2009-08-04 at 16:16 -0700, John Lovell wrote:
> On Ubuntu 9.04 I get a KeyError thrown.  Can you provide a list of versions like the below?
> 
> python:            2.6.2
> lxml.etree:        (2, 1, 5, 0)
> libxml used:       (2, 6, 32)
> libxml compiled:   (2, 6, 32)
> libxslt used:      (1, 1, 24)
> libxslt compiled:  (1, 1, 24)

Bizarre .. your right it works in python.. 
it's the error parsing in ipython that runs into trouble:
Not sure if the bug is in ipython or lxml but no matter.

lxml.etree:        (2, 1, 5, 0)
libxml used:       (2, 6, 32)
libxml compiled:   (2, 6, 32)
libxslt used:      (1, 1, 24)
libxslt compiled:  (1, 1, 24)

--------------------------------------------------------------------
Python 2.5.2 (r252:60911, Jan  4 2009, 21:59:32) 
Type "copyright", "credits" or "license" for more information.

IPython 0.8.4 -- An enhanced Interactive Python.

In [3]: a=etree.Element('a', z='1', x='2')

In [4]: del a.attrib['x']

In [5]: del a.attrib['x']
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (3059, 0))

-------------------------------------------------

$ python
Python 2.5.2 (r252:60911, Jan  4 2009, 21:59:32) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=etree.Element('a', z='1', x='2')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'etree' is not defined
>>> from lxml import etree
>>> a=etree.Element('a', z='1', x='2')
>>> del a.attrib['x']
>>> del a.attrib['x']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1857, in lxml.etree._Attrib.__delitem__
(src/lxml/lxml.etree.c:18787)
  File "apihelpers.pxi", line 435, in lxml.etree._delAttribute
(src/lxml/lxml.etree.c:31747)
KeyError: 'x'

> This should help...
> http://codespeak.net/lxml/2.0/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do
> 

Thanks,
Piet van Oostrum | 5 Aug 04:33
Picon

Re: Key error on del attribute?

With iPython 0.9.1 on Python 2.6.2 it just
works:

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/ipython-0.9.1-py2.6.egg/IPython/Magic.py:38:
DeprecationWarning: the sets module is deprecated
  from sets import Set
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: from lxml import etree

In [2]: a=etree.Element('a', z='1', x='2')

In [3]: del a.attrib['x']

In [4]: del a.attrib['x']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

/Users/piet/Mail/≤ipython console> in
<module>()

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml-2.2.2-py2.6-macosx-10.3-fat.egg/lxml/etree.so
in lxml.etree._Attrib.__delitem__
(src/lxml/lxml.etree.c:42562)()

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/lxml-2.2.2-py2.6-macosx-10.3-fat.egg/lxml/etree.so
in lxml.etree._delAttribute (src/lxml/lxml.etree.c:13933)()

KeyError: 'x'

--

-- 
Piet van Oostrum <piet <at> cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet <at> vanoostrum.org
bl8cki | 5 Aug 14:21
Picon

iterating xpath?

I was searching the api and found things like iterfind, but it seems
that this work with ElementPath
I would like to do something like iterxpath. Is there any way to achieve this?

Thanks a lot!
Mary Lei | 5 Aug 18:48
Picon

lxml2.2 doctype missing

I noticed that the xhtml converted from
the parse tree has doctype missing.
I am using lxml 2.2.

Is this bug still not fixed in lxml 2.2 ?

--

-- 
Mary Lei

Software Testing
IPAC-NExScl

Rm: KS-233
MS: 220-6
Phone: 395-1998
Alexander Shigin | 5 Aug 20:26
Picon

Re: Splitting an xml file.

В Втр, 04/08/2009 в 16:16 -0400, Chris Wj пишет:
> What about xslt, can I use that to accomplish the task?

I've never used the ability of xslt to produce many output files. I've
just briefly reviewed XSLT specification and can't find how to use it.

You can use xslt param and produce different output by changing 'keep'
param. 

For example, you can use xsltproc and example file q.xslt.
$ xsltproc --param keep 2 q.xslt q.xml
=== q.xslt ===
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>

<xsl:param name="keep"/>

<xsl:template match="child">
    <xsl:if test="position()=$keep">
        <xsl:copy-of select="."/>
    </xsl:if>
</xsl:template>

<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>
======

This solution has another issue: I don't know how to find out position
numbers. The next XML has 'child' elements in 2 and 4 position.
  <parent>
      1234
      <child>....</child>
      <child>....</child>
      <some-tag>....</some-tag>
  </parent>

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev

Gmane