Alan Evangelista | 28 Aug 20:42 2014
Picon

Unexpected output when using xpath() twice

When I try to use xpath() method twice, I get unexpected outputs.

 >>> from lxml import etree
 >>> xml_systems = etree.fromstring("<systems><system 
name='gekko1'/><system name='gekko2'/></systems>")
# get all nodes with name 'system' which are children of root node 'systems'
 >>> xml_system_list = xml_systems.xpath("/systems/system")
 >>> etree.tostring(xml_system_list[0])
'<system name="gekko1"/>'
 >>> etree.tostring(xml_system_list[1])
'<system name="gekko2"/>'
# get 'name' attribute of all 'system's nodes in the hierarchy
 >>> xml_system_list[0].xpath("//system/ <at> name")
['gekko1', 'gekko2']
# get 'name' attribute of root node 'system'
 >>> xml_system_list[0].xpath("/system/ <at> name")
[]

I expected 'gekko1' string as output in two last commands.
Maybe I have not understood the API correctly? What should
I do to get the behavior I expect?

FYI if I convert xml_system_list[0] to string and convert it back to XML
using tostring() and fromstring(), I get the expected output.

Regards,
Alan Evangelista

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
(Continue reading)

Stefan Behnel | 28 Aug 17:01 2014
Picon

lxml 3.3.6 released

Hi all,

I just released lxml 3.3.6. This is a bug-fix-only release for the stable
lxml 3.3 series that fixes a couple of crashes.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.3.6.tgz

Signature: http://lxml.de/files/lxml-3.3.6.tgz.asc

Changelog: http://lxml.de/3.3/changes-3.3.6.html

Github:
https://github.com/lxml/lxml/commit/4c8e222b6704b78381bdcaa5f6d3abf1d041d0b4

This release was built using Cython 0.20.1.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

3.3.6 (2014-08-28)
==================

Bugs fixed
----------
(Continue reading)

Will McGugan | 22 Aug 18:49 2014
Picon

position attribute of XMLSyntaxError seems wrong

Hi,

I'm catching XMLSyntaxError's in my app and displaying information regarding the error. In particular, the line and column of the error form the 'position' attribute of the exception. The line is fine, but the column doesn't seem to correspond to the point where I would consider the error to have occurred. 

Here's a small Python 2 that shows the problem:

    xml = b"""<test>
        <tag foo="some text" namespace:attr="value" />
    </test>
    
    """

    from lxml import etree
    import io
    lines = xml.splitlines()
    try:
        root = etree.parse(io.BytesIO(xml)).getroot()
    except Exception as e:
        line, col = e.position
        print lines[line - 1]
        print " " * (col - 1) + '^'


I get the following output from that:

    <tag foo="some text" namespace:attr="value" />
                           ^
Any help would be appreciated...

Will McGugan

--
Will McGugan
http://www.willmcgugan.com
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 21 Aug 22:36 2014

a question about a loop within a loop


I would be greateful for advice on the following:

I iterate over a sequence of sibling elements with the typical code

for element in tree.iter(tei +'w', tei +'c'):
	do this or that

Within that sequence there are shorter sequences (between two or seven
elements) that begin with an element <w part="I"/> and end with an element
<w part="F"/>. There may or may not be one or more elements of the type <w
part="M"/>. Since most of the cases involve sequences of two or three
elements, I've dealt with code like "if the next-but-one" element has a
part='F' attribute."

That works for the simple cases, but it would be much better if I could
break out of the current iteration, isolate the sequence that goes from <w
part="I"/> to <w part="F"/>, iterate over it, and integrate the result ( a
single <w> element) back into the tree. But I don't know how to write code
that would  

1.start at a known point and make that the point of departure for a
sequence that can be iterated over
2. gather the elements that follow it until I come to the unknown future
point that is defined by part="F"

And I don't know whether that would be an lxml or a more general Python
procedure. 

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Elmar Bartel | 19 Aug 12:13 2014

"Inventing" XML elements - bug?

Hello Everyone,

This is my first posting to this list - so excuse if anything does
not meet the standards.

I use lxml.html.fromstring() to parse html.

The parser tries to do its best to make something reasonable,
even when the input is broken. This works fine and the parser does not
"invent" elements. I.e. the resulting tree does not contain elements
never present in the input.
But this is what I observe when the input is of following kind:

	<html><body> .... </body><html

Note the missing '>' at the end of the input!
Whether "inventing" elements is a bug in case of invalid input
is debatable, but what if the number of elements is nearly doubled?

Please consider the following script which illustrates the effect:
It creates inside the <body> element a sequence of <img> elements
and checks after parsing the number of elements reported by
iterlinks():

=======================================================================
import sys
import lxml.html
import lxml.etree

parser= lxml.html.HTMLParser()

failCount= 0
for imageCount in range(1,20):
    # Produce some simple HTML document with some <img> elements
    content= '<html>\n<body>\n%s</body>\n</html>' % (
	'\n'.join(['<img src="verysmall-icon-%d.png" align="right">' % i
	for i in range(imageCount)])
    )
    # Parse this and assert the number of links found.
    # (this works always)
    html= lxml.html.fromstring(content, parser=parser)
    imagesFound= len([x for x in html.iterlinks()])
    assert(imagesFound == imageCount)

    # Now remove the last '>' of the closing '<html>' element.
    # After some tries, the parser "resuses" some of its
    # parsed tree fragments and appends them to the tree.
    # These fragments may even come from completly different
    # parsed documents.
    content=content[:-1]
    html= lxml.html.fromstring(content, parser=parser)
    imagesFound= len([x for x in html.iterlinks()])
    if imageCount != imagesFound:
	print 'Input:\n%s\n%s\n%s' % ('-'*40, content, '-'*40)
    	print 'FAILURE: found %d img elements when only %d were present' % (imagesFound, imageCount)
	break

versionFmt= "%-25s %s"
print
print versionFmt % ('Python', sys.version_info)
for vers in (
  'LXML_VERSION',
  'LIBXML_VERSION',
  'LIBXML_COMPILED_VERSION',
  'LIBXSLT_VERSION',
  'LIBXSLT_COMPILED_VERSION',
):
    print versionFmt % (vers, getattr(lxml.etree, vers))
=======================================================================

On my machine (Ubuntu 12.04) the output is:
=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
</html
----------------------------------------
FAILURE: found 20 img elements when only 11 were present

Python                    sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
LXML_VERSION              (3, 3, 5, 0)
LIBXML_VERSION            (2, 7, 8)
LIBXML_COMPILED_VERSION   (2, 7, 8)
LIBXSLT_VERSION           (1, 1, 26)
LIBXSLT_COMPILED_VERSION  (1, 1, 26)
=======================================================================

On a different machine (Solaris 10 ;-)

=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
</html
----------------------------------------
FAILURE: found 20 img elements when only 11 were present

Python                    sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
LXML_VERSION              (2, 3, 5, 0)
LIBXML_VERSION            (2, 9, 0)
LIBXML_COMPILED_VERSION   (2, 6, 23)
LIBXSLT_VERSION           (1, 1, 28)
LIBXSLT_COMPILED_VERSION  (1, 1, 24)
=======================================================================

I've discovered this behaviour when crawling a web site.
I do this multi threaded and the links reported by iterlinks()
returned 404 when the crawler tried to fetch them.
The reason was iterlinks(): it was running on a tree, built from
a webpage with missing '>' at the end. The parser produced
a tree with lot of fragments coming from other parsed pages... 
You can imagine what happens then.

Yours,
Elmar.
--

-- 
LEO GmbH          | Elmar Bartel                 | 
Mühlweg 2b        | Phone: +49 (0)8104-90950141  | No signature here.
D-82054 Sauerlach | Fax:   +49 (0)8104-90950290  |
Germany           | Email: elmar <at> leo.org         |

Register Gericht: Amtsgericht München, HRB161107
Geschäftsführer:  Hans Riethmayer, Elmar Bartel
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
varun bhatnagar | 18 Aug 17:28 2014
Picon

lxml error :: '--' or ending '-' not allowed in comment

Hi,

Today I was trying to compile my xsl file using lxml and I got the following error:

lxml.etree.XSLTApplyError: xsl:comment : '--' or ending '-' not allowed in comment

Could you please tell me what went wrong here?


Thanks,
BR,
Varun
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
varun bhatnagar | 12 Aug 14:08 2014
Picon

Comparing xslt attributes and putting them in sequential order

Hi,

I have two files and I am trying to merge them.
I want to compare the attribute value of "version" with in the same file, if the values are different I want to copy them in sequential order but if they are same then those attribute should have the same value.

Any kind of help is appreciated. :)
I have pasted my files below:

File1.xml
<?xml version="1.0" encoding="UTF-8"?>
<config>
  <version>
     <input00 version ="1"/>
<name00 name ="abc"/>
  </version>
   <version>
     <input00 version ="2"/>
<name00 name ="def"/>
  </version>
</config>


File2.xml
<?xml version="1.0" encoding="UTF-8"?>
<config>
  <version>
     <input00 version ="1"/>
     <name00 name ="xyz"/>
  </version>
    <version>
     <input00 version ="2"/>
     <name00 name ="pqr"/>
  </version>
  </version>
    <version>
     <input00 version ="2"/>
     <name00 name ="pqr1"/>
  </version>
    <version>
     <input00 version ="3"/>
     <name00 name ="uvw"/>
  </version>
   <version>
     <input00 version ="3"/>
     <name00 name ="uvw1"/>
  </version>
    <version>
     <input00 version ="4"/>
     <name00 name ="lmn"/>
  </version>
</config>


Output.xml
<config>
  <version>
     <input00 version ="1"/>
     <name00 name ="abc"/>
  </version>
    <version>
     <input00 version ="2"/>
     <name00 name ="def"/>
  </version>
  </version>
    <version>
     <input00 version ="3"/>
     <name00 name ="xyz"/>
  </version>
    <version>
     <input00 version ="4"/>
     <name00 name ="pqr"/>
  </version>
   <version>
     <input00 version ="4"/>
     <name00 name ="pqr1"/>
  </version>
    <version>
     <input00 version ="5"/>
     <name00 name ="uvw"/>
  </version>
  </version>
    <version>
     <input00 version ="5"/>
     <name00 name ="uvw1"/>
  </version>
    <version>
     <input00 version ="6"/>
     <name00 name ="lmn"/>
  </version>
</config>

Thanks,
BR,
Varun

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
varun bhatnagar | 11 Aug 10:47 2014
Picon

: Need help with merging two common tags of multiple xmls

Hi,

I am still stuck with this problem while trying to merge two common tags in multiple xmls.
Every xml file has 4 sections: 
1) systemInfo
2) systemInitialization
3) Procedure
4) systemWrapup

I have written rules for merging systemInfo systemInitialization and systemWrapup sequentially. As they appear just once in my xmls there is no problem. I am stuck with  Procedure tag as they may appear more than once and I have to merge it sequentially by taking out from both the files and modifying its attributes.

How can I achieve the output?

File1.xml

<?xml version="1.0" encoding="ASCII"?>
  <System saSysName="AppSysCyborg">
  
    <systemInfo>
      <systemPeriod sysExpectedTime="600000000"/>
      <configurationBase sysBase="0" />
    </systemInfo>
    
    <systemInitialization>
      <addToSys>
        ..........................
        ..........................
      </addToSys>
    </systemInitialization>
    
    <Procedure procName="Proc1" procLevel="1">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <immCCB ccbFlags="0">
          <create objectClassName="sysApplication">
            <attr name="sysAppl" type="hvm">
              <val>1</val>
            </attr>
          </create>
        </immCCB>
      </sysAction>
    </systemInstall>
    </Procedure>
    
    <Procedure sysSmfProcedure="Proc2" sysExecLevel="2">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <sysCB ccbFlags="0">
          <create objectClassName="sysApplication">
            <attr name="sysAppl" type="hvm">
              <val>1</val>
            </attr>
          </create>
        </sysCB>
      </sysAction>
     </Procedure>
        
    <systemWrapup>
        ..........................
        ..........................
    </systemWrapup>
  </System>


File-2.xml

<?xml version="1.0" encoding="ASCII"?>
  <System saSysName="Term-1">
  
    <systemInfo>
      <systemPeriod sysExpectedTime="600000000"/>
      <configurationBase sysBase="0" />
    </systemInfo>
    
    <systemInitialization>
      <addToSys>
         ..........................
      </addToSys>
    </systemInitialization>
    
    <Procedure procName="Proc1" procLevel="1">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <immCCB ccbFlags="0">
          <create objectClassName="sysApp"/>
        </immCCB>
      </sysAction>
      <systemInstall/>
    </Procedure>

    <Procedure procName="Proc2" procLevel="2">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
   </Procedure>  
   
    <Procedure procName="Proc3" procLevel="3">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
    </Procedure> 
         
    <systemWrapup>
         ..........................
        .......................... 
    </systemWrapup>
  </System>


Output.xml

<?xml version="1.0" encoding="ASCII"?>
  <System saSysName="AppSysCyborg">
  
    <systemInfo>
      <systemPeriod sysExpectedTime="600000000"/>
      <configurationBase sysBase="0" />
    </systemInfo>
    
    <systemInitialization>
      <addToSys>
        ......  
        ..........
      </addToSys>
      
      <addToSys>
        ..........................
        ..........................
        ..........................
      </addToSys>
    <sysInit>
        ..........................
        ..........................
    </sysInit>
    
    <sysInit>
        ..........................
        ..........................
    </sysInit>    
    </systemInitialization>
    
<Procedure procName="Proc1" procLevel="1">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <immCCB ccbFlags="0">
          <create objectClassName="sysApplication">
            <attr name="sysAppl" type="hvm">
              <val>1</val>
            </attr>
          </create>
        </immCCB>
      </sysAction>
    </Procedure>
    
    <Procedure procName="Proc2" sysExecLevel="2">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <sysCB ccbFlags="0">
          <create objectClassName="sysApplication">
            <attr name="sysAppl" type="hvm">
              <val>1</val>
            </attr>
          </create>
        </sysCB>
      </sysAction>
    </Procedure>  
      
<Procedure procName="Proc1" procLevel="3">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <immCCB ccbFlags="0">
          <create objectClassName="sysApp"/>
        </immCCB>
      </sysAction>
      <systemInstall/>
    </Procedure>
    
    <Procedure procName="Proc2" procLevel="4">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
      <sysAction>
        <sysCB ccbFlags="0">
          <create objectClassName="sysApplication">
            <attr name="sysAppl" type="hvm">
              <val>1</val>
            </attr>
          </create>
        </sysCB>
      </sysAction>
    </Procedure>

    <Procedure procName="Proc3" procLevel="3">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
    </Procedure>  
    
    
    <Procedure procName="Proc4" procLevel="5">
      <outageInfo>
        <procedurePeriod sysPeriod="100000000"/>
      </outageInfo>
    </Procedure>   

  <systemWrapup>
         ..........................
        ..........................
    </systemWrapup>
  </System>        
    
Thanks,
Varun
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Charlie Clark | 10 Aug 14:58 2014
Picon

Nested generators with xmlfile

Hiya,

I'm working on the openpyxl project and thanks to Stefan's extensive help  
have been able to make good use of lxml in the library. I'm currently  
working on using xmlfile for serialisation instead of  
saxutils.XMLGenerator - xmlfile is noticeably faster and easier to use as  
long as the source is synchronous but I'm scratching my head at the moment  
on how to nest generators for asynchronous use.

Worksheets in openpyxl are rows of cells. As each row can hold up to  
around 16,000 cells I want to be able to serialise each cell immediately.  
But when working asynchronously (some use cases have worksheets created in  
parallel) I can't write all rows out at once.

Following the example in the documentation I've managed to create a  
generator for rows:

     def _write_row(self):
         with xmlfile(self._fileobj_content_name) as xf:
             with xf.element("sheetData"):
                 try:
                     while True:
                         r = (yield)
                         xf.write(r)
                 except GeneratorExit:
                     pass

writer = ws._write_row()
next(writer) # advance the generator
for c in cells:
	writer.send(c)

But this requires building what might be a potentially quite large row  
object in memory. What I really want is two generators: one for an  
individual row and one for the sequence of rows but I can't figure out how  
to do this with context managers. I tried adding a switch based on the  
element sent to the generator:

if r.tag == 'row':
	with xf…

But embedding the context manager within the switch makes it impossible to  
send child objects to: as long as the condition is true no cells will  
enter the code block; the context manager is closed for other cases.

Any ideas on how can I approach this?

Charlie
--

-- 
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
varun bhatnagar | 7 Aug 15:59 2014
Picon

Problem while doing xslt calculations

Hi,

I am trying to do some calculations and set the value to an attribute but it is not setting the expected value. The following xsl is not setting the proper value when I am using it with lxml.

<!-- Calculating 10% of the Expected Time and assigning it to an attribute -->
<xsl:element name="sysPeriod">
 <xsl:attribute name="sysExpectedTime">
   <xsl:value-of select="$setTimeValue + ($setTimeValue * 10 div 100)" />
 </xsl:attribute>
 </xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:element name="sysPeriod"/>
</xsl:otherwise>
</xsl:choose>

The value in $setTimeValue is 111100000000 and after calculation it should be set to 122210000000. The above expression sets "1.2221e+011"

Can anyone tell me why is it happening like that.

Thanks,
BR,
Varun
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Gelin Yan | 4 Aug 15:13 2014
Picon

a question about static building procedure of lxml 3.3.5

Hi All

   I tried to build a static version of lxml and it looked smoothly firstly, but I noticed there was an import error: "undefined symbol clock_gettime" when I imported lxml.etree.

   I added a lib link librt in "def libraries()" of setupinfo.py such as:
libs = ['z', 'm', 'rt'] and built again.
  
  It worked..  

  I want to know whether it is a bug from gcc or I did sth wrong of building lxml statically.

My OS is ubuntu 12.04 64bit, gcc version 4.6.3
python ver 2.7.3
lxml 3.3.5

Thanks.

Regards

gelin yan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane