Charlie Clark | 19 Sep 20:23 2014
Hi,

I'm trying to build lxml from source on Mac OS (it builds fine as a
dependency) but I seem to be hitting a wall:

Trying to build without Cython, but pre-generated 'src/lxml/lxml.etree.c'

I get this whether or not I'm using Cython (it's installed) or python
setup.py build --static-deps

libxml2 and libxlst are installed.

Charlie
--

-- 
Charlie Clark
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-938-5360
Mobile: +49-178-782-6226
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Burak Arslan | 16 Sep 10:12 2014
Picon

missing void elements

Hello,

Where's the lxml issue tracker? I couldn't find it?

Void elements are html tags that don't need a closing tag. <br> is a
well known example:

>>> from lxml.builder import E
>>> html.tostring(E.br())
'<br>'
>>> etree.tostring(E.br())
'<br/>'
>>>

<p> is not a void element, so:

>>> html.tostring(E.p())
'<p></p>'
>>> etree.tostring(E.p())
'<p/>'

see the full list:

http://www.w3.org/TR/html5/syntax.html#elements-0

While working on etree.htmlfile, I noticed that the following tags are
not treated as void:

embed, keygen, source, track, wbr

(Continue reading)

Stefan Behnel | 10 Sep 19:19 2014
Picon

lxml 3.4.0 released

Hi all,

I just released the final version of lxml 3.4, with no code changes since
the last beta. This is a feature release that mostly cleans up some prior
deficiencies and speeds up parsing for documents that contain XML-IDs.

Note that this release drops support for older Python versions and now
requires Py2.6/7 or Py3.2+. It also removes support for very old versions
of libxml2 and libxslt (<=2008).

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.4.0.tgz

Signature: http://lxml.de/files/lxml-3.4.0.tgz.asc

Changelog: http://lxml.de/3.4/changes-3.4.0.html

Github:
https://github.com/lxml/lxml/blob/14505bc62f5f1fc9fb0ff007955f3e67ab4562bb

This release was built using Cython 0.21, but should also build fine with
0.20.x.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan
(Continue reading)

Charlie Clark | 10 Sep 12:17 2014
Picon

Comparing XML requires unicode?

Hi,

last year Stefan very kindly showed me how to use LXMLOutputchecker to  
compare XML trees. This is a lifesaver if you generate a lot of XML and  
want to check it: we're using it extensively in openpyxl.

But recently I've found myself bashing my head against it repeatedly as it  
seems to work with unicode only, which means I can't simply use  
compare_xml(tostring(tree), expected) - wrapper function from
https://bitbucket.org/openpyxl/openpyxl/src/03cb2a7f046d02ec3a19cbeba4375b6d6a19db73/openpyxl/tests/helper.py?at=default#cl-68

I can work around this using a helper function or lxml's handy  
tounicode(), except that we also need to run tests assuming lxml is not  
installed.

Am I missing something simple?

Charlie
--

-- 
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
(Continue reading)

Stefan Behnel | 5 Sep 15:35 2014
Picon

lxml 3.4.0 beta 1 released

Hi all,

I just released the first beta version of the upcoming lxml 3.4. This is a
feature release that mostly cleans up some prior deficiencies and speeds up
parsing for documents that contain XML-IDs. Please give it some testing
against your code.

Note that this release drops support for older Python versions and now
requires Py2.6/7 or Py3.2+. It also removes support for very old versions
of libxml2 and libxslt (<=2008).

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.4.0beta1.tgz

Signature: http://lxml.de/files/lxml-3.4.0beta1.tgz.asc

Changelog: http://lxml.de/3.4/changes-3.4.0beta1.html

Github:
https://github.com/lxml/lxml/commit/638b9ce006ba32e46a09101e15c93ee94649a2ae

This release was built using a pre-release version of Cython 0.21
(7a47dfdabcb9a9861480b1437f092c5f84911558). The final release is expected
to use Cython 0.21 (but should build just fine with 0.20.x).

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,
(Continue reading)

D.H.J. Takken | 3 Sep 14:46 2014
Picon
Picon

Incremental Serialisation: Output buffering

Hello,

I am attempting to use etree.xmlfile to incrementally generate an XML
stream into a file-like object, but I ran into a buffering problem. It
looks like etree.xmlfile does not write straight into the supplied
output, but is buffering internally. I searched for a way to force
flushing the generated XML into the output, but did not find any.

Looking at the source code, I see only one call to
xmlOutputBufferFlush() and - as far as I can see - it is not called
during incremental serialisation.

Is my assumption about internal buffering correct? Is there a way to
flush the internal buffer?

Thanks a lot for helping me out!
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Alan Evangelista | 28 Aug 20:42 2014
Picon

Unexpected output when using xpath() twice

When I try to use xpath() method twice, I get unexpected outputs.

 >>> from lxml import etree
 >>> xml_systems = etree.fromstring("<systems><system 
name='gekko1'/><system name='gekko2'/></systems>")
# get all nodes with name 'system' which are children of root node 'systems'
 >>> xml_system_list = xml_systems.xpath("/systems/system")
 >>> etree.tostring(xml_system_list[0])
'<system name="gekko1"/>'
 >>> etree.tostring(xml_system_list[1])
'<system name="gekko2"/>'
# get 'name' attribute of all 'system's nodes in the hierarchy
 >>> xml_system_list[0].xpath("//system/ <at> name")
['gekko1', 'gekko2']
# get 'name' attribute of root node 'system'
 >>> xml_system_list[0].xpath("/system/ <at> name")
[]

I expected 'gekko1' string as output in two last commands.
Maybe I have not understood the API correctly? What should
I do to get the behavior I expect?

FYI if I convert xml_system_list[0] to string and convert it back to XML
using tostring() and fromstring(), I get the expected output.

Regards,
Alan Evangelista

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
(Continue reading)

Stefan Behnel | 28 Aug 17:01 2014
Picon

lxml 3.3.6 released

Hi all,

I just released lxml 3.3.6. This is a bug-fix-only release for the stable
lxml 3.3 series that fixes a couple of crashes.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.3.6.tgz

Signature: http://lxml.de/files/lxml-3.3.6.tgz.asc

Changelog: http://lxml.de/3.3/changes-3.3.6.html

Github:
https://github.com/lxml/lxml/commit/4c8e222b6704b78381bdcaa5f6d3abf1d041d0b4

This release was built using Cython 0.20.1.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

3.3.6 (2014-08-28)
==================

Bugs fixed
----------
(Continue reading)

Will McGugan | 22 Aug 18:49 2014
Picon

position attribute of XMLSyntaxError seems wrong

Hi,

I'm catching XMLSyntaxError's in my app and displaying information regarding the error. In particular, the line and column of the error form the 'position' attribute of the exception. The line is fine, but the column doesn't seem to correspond to the point where I would consider the error to have occurred. 

Here's a small Python 2 that shows the problem:

    xml = b"""<test>
        <tag foo="some text" namespace:attr="value" />
    </test>
    
    """

    from lxml import etree
    import io
    lines = xml.splitlines()
    try:
        root = etree.parse(io.BytesIO(xml)).getroot()
    except Exception as e:
        line, col = e.position
        print lines[line - 1]
        print " " * (col - 1) + '^'


I get the following output from that:

    <tag foo="some text" namespace:attr="value" />
                           ^
Any help would be appreciated...

Will McGugan

--
Will McGugan
http://www.willmcgugan.com
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 21 Aug 22:36 2014

a question about a loop within a loop


I would be greateful for advice on the following:

I iterate over a sequence of sibling elements with the typical code

for element in tree.iter(tei +'w', tei +'c'):
	do this or that

Within that sequence there are shorter sequences (between two or seven
elements) that begin with an element <w part="I"/> and end with an element
<w part="F"/>. There may or may not be one or more elements of the type <w
part="M"/>. Since most of the cases involve sequences of two or three
elements, I've dealt with code like "if the next-but-one" element has a
part='F' attribute."

That works for the simple cases, but it would be much better if I could
break out of the current iteration, isolate the sequence that goes from <w
part="I"/> to <w part="F"/>, iterate over it, and integrate the result ( a
single <w> element) back into the tree. But I don't know how to write code
that would  

1.start at a known point and make that the point of departure for a
sequence that can be iterated over
2. gather the elements that follow it until I come to the unknown future
point that is defined by part="F"

And I don't know whether that would be an lxml or a more general Python
procedure. 

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Elmar Bartel | 19 Aug 12:13 2014

"Inventing" XML elements - bug?

Hello Everyone,

This is my first posting to this list - so excuse if anything does
not meet the standards.

I use lxml.html.fromstring() to parse html.

The parser tries to do its best to make something reasonable,
even when the input is broken. This works fine and the parser does not
"invent" elements. I.e. the resulting tree does not contain elements
never present in the input.
But this is what I observe when the input is of following kind:

	<html><body> .... </body><html

Note the missing '>' at the end of the input!
Whether "inventing" elements is a bug in case of invalid input
is debatable, but what if the number of elements is nearly doubled?

Please consider the following script which illustrates the effect:
It creates inside the <body> element a sequence of <img> elements
and checks after parsing the number of elements reported by
iterlinks():

=======================================================================
import sys
import lxml.html
import lxml.etree

parser= lxml.html.HTMLParser()

failCount= 0
for imageCount in range(1,20):
    # Produce some simple HTML document with some <img> elements
    content= '<html>\n<body>\n%s</body>\n</html>' % (
	'\n'.join(['<img src="verysmall-icon-%d.png" align="right">' % i
	for i in range(imageCount)])
    )
    # Parse this and assert the number of links found.
    # (this works always)
    html= lxml.html.fromstring(content, parser=parser)
    imagesFound= len([x for x in html.iterlinks()])
    assert(imagesFound == imageCount)

    # Now remove the last '>' of the closing '<html>' element.
    # After some tries, the parser "resuses" some of its
    # parsed tree fragments and appends them to the tree.
    # These fragments may even come from completly different
    # parsed documents.
    content=content[:-1]
    html= lxml.html.fromstring(content, parser=parser)
    imagesFound= len([x for x in html.iterlinks()])
    if imageCount != imagesFound:
	print 'Input:\n%s\n%s\n%s' % ('-'*40, content, '-'*40)
    	print 'FAILURE: found %d img elements when only %d were present' % (imagesFound, imageCount)
	break

versionFmt= "%-25s %s"
print
print versionFmt % ('Python', sys.version_info)
for vers in (
  'LXML_VERSION',
  'LIBXML_VERSION',
  'LIBXML_COMPILED_VERSION',
  'LIBXSLT_VERSION',
  'LIBXSLT_COMPILED_VERSION',
):
    print versionFmt % (vers, getattr(lxml.etree, vers))
=======================================================================

On my machine (Ubuntu 12.04) the output is:
=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
</html
----------------------------------------
FAILURE: found 20 img elements when only 11 were present

Python                    sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
LXML_VERSION              (3, 3, 5, 0)
LIBXML_VERSION            (2, 7, 8)
LIBXML_COMPILED_VERSION   (2, 7, 8)
LIBXSLT_VERSION           (1, 1, 26)
LIBXSLT_COMPILED_VERSION  (1, 1, 26)
=======================================================================

On a different machine (Solaris 10 ;-)

=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
</html
----------------------------------------
FAILURE: found 20 img elements when only 11 were present

Python                    sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
LXML_VERSION              (2, 3, 5, 0)
LIBXML_VERSION            (2, 9, 0)
LIBXML_COMPILED_VERSION   (2, 6, 23)
LIBXSLT_VERSION           (1, 1, 28)
LIBXSLT_COMPILED_VERSION  (1, 1, 24)
=======================================================================

I've discovered this behaviour when crawling a web site.
I do this multi threaded and the links reported by iterlinks()
returned 404 when the crawler tried to fetch them.
The reason was iterlinks(): it was running on a tree, built from
a webpage with missing '>' at the end. The parser produced
a tree with lot of fragments coming from other parsed pages... 
You can imagine what happens then.

Yours,
Elmar.
--

-- 
LEO GmbH          | Elmar Bartel                 | 
Mühlweg 2b        | Phone: +49 (0)8104-90950141  | No signature here.
D-82054 Sauerlach | Fax:   +49 (0)8104-90950290  |
Germany           | Email: elmar <at> leo.org         |

Register Gericht: Amtsgericht München, HRB161107
Geschäftsführer:  Hans Riethmayer, Elmar Bartel
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane