Stefan Behnel | 15 Apr 19:44 2015
Picon

lxml 3.4.3 released

Hi all,

I just released a conservative bug fix version of lxml, 3.4.3. Updating is
recommended as it fixes the ElementPath expression cache and thus a
previous performance regression in the find*() methods.

Note that the 3.4 release series no longer supports Python versions before
Py2.6 and Py3.2. Support for very old versions of libxml2 and libxslt
(<=2008) was also removed.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.4.3.tgz

Signature: http://lxml.de/files/lxml-3.4.3.tgz.asc

Changelog: http://lxml.de/3.4/changes-3.4.3.html

Github:
https://github.com/lxml/lxml/commit/fe7dfaa133ec963a5173169993d464c324640f87

This release was built using Cython 0.21.2.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

(Continue reading)

Andreas Jung | 11 Apr 04:01 2015

Very slow XMLSchema parsing

Hi there,

I am using this code


for benchmarking the XSD parsing speed against some variants of the MODS schema.

mods-3-1.xsd and mods-3-2.xsd take about 15 seconds for parsing while the other variants
parser in less than 0.3 seconds.

How can one explain this huge difference?

Andreas

----

mods-3-1.xsd

0.00209999084473

15.2129580975

--------------------------------------------------------------------------------

mods-3-2.xsd

0.00260806083679

15.2835290432

--------------------------------------------------------------------------------

mods-3-3.xsd

0.00289702415466

0.300955057144

--------------------------------------------------------------------------------

mods-3-4.xsd

0.00385713577271

0.313620090485

--------------------------------------------------------------------------------

mods-3-5.xsd

0.00278782844543

0.278451919556

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Jens Tröger | 6 Apr 23:56 2015
Picon

XSLT extension: process_children

Hi,

I am trying to understand the process_children() function.  If in my
extension is have

    def execute(self, context, self_node, input_node, output_parent):       
        self.process_children(context, output_parent)                       

then processing works just fine (although I haven't done anything with
the output here).  However, using

    def execute(self, context, self_node, input_node, output_parent):       
        x = self.process_children(context, output_parent=None)              
        output_parent.append(x)                                             

crashes:

      x = self.process_children(context, output_parent=None)
    File "xsltext.pxi", line 111, in lxml.etree.XSLTExtension.process_children (src/lxml/lxml.etree.c:164356)
    File "readonlytree.pxi", line 550, in lxml.etree._nonRoNodeOf (src/lxml/lxml.etree.c:78766)
  TypeError: invalid argument type <class 'NoneType'>

although I had assumed the two to be synonymous.  I ask because I want
to inject my own node and dangle children off of my own node like this:

    p = lxml.etree.Element("p")
    p.attrib["style"] = "..."
    output_parent.append(p)
    self.process_children(context, p)

but that, too, fails.  What am I doing wrong here?

Thanks!
Jens

--

-- 
Jens Tröger
http://savage.light-speed.de/
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
John R. Frank | 4 Apr 19:13 2015

HTML alignment of Offsets and XPath

Hi lxml experts,

I have a question about character offsets and Xpath that would not be 
necessary if everything in the world were XML:

tl;dr:  what is the best way to translate back and forth between a 
character (or byte) offset in the string from of the HTML and the 
Xpath-plus-relative-offset in the rendered DOM?

Context:

We're working on an HTML highlighting tool that is intended to allow users 
to modify/create text selections generated by automatic named entity 
recognition algorithms.  Many parts of this are working, and a key element 
uses your wonderful lxml.html.clean.Cleaner [1]

After cleansing a page to make `clean_html`, we also generate a 
tag-stripped form that has exactly the same byte offsets by replacing tags 
with whitespace of the same byte length.  We call this `clean_visible`. 
This allows named entity recognizers, such as LingPipe, Basis Rosette, 
Stanford CoreNLP, Clear Forest, etc to recognize names of things (people, 
companies, locations, etc) in the natural language.  The resulting offsets 
then are correct for both the HTML string and the tag-stripped form. 
Some NER tools can parse HTML to do an even better job of their task, and 
many do not.

The user facing components are in a not-yet-released FOSS javascript 
component called "HTML highlighter" that operates *in the browser* on the 
live DOM rendered from the clean_html.  Using ideas similar to those in 
Rangy [2], it figures out xpath+offset for start and end of the user's 
selection.  This form of offset is really all that JavaScript can handle.

To make this whole thing work perfectly, we need to construct a python 
service that translates between absolute offsets in the HTML string and 
xpath+offsets in the corresponding DOM.

Can lxml help with this?

Thanks for any guidance.  (and please no flames about how the whole world 
should be in XML, because... that's not this world :-)

John

[1]  https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_pipeline/_clean_html.py#L122-L127

[2]  https://github.com/timdown/rangy

--
______________________________
John R. Frank <jrf <at> diffeo.com>
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 4 Apr 16:40 2015

memory allocation failure


I am puzzled about the following error message, which I got when I ran a familiar script on my new iMac with python3.4 and lxml 3.42.




Traceback (most recent call last):
  File "/Users/martinmueller/Dropbox/PycharmProjects/emd/emdFeb2015.py", line 99, in <module>
    tree = etree.parse(filename, parser)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src/lxml/lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95722)
  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94789)
lxml.etree.XMLSyntaxError: Memory allocation failed, line 13323, column 18


This error occurs after running about 70 texts, each of them between 2 and 4 MB in length. The error is not a function of anything in the text that fails, because the text is processed perfectly when processed separately.  IN watching memory allocation for different processes on the Mac Activity monitor, there isn’t anything unusual about the memory currently used  by Python or Pycharm, which I use. 

It would seem from this diagnosis that somehow memory is used up cumulatively in lxml and crosses some threshold after a while.  Is it related to an earlier problem where the underlying libxml stores all xml:ids in batch operations?  But that led to a noticeable slowdown in operations, whereas here the processing time for each text seems a stable and linear function of its length, until suddenly it collapses. 

I’ll be grateful for any advice. 
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Charlie Clark | 1 Apr 20:23 2015
Picon

Automatic Windows wheels

Hiya,

I guess this is primarily for Stefan but it looks like it's now possible  
to get the relevant Python wheels built automatically:

https://packaging.python.org/en/latest/appveyor.html

I wonder if this could be used by lxml? And if so, could I get a Python  
3.5 build? ;-) Python 3.5 is due to go to beta soon so I'd like to have it  
my tox configuration, now that I've more or less worked out how to get tox  
working with lxml on Windows.

Charlie
--

-- 
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Olivier Collioud | 1 Apr 15:44 2015
Picon

How to prefix embedded xhtml elements

Hi,

I'm using Python 2.7 and LXML 2.3.0 (PortablePython distrib).

My custom XSD is importing XHTML Strict 1.0 one.

All XSD have elementFormDefault="qualified".

In my XML instances a default namespace is declared for the custom XSD elements (such elements are not prefixed). xmlns:xhtml prefix is also declared on the root start ta.g

I would like all xhtml elements to be prefixed by "xhtml:".

Instead, LXML is just changing the default namespace on each xhtml fragment root start tag.

In my code, xhtml fragment root elements are created with:
div = etree.Element('{http://...}div')

children elements are added this way:
p = etree.SubElement(div, '{http://...}p')

and then div is added to the custom parent element:
parent.append(div).

What do I miss?

The only (UGLY) workaround I found so far is to create elements this way:
div = etree.Element('xhtml_div')

and then serialize this way:
etree.tostring(root, ...).replace('<xhtml_', '<xhtml:').replace('</xhtml_', '</xhtml:')
!

Regards,
Olivier.

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Alex Boese | 27 Mar 14:08 2015
Picon

Etree to string problematic?

I noticed a strange behavior, and can only describe it as the code I write is fully owned my those who employ me.

I was utilizing an lxml iterator routine to go through all of the nodes in some xml documents to look for
differences. Now, because of the rigor of this computation, I had decided to convert some of these
elements to string using the etree function by the same name.

What I have observed in practice, but not expected, is that sometimes the tostring function will append a
carriage return to the output. This seems to occur when there is white space between the closing tag and the
next tag, which in this case was another closing tag.

So if I had two duplicate documents which I nicened using "xmllint --format filename" and added a carriage
return to the second which was just outside of the closing tag, something like this might occur. So
everything between tags would be precisely the same, and then an extra carriage return after end tag would
cause the tostring to output different.

I'm using a version of Python 2. I could probably post version of lxml and libxml2 additionally. Has anyone
had this experience? 

Sent from my Planet
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Burak Arslan | 20 Mar 17:50 2015
Picon

how to make a unicode string valid for xml?

Hello,

I'm looking for a function like xml_unicode(some_unicode_string,
'ignore') that works like unicode(some_string, 'utf8', 'ignore'). Does
lxml export such a function? I looked around the source but I didn't see
any.

Best regards,
Burak
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Maciej Fijalkowski | 11 Mar 15:28 2015
Picon

Inclusion of lxml-cffi into lxml

Hi

What it would take to include lxml-cffi
(https://github.com/amauryfa/lxml/tree/cffi) as an official part of
lxml? It works better on PyPy (with the original lxml being slow and
prone to bugs, notably segfaulting for me)

Cheers,
fijal
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Omar Gutiérrez | 10 Mar 20:47 2015
Picon

Why the character apostrophe is not escaped?

I was wondering why the apostrophe is not automatically escaped as other characters?

For example:

< is transformed to &lt;

and

> is transformed to &gt;

but ' is not transformed to &apos;


Python              : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)

lxml.etree          : (3, 3, 3, 0)
libxml used         : (2, 9, 1)
libxml compiled     : (2, 9, 1)
libxslt used        : (1, 1, 28)
libxslt compiled    : (1, 1, 28)


_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane