Felix Fontein | 23 May 14:42 2015
Picon

Problem when writing source tag in HTML5 video tag

Hi everyone!

I'm using lxml.html to read and write a HTML5 file. Now I noticed that if
the HTML5 file contains a <video> element which contains a <source>
element, when writing the file back a </source> is generated which
shouldn't be there.

Minimal example:

---------8<---------8<---------8<---------8<---------8<---------

import lxml.html

data = """<!DOCTYPE html>
<html>
  <head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source src="1.ogv" type="video/ogg">
    </video>
  </body>
</html>"""

parser = lxml.html.HTMLParser()
doc = lxml.html.document_fromstring(data, parser)
data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc)
result = data.decode('utf-8')

(Continue reading)

prakhar joshi | 23 May 08:53 2015
Picon

Parsing iframe leading text error

Hello everyone,
 
 I have been using lxml and I have to parse iframe (http://pastie.org/10203277 )  tag with leading text and when I am doing so I am getting an error http://pastie.org/10203259 . So I want the out put as it is as input (html) so can anyone help me with this error as I can't find the proper way in docs. Thanks for your guidance.

Cheers!!

Prakhar Joshi
DA-IICT,Gandhinagar
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Thomas Schraitle | 21 May 10:28 2015
Picon

Line Number of a Start Tag

Hi,

assume I have the following Python 3 code:

------------------------------
import io
from lxml import etree

source = """<?xml version="1.0"?>
<article version="5.0" xml:lang="en"
         xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>...</title>
  <para>...</para>
</article>
"""

tree = etree.parse(io.StringIO(source))
root = tree.getroot()
print(root.sourceline)
------------------------------

When I run the above code, I get "4" as a result. This is a bit
unexpected.

It seems, root.sourceline returns the line number where the start tag
_ends_. However, I need to get the line number where <article> _starts_
(here in this example "2").

How can I get the "starting" line number of a start tag?

Thanks!

--

-- 
Gruß/Regards,
    Thomas Schraitle
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Tom Kralidis | 12 May 20:03 2015
Picon

cleanup namespaces and XML elements with QNames


Hi: using lxml 3.3.5, we have XML documents with elements having QName
type values.  We would like to implement etree.cleanup_namespaces but
are finding that this affects downstream parsers/validators complaining
about undeclared namespace prefixes.  Below is an isolated example:

from lxml import etree

nsmap = {
     'ogc': 'http://www.opengis.net/ogc',
     'ows': 'http://www.opengis.net/ows',
     'gml': 'http://www.opengis.net/gml'
}

root = etree.Element('{http://www.opengis.net/ogc}Filter', nsmap=nsmap)

typename = etree.SubElement(root, '{http://www.opengis.net/ogc}typeName')
typename.text = etree.QName('http://www.opengis.net/gml', 'Envelope')

typename2 = etree.SubElement(root, '{http://www.opengis.net/ogc}typeName')
typename2.text = etree.QName('{http://www.opengis.net/gml}Envelope')

print etree.tostring(root, pretty_print=True)
etree.cleanup_namespaces(root)
print etree.tostring(root, pretty_print=True)

Here we would like the gml namespace declaration, but it looks like
cleanup_namespaces is throwing out namespace declarations even if they
apply to element content.

Are there any workarounds we can use/implement to cleanup unused namespaces
while preserving those for element content per above?

Thanks in advance.

..Tom
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Tom Kralidis | 10 May 19:01 2015
Picon

setting XML_CATALOG_FILES

Hi: using lxml 3.3.5 our project is implementing XML catalogs.

On initial testing it appears that the XML_CATALOG_FILES environment
variable needs to be set before lxml is imported, i.e.:

# e.g. 1: works
import os
os.environ['XML_CATALOG_FILES'] = '/tmp/catalog.xml'
from lxml import etree
...
# validate XML
schema = etree.XMLSchema(file=myschema)
parser = etree.XMLParser(schema=schema)
doc = etree.fromstring(postdata, parser)

# e.g. 2: does not work
import os
from lxml import etree
...
os.environ['XML_CATALOG_FILES'] = '/tmp/catalog.xml'
...
# validate XML
schema = etree.XMLSchema(file=myschema)
parser = etree.XMLParser(schema=schema)
doc = etree.fromstring(postdata, parser)

Is there any way for lxml to realize XML_CATALOG_FILES after being imported?

Thanks

..Tom
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Aaron Storm | 1 May 20:24 2015
Picon

tree.getpath() on xml with default namespace (no prefix)

Having an issue with an xml file that has a namespace, but no prefix. The result of getpath() is showing as '*'
Is there a workaround? I must be missing something very basic, but the documentation for getpath() doesn't mention what need to be done in this case (http://lxml.de/xpathxslt.html#generating-xpath-expressions) 


##getpath.py
from lxml import etree
import sys

xml_file = sys.argv[1]
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(xml_file, parser)
root = tree.getroot()

for child in root.iter():
print(tree.getpath(child))


##test1.xml -- with xmlns, but no prefix -- the result is not what I'm expecting
<Test xmlns="http://www.test.org/test">
<elem>some text</elem>
</Test>

$ python getpath.py test1.xml
/*
/*/*


##test2.xml -- without xmlns -- showing the expected results
<Test>
<elem>some text</elem>
</Test>

$ python getpath.py test2.xml
/Test
/Test/elem

##test3.xml -- xmlns with prefix -- working as expected
<ns:Test xmlns:ns="http://www.test.org/test">
<ns:elem>some text</ns:elem>
</ns:Test>

$ python getpath.py test3.xml 
/ns:Test
/ns:Test/ns:elem

Cheers,
Aaron


_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Jens Tröger | 19 Apr 16:07 2015
Picon

Performance of _Element.find()

Hi,

Suppose I have a large and shallow XML tree; in my case a <book> with
several thousand <par> elements.  I also have a large number (thousands)
of paths like

  ...
  chapter[2]/par[19] 
  ...
  chapter[2]/par[538]/em 
  ...
  chapter[2]/par[1937]
  ...

I started with a loop that iterates over these paths, calls find() and
then manipulates the attributes of the found _Element instance.  That
could take several tens of seconds.  It turns out that the culprit in my
loop was the call to find(), accounting for 99% of the time.

So I tried to see what would happen if I used an index map, and that's
being generated in two passes:

1. Iterate over all paths, split them into their components and use
those components as dict keys.  Nest the dictionaries according to
their path component.  The value is a tuple (elem, dict()) where elem
will be filled in by the second pass, and the dictionary is for nesting.
For the above example:

  { 'chapter[2]': (None, { 'par[19]': (None, {}),
                           'par[538]': (None, { 'em': (None, {}) }),
                           'par[1937]: (None, {}),
                         }),
  }

2. Iterate over all nodes of the XML tree (xpath('//*')) and get their
path.  Then fill the above dictionary with the elem references for those
which are in that dictionary, i.e. replace the None with _Element
instance references.

Building that index map is negligible.

Using this index map to find the elements in the XML tree is orders of
magnitude faster than using find() -- iterating over all path
expressions to manipulate attributes of elements went from several tens
of seconds to a fraction of a second.

I am astonished that find() is so slow?  Why is that?  Is walking down
the tree based on the path string that expensive?

Cheers,
Jens

--

-- 
Jens Tröger
http://savage.light-speed.de/
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 15 Apr 19:44 2015
Picon

lxml 3.4.3 released

Hi all,

I just released a conservative bug fix version of lxml, 3.4.3. Updating is
recommended as it fixes the ElementPath expression cache and thus a
previous performance regression in the find*() methods.

Note that the 3.4 release series no longer supports Python versions before
Py2.6 and Py3.2. Support for very old versions of libxml2 and libxslt
(<=2008) was also removed.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.4.3.tgz

Signature: http://lxml.de/files/lxml-3.4.3.tgz.asc

Changelog: http://lxml.de/3.4/changes-3.4.3.html

Github:
https://github.com/lxml/lxml/commit/fe7dfaa133ec963a5173169993d464c324640f87

This release was built using Cython 0.21.2.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

3.4.3 (2015-04-15)
==================

Bugs fixed
----------

* Expression cache in ElementPath was ignored.  Fix by Changaco.

* LP#1426868: Passing a default namespace and a prefixed namespace mapping
  as nsmap into ``xmlfile.element()`` raised a ``TypeError``.

* LP#1421927: DOCTYPE system URLs were incorrectly quoted when containing
  double quotes.  Patch by Olli Pottonen.

* LP#1419354: meta-redirect URLs were incorrectly processed by
  ``iterlinks()`` if preceded by whitespace.
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Andreas Jung | 11 Apr 04:01 2015

Very slow XMLSchema parsing

Hi there,

I am using this code


for benchmarking the XSD parsing speed against some variants of the MODS schema.

mods-3-1.xsd and mods-3-2.xsd take about 15 seconds for parsing while the other variants
parser in less than 0.3 seconds.

How can one explain this huge difference?

Andreas

----

mods-3-1.xsd

0.00209999084473

15.2129580975

--------------------------------------------------------------------------------

mods-3-2.xsd

0.00260806083679

15.2835290432

--------------------------------------------------------------------------------

mods-3-3.xsd

0.00289702415466

0.300955057144

--------------------------------------------------------------------------------

mods-3-4.xsd

0.00385713577271

0.313620090485

--------------------------------------------------------------------------------

mods-3-5.xsd

0.00278782844543

0.278451919556

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Jens Tröger | 6 Apr 23:56 2015
Picon

XSLT extension: process_children

Hi,

I am trying to understand the process_children() function.  If in my
extension is have

    def execute(self, context, self_node, input_node, output_parent):       
        self.process_children(context, output_parent)                       

then processing works just fine (although I haven't done anything with
the output here).  However, using

    def execute(self, context, self_node, input_node, output_parent):       
        x = self.process_children(context, output_parent=None)              
        output_parent.append(x)                                             

crashes:

      x = self.process_children(context, output_parent=None)
    File "xsltext.pxi", line 111, in lxml.etree.XSLTExtension.process_children (src/lxml/lxml.etree.c:164356)
    File "readonlytree.pxi", line 550, in lxml.etree._nonRoNodeOf (src/lxml/lxml.etree.c:78766)
  TypeError: invalid argument type <class 'NoneType'>

although I had assumed the two to be synonymous.  I ask because I want
to inject my own node and dangle children off of my own node like this:

    p = lxml.etree.Element("p")
    p.attrib["style"] = "..."
    output_parent.append(p)
    self.process_children(context, p)

but that, too, fails.  What am I doing wrong here?

Thanks!
Jens

--

-- 
Jens Tröger
http://savage.light-speed.de/
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
John R. Frank | 4 Apr 19:13 2015

HTML alignment of Offsets and XPath

Hi lxml experts,

I have a question about character offsets and Xpath that would not be 
necessary if everything in the world were XML:

tl;dr:  what is the best way to translate back and forth between a 
character (or byte) offset in the string from of the HTML and the 
Xpath-plus-relative-offset in the rendered DOM?

Context:

We're working on an HTML highlighting tool that is intended to allow users 
to modify/create text selections generated by automatic named entity 
recognition algorithms.  Many parts of this are working, and a key element 
uses your wonderful lxml.html.clean.Cleaner [1]

After cleansing a page to make `clean_html`, we also generate a 
tag-stripped form that has exactly the same byte offsets by replacing tags 
with whitespace of the same byte length.  We call this `clean_visible`. 
This allows named entity recognizers, such as LingPipe, Basis Rosette, 
Stanford CoreNLP, Clear Forest, etc to recognize names of things (people, 
companies, locations, etc) in the natural language.  The resulting offsets 
then are correct for both the HTML string and the tag-stripped form. 
Some NER tools can parse HTML to do an even better job of their task, and 
many do not.

The user facing components are in a not-yet-released FOSS javascript 
component called "HTML highlighter" that operates *in the browser* on the 
live DOM rendered from the clean_html.  Using ideas similar to those in 
Rangy [2], it figures out xpath+offset for start and end of the user's 
selection.  This form of offset is really all that JavaScript can handle.

To make this whole thing work perfectly, we need to construct a python 
service that translates between absolute offsets in the HTML string and 
xpath+offsets in the corresponding DOM.

Can lxml help with this?

Thanks for any guidance.  (and please no flames about how the whole world 
should be in XML, because... that's not this world :-)

John

[1]  https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_pipeline/_clean_html.py#L122-L127

[2]  https://github.com/timdown/rangy

--
______________________________
John R. Frank <jrf <at> diffeo.com>
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane