John Munroe | 26 Jun 08:14 2015
Picon

Error from .itertext() “ValueError: Input object has no element: HtmlComment”

Hi

I'm trying to iterate through the text content of a subtree using elt.itertext() (v3.5.0b1 git master
branch) as follows:

import lxml.html.soupparser as soupparser
import requests

doc = requests.get("http://f10.5post.com/forums/showthread.php?t=1142017").content
tree = soupparser.fromstring(doc)

nodes = tree.getchildren()

for elt in nodes:
    for t in elt.itertext():
         print t

But I keep getting an error saying

 File "src/lxml/iterparse.pxi", line 248, in lxml.etree.iterwalk.__init__ (src/lxml/lxml.etree.c:134032)
 File "src/lxml/apihelpers.pxi", line 67, in lxml.etree._rootNodeOrRaise (src/lxml/lxml.etree.c:15220)
ValueError: Input object has no element: HtmlComment

Is there a way to skip all HTML comments? Also, what does this error actually mean?

Any help will be appreciated.

Thanks

John
(Continue reading)

John Munroe | 24 Jun 10:10 2015
Picon

Build error: No such file or directory: 'src/lxml/lxml.etree.c'

Hi,

I've grabbed 3.5.0beta1 from github and tried building it. I'm on OS X and have lxml2.9.2 rather than
lxml2.9.1. So, I’m using the following command to build:

python setup.py build --static-deps --libxml2-version=2.9.2 --without-cython

but I keep getting an error saying

clang: error: no such file or directory: 'src/lxml/lxml.etree.c'
clang: error: no input files

Indeed, the C file doesn't exist and isn't part of the distribution though.

Am I missing something? I'd like to have it installed in a virtualenv (eventually).

Any help will be appreciated.

Thanks

John

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 21 Jun 13:41 2015
Picon

Re: Can I change maxvars?

Sam Bull schrieb am 21.06.2015 um 12:12:
> On Mon, 2014-07-14 at 20:52 +0200, Stefan Behnel wrote:
>> Sam Bull, 12.07.2014 17:15:
>>> I'm trying to process some XML files, and a few of them are several
>>> thousand lines long, and with the moderately complicated XSL I'm using,
>>> I seem to be hitting recursion limits.
>>>
>>> I'm currently getting this message:
>>>         lxml.etree.XSLTApplyError: xsltApplyXSLTTemplate: A potential
>>>         infinite template recursion was detected.
>>>         You can adjust maxTemplateVars (--maxvars) in order to raise the
>>>         maximum number of variables/params (currently set to 15000).
>>>
>>> It says I can adjust the value, but doesn't explain how, nor is this
>>> value mentioned anywhere in the documentation.
>>>
>>> I've just had to change the maxdepth, which can be done with
>>> XSLT.set_global_max_depth(), but there doesn't appear to be an
>>> equivalent for maxvars. How can I change this value?
>>
>> You can't currently. The problem is, it was new in libxslt 1.1.27, and even
>> the next lxml release will still support everything back to 1.1.23, so this
>> needs a little C level hacking to support depending on the libxslt version
>> it compiles against.
>>
>> The upside is that libxslt 1.1.27 also introduced a per-context setting
>> (maxTemplateVars), i.e. you can define the value for each stylesheet run
>> rather than setting a global value. A new keyword argument for XSLT()
>> should work nicely here, e.g. "max_recursion_vars". The same applies to
>> "maxTemplateDepth" in 1.1.27, which could be set as "max_recursion_depth"
(Continue reading)

Dionyz Lazar | 16 Jun 12:24 2015
Picon

Iterparse memory problem

Hello, 

I have been using lxml (3.4.3) for parsing xmls from vendors. For example, here is one of the smaller files that should be publicly available: http://www.eberry.cz/editor/image/eshop_products/feed_seznam_jyxo.xml

I am using urllib3 to get the response which should be file-like object that I am sending straight to iterparse method. It works great memory-wise as it does not have to put whole file into memory (some files can be huge).

I am interested only in SHOPITEM element and I also clear() the element after I am done with it. I tried tag attribute of iterparse method to get events relevant only to this element. When I do that, the memory usage spikes up and it looks like it is putting whole file in memory. 

Any ideas on what could cause this behavior? 

Regards,
Dio




_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Paul Keating | 9 Jun 12:55 2015

How to set up a Soap envelope

My web services people want me to enclose an xml message in the following envelope:

 

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" >

<soapenv:Header/>

<soapenv:Body>

… (actual message goes here) …

</soapenv:Body>

</soapenv:Envelope>

 

I don’t know how to express this in lxml calls, because  I’m a total xml novice who understands nothing about namespaces. Pointers would be welcome.

 

 

Regards

 

P

 



The information contained in this e-mail is confidential and may be privileged. It may be read, copied and used only by the intended recipient. If you have received it in error, please contact the sender immediately by return e-mail. Please delete this e-mail and do not disclose its contents to any person. NIBC Holding N.V. nor its subsidiaries accept liability for any errors, omissions, delays of receipt or viruses in the contents of this message which arise as a result of e-mail transmission. NIBC Holding N.V. (Chamber of commerce nr. 27282935), NIBC Bank N.V. (Chamber of commerce nr. 27032036) and NIBC Investment Management N.V. (Chamber of commerce nr. 27253909) all have their corporate seat in The Hague, The Netherlands.

De informatie in dit e-mailbericht is vertrouwelijk en uitsluitend bestemd voor de geadresseerde. Wanneer u dit bericht per abuis ontvangt, gelieve onmiddellijk contact op te nemen met de afzender per kerende e-mail. Wij verzoeken u dit e-mailbericht te vernietigen en de inhoud ervan aan niemand openbaar te maken. NIBC Holding N.V. noch haar dochterondernemingen aanvaarden enige aansprakelijkheid voor onjuiste, onvolledige dan wel ontijdige overbrenging van de inhoud van een verzonden e-mailbericht, noch voor door haar daarbij overgebrachte virussen. NIBC Holding N.V. (KvK nr. 27282935), NIBC Bank N.V. (KvK nr. 27032036) en NIBC Investment Management N.V. (KvK nr. 27253909) zijn statutair gevestigd te Den Haag, Nederland.
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Frederik Elwert | 9 Jun 12:06 2015
Picon

xmlfile and namespaces/pretty printing

Hello,

I want to write a very large XML file to disc. Since I ran into memory
issues using the regular ElementTree.write() method, I switched to using
etree.xmlfile. Generally, it works quite well, but I ran into two
issues. Here’s my test code:

----8<----

from lxml import etree

P_DATA = '{http://www.dspin.de/data}'
P_TEXT = '{http://www.dspin.de/data/textcorpus}'

with etree.xmlfile('test.xml', encoding='utf-8') as xf:
    with xf.element(P_DATA + 'D-Spin',
                    nsmap={None: 'http://www.dspin.de/data'}):
        with xf.element(P_TEXT + 'TextCorpus',
                lang='de',
                nsmap={None: 'http://www.dspin.de/data/textcorpus'}):
            element = etree.Element(P_TEXT + 'tokens',
                    nsmap={None: 'http://www.dspin.de/data/textcorpus'})
            element2 = etree.SubElement(element, P_TEXT + 'token')
            xf.write(element, pretty_print=True)

---->8----

And here’s the output:

----8<----
<D-Spin xmlns="http://www.dspin.de/data"><TextCorpus
xmlns="http://www.dspin.de/data/textcorpus" lang="de"><tokens
xmlns="http://www.dspin.de/data/textcorpus">
  <token/>
</tokens>
</TextCorpus></D-Spin>
---->8----

Now my questions are:

1. I had to add an nsmap argument to the creation of "element" in order
to prevent an "ns0:" prefix in the output. But this lead to a
duplication of the declaration of the default namespace
'http://www.dspin.de/data/textcorpus' on both <TextCorpus> and <tokens>.

Since the generation of the Elements that I write to the xmlfile happens
somewhere else in the real code, it is a bit cumbersome to add nsmaps
all over the place. And even then, I have the duplicated namespace
declaration. So ideally I’d like xf.write() to be aware of the current
namespace map defined by the xf.element. Is that possible?

2. I can pass "pretty_print=True" to xf.write(), but it naturally only
affects those sub-trees. Is it possible to pretty-print the elements
generated by xf.element() as well? Maybe it would be nice to be able to
pass pretty_print to etree.xmlfile() itself?

Regards,
Frederik

--

-- 
Dr. Frederik Elwert

Project Manager/SeNeReKo
Postdoctoral Researcher/KHK
Centre for Religious Studies
Ruhr-University Bochum

Universitätsstr. 150
D-44780 Bochum
	
Phone +49(0)234 32-23024
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Felix Fontein | 23 May 14:42 2015
Picon

Problem when writing source tag in HTML5 video tag

Hi everyone!

I'm using lxml.html to read and write a HTML5 file. Now I noticed that if
the HTML5 file contains a <video> element which contains a <source>
element, when writing the file back a </source> is generated which
shouldn't be there.

Minimal example:

---------8<---------8<---------8<---------8<---------8<---------

import lxml.html

data = """<!DOCTYPE html>
<html>
  <head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source src="1.ogv" type="video/ogg">
    </video>
  </body>
</html>"""

parser = lxml.html.HTMLParser()
doc = lxml.html.document_fromstring(data, parser)
data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc)
result = data.decode('utf-8')

assert result == """<!DOCTYPE html>
<html>
  <head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source src="1.ogv" type="video/ogg">
    </source></video>
  </body>
</html>"""

---------8<---------8<---------8<---------8<---------8<---------

Am I doing something wrong? Or how can I get rid of </source>?

I also tried using lxml's html5parser, but lxml.html.tostring() also
produced a </source> tag (and there was a "html:" namespace in every
tag).

Thanks a lot and best regards,
Felix

--

-- 
Felix Fontein -- felix <at> fontein.de -- https://felix.fontein.de/
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
prakhar joshi | 23 May 08:53 2015
Picon

Parsing iframe leading text error

Hello everyone,
 
 I have been using lxml and I have to parse iframe (http://pastie.org/10203277 )  tag with leading text and when I am doing so I am getting an error http://pastie.org/10203259 . So I want the out put as it is as input (html) so can anyone help me with this error as I can't find the proper way in docs. Thanks for your guidance.

Cheers!!

Prakhar Joshi
DA-IICT,Gandhinagar
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Thomas Schraitle | 21 May 10:28 2015
Picon

Line Number of a Start Tag

Hi,

assume I have the following Python 3 code:

------------------------------
import io
from lxml import etree

source = """<?xml version="1.0"?>
<article version="5.0" xml:lang="en"
         xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>...</title>
  <para>...</para>
</article>
"""

tree = etree.parse(io.StringIO(source))
root = tree.getroot()
print(root.sourceline)
------------------------------

When I run the above code, I get "4" as a result. This is a bit
unexpected.

It seems, root.sourceline returns the line number where the start tag
_ends_. However, I need to get the line number where <article> _starts_
(here in this example "2").

How can I get the "starting" line number of a start tag?

Thanks!

--

-- 
Gruß/Regards,
    Thomas Schraitle
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Tom Kralidis | 12 May 20:03 2015
Picon

cleanup namespaces and XML elements with QNames


Hi: using lxml 3.3.5, we have XML documents with elements having QName
type values.  We would like to implement etree.cleanup_namespaces but
are finding that this affects downstream parsers/validators complaining
about undeclared namespace prefixes.  Below is an isolated example:

from lxml import etree

nsmap = {
     'ogc': 'http://www.opengis.net/ogc',
     'ows': 'http://www.opengis.net/ows',
     'gml': 'http://www.opengis.net/gml'
}

root = etree.Element('{http://www.opengis.net/ogc}Filter', nsmap=nsmap)

typename = etree.SubElement(root, '{http://www.opengis.net/ogc}typeName')
typename.text = etree.QName('http://www.opengis.net/gml', 'Envelope')

typename2 = etree.SubElement(root, '{http://www.opengis.net/ogc}typeName')
typename2.text = etree.QName('{http://www.opengis.net/gml}Envelope')

print etree.tostring(root, pretty_print=True)
etree.cleanup_namespaces(root)
print etree.tostring(root, pretty_print=True)

Here we would like the gml namespace declaration, but it looks like
cleanup_namespaces is throwing out namespace declarations even if they
apply to element content.

Are there any workarounds we can use/implement to cleanup unused namespaces
while preserving those for element content per above?

Thanks in advance.

..Tom
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Tom Kralidis | 10 May 19:01 2015
Picon

setting XML_CATALOG_FILES

Hi: using lxml 3.3.5 our project is implementing XML catalogs.

On initial testing it appears that the XML_CATALOG_FILES environment
variable needs to be set before lxml is imported, i.e.:

# e.g. 1: works
import os
os.environ['XML_CATALOG_FILES'] = '/tmp/catalog.xml'
from lxml import etree
...
# validate XML
schema = etree.XMLSchema(file=myschema)
parser = etree.XMLParser(schema=schema)
doc = etree.fromstring(postdata, parser)

# e.g. 2: does not work
import os
from lxml import etree
...
os.environ['XML_CATALOG_FILES'] = '/tmp/catalog.xml'
...
# validate XML
schema = etree.XMLSchema(file=myschema)
parser = etree.XMLParser(schema=schema)
doc = etree.fromstring(postdata, parser)

Is there any way for lxml to realize XML_CATALOG_FILES after being imported?

Thanks

..Tom
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane