Jens Tröger | 25 May 19:48 2016
Picon

Schema validation fails after replace() call?

Hello,

I'm not quite sure if I'm asking a question, or sharing an observation.

Is it possible that an lxml.etree instance validates before a replace()
call, but not after?  The error messages I get from the lxml validation
are almost 200 of the same:

  <string>:440:0:ERROR:SCHEMASV:SCHEMAV_CVC_IDC: Element 'deviation': No match found for
key-sequence ['WtC2eoepX'] of keyref 'deviationstyle-refer'.

Looking at the actual XML I can positively confirm that the IDs and
IDREFs exist and are valid, before and after the replace() call.

The new subtree is equivalent to the old one, but there are elements in
the whole tree that refer to elements in the replaced subtree.  I
suspect that this causes the problem.

Interestingly:
 - xmllint validates the new tree when written to a file, and
 - if I serialize the entire tree (including the new replaced subtree)
   and parse it back, it validates.

This is intended behavior, an odd side effect, or a bug?

Thanks!
Jens

--

-- 
Jens Tröger
(Continue reading)

Stein Rune Risa | 6 May 20:26 2016
Picon

Unable to find element with xpath


I have an XML document that looks like this:


I've previously used lxml for parsing html documents and would like to use it for XML documents as well.

I am interested in finding all the "Lap" elements in the XML file and have written some simple python code:

    sourcefile = f1=open('NAME_AND_PATH_OF_XML', 'r')
    sourceXML = sourcefile.read()
    root = etree.fromstring(sourceXML)
    laps = root.xpath('//Lap')
    print len(laps)

For some reason it cannot find "Lap" in the XML. The XML seems to be valid when I open it with a browser.

Any suggestions?

Best regards
Ziggy999
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Picon

Question regarding splitting documents

Hi folks,

I'm attaching a small sample program. My intent is to split the HTML snippet into smaller html documents using the <h4> tags as the splitting points.

Any clues?

Thanks, /PA

--
Fragen sind nicht da um beantwortet zu werden,
Fragen sind da um gestellet zu werden
Gerog Kreisler
Attachment (sample1.py): text/x-python, 1766 bytes
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Heiko Nardmann | 14 Apr 16:24 2016
Picon

Q: lxml Windows wheels

I have some trouble using the offered Windows wheels; the following is
what I get from pip:

    Could not find a version that satisfies the requirement lxml (from
versions: )

So I had a look at the WHEEL file inside the ZIP
lxml-3.6.0-cp32-none-win32.whl to see which requirements are stated inside.

Is it okay that 'Tag' is set to 'cp27-none-linux_x86_64' inside that
file? Might this be the reason why no version can be found? Maybe
someone can enlighten me with respect to wheels?

I have to admit that I'm completely new to wheels and quite new to
Python packaging but I wouldn't expect metadata with "Linux" inside a
Windows wheel?

Thx in advance!

Kind regards,

  Heiko Nardmann

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Picon

soupparser

    Hello everyone ! 

Currently working on a project involving scraping and parsing Intranet resources. 'Till now I have used BeautifulSoup, but decided to shave every ms possible thus the switch to lxml. Still, handling most of the encoding issues will be a pain, so naturally I wanted to benefit from bs4's encoding capabilities. Now to do so I have to:
   
        >>> from lxml.html.souppareser import fromstring

But I am presented with an ImportError; the module BeautifulSoup is not found

Of course that super easy to fix, but my question is: Am I missing something ? What's the bigger picture here, so to say or is it just a bug  ?

On my Window$ computer at the office the module is named bs4, same for the Linux and BSD computers at home.

Obviously, I am using BeautifulSoup4 (4.4.1 to be exact).

Thank you ! 
Nikola
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Rainer Hoerbe | 31 Mar 23:42 2016
Picon
Gravatar

cannot load Schematron stylesheet: "<string>:0:0:ERROR:XSLT:ERR_OK: unknown error"

I am able to reproduce an erroneous behavior with python 2.7 and 3.4l + xml 3.6.0. Processing is correct when
using xsltproc directly:

    import lxml.etree as ET
    sf = 'rules/schtron/rule04E.xsl'
    xslt = ET.fromstring(open(sf).read())
    transform = ET.XSLT(xslt)
   # at this point transform.error_log.lasterror contains "<string>:0:0:ERROR:XSLT:ERR_OK: unknown error"

The stylesheet is:
https://github.com/rhoerbe/saml_schematron/blob/master/rules/schtron/rule04E.xsl
and has been generated from this schematron:
https://github.com/rhoerbe/saml_schematron/blob/master/rules/schtron/rule04E.sch

The code works with other (simple?) stylesheets. Are there any restrictions in lxml that prevent the
processing of ISO-schematron stylesheets? Or is this a bug that should be filed?

Regards,
Rainer
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 17 Mar 15:29 2016
Picon

lxml 3.6.0 released

Hi all,

I just released lxml 3.6.0. All changes in this release came from pull
requests. Thanks to everyone who contributed. The main change is support
for PyPy 5.0, which is also the minimum supported version now. This should
generally improve the stability and performance under PyPy.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.6.0.tgz

Signature: http://lxml.de/files/lxml-3.6.0.tgz.asc

Changelog: http://lxml.de/3.6/changes-3.6.0.html

Github:
https://github.com/lxml/lxml/commit/79637c7c245f6439cc4b85817b12826483e3e8d5

This release was built using Cython 0.23.4.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

3.6.0 (2016-03-17)
==================

Features added
--------------

* GH#187: Now supports (only) version 5.x and later of PyPy.
  Patch by Armin Rigo.

* GH#181: Direct support for `.rnc` files in `RelaxNG()` if `rnc2rng`
  is installed.  Patch by Dirkjan Ochtman.

Bugs fixed
----------

* GH#189: Static builds honour FTP proxy configurations when downloading
  the external libs.  Patch by Youhei Sakurai.

* GH#186: Soupparser failed to process entities in Python 3.x.
  Patch by Duncan Morris.

* GH#185: Rare encoding related `TypeError` on import was fixed.
  Patch by Petr Demin.
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Oliver Bestwalter | 16 Mar 15:07 2016
Picon
Gravatar

Undocumented or unwanted change in keyword parameter?

Hello,

I don't want to report a bug (yet), as I don't know as what this should be reported - but it's definitely a problem.

My info on the system where I reproduced it:
Python              : sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)
lxml.etree          : (3, 5, 0, 0)
libxml used         : (2, 9, 1)
libxml compiled     : (2, 9, 1)
libxslt used        : (1, 1, 28)
libxslt compiled    : (1, 1, 28)

In lxml==3.2.1 under Python 2.7 I instantiated it succesfully like this XmlParser([...], XMLSchema_schema=None, [...]) (note the underscore)

In lxml==3.5.0 under Python 3.4 this started failing with the error:

Traceback (most recent call last):
  File "/path/to/my/script.py", line 7, in <module>
    XMLSchema_schema=xmlSchema)
  File "src/lxml/parser.pxi", line 1437, in lxml.etree.XMLParser.__init__ (src/lxml/lxml.etree.c:120522)
TypeError: __init__() got an unexpected keyword argument 'XMLSchema_schema'
from lxml import etree

The code looks like this:

from lxml import etree

xmlSchema = etree.XMLSchema(file='/path/to/some/schema.xsd')
parser = etree.XMLParser(
    remove_blank_text=True, attribute_defaults=True,
    XMLSchema_schema=xmlSchema)


It can be fixed by just using "schema" instead of "XMLSchema_schema" as keyword argument:

 XmlParser([...], schema=xmlSchema, [...])

---

I had a look at the sources of both versions.

The affected source line In file src/lxml/lxml.etree.pyx:1437 in both versions looks the same

def __init__([...], XMLSchema schema=None, [...]):

so I guess somewhere between 3.21. and 3.5.0 the type hint does not get baked into the keyword argument anymore? My question would be: Is this a bug in the documentation that fails to identify the new keyword parameter as "schema" or is this an unwanted change of the parameter that should actually remain as "XMLSchema_schema" and turned accidentally into "schema"?

cheers
Oliver
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Markus Schöpflin | 16 Mar 13:14 2016

XML Schema API?

I'm using XPath queries on XML schema documents to extract certain information 
from a given schema document. This works well using lxml.

This approach falls short when the XML schema document uses xs:include. Now I 
stumbled across the XML Schema API (see 
https://www.w3.org/Submission/xmlschema-api/). Using the XML Schema API likely 
would solve all issues regarding include statements.

Is it possible to use the XML Schema API with lxml?

TIA, Markus

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Christoph | 8 Mar 09:16 2016
Picon

lxml - pretty-print (?) - multiple consecutive key pairs per line ?

Hiya,
excuse the highly noobistic nature of my question... I'm not a programmer, but I can write some basic python, but lxml is all new to me.
In essence, I need to build an xml file to a very specific format. Unfortunately, that target format utilizes multiple keys (? semantics...) per line, without a line break, like:
            <key>Track ID</key><integer>971</integer>
            <key>Name</key><string>Citadel Station</string>
            <key>Artist</key><string>Disrupt</string>

I found the syntax to nest multiple keys, and to white-space them properly for .xml file usage, but I have not found a way to have multiple consecutive keys per line (without a line-break / whiste space inbetween them).

What I have (python bits to reproduce them below):
----------------------------------------
      <key>some value3</key>
      <string>some value5</string>
----------------------------------------

What I try to achieve:
----------------------------------------
      <key>some value3</key><string>some value5</string>
----------------------------------------

If anyone has any pointers, that would be grand.
Cheerio & all the best!
c.



-----------------------------------------------------
import lxml.etree
import lxml.builder

E = lxml.builder.ElementMaker()

#dict & key syntax
DICT = E.dict
KEY = E.key

TrackID = E.key
Name = E.key
Artist = E.key

INTGR = E.integer
STRNG = E.string
DTE = E.date

the_XML = DICT(
                DICT(
                    KEY('2'),
                    DICT(
                        TrackID('some value1', INTGR('some value3')), #nested trees/keys
                        Name('some value2'), STRNG('some value4'),  #standard behaviour
                        Artist('some value3'), STRNG('some value5'), # how to print this in a single line: <key>some value3</key><string>some value5</string>
                        ),
                    KEY('4'),
                    )
        )

print lxml.etree.tostring(the_XML, pretty_print=True)
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Simon Sapin | 25 Feb 18:12 2016
Gravatar

Building an lxml tree from C / Rust with html5ever

Hi all,

How would you recommend creating a lxml tree from C code? Or rather from 
Rust code, given that Rust code can call C functions and manipulate C 
structs, and Rust functions can be made to have a C-compatible ABI and 
be called from C. (I don’t mind adding some Cython code to the mix if it 
makes things easier.)

I’ve seen http://lxml.de/capi.html and etreepublic.pxd, but it’s not 
clear to me if that API is suppose to be complete or if I should use 
libxml2’s API as well.

Also, strings. Rust strings are UTF-8, and libxml2 seems to also be 
using UTF-8 internally. I’d rather not have everything go through Python 
strings and back, if possible.

Any advice or links to docs or example code is appreciated.

## Background

When parsing HTML in Python, there’s among others html5lib which is 
standards-compliant but relatively slow, and lxml.html which is very 
fast but uses libxml2’s old HTML 4 parser.

Could we have the best of both? html5ever is a parser written for Servo 
per WHATWG’s HTML Standard. It’s written in Rust with performance in 
mind (e.g. by avoiding string copies as much as possible). The parser is 
separated from tree builders and tree data structures.

https://github.com/servo/html5ever
https://servo.org/

I’ve played a bit with using html5ever from Python through CFFI and 
writing tree builders in Python. Performance is somewhere in between 
html5lib and lxml.html. Even after some optimization, cProfile shows 
that more than half the time is spent in the tree builder, creating 
Python objects and going back and forth between languages.

https://github.com/SimonSapin/html5ever-python
https://github.com/SimonSapin/html5ever-python/tree/master/benchmarks

I’m guessing that part of the reason lxml.html is so fast is that it 
doesn’t create Python object for each node during parsing, only 
on-demand during traversal. Perhaps a better benchmark would include a 
tree traversal in addition to parsing.

I think the same approach could be good for html5ever-python: create 
libxml2 nodes in Rust / C / Cython without involving much Python code or 
many Python objects, and then create a lxml.etree.ElementTree object at 
the end.

Thanks,
--

-- 
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane