Markus Schöpflin | 10 Feb 12:29 2016

Using xinclude and XMLSchema together

I'm trying to use xinclude to construct a schema which can then be parsed by 
XMLSchema.

IOW, the following python script should print an XML document representing a 
valid XML schema and successfully create a XML schema object:

--- include.py ---
from lxml import etree
tree = etree.parse("s1.xsd")
tree.xinclude()
print etree.tostring(tree)
schema = etree.XMLSchema(tree)
--- include.py ---

Given the following two XML schema documents, this works as expected:

--- s1.xsd ---
<?xml version="1.0"?>
<xs:schema
     xmlns:xs="http://www.w3.org/2001/XMLSchema"
     xmlns:xi="http://www.w3.org/2001/XInclude">
   <xi:include href="s2.xsd" xpointer="/1/1"/>
</xs:schema>
--- s1.xsd ---

--- s2.xsd ---
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="e1">
     <xs:simpleType>
(Continue reading)

John Hagen | 7 Feb 01:01 2016
Picon

lxml Windows Wheels

It's been discussed before, but just wanted to ask if there had been any progress getting wheels developed for Windows.  It's not possible to use lxml in virtual environments on Windows, and when lxml is a dependency of another package, the user is greeted with a cryptic pip install error.

The cryptography package is a good example of wheel distribution I've seen: https://pypi.python.org/pypi/cryptography

Thanks,
John
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Austin Platt | 25 Jan 15:52 2016
Picon

Lxml fails to parse httpbin.org example utf-8 page

Hello all,

I was doing some tests with lxml and decided to try it out on the test response pages of httpbin.org.

Lxml fails to 'correctly' parse the example utf8 example page supplied by httpbin.org. The page can be found here: http://httpbin.org/encoding/utf8.

Here is a reproduction of the case:

    > import requests
    > r = requests.get("http://httpbin.org/encoding/utf8")
    > html = r.text
    > print(html)
    [...]

    > from lxml import etree
    > etree_parser = etree.HTMLParser(encoding='utf-8')
    > tree = etree.fromstring(html, parser=etree_parser)
    > new_html = etree.tostring(tree, method='html', encoding='utf-8')
    > print(new_html)
    [...]

The new_html is truncated after a `<` character in the `pre` tag of the original response. I presume this is because lxml attempts to interpret the `<` character as the start of an html tag.

Does lxml have any heuristics for deciding whether to interpret a lone `<` character as a text character as opposed to a html tag initiator? 

Cheers
Austin




_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 23 Jan 15:00 2016

remove an element without removing its tail?

I found a solution that works for this example:

float.tail = q.tail

Is that what others would do?  I still don't know what kind of animal
"tail" is. It's a string. I sort of understand that an element is a list.
So an XML document is a list with nested lists. That's easy to get. It's
harder to see how the string of tail is globbed on to a particular list.
Some element.text is the value of a list item. But how does the string
"tail" know where it belongs?

Of course, as long as it works, I don't really need to know...

I am struggling with a tail problem. Here is the example;

<sp>
    <speaker>Rainoldes.</speaker>
    <p>You may learne the reason hereof in your <hi>Por‑tesse,</hi>
reformed lately by the Pope. In your olde <note n="c" place="margin">
        <hi>Portiforium seu breuiarium, ad vsum ecclesiae Sarum: in festo
S. Thomae Can‑••ariensis.</hi>
    </note>
        <hi>Portesse</hi> there was this prayer to the Popes martyr,
<hi>S. Thomas Bec‑ket of Canterbury:</hi>
        <q>
            <floatingText>
                <body>
                    <div type="version">
                        <l>Christe Iesu,</l>
                        <l>per Thomae vulnera,</l>
                        <l>Quae nos ligant.</l>
                        <l>relaxa scelera.</l>
                    </div>
                    <div type="version">
                        <l>By Thomas woundes,</l>
                        <l>O Christ Iesus,</l>
                        <l>Loose thou the sinnes▪</l>
                        <l>which do binde vs.</l>
                    </div>
                </body>
            </floatingText>
        </q>
        Or, if you will haue better ryme, with as bad reason:
        <q>
            <pb n="480" facs="tcp:15991:239"/>
            <floatingText>
                <body>
                    <div type="version">
                        <l>Tu per Thomae sanguinem</l>
                        <l>quem pro te impendit,</l>
                        <l>Fac nos Christe scandere</l>
                        <l>quo Thomas a•cendit.</l>
                    </div>
                    <div type="version">
                        <l>By the blood of Thomas</l>
                        <l>which he for thee did spend,</l>
                        <l>Make vs O Christ to clime</l>
                        <l>whether he did a••end.</l>
                    </div>
                </body>
            </floatingText>
        </q>
        <q>
            <l>Mary had a little lamb</l>
        </q>
    </p>
</sp>


In this example (and many others from a larger collection) I want to strip
<q> as unnecessary wrappers. In this particular case, I could use
strip_tags. But this wouldn't work in cases where q does not wrap a
floatingText element. In many cases, it would be simple to do something
like 

q.addprevious(floatingText)
parent = q.getparent()
parent.remove(q)

But this code removes the tail of q, which I want to keep. How do I remove
the empty q element without removing the tail as well?

MM


_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 23 Jan 00:18 2016

remove an element without removing its tail?



I am struggling with a tail problem. Here is the example;

<sp>
    <speaker>Rainoldes.</speaker>
    <p>You may learne the reason hereof in your <hi>Por‑tesse,</hi>
reformed lately by the Pope. In your olde <note n="c" place="margin">
        <hi>Portiforium seu breuiarium, ad vsum ecclesiae Sarum: in festo
S. Thomae Can‑••ariensis.</hi>
    </note>
        <hi>Portesse</hi> there was this prayer to the Popes martyr,
<hi>S. Thomas Bec‑ket of Canterbury:</hi>
        <q>
            <floatingText>
                <body>
                    <div type="version">
                        <l>Christe Iesu,</l>
                        <l>per Thomae vulnera,</l>
                        <l>Quae nos ligant.</l>
                        <l>relaxa scelera.</l>
                    </div>
                    <div type="version">
                        <l>By Thomas woundes,</l>
                        <l>O Christ Iesus,</l>
                        <l>Loose thou the sinnes▪</l>
                        <l>which do binde vs.</l>
                    </div>
                </body>
            </floatingText>
        </q>
        Or, if you will haue better ryme, with as bad reason:
        <q>
            <pb n="480" facs="tcp:15991:239"/>
            <floatingText>
                <body>
                    <div type="version">
                        <l>Tu per Thomae sanguinem</l>
                        <l>quem pro te impendit,</l>
                        <l>Fac nos Christe scandere</l>
                        <l>quo Thomas a•cendit.</l>
                    </div>
                    <div type="version">
                        <l>By the blood of Thomas</l>
                        <l>which he for thee did spend,</l>
                        <l>Make vs O Christ to clime</l>
                        <l>whether he did a••end.</l>
                    </div>
                </body>
            </floatingText>
        </q>
        <q>
            <l>Mary had a little lamb</l>
        </q>
    </p>
</sp>


In this example (and many others from a larger collection) I want to strip
<q> as unnecessary wrappers. In this particular case, I could use
strip_tags. But this wouldn't work in cases where q does not wrap a
floatingText element. In many cases, it would be simple to do something
like 

q.addprevious(floatingText)
parent = q.getparent()
parent.remove(q)

But this code removes the tail of q, which I want to keep. How do I remove
the empty q element without removing the tail as well?

MM

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 22 Jan 16:54 2016

Null character problems

In running an lxml on some3,000 TEI files, about 2,000 ended up with string of null characters( \x00) after the closing </TEI> element.  They generate the error message "Content is not allowed in training section."The problem doesn't seem to be random. While I can't see the difference between the input files(all of them validate with Jing), files that don't generate the\x00 characters the first time, also don't do it the second time, but files that do it the first time do it the second time. 

I can remove the offending character manually, but that doesn't scale to 3,000 files.

Is there a way of checking for the characters before or after the final serialization and getting rid of them?

Martin Mueller

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Wietse Jacobs | 21 Jan 12:18 2016
Picon

from lxml cimport etree

Hello,

I'm experimenting with cython and lxml. I would like to "cimport" lxml in my own project (parsing XBRL files), but when I build my extension I get the following error:

----------------------------------------------------------------------------------------------------
cyxbrl\cy_xbrl.pyx:5:0: 'lxml.pxd' not found

Error compiling Cython file:
------------------------------------------------------------
...
# encoding: utf-8

import os
from lxml cimport etree
^
------------------------------------------------------------

cyxbrl\cy_xbrl.pyx:5:0: 'lxml\etree.pxd' not found
Traceback (most recent call last):
  File ".\cysetup.py", line 24, in <module>
    setup(ext_modules=cythonize("cyxbrl/*.pyx"))
  File "C:\Users\Wietse\projects\pyenv35\cython\lib\site-packages\Cython\Build\Dependencies.py", line 877, in cythonize
    cythonize_one(*args)
  File "C:\Users\Wietse\projects\pyenv35\cython\lib\site-packages\Cython\Build\Dependencies.py", line 997, in cythonize_
one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: cyxbrl\cy_xbrl.pyx
----------------------------------------------------------------------------------------------------

I've been able to build lxml on windows from source - with some hacks in the setup.py and setupinfo.py scripts - with the following command:

python .\setup.py build_ext -i --with-cython --static

(I've disabled the automatic download of libxml and libxslt because I wanted to use my own 64 bit build).

I've set PYTHONPATH to the lxml.src directory. Everything works fine if I use a normal python import.

(This is on Windows 10 with Visual Studio 2015 Community Edition and python 3.5 - 64 bit).

Is there a way to fix this?
Thanks,
Wietse
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 15 Jan 16:31 2016

a question about odd behaviour of 'break' in an lxml loop with iterchildren and itersiblings

I am puzzled by the following behaviour of lxml.  I want to group element
children with the same tag. Here is a schematic representation of my text:

<sp xml:id="sp-0645">
<speaker xml:id="spk-0645"/>
<ab xml:id="ftln-0645" n="2.1.1"/>
<ab xml:id="ftln-0646" n="2.1.2"/>
<ab xml:id="ftln-0647" n="2.1.3"/>
<ab xml:id="ftln-0648" n="2.1.4"/>
<stage/>
<ab xml:id="ftln-0703" n="2.1.59"/>
<ab xml:id="ftln-0704" n="2.1.60"/>
<stage/>
<ab xml:id="ftln-0705" n="2.1.61"/>
</sp>

I want to end up with something like :

<sp xml:id="sp-0645">
<speaker xml:id="spk-0645"/>
<p>
<lb xml:id="ftln-0645" n="2.1.1"/>
<lb xml:id="ftln-0646" n="2.1.2"/>
<lb xml:id="ftln-0647" n="2.1.3"/>
<lb xml:id="ftln-0648" n="2.1.4"/>
</p>
<stage/>
<p>
<lb xml:id="ftln-0703" n="2.1.59"/>
<lb xml:id="ftln-0704" n="2.1.60"/>
</p>
<stage/>
<p>
<lb xml:id="ftln-0705" n="2.1.61"/>
</p>
</sp>

I try to do this through a combination of iterchildren() and
itersiblings() and use the following script:

speech = 'speech.xml'
tree = etree.parse(speech)

for sp in tree.iter('sp'):
   for child in sp.iterchildren():
      if child.tag != 'ab':
          print('child', child.tag)

          for sibling in child.itersiblings():
              if sibling.tag == 'ab':
                 print('sibling'. sibling.tag)
              else:
                  break

The output of this script suggests that the two loops work properly:
child speaker
sibling ab
sibling ab
sibling ab
sibling ab
child stage
sibling ab
sibling ab
child stage
sibling ab

Now I complicate things by creating a paragraph element with each child
element.  The printout shows that each of the paragraphs exists and that
after the break statement the program returns to the top child loop:

child speaker
sibling ab
sibling ab
sibling ab
sibling ab
else <Element p at 0x1014cb548>
child stage
sibling ab
sibling ab
else <Element p at 0x1014cb508>
child stage
sibling ab
elif <Element p at 0x1014cb608>

But if I now add code to the script that fills the created paragraphs with
ab elements, the program does this correctly for the first iteration but
then it exits. Here is the code and the result:

speech = 'speech.xml'
tree = etree.parse(speech)

for sp in tree.iter('sp'):
    for child in sp.iterchildren():
        if child.tag != 'ab':
            paragraph = etree.Element('p')
            print('child', child.tag)

        for sibling in child.itersiblings():
            if sibling.tag == 'ab' and sibling.getnext()is not None:

                print('sibling', sibling.tag)
                paragraph.append(sibling)
            elif sibling.tag == 'ab' and sibling.getnext() is None:
                print('sibling', sibling.tag)
                paragraph.append(sibling)

                print('elif', etree.tostring(paragraph,
encoding='unicode', pretty_print=True))

            else:

                print('else',etree.tostring(paragraph, encoding='unicode',
pretty_print=True))

                Break

child speaker
sibling ab
sibling ab
sibling ab
sibling ab
else <p><ab xml:id="ftln-0645" n="2.1.1"/>

            <ab xml:id="ftln-0646" n="2.1.2"/>

            <ab xml:id="ftln-0647" n="2.1.3"/>

            <ab xml:id="ftln-0648" n="2.1.4"/>

            
            </p>

Why does the program execute only the first iteration of the loop and exit
completely after the 'break', when it doesn't do that in the structurally
identical but simpler versions?

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 14 Jan 17:16 2016

a question about iter

Is there a way of telling etree to iterate only over the immediate children of an element?

I know I can do it by enumerating the elements that occur as chid element.  In my particular case, it would be simpler if I could exclude elements,  as in "cycle through everything except 'w', 'c', and 'pc'. 
 On
But it would be more elegant to say "cycle through the elements just one level below the current element". Is that possible?  One possibility might be to use the getchildren() function, but I'm not sure whether that's the best way.


_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Misha Penkov | 28 Dec 10:03 2015
Picon
Gravatar

Why is meta tag attributes removed by Cleaner?

Hi,

I'm trying to clean a HTML file that contains meta tags. I want the meta tags to be preserved as-is. Unfortunately, the cleaner removes everything except the "name" attribute of the tag. How can I prevent this behavior?

Here is some example source:

import lxml.html.clean
html = """<html>
  <head>
    <meta name="keywords" content="test">
  </head>
</html>"""


def clean_html(html):
    """Removes parts of HTML unnecessary for processing."""
    kill_tags = ["map", "base", "iframe", "select", "noscript"]
    kwargs = {"scripts": True,
              "javascript": True,
              "comments": True,
              "style": True,
              "links": True,
              "meta": False,
              "page_structure": False,
              "processing_instructions": True,
              "embedded": True,
              "frames": False,
              "forms": False,
              "annoying_tags": True,
              "kill_tags": kill_tags,
              "whitelist_tags": ["meta"]}
    cleaner = lxml.html.clean.Cleaner(**kwargs)
    cleaned = cleaner.clean_html(unicode(html))
    return cleaned

print clean_html(html)

On my system, I see this printed to standard output:

<html>
  <head>
    <meta name="keywords">
  </head>
</html>

How can I prevent the cleaner from removing the content attribute?

Cheers,
Michael
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
anatoly techtonik | 13 Dec 10:42 2015
Picon
Gravatar

Fwd: Building wheels for Windows

Hi.

Top voted issue:
https://bugs.launchpad.net/lxml/+bugs?orderby=-heat&start=0

Looks like Makefile has a command to build wheels. Why they
are not uploaded to PyPI?

https://pypi.python.org/pypi/lxml/3.5.0
--
anatoly t.

--

-- 
anatoly t.

Gmane