Markus Schöpflin | 3 Mar 10:23 2015

Replacing children via slicing

Hello,

given the following piece of code which replaces all A nodes in P with B:

 >>> tree = etree.fromstring("<P><A/><A/></P>")
 >>> l = tree.findall("A")
 >>> first, last = tree.index(l[0]), tree.index(l[-1])
 >>> tree[first : last + 1] = [ etree.Element("B") ]
 >>> etree.tostring(tree)
'<P><B/></P>'

This code works, but it feels kind of ugly.

Is there a more elegant way to replace a slice of the list of children 
(actually all children having the same name, but they always appear in a row) 
with another list of elements and keeping the document order intact?

TIA, Markus

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Axel | 2 Mar 14:14 2015
Picon

html parsing - cssselect

hi,

i'm new to python and cssselect.
i'm trying to get some links from a webpage.

http://pastebin.com/w35S8dJm

i have not the slightest idea why the specified selector does not return the exptected results.

the selector does return the expected results in chrome console and firebug console.

thanks in advance for your help!

axel
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Aaron Storm | 1 Mar 17:12 2015
Picon

Re: Programmatically accessing schema

Thanks for your reply Stefan. I'll look into the threads you pointed out in more detail.

I am kinda glad that I didn't miss an obvious solution =)

My need is rather simplistic, so maybe I can try to querying the schema.

I did read a little bit about "Abbot" from MONK -- http://quest.library.illinois.edu/monk/project/ -- per Martin's suggestion (Thanks!), but it seems like it is no longer active and it is not a direct solution to my issue. 

Thanks!
Aaron


On Sunday, March 1, 2015 7:00 AM, "lxml-request <at> lxml.de" <lxml-request <at> lxml.de> wrote:

Date: Sat, 28 Feb 2015 21:28:37 +0100
From: Stefan Behnel <stefan_ml <at> behnel.de>
To: lxml <at> lxml.de
Subject: Re: [lxml] Programmatically accessing schema
Message-ID: <54F224F5.3030200 <at> behnel.de>
Content-Type: text/plain; charset=utf-8

Aaron Storm schrieb am 28.02.2015 um 12:29:
> I went through the documentation and couldn't find any hints on this.
> And I am not sure what keywords to search on google. Is there a way to
> programmatically (or api?) query the XSD for a specific element to get
> its spec? For example, I would like to know if /MyContainer/Container2
> can be repeated. Or if /MyContainer/Container1/Item1/ text() is
> optional.

Similar questions have been discussed in the past. These are related, for
example:

http://thread.gmane.org/gmane.comp.python.lxml.devel/7318

http://thread.gmane.org/gmane.comp.python.lxml.devel/5619

In general, figuring out what an XML Schema specification allows is rather
difficult. It can be done for simple cases (it's XML, you can search in
it), but schemas can be arbitrarily complex. Sometimes there are multiple
ways to express something, and covering all cases is cumbersome.

Stefan



_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Aaron Storm | 28 Feb 12:29 2015
Picon

Programmatically accessing schema

I went through the documentation and couldn't find any hints on this. And I am not sure what keywords to search on google.

Is there a way to programmatically (or api?) query the XSD for a specific element to get its spec?
 
For example, I would like to know if /MyContainer/Container2 can be repeated. Or if /MyContainer/Container1/Item1/ text() is optional.
 
my.xml
```xml
<?xml version="1.0" encoding="UTF-8"?>
<MyContainer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="sample.xsd">
  <Container1>
    <Item1>String</Item1>
  </Container1>
  <Container2>
    <Item2>2014-12-17T09:30:47Z</Item2>
  </Container2>
  <Container2>
    <Item2>2015-01-17T09:30:47Z</Item2>
  </Container2>
</MyContainer>
```
 
sample.xsd
```
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="MyContainer">
    <xs:annotation>
      <xs:documentation>docs</xs:documentation>
    </xs:annotation>
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Container1">
          <xs:complexType>
            <xs:sequence>
              <xs:element minOccurs="0" name="Item1" type="xs:string"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element maxOccurs="unbounded" ref="Container2"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="Container2">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Item2" type="xs:dateTime"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
```


cheers,
Aaron

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Pierpaolo Da Fieno | 25 Feb 20:21 2015
Picon

Isoschematron.Schematron not working as expected

Hallo everyone,
while working on a ISO Schematron validation routine, I noticed that reporting, even if triggered correctly, was not generating a False value.
If check the validation_report, the report element is being triggered correctly but the final result is True and the error_log is empty. 
Finally I went back to a very simple example included in isoschematron.Schematron docstring:

>>> from lxml import isoschematron >>> schematron = isoschematron.Schematron(etree.XML(''' ... <schema xmlns="http://purl.oclc.org/dsdl/schematron" > ... <pattern id="id_only_attribute"> ... <title>id is the only permitted attribute name</title> ... <rule context="*"> ... <report test=" <at> *[not(name()='id')]">Attribute ... <name path=" <at> *[not(name()='id')]"/> is forbidden<name/> ... </report> ... </rule> ... </pattern> ... </schema> ... ''')) >>> xml = etree.XML(''' ... <AAA name="aaa"> ... <BBB id="bbb"/> ... <CCC color="ccc"/> ... </AAA> ... ''') >>> schematron.validate(xml) 0 >>> xml = etree.XML(''' ... <AAA id="aaa"> ... <BBB id="bbb"/> ... <CCC/> ... </AAA> ... ''') >>> schematron.validate(xml) 1
Now if I run the above code I always get True, even with the invalid xml input. Same situation: validation_report is correct, but return value is True and error_log is empty.I'm running python 3.4.1 with lxml 3.5dev0. Same result with lxml 3.4.2.Am I doing something horribly wrong without realizing or is there actually a bug here?
Best RegardsPierpaolo Da Fieno
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Charlie Clark | 25 Feb 12:45 2015
Picon

Avoiding validation errors when offline

Hi,

while travelling yesterday I got a shock when I ran my tests and they  
failed with a validation error:

xmlschema.pxi:87: in lxml.etree.XMLSchema.__init__  
(src/lxml/lxml.etree.c:174765)
     ???
E   lxml.etree.XMLSchemaParseError: Element  
'{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName  
value '{http://purl.org/dc/terms/}created' does not resolve to a(n)  
element declaration., line 19

My machine had been playing up a bit so I initially thought this might  
have come from a corrupt file but this was, fortunately (or perhaps  
unfortunately), not the case. The error arises, I think, because a schema  
(see below) is parsed with an online resource.

In the particular case I can probably remove the validation test as it's  
effectively redundant but I wondered what is the best way to deal with  
this. Should the test be skipped if I can somehow workout that there is no  
network connection? Is there a way of caching the online schema? Can lxml  
be allowed to fail gracefully if it can't load a remote resource?

Charlie

   <xs:import namespace="http://purl.org/dc/elements/1.1/"
     schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd"/>
   <xs:import namespace="http://purl.org/dc/terms/"
     schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dcterms.xsd"/>
   <xs:import id="xml" namespace="http://www.w3.org/XML/1998/namespace"/>

   <xs:element name="coreProperties" type="CT_CoreProperties"/>

   <xs:complexType name="CT_CoreProperties">
     <xs:all>
       <xs:element name="category" minOccurs="0" maxOccurs="1"  
type="xs:string"/>
       <xs:element name="contentStatus" minOccurs="0" maxOccurs="1"  
type="xs:string"/>
       <xs:element ref="dcterms:created" minOccurs="0" maxOccurs="1"/>
       <xs:element ref="dc:creator" minOccurs="0" maxOccurs="1"/>
       <xs:element ref="dc:description" minOccurs="0" maxOccurs="1"/>
       <xs:element ref="dc:identifier" minOccurs="0" maxOccurs="1"/>
       <xs:element name="keywords" minOccurs="0" maxOccurs="1"  
type="CT_Keywords"/>
       <xs:element ref="dc:language" minOccurs="0" maxOccurs="1"/>
       <xs:element name="lastModifiedBy" minOccurs="0" maxOccurs="1"  
type="xs:string"/>
       <xs:element name="lastPrinted" minOccurs="0" maxOccurs="1"  
type="xs:dateTime"/>
       <xs:element ref="dcterms:modified" minOccurs="0" maxOccurs="1"/>
       <xs:element name="revision" minOccurs="0" maxOccurs="1"  
type="xs:string"/>
       <xs:element ref="dc:subject" minOccurs="0" maxOccurs="1"/>
       <xs:element ref="dc:title" minOccurs="0" maxOccurs="1"/>
       <xs:element name="version" minOccurs="0" maxOccurs="1"  
type="xs:string"/>
     </xs:all>
   </xs:complexType>

--

-- 
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 24 Feb 19:25 2015

a 'tail' problem


I have a "tail" problem with the following XML fragment

<lg>
    <l>WHo doth desire the trump of fame, to sound vnto the Skies,</l>
    <l><pb/>Or els who seekes the holy place, where mighty Ioue he
lies,</l>
    <l>He must not by deceitfull mind, nor yet by puissant strength,</l>

</lg>

I want to turn the <pb> tag from child to previous sibling of <l> and use
this code

for element in tree.iter():
	if element.tag == 'pb':
		parent = element.getparent()
		grandparent = parent.getparent()
		position = grandparent.index(parent)
		parent.text = element.tail
		grandparent.insert(position , (element))

Some of it works but it fails to get rid of the pb tail and produces this
output:

<lg>
    <l>WHo doth desire the trump of same, to sound vnto the skies,</l>
    <pb/>Or els who seekes the holy place, where mighty Ioue he lies,
<l>Or els who seekes the holy place, where mighty Ioue he lies,</l>
    <l>He must not by deceitfull mind, nor yet by puissant strength,</l>

</lg>

If there is a with_tail=False solution, I don't know where to stick the
command. 

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Picon

Is there a way to disable error logging?

Hi,

is there any way to disable error logging in lxml completely. I'm not sure but I think I have some memory issue related to lxml error logging, at least when memory usage of my application getting huge heapy shows the following:

ipdb> hp
Partition of a set of 648522 objects. Total size = 314884560 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0   6422   1 179466752  57 179466752  57 unicode
     1 188069  29 62420432  20 241887184  77 str
     2 119316  18 13363392   4 255250576  81 lxml.etree._LogEntry
     3  97582  15  8826128   3 264076704  84 tuple
     4  11013   2  7514616   2 271591320  86 dict (no owner)
     5  24769   4  4259720   1 275851040  88 list
     6   1400   0  3499328   1 279350368  89 dict of module
     7   3487   1  3148856   1 282499224  90 type
     8   3487   1  3043048   1 285542272  91 dict of type
     9   9803   2  2744840   1 288287112  92 dict of pycparser.plyparser.Coord
ipdb> log_entries = hp[2]
ipdb> it = iter(log_entries.nodes)
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()
ipdb> it.next()

heapy shows ~300 MB memory usage and total memory usage of a process was 690 MB, so I assume it can be some unreleased memory outside of python interpreter (C code?).

--

Best regards,
Alexander Chekunkov
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Jens Tröger | 21 Feb 20:21 2015
Picon

XML parsing and CDATA

Hi,

The following is based on this question here:

  http://stackoverflow.com/questions/28398651/how-to-generate-markup-inside-of-xslattribute-text

What I ended up using in my XSLT stylesheet was something along these
lines:

  <xsl:attribute name="bla">
    <![CDATA[Here is <em>some</em>verbatim text with id="]]>
    <xsl:value-of select=" <at> id"/>
    <![CDATA[" and <strong>more</strong> stuff.]]>
  </xsl:attribute>

This worked reliably.  However, this worked only after a seemingly subtle
fix to the text itself!  Initially I had something like

    id-<xsl:value-of select=" <at> id"/>

i.e. there was some text before that <xsl:value>.  This worked
sometimes, but not others where suddenly the CDATA following that
text/element combination was swallowed away.  Removing any text around
the <xsl:value> seems to work in all cases though.

Unfortunately, the document is too large to send and I don't really have
the time to build a small test case which illustrates this problem.  But
it seems to me that there is a parser issue here when multiple CDATA
inside of the <xsl:attribute> were combined with text and another
element like <xsl:value>.  There was no error message, just garbled
output.

Cheers,
Jens

--

-- 
Jens Tröger
http://savage.light-speed.de/
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Tcll | 17 Feb 17:07 2015
Picon

egg package

hello, I'm using a portable python installation and do not have the means to build the source.
may I request an egg distribution for python 2.7 x86

thanks :)
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Gilles Lenfant | 12 Feb 14:59 2015
Picon

Having all namespace declarations in the root element

Hi,

I made an XML doc grabbing and assembling nodes from various stuffs. namespace declarations are consistent but repeated among various descendant elements. And I want to gather all these namespace declarations in the root element.

i. e. when I have :

<root>
  <foo xmlns:abc="abc.namespace">
    <abc:bar attrib="value"> blah </abc:bar>
  </foo>
  <joe  xmlns:abc="abc.namespace">
    <abc:stuff> stuff </abc:stuff>
  </joe>
</root>

Then  I need (once serialized with .tostring(...) :

<root  xmlns:abc="abc.namespace">
  <foo>
    <abc:bar attrib="value"> blah </abc:bar>
  </foo>
  <joe>
    <abc:stuff> stuff </abc:stuff>
 </joe>
</root>

I naively thought that merging all elements .nsmap attribs into the root.nsmap should do the job, but it appears that changes to the Element.nsmap mapping fail silently.

My code (what I want to do with a really mutable .nsmap) :

----

etree.cleanup_namespaces(doc)

for elem in doc.getiterator():

    if hasattr(elem, 'nsmap'):

        for k in elem.nsmap:

            doc.nsmap[k] = elem.nsmap.pop(k)

----

Unfortunately .nsmap attrib is a so called "non mutable dict but ignoring silently mutation attempts". Thus this code does not work as expected.

Any hint to unlock me is welcome.

Best regards

-- 

Gilles Lenfant

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane