Stefan Behnel | 5 Feb 13:01
Picon
Favicon

Re: Cython errors building from git on Mac OS X

Laurence Rowe, 30.01.2012 16:59:
> I'm having trouble building lxml from a git clone on Mac OS X 10.6
> with python.org python 2.7 or my own compiled python 2.6. Any idea
> what might be causing the errors below?
> 
> $ python2.7 setup.py build
> Building lxml version 2.4.dev.
> Building with Cython 0.15.1.
> [...]
> cythoning src/lxml/lxml.etree.pyx to src/lxml/lxml.etree.c
> 
> Error compiling Cython file:
> ------------------------------------------------------------
> ...
>     cdef _Element _node
>     cdef _node_to_node_function _next_element
>     cdef _MultiTagMatcher _matcher
> 
>     @cython.final
>     cdef _initTagMatcher(self, tags):
>         ^
> ------------------------------------------------------------
> 
> src/lxml/lxml.etree.pyx:2501:9: The final compiler directive is not
> allowed in function scope
> [...]

Looks like you'll need 0.16 (i.e. the latest git version) of Cython for the
latest lxml master branch. It's close to release (certainly closer than
lxml 2.4) and should be stable enough for general use.
(Continue reading)

David Roe | 6 Feb 18:48
Favicon

Allowing bad characters

I have to load an XML file generated by a third-party program. They have some non-printing chars in the document and I get a “lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value…” exception.

 

Is there any way to relax or switch off the allowed characters in lxml? If not then is there a way to plug into the character entity parsing code? I really don’t want to resort to a pre-parse regex replace.

 

Thanks


This E-Mail is sent in confidence for the addressee only. Unauthorised recipients must preserve this confidentiality and should please advise the sender immediately by telephone (+44 (0)1625 505100) and return the original E-Mail to the sender without taking a copy. Cyprotex has taken all reasonable precautions to ensure that no viruses are transmitted from Cyprotex to any third party. Cyprotex accepts no responsibility for any loss or damage resulting directly or indirectly from the use of this E-Mail or the contents.
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 7 Feb 08:38
Picon
Favicon

Re: Allowing bad characters

David Roe, 06.02.2012 18:48:
> I have to load an XML file generated by a third-party program. They have
> some non-printing chars in the document and I get a
> "lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value..."
> exception.

Looks like it's not giving you XML then. Could you post an example of the
character references that it cannot parse? Here's the list of allowed XML
characters:

http://www.w3.org/TR/REC-xml/#charsets

But please take care to send your next post without the legal restrictions
at the bottom of your first e-mail.

> Is there any way to relax or switch off the allowed characters in lxml?

No, it uses a standards compliant XML parser. You can pass it the "recover"
option, but that may or may not do what you want.

> If not then is there a way to plug into the character entity parsing
> code?

No.

> I really don't want to resort to a pre-parse regex replace.

And you shouldn't.

> This E-Mail is sent in confidence for the addressee only.  Unauthorised
> recipients must preserve this confidentiality and should please advise
> the sender immediately by telephone (+44 (0)1625 505100) and return the
> original E-Mail to the sender without taking a copy.

Hmm, I can't find myself in the list of recipients - am I authorised to
keep a copy of your e-mail or not? Sorry for not having my phone within
reach while receiving e-mails.

Anyway, this is not the kind of comment I expect when you ask others for
help. And it's certainly not appropriate for a public mailing list.

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
David Roe | 7 Feb 11:01
Picon

Re: Allowing bad characters


> David Roe, 06.02.2012 18:48:
> > I have to load an XML file generated by a third-party program. They
> > have some non-printing chars in the document and I get a
> > "lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value..."
> > exception.
>
> Looks like it's not giving you XML then. Could you post an example of the character references that it cannot parse?

Here are the ones:
    &#012;&#007;

Obviously those aren't valid XML characters (hence my orignial post). Unfortunately this is the reality of the file I need to load. The way I see it I have a few options:

1) Find some way to get lxml to load those.
2) String replace them with some placeholder before parsing.
3) Roll my own XML (almost) parser.

I'd obviously rather not do (2) or (3) which is why I was asking if there is any way to load this file with lxml. I take it from your reply that (1) is not a possibility?

> But please take care to send your next post without the legal restrictions at the bottom of your first e-mail.

I can't switch that off so I'm sending from another account.


> Hmm, I can't find myself in the list of recipients - am I authorised to keep a copy of your e-mail or not? Sorry for not having my phone within reach while receiving e-mails.

You are a member of the mailing list I sent it to so yes you are.


_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Laurence Rowe | 7 Feb 16:25
Picon
Gravatar

[Pull 30] Use regexp for cssselect ":contains()" pseudo-class

This provides:

* portability with other xslt processors
* removes the 'css'  global prefix assignment which can cause
unexpected behaviour when working with xml files using that prefix
* a slight performance improvement

Users now need to explicitly pass in the regexp namespace when calling
the ``xpath()`` method if they use the deprecated ``:contains()``
pseudo-class.

https://github.com/lxml/lxml/pull/30

Laurence
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Marcin Krol | 8 Feb 14:32
Picon

schema problem


Hello everyone,

I have some (legacy == unmodifiable) xml files to handle and they cause
problems with lxml. I'm not even sure if this node is correct:

<test
  xmlns="TestAutomation"
  xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" 	 	
  xs:schemaLocation="TestAutomation TestAutomation.xsd" >

The problem is that when parsing file with such (root) node, I get this
  in namespace:

root.nsmap {'xs': 'http://www.w3.org/2001/XMLSchema-instance', None:
'TestAutomation'}

This makes the tree unparseable (iterators don't work, xpath doesn't
work), specifically, None key causes this problem - even if I pass
root.nsmap to 'namespaces' in element.xpath() call, I get TypeError (if
I pass a copy of dictionary without None key, xpath and iterators
silently fail, that is, they return empty results).

OTOH, this element is impossible to delete from the namespace - that is:

del root.nsmap[None]

executes, but the "None" key is still there in the root.nsmap dictionary.

Is this a bug or is this incorrect node schema attribs?

Regards,
Marcin Krol

Marcin Krol | 8 Feb 14:34
Picon

schema problem


Argh, I forgot to include info:

Python              : sys.version_info(major=2, minor=7, micro=2,
releaselevel='final', serial=0)
lxml.etree          : (2, 3, 3, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

jholg | 8 Feb 16:37
Picon
Picon

Re: schema problem

Hi,

> <test
>   xmlns="TestAutomation"
>   xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" 	 	
>   xs:schemaLocation="TestAutomation TestAutomation.xsd" >
> 

Parses ok for me:

>>> root = etree.fromstring("""
... <test
...   xmlns="TestAutomation"
...   xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
...   xs:schemaLocation="TestAutomation TestAutomation.xsd" >
... <x>foobar</x>
... </test>
... """)
>>> print root
<Element {TestAutomation}test at 26bbd0>
>>>

> This makes the tree unparseable (iterators don't work, xpath doesn't
> work), specifically, None key causes this problem - even if I pass
> root.nsmap to 'namespaces' in element.xpath() call, I get TypeError (if
> I pass a copy of dictionary without None key, xpath and iterators
> silently fail, that is, they return empty results).

Iterators work ok:

>>> for elt in root.iter(): print elt
...
<Element {TestAutomation}test at 26bbd0>
<Element {TestAutomation}x at 26ba50>

>>> for elt in root.iter('test'): print elt
...
>>> for elt in root.iter('{TestAutomation}test'): print elt
...
<Element {TestAutomation}test at 26bbd0>
>>> for elt in root.iter('{TestAutomation}*'): print elt
...
<Element {TestAutomation}test at 26bbd0>
<Element {TestAutomation}x at 26ba50>
>>>

XPath works ok:

>>> root.xpath("//nsprefix:*", namespaces={'nsprefix':'TestAutomation'})[<Element
{TestAutomation}test at 26bbd0>, <Element {TestAutomation}x at 26ba50>]
>>>
>>> root.xpath("//nsprefix:test", namespaces={'nsprefix':'TestAutomation'})
[<Element {TestAutomation}test at 26bbd0>]
>>>

I suppose you've run into this:

>>> root.xpath("//test", namespaces={None:'TestAutomation'})
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "lxml.etree.pyx", line 1314, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:37234)
  File "xpath.pxi", line 244, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:103583)
  File "xpath.pxi", line 117, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:102283)
  File "xpath.pxi", line 55, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:101630)
  File "extensions.pxi", line 78, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:94046)
TypeError: empty namespace prefix is not supported in XPath
>>>

?

You should be able to use prefixed XPath expression instead.  

Might be worth reading http://lxml.de/tutorial.html#namespaces for clarification.

Note that your elements live in namespace "TestAutomation" (which happens to be used unprefixed in the xml
file). Personally it helps me a lot to think about namespaces in Clarke notation (see link above).

Hth
Holger

--

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 8 Feb 19:04
Picon
Favicon

Re: schema problem

Marcin Krol, 08.02.2012 14:32:
> I have some (legacy == unmodifiable) xml files to handle and they cause
> problems with lxml. I'm not even sure if this node is correct:
> 
> <test
>   xmlns="TestAutomation"
>   xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" 	 	
>   xs:schemaLocation="TestAutomation TestAutomation.xsd" >
> 
> 
> The problem is that when parsing file with such (root) node, I get this
>   in namespace:
> 
> 
> root.nsmap {'xs': 'http://www.w3.org/2001/XMLSchema-instance', None:
> 'TestAutomation'}
> 
> 
> This makes the tree unparseable (iterators don't work, xpath doesn't
> work), specifically, None key causes this problem

This should help:

http://lxml.de/FAQ.html#how-can-i-specify-a-default-namespace-for-xpath-expressions

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Klaus Schilling | 11 Feb 20:40
Picon

External Entities in a Dictionary

Hello lxmlers,

is it possible to pass a dictionary object to an etree.parser object for the
lookup of external entities?

Klaus Schilling
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane