Dwyer, Kevin | 2 Apr 18:48
Favicon

XMLSchemaParseError: Document is not XML Schema

Hello,
 
I have encountered a problem with schema object creation with lxml; the problem relates to namespace used for the root element of the schema.
 
<snip>
>>> import lxml.etree
>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>> et
<lxml.etree._ElementTree object at 0x011B8AF8>
>>> xsd = lxml.etree.XMLSchema(et)
 
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    xsd = lxml.etree.XMLSchema(et)
  File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:120919)
XMLSchemaParseError: Document is not XML Schema
</snip>
 
Looking in subversion (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the XMLSchema class I see:
 
<snip>
 
# work around for libxml2 bug if document is not XML schema at all #if _LIBXML_VERSION_INT < 20624: c_node = root_node._c_node c_href = _getNs(c_node) if c_href is NULL or \ cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: raise XMLSchemaParseError, u"Document is not XML Schema"</snip>The schemas that I am using use this root element:<xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">If I change them to <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> they validate. Can you explain why the earlier namespace definition is unacceptable? Is there a workaround? The schemas are not built by my application, so changing them might be an issue.

Cheers,

 

Kevin



"Misys" is the trade name for Misys plc (registered in England and Wales). Registration Number: 01360027. Registered office: One Kingdom Street, London W2 6BL, United Kingdom. For a list of Misys group operating companies please go to http://www.misys.com/corp/About_Us/misys_operating_companies.html. This email and any attachments have been scanned for known viruses using multiple scanners. This email message is intended for the named recipient only. It may be privileged and/or confidential. If you are not the named recipient of this email please notify us immediately and do not copy it or use it for any purpose, nor disclose its contents to any other person. This email does not constitute the commencement of legal relations between you and Misys plc. Please refer to the executed contract between you and the relevant member of the Misys group for the identity of the contracting party with which you are dealing.

 

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Victor Borda | 3 Apr 07:53
Picon

problems with custom build

Hi List,


I was very excited today to find the lxml module for python. As I need to write some xml checking scripts and finding bash scripting not well suited, I have decided to give python a try. So far I like it. However, here is the situation:

1) The target platform has an older RedHat installation with glibc2.3.4, so none of the binaries for libxml or libxslt were of any use. So I had to build from source on those. Not too painful.
2) However, trying to get lxml running has been really difficult. I need help here.
3) The target machine is not connected to the internet. It is not able to remotely retrieve packages.

Questions/Steps:
0) There don't appear to be any rpm's for lxml. Is this correct?
1) Since I don't have an internet connection from this machine it means I have to build from source, don't I (ie easy_install is not an option)?
2) I have assumed that I do have to build from source so I have given it a shot. I copied over the lxml2.2 tar file, unzipped it.
3) I got setuptools-0.6c9-py2.3.egg and dropped in that unzipped directly, and ran python ez_setup.py which seemed to go fine.
4) Then I ran python setup.py build. The build seemed to go fine.
5) I go to run test.py and I get this error message:

[]# python test.py
Traceback (most recent call last):
  File "test.py", line 595, in ?
    exitcode = main(sys.argv)
  File "test.py", line 558, in main
    test_cases = get_test_cases(test_files, cfg, tracer=tracer)
  File "test.py", line 260, in get_test_cases
    module = import_module(file, cfg, tracer=tracer)
  File "test.py", line 203, in import_module
    mod = __import__(modname)
  File "/home/victorborda/buildstuff/lxml-2.2/src/lxml/html/__init__.py", line 12, in ?
    from lxml import etree
ImportError: /home/victorborda/buildstuff/lxml-2.2/src/lxml/etree.so: undefined symbol: xmlSchematronFree

And with that, I tried running 'make test' and got the same result. The build appeared to go fine. The contents of 
lxml-2.2/build/lib.linux-x86_64-2.3/lxml

are:
-rw-r--r--  1 root root    7637 Jun 19  2008 builder.py
-rw-r--r--  1 root root   28750 Nov 23 19:33 cssselect.py
-rw-r--r--  1 root root   18287 May 31  2008 doctestcompare.py
-rw-r--r--  1 root root    7641 Jul  9  2008 ElementInclude.py
-rw-r--r--  1 root root    6407 Feb 27 14:45 _elementpath.py
-rwxr-xr-x  1 root root 3125362 Apr  3 04:49 etree.so
drwxr-xr-x  2 root root    4096 Apr  3 04:49 html
-rw-r--r--  1 root root      21 Oct 22  2007 __init__.py
-rwxr-xr-x  1 root root  846592 Apr  3 04:49 objectify.so
-rw-r--r--  1 root root      87 Mar  2  2008 pyclasslookup.py
-rw-r--r--  1 root root    8229 May 31  2008 sax.py
-rw-r--r--  1 root root     230 May 31  2008 usedoctest.py


--
Best Regards,
Victor Borda



_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 3 Apr 08:09
Picon
Favicon
Gravatar

Re: problems with custom build

Hi,

Victor Borda wrote:
> I was very excited today to find the lxml module for python. As I need to
> write some xml checking scripts and finding bash scripting not well suited,
> I have decided to give python a try. So far I like it. However, here is the
> situation:
> 
> 1) The target platform has an older RedHat installation with glibc2.3.4, so
> none of the binaries for libxml or libxslt were of any use. So I had to
> build from source on those. Not too painful.
> 2) However, trying to get lxml running has been really difficult. I need
> help here.
> 3) The target machine is not connected to the internet. It is not able to
> remotely retrieve packages.
> 
> Questions/Steps:
> 0) There don't appear to be any rpm's for lxml. Is this correct?
> 1) Since I don't have an internet connection from this machine it means I
> have to build from source, don't I (ie easy_install is not an option)?
> 2) I have assumed that I do have to build from source so I have given it a
> shot. I copied over the lxml2.2 tar file, unzipped it.
> 3) I got setuptools-0.6c9-py2.3.egg and dropped in that unzipped directly,
> and ran python ez_setup.py which seemed to go fine.
> 4) Then I ran python setup.py build. The build seemed to go fine.
> 5) I go to run test.py and I get this error message:
> 
> []# python test.py
> Traceback (most recent call last):
>   File "test.py", line 595, in ?
>     exitcode = main(sys.argv)
>   File "test.py", line 558, in main
>     test_cases = get_test_cases(test_files, cfg, tracer=tracer)
>   File "test.py", line 260, in get_test_cases
>     module = import_module(file, cfg, tracer=tracer)
>   File "test.py", line 203, in import_module
>     mod = __import__(modname)
>   File "/home/victorborda/buildstuff/lxml-2.2/src/lxml/html/__init__.py",
> line 12, in ?
>     from lxml import etree
> ImportError: /home/victorborda/buildstuff/lxml-2.2/src/lxml/etree.so:
> undefined symbol: xmlSchematronFree

I assume that you have installed newer versions of libxml2 and libxslt
somewhere, but it looks like lxml can't find them at runtime. Try to
compile with lxml with the "--auto-rpath" option to make it remember where
it found the libraries it was built against.

Another option is to copy the libxml2 and libxslt tar.gz archives into
"lxml-2.2/libs/" and pass

	--static-deps --libxml2-version=2.X.Y --libxslt-version=1.1.XY

to setup.py, which will then build those libs first and build lxml
statically against them.

Stefan
Kev Dwyer | 3 Apr 17:42
Picon

http://www.w3.org/2001/XMLSchema"

Hello,


This is a re-post of my earlier posting, at Stefan's request, without the corporate boilerplate
that I inadvertently sent last time.  Sorry about that.

Bug 354574 logged at Stefan's request.

 
I have encountered a problem with schema object creation with lxml; the
problem relates to namespace used for the root element of the schema.
 
<snip>
>>> import lxml.etree
>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>> et
<lxml.etree._ElementTree object at 0x011B8AF8>
>>> xsd = lxml.etree.XMLSchema(et)
 
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    xsd = lxml.etree.XMLSchema(et)
  File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
(src/lxml/lxml.etree.c:120919)
XMLSchemaParseError: Document is not XML Schema
</snip>
 
Looking in subversion
(http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
XMLSchema class I see:
 
<snip>
 
            # work around for libxml2 bug if document is not XML schema at all
            #if _LIBXML_VERSION_INT < 20624:
            c_node = root_node._c_node
            c_href = _getNs(c_node)
            if c_href is NULL or \
                   cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0:
                raise XMLSchemaParseError, u"Document is not XML Schema"
</snip>
The schemas that I am using use this root element:
<xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">
If I change them to <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> they validate. 
Can you explain why the earlier namespace definition is unacceptable?
Is there a workaround? 
The schemas are not built by my application, so changing them might be
an issue.

Cheers,

 

Kevin

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 3 Apr 21:31
Picon
Favicon
Gravatar

http://www.w3.org/2001/XMLSchema"

Hi,

Kev Dwyer wrote:
> I have encountered a problem with schema object creation with lxml; the
> problem relates to namespace used for the root element of the schema.
> 
> <snip>
>>>> import lxml.etree
>>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>>> et
> <lxml.etree._ElementTree object at 0x011B8AF8>
>>>> xsd = lxml.etree.XMLSchema(et)
> 
> Traceback (most recent call last):
>   File "<pyshell#4>", line 1, in <module>
>     xsd = lxml.etree.XMLSchema(et)
>   File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
> (src/lxml/lxml.etree.c:120919)
> XMLSchemaParseError: Document is not XML Schema
> </snip>
> 
> Looking in subversion
> (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
> XMLSchema class I see:
> 
> <snip>
> 
>             # work around for libxml2 bug if document is not XML schema at
> all
>             #if _LIBXML_VERSION_INT < 20624:
>             c_node = root_node._c_node
>             c_href = _getNs(c_node)
>             if c_href is NULL or \
>                    cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema')
> != 0:
>                 raise XMLSchemaParseError, u"Document is not XML Schema"

Thanks for pointing me to this, this is a left-over work-around for a bug
that no longer exists in more recent libxml2 versions. I'll try to figure
out when it was fixed and disable this from that point on. Note that this
will not solve your problem, though.

> The schemas that I am using use this root element:
> <xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">

I actually had to look this up, and found a lot of documents containing
this namespace, but little information why it was changed at the time. It
appears to be part of an older specification version that happens to still
work for your stylesheets.

Note that libxml2 does not support this namespace at all, just like most
other validators I could find a link about.

> The schemas are not built by my application, so changing them might be
> an issue.

You can always do a string replace before passing the XML data to the
schema parser. Or, you can parse the XML tree using iterparse and fix the
namespaces while doing so, simply by overwriting the tag names. You can
pass "tag={http://www.w3.org/2000/10/XMLSchema}*" to iterparse() to make
sure it only intercepts on the interesting elements. It will still build
the complete tree for you, which you can retrieve using "it.root" at the end.

Note that a string replace might still be the safer way to do it, as it
also keeps any prefix mappings intact that XMLSchema may use in text
content (i.e. qualified names). To be sure that you can safely replace the
string, you can parse the XML, serialise it to UTF-8, do the replacement,
and then parse it again. Both parsing and serialising are fast, so you may
not even notice the difference.

Does that help?

Stefan
chris hoke | 4 Apr 11:57
Picon
Gravatar

setting xslt output encoding with lxml

hi,
(hope this is the right list for my question)

To set the XSL output encoding I normally use <xsl:output encoding="..."/> in the stylesheet.

At least in the Java based XSLT processors it is possible to set some attributes of xsl:output from the "outside" meaning when initializing or starting the transformation. So it is possible for example to overwrite any encoding specified in <xsl:output... with a different encoding.

Is there any way to do this with LXML?

The docs only have an example to set an XSLT parameter but not an XSLT processing configuration like the output encoding.

BTW, the example on http://codespeak.net/lxml/xpathxslt.html#xslt

>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
...     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...     <xsl:template match="/">
...         <foo><xsl:value-of select="$a" /></foo>
...     </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)

seems to miss an <xsl:param name="a"/> parameter. I have not checked if it works without it but I guess it would be good style to declare any incoming parameters if not for setting a default value, would it not?


thanks for any hints,
Christof

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 4 Apr 21:01
Picon
Favicon
Gravatar

Re: setting xslt output encoding with lxml

Hi,

chris hoke wrote:
> (hope this is the right list for my question)

Yes.

> To set the XSL output encoding I normally use <xsl:output encoding="..."/>
> in the stylesheet.
> 
> At least in the Java based XSLT processors it is possible to set some
> attributes of xsl:output from the "outside" meaning when initializing or
> starting the transformation. So it is possible for example to overwrite any
> encoding specified in <xsl:output... with a different encoding.

lxml.etree does not currently support this.

> Is there any way to do this with LXML?

You can parse the stylesheet with the normal XML parser, change the
xsl:output element according to your needs, and pass the result to XSLT().

Note that you can use

	iterparse(the_file, tag="{...XSL NS...}output")

to update the element while parsing.

> BTW, the example on http://codespeak.net/lxml/xpathxslt.html#xslt
> 
>>>> xslt_tree = etree.XML('''\
> ... <xsl:stylesheet version="1.0"
> ...     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"<http://www.w3.org/1999/XSL/Transform>>
> 
> ...     <xsl:template match="/">
> ...         <foo><xsl:value-of select="$a" /></foo>
> ...     </xsl:template>
> ... </xsl:stylesheet>''')
>>>> transform = etree.XSLT(xslt_tree)
>>>> f = StringIO('<a><b>Text</b></a>')
>>>> doc = etree.parse(f)
> 
> seems to miss an <xsl:param name="a"/> parameter. I have not checked if it
> works without it but I guess it would be good style to declare any incoming
> parameters if not for setting a default value, would it not?

Yes, thanks for catching that.

Stefan
Stefan Behnel | 4 Apr 22:03
Picon
Favicon
Gravatar

Re: setting xslt output encoding with lxml


Stefan Behnel wrote:
> chris hoke wrote:
>> To set the XSL output encoding I normally use <xsl:output encoding="..."/>
>> in the stylesheet.
>>
>> At least in the Java based XSLT processors it is possible to set some
>> attributes of xsl:output from the "outside" meaning when initializing or
>> starting the transformation. So it is possible for example to overwrite any
>> encoding specified in <xsl:output... with a different encoding.
> 
> lxml.etree does not currently support this.

I skimmed through the libxslt source and it looks like such a feature is
not easily available. So the best way to do it is actually to copy and
modify the stylesheet document as I explained.

Stefan
Laurence Rowe | 4 Apr 22:34
Picon
Gravatar

Re: setting xslt output encoding with lxml

It seems that libxslt respects the last <xsl:output> tag found, so
just append your required version to the end of the stylesheet:

>>> xslt_doc = etree.XML('''<?xml version="1.0" ?>
... <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
...   <xsl:output method="xml"/>
...     <xsl:template match="/"><br /></xsl:template>
... </xsl:stylesheet>''')

>>> str(etree.XSLT(xslt_doc)(etree.XML('''<foo/>''')))
'<?xml version="1.0"?>\n<br/>\n'

>>> xslt_doc.append(etree.XML('''<xsl:output
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" method="html"/>'''))

>>> str(etree.XSLT(xslt_doc)(etree.XML('''<foo/>''')))
'<br>\n'

Laurence

2009/4/4 Stefan Behnel <stefan_ml <at> behnel.de>:
>
> Stefan Behnel wrote:
>> chris hoke wrote:
>>> To set the XSL output encoding I normally use <xsl:output encoding="..."/>
>>> in the stylesheet.
>>>
>>> At least in the Java based XSLT processors it is possible to set some
>>> attributes of xsl:output from the "outside" meaning when initializing or
>>> starting the transformation. So it is possible for example to overwrite any
>>> encoding specified in <xsl:output... with a different encoding.
>>
>> lxml.etree does not currently support this.
>
> I skimmed through the libxslt source and it looks like such a feature is
> not easily available. So the best way to do it is actually to copy and
> modify the stylesheet document as I explained.
>
> Stefan
> _______________________________________________
> lxml-dev mailing list
> lxml-dev <at> codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
Kev Dwyer | 6 Apr 11:32
Picon

http://www.w3.org/2001/XMLSchema"

Hello Stefan,

Thanks for the speedy response, and for the workaround suggestions.

All the best,

Kevin

2009/4/3 Stefan Behnel <stefan_ml <at> behnel.de>
Hi,

Kev Dwyer wrote:
> I have encountered a problem with schema object creation with lxml; the
> problem relates to namespace used for the root element of the schema.
>
> <snip>
>>>> import lxml.etree
>>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>>> et
> <lxml.etree._ElementTree object at 0x011B8AF8>
>>>> xsd = lxml.etree.XMLSchema(et)
>
> Traceback (most recent call last):
>   File "<pyshell#4>", line 1, in <module>
>     xsd = lxml.etree.XMLSchema(et)
>   File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
> (src/lxml/lxml.etree.c:120919)
> XMLSchemaParseError: Document is not XML Schema
> </snip>
>
> Looking in subversion
> (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
> XMLSchema class I see:
>
> <snip>
>
>             # work around for libxml2 bug if document is not XML schema at
> all
>             #if _LIBXML_VERSION_INT < 20624:
>             c_node = root_node._c_node
>             c_href = _getNs(c_node)
>             if c_href is NULL or \
>                    cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema')
> != 0:
>                 raise XMLSchemaParseError, u"Document is not XML Schema"

Thanks for pointing me to this, this is a left-over work-around for a bug
that no longer exists in more recent libxml2 versions. I'll try to figure
out when it was fixed and disable this from that point on. Note that this
will not solve your problem, though.


> The schemas that I am using use this root element:
> <xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">

I actually had to look this up, and found a lot of documents containing
this namespace, but little information why it was changed at the time. It
appears to be part of an older specification version that happens to still
work for your stylesheets.

Note that libxml2 does not support this namespace at all, just like most
other validators I could find a link about.


> The schemas are not built by my application, so changing them might be
> an issue.

You can always do a string replace before passing the XML data to the
schema parser. Or, you can parse the XML tree using iterparse and fix the
namespaces while doing so, simply by overwriting the tag names. You can
pass "tag={http://www.w3.org/2000/10/XMLSchema}*" to iterparse() to make
sure it only intercepts on the interesting elements. It will still build
the complete tree for you, which you can retrieve using "it.root" at the end.

Note that a string replace might still be the safer way to do it, as it
also keeps any prefix mappings intact that XMLSchema may use in text
content (i.e. qualified names). To be sure that you can safely replace the
string, you can parse the XML, serialise it to UTF-8, do the replacement,
and then parse it again. Both parsing and serialising are fast, so you may
not even notice the difference.

Does that help?

Stefan

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev

Gmane