Fwd: iterparse() doesn't do dtd_validation
Adam Bielański <ab <at> rdprojekt.pl>
2010-07-02 15:47:22 GMT
Hi,
Can anyone help me with validating XML file against internal DTD with
iterparse()? I just can't make iterparse() use dtd_validation flag. I
ended up with two execution paths, one that uses etree.XMLParser and
finds all errors in validated file and another one which uses iterparse
and just prints all elements from XML file. Below is the code I used,
and attached are sample XML and DTD files I used. Just to make it clear:
I'm using lxml in version 2.2.4 on Python 2.6.5.
The XmlWithDTDStream class (mentioned in code below) behaves like a
stream, returning XML declaration (first line of well-formed XML file),
DTD string and then - rest of XML file. It works correctly, since you
can see that XMLParser returns correct errors and iterparse returns
actual elements from input.xml.
<pre><code>
from lxml import etree
if __name__ == "__main__":
print "XMLParser"
with open("internalSchema.dtd") as dtdFile:
dtd = dtdFile.read()
stream = XmlWithDTDStream(dtd, 'input.xml')
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
try:
root = etree.XML(stream.read(32768), parser)
except Exception, e:
print "An exception:: ", e
print "Error log:: ", parser.error_log
print "iterparse: "
with open("internalSchema.dtd") as dtdFile:
dtd = dtdFile.read()
stream = XmlWithDTDStream(dtd, 'input.xml')
for aTuple in etree.iterparse(stream, dtd_validation=True,
load_dtd=True):
print aTuple
</code></pre>
The result is then as follows:
<pre>
XMLParser
An exception:: No declaration for attribute badAttribute of element
pName, line 23, column 26
Error log::<string>:23:26:ERROR:VALID:DTD_UNKNOWN_ATTRIBUTE: No
declaration for attribute badAttribute of element pName
<string>:25:7:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element
property--badName
<string>:26:5:ERROR:VALID:DTD_CONTENT_MODEL: Element propertyGroup
content does not follow the DTD, expecting (property)+, got
(property--badName )
<string>:29:29:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element name
iterparse:
(u'end',<Element pName at 13e6450>)
(u'end',<Element pValue at 13e64e0>)
(u'end',<Element property--badName at 13e6510>)
(u'end',<Element propertyGroup at 13e6540>)
(u'end',<Element name at 13e6570>)
(u'end',<Element attributeInstance at 13e65a0>)
(u'end',<Element attributeGroup at 13e65d0>)
(u'end',<Element rootElement at 13e63f0>)
</pre>
Hope someone could assist me, or state that this is a bug in lxml. My
issues comes from the fact that I need to validate files that are too
large to be read into memory using etree.XML(). If there is a way I
could do it without iterparse - I'd gladly learn it.
Best regards,
Adam.
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev