Aaron Maxwell | 1 May 19:41
Gravatar

Ingore namespace when parsing

Hi all,

When using python lxml to parse an XML document whose root element
defines a namespace, is there some way the library can allow me to not
explicitly invoke that namespace in queries?

Consider an XML document with this content:
{{{
<?xml version="1.0" ?>
<Root xmlns="http://redsymbol.net/SomeNamespace">
  <Child1></Child1>
  <Child2></Child2>
</Root>
}}}

If I parse it like this:
{{{
def ignore_ns(path_to_file):
    x = etree.parse(open(path_to_file))
    for kid in x.getroot():
        print kid.tag
}}}

... where the path_to_file contains the above xml document, then this
output is produced:

{{{
{http://redsymbol.net/SomeNamespace}Child1
{http://redsymbol.net/SomeNamespace}Child2
}}}
(Continue reading)

John Lovell | 1 May 20:00
Favicon

Re: Ingore namespace when parsing

Aaron:

It sounds to me like you could use an xpath query.

rootElement.xpath('//*[local-name() = 'Child1')

http://codespeak.net/lxml/xpathxslt.html

Good luck,

John W. Lovell
Web Applications Engineer
Northwest Educational Service District
1601 R Avenue
Anacortes, WA 98221
(360) 299-4086
jlovell <at> nwesd.org

www.nwesd.org
Together We Can ...

-----Original Message-----
From: lxml-dev-bounces <at> codespeak.net [mailto:lxml-dev-bounces <at> codespeak.net] On Behalf Of Aaron Maxwell
Sent: Friday, May 01, 2009 10:41 AM
To: lxml-dev <at> codespeak.net
Subject: [lxml-dev] Ingore namespace when parsing

Hi all,

When using python lxml to parse an XML document whose root element defines a namespace, is there some way the
(Continue reading)

Picon

Re: Ingore namespace when parsing

On Fri, 2009-05-01 at 10:41 -0700, Aaron Maxwell wrote:
> {{{
> # ns found from rootElement.nsmap as above
> rootElement.find('{' + ns + '}' + 'Child1')
> }}}
> 
> will correctly find the child element.
> 
> In this kind of situation, where I just want to parse the document and
> really don't care about the namespace, is there some way to construct
> a parser that will ignore it in a more automated way?  Is there a
> simpler, better approach, or some insight I'm missing?
> 

rootElement.xpath('//Child1')

--

-- 
Sérgio M. B.
Attachment (smime.p7s): application/x-pkcs7-signature, 2192 bytes
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Aaron Maxwell | 2 May 00:30
Gravatar

Re: Ingore namespace when parsing

On Friday 01 May 2009 11:00:29 am John Lovell wrote:
> Aaron:
>
> It sounds to me like you could use an xpath query.
> rootElement.xpath('//*[local-name() = 'Child1')
> http://codespeak.net/lxml/xpathxslt.html

Thanks, that does work fine.

My actual problem is somewhat more complex than the simplistic example I gave, 
however.  The structure of the XML document is more like this (lots of the 
actual document is excised):
{{{
<ItemLookupResponse 
xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07">
  <OperationRequest>
  <Items>
    <Item>
      <ASIN>0521545668</ASIN>
      <OfferSummary>
         (snip)
      </OfferSummary>
      <Offers>
        <Offer>
          <OfferListing>
            <Price>
              <Amount>7517</Amount>
            </Price>
(snip)
}}}
(Continue reading)

Picon

Re: Ingore namespace when parsing


from http://codespeak.net/lxml/xpathxslt.html 
a simplifyed example:     
    f = StringIO('''<a:foo xmlns:a="http://codespeak.net/ns/test1"
        xmlns:b="http://codespeak.net/ns/test2">
        <b:bar>Text</b:bar>
    </a:foo> ''')

    doc = etree.parse(f,parser=hparser)
    r = doc.xpath('//b:bar',
         namespaces={'b': 'http://codespeak.net/ns/test2'})
    print len(r)
    print r[0].tag
    print r[0].text

and extensions 
http://codespeak.net/lxml/extensions.html

I'm trying work with some namespaces either but the documentation spin
too much for me. 

In yours example, I don't see any <t:price> etc .
so it is difficult guess 

On Fri, 2009-05-01 at 15:30 -0700, Aaron Maxwell wrote:
> On Friday 01 May 2009 11:00:29 am John Lovell wrote:
> > Aaron:
> >
> > It sounds to me like you could use an xpath query.
> > rootElement.xpath('//*[local-name() = 'Child1')
(Continue reading)

Laurence Rowe | 6 May 00:04
Picon
Gravatar

Re: Ingore namespace when parsing

2009/5/2 Aaron Maxwell <amax <at> redsymbol.net>:
> On Friday 01 May 2009 11:00:29 am John Lovell wrote:
>> Aaron:
>>
>> It sounds to me like you could use an xpath query.
>> rootElement.xpath('//*[local-name() = 'Child1')
>> http://codespeak.net/lxml/xpathxslt.html
>
> Thanks, that does work fine.
>
> My actual problem is somewhat more complex than the simplistic example I gave,
> however.  The structure of the XML document is more like this (lots of the
> actual document is excised):
> {{{
> <ItemLookupResponse
> xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07">
>  <OperationRequest>
>  <Items>
>    <Item>
>      <ASIN>0521545668</ASIN>
>      <OfferSummary>
>         (snip)
>      </OfferSummary>
>      <Offers>
>        <Offer>
>          <OfferListing>
>            <Price>
>              <Amount>7517</Amount>
>            </Price>
> (snip)
(Continue reading)

Picon

Re: Ingore namespace when parsing

On Wed, 2009-05-06 at 00:04 +0200, Laurence Rowe wrote:
> 2009/5/2 Aaron Maxwell <amax <at> redsymbol.net>:
> > On Friday 01 May 2009 11:00:29 am John Lovell wrote:
> >> Aaron:
> >>
> >> It sounds to me like you could use an xpath query.
> >> rootElement.xpath('//*[local-name() = 'Child1')
> >> http://codespeak.net/lxml/xpathxslt.html
> >
> > Thanks, that does work fine.
> >
> > My actual problem is somewhat more complex than the simplistic example I gave,
> > however.  The structure of the XML document is more like this (lots of the
> > actual document is excised):
> > {{{
> > <ItemLookupResponse
> > xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07">
> >  <OperationRequest>
> >  <Items>
> >    <Item>
> >      <ASIN>0521545668</ASIN>
> >      <OfferSummary>
> >         (snip)
> >      </OfferSummary>
> >      <Offers>
> >        <Offer>
> >          <OfferListing>
> >            <Price>
> >              <Amount>7517</Amount>
> >            </Price>
(Continue reading)

Mary Lei | 7 May 02:27
Picon

how to get line,col position

How can I get dtd.validate to return the
line, column number for the xhtml in error?

Here is my code to validate an xhtml doc
against the dtd using lxml:

# no need to write a temp html file
CoRotHomeFile = open ( 'CoRoTHome.html', 'r' )
contents = CoRotHomeFile.read()
CoRotHomeFile.close()

dtd1 = etree.DTD(file='xhtml1-transitional.dtd') (the ent files are present)

etree.clear_error_log()

root1 = etree.HTML(contents)
try:
      rc = dtd1.validate(root1)
except (DTDValidateError,DTDError),e:
      print "e ", e

print "dtd errors"
len = len(dtd1.error_log)
error = dtd1.error_log[0]
print "line", (error.line)
print "column", (error.column)

print dtd1.error_log

If I use xmllint, I got column
(Continue reading)

Stefan Behnel | 7 May 07:17
Picon
Favicon
Gravatar

Re: how to get line,col position

Hi,

Mary Lei wrote:
> How can I get dtd.validate to return the
> line, column number for the xhtml in error?

You can't if you use the HTML parser, that's a known bug in libxml2:

http://bugzilla.gnome.org/show_bug.cgi?id=580705

Note that this bug has a patch associated to it, which you can apply to
libxml2 to get what you want.

Otherwise, for parsing XHTML you should use the XML parser anyway, which
will track line numbers correctly.

> But if I apply xmllint, it gives the same messages but with positional info:
> /home/lei/python-stuff/CoRoTHome.html:83: HTML parser error : 
> htmlParseStartTag: invalid element name
> dedicated to asteroseismology of bright stars (typically V<10mag) and
>                                                             ^
> /home/lei/python-stuff/CoRoTHome.html:23: element tr: validity error : 
> standalone: tr declared in the external subset contains white spaces nodes
> ...
> Document /home/lei/python-stuff/CoRoTHome.html does not validate against 
> xhtml1-transitional.dtd

You didn't say if you used the HTML parser or the XML parser in xmllint. In
any case, xmllint does the DTD validation at parse time, where the line
information is still available. It only gets lost when building the tree,
(Continue reading)

Mary Lei | 7 May 19:28
Picon

Re: how to get line,col position

My responses are below:

Stefan Behnel wrote:
> Hi,
> 
> Mary Lei wrote:
>> How can I get dtd.validate to return the
>> line, column number for the xhtml in error?
> 
> You can't if you use the HTML parser, that's a known bug in libxml2:
> 
> http://bugzilla.gnome.org/show_bug.cgi?id=580705
> 
> Note that this bug has a patch associated to it, which you can apply to
> libxml2 to get what you want.
Where can I locate this patch ?
> 
> Otherwise, for parsing XHTML you should use the XML parser anyway, which
> will track line numbers correctly.
Using the XML parser, results in error to load the dtd from network 
lxml.etree.XMLSyntaxError: Attempt to load network entity 
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
If I turned off the option, I dont get anything from parser.

I dont really want to load each time so I downloaded a copy with the 
entities and decide to use etree.dtd.validate to validate it instead. But as
mentioned, this does not give the line,col info.

If I use the XMLParser, I have an issue with
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 24, column 13
(Continue reading)


Gmane