matt barto | 1 Jul 2010 22:01
Picon

html.fromstring returning encoded string from a non unicoded string source

Hello,

I am trying to obtain a title from a website which has a Unicode tm and register mark, but the Unicode behavior is not what I expect.

tree = html.fromstring("<html><body><h1>Apple&#174; - iPad&#153; with Wi-Fi - 16GB</h1></body></html>"
print tree.text_content()
--------> print(tree.text_content())
Apple® - iPad™ with Wi-Fi - 16GB

I would expect the output from the print is "Apple&#174; - iPad&#153; with Wi-Fi - 16GB", but it seems some encoding occurred during the tree creation.  The browser is able to handle these html characters correctly using "iso-8856-1" char set (even though the document says "utf-8").

Can you provide some insight how these two html tags are handled and what I can do to have the expected behavior? My lxml version is 2.2.2.

Thanks in advanced.

Best,
Matt

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Eugene Van den Bulke | 2 Jul 2010 10:49
Picon
Gravatar

regexp

Hi,

I am experimenting with web scraping using lxml.

I have played a little with BeautifulSoup in the past and scrapy recently.

I am recoding something I did with scrapy with lxml but encounter a
problem I am not sure how to iron out.

With scrapy, hxs is an xpath selector which has a select and re method

types = hxs.select('.//a[ <at> href]/ <at> href').re(r'type=([A-Z]*)')

Which will return a list of the matches in href.

How would I do the same thing with lxml?

types = doc.xpath('.//a[ <at> href]/ <at> href') ...

Thanks a lot,

--

-- 
EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you?
Stefan Behnel | 2 Jul 2010 11:42
Picon
Favicon

Re: regexp

Eugene Van den Bulke, 02.07.2010 10:49:
> I am experimenting with web scraping using lxml.
>
> I have played a little with BeautifulSoup in the past and scrapy recently.
>
> I am recoding something I did with scrapy with lxml but encounter a
> problem I am not sure how to iron out.
>
> With scrapy, hxs is an xpath selector which has a select and re method
>
> types = hxs.select('.//a[ <at> href]/ <at> href').re(r'type=([A-Z]*)')
>
> Which will return a list of the matches in href.
>
> How would I do the same thing with lxml?
>
> types = doc.xpath('.//a[ <at> href]/ <at> href') ...

http://lmgtfy.com/?q=lxml+regular+expressions&l=1

;-)

Stefan
Stefan Behnel | 2 Jul 2010 12:13
Picon
Favicon

Re: regexp

Eugene Van den Bulke, 02.07.2010 11:53:
 > Stefan Behnel, 02.07.2010 11:42:
 >> Eugene Van den Bulke, 02.07.2010 10:49:
 >>> I am experimenting with web scraping using lxml.
 >>>
 >>> I have played a little with BeautifulSoup in the past and scrapy
 >>> recently.
 >>>
 >>> I am recoding something I did with scrapy with lxml but encounter a
 >>> problem I am not sure how to iron out.
 >>>
 >>> With scrapy, hxs is an xpath selector which has a select and re method
 >>>
 >>> types = hxs.select('.//a[ <at> href]/ <at> href').re(r'type=([A-Z]*)')
 >>>
 >>> Which will return a list of the matches in href.
 >>>
 >>> How would I do the same thing with lxml?
 >>>
 >>> types = doc.xpath('.//a[ <at> href]/ <at> href') ...

Note that this is redundant, './/a/ <at> href' is enough.

 >> http://lmgtfy.com/?q=lxml+regular+expressions&l=1
 >
> I did read the doc before I took the liberty to post ... I am afraid I
> just don't get it.

Personally, I wouldn't even use XPath regular expressions here. I'd rather 
do something like this:

     from lxml import html
     import re

     parse_type_value = re.compile(r'type=([A-Z]*)').findall

     root = html.parse(the_file).getroot()

     for el, attr, link, pos in root.iterlinks():
         if 'type=' in link:
              print el.tag, parse_type_value(link)

Note that this will give you all links, not only those in <a> href's. If 
you really only want those, the XPath expression above will do just fine.

Stefan
Eugene Van den Bulke | 2 Jul 2010 12:18
Picon
Gravatar

Re: regexp

On Fri, Jul 2, 2010 at 8:13 PM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
>>>> types = doc.xpath('.//a[ <at> href]/ <at> href') ...
>
> Note that this is redundant, './/a/ <at> href' is enough.

I am discovering XPath as well as you can tell :P

> Personally, I wouldn't even use XPath regular expressions here. I'd rather
> do something like this:
>
>    from lxml import html
>    import re
>
>    parse_type_value = re.compile(r'type=([A-Z]*)').findall
>
>    root = html.parse(the_file).getroot()
>
>    for el, attr, link, pos in root.iterlinks():
>        if 'type=' in link:
>             print el.tag, parse_type_value(link)
>
> Note that this will give you all links, not only those in <a> href's. If you
> really only want those, the XPath expression above will do just fine.
>
> Stefan

Thanks !

--

-- 
EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you?
Paul Girard | 2 Jul 2010 12:20
Picon

mac os x installation process

Hi dear people of lxml,

I wrote a small lib in Python using lxml to generate graph file in a 
specific format gexf.
This little thing is called pygexf :
http://packages.python.org/pygexf/
http://github.com/paulgirard/pygexf

First thanks for your great work on lxml, I am loving it !

Second I am experiencing rude probs with installing lxml on my mac os x.
I read that this hasn't been completed ported yet but still some 
workarounds with static libs seems to have worked for some of us.

I will only focus on the method :  STATIC_DEPS=true sudo easy_install lxml

I am having probs with gcc.
I have 3 different versions of gcc :
gcc-4 : coming from fink install gcc42
gcc-4.0
gcc-4.2 : bot coming from Xcode mac os x tools

note: I am changing the link gcc to the different versions to test all 
of them

Now here are the different errors I have with the various gcc versions :

gcc-4

$ls -la /usr/bin/gcc
lrwxr-xr-x  1 root  wheel  13  2 jul 11:43 /usr/bin/gcc -> /sw/bin/gcc-4

$STATIC_DEPS=true sudo easy_install lxml
searching for lxml
[...]
Building against libxml2/libxslt in the following directory: /usr/lib
gcc: unrecognized option '-no-cpp-precomp'
cc1: erreur: option "-mno-fused-madd" de la ligne de commande non reconnue
cc1: erreur: option "-arch" de la ligne de commande non reconnue
cc1: erreur: option "-arch" de la ligne de commande non reconnue
cc1: erreur: option "-Wno-long-double" de la ligne de commande non reconnue
error: Setup script exited with error: command 'gcc' failed with exit 
status 1

gcc-4.2

$ STATIC_DEPS=true sudo easy_install lxml
Searching for lxml
[...]
Using build configuration of libxslt 1.1.12
Building against libxml2/libxslt in the following directory: /usr/lib
cc1: error: unrecognized command line option "-Wno-long-double"
cc1: error: unrecognized command line option "-Wno-long-double"
lipo: can't open input file: /var/tmp//ccDn5F16.out (No such file or 
directory)
error: Setup script exited with error: command 'gcc' failed with exit 
status 1

The error list of gcc-4.0 is huge.
I'll not post it here but I could if necessary.

So here I am facing building problems.
I am not a expert into that kind of thing.

I am usually developing on linux (ubuntu) but many users of my small lib 
(including me) are mac users.
I don't really want to change my code to use another xml lib so I hope 
i'll finally find a way..

If anyone can help on this issue it'd be more than great,

thanks for reading me

Paul

ps: I couldn't try the darwin port method I couldn't understand how to 
use it...

--

-- 
Paul Girard
responsable numérique médialab
paul.girard <at> sciences-po.fr
01 45 49 63 58
médialab | Sciences Po
medialab.sciences-po.fr <http://medialab.sciences-po.fr>
13 rue de l'université
75007 PARIS
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Adam Bielański | 2 Jul 2010 17:47
Picon
Favicon

Fwd: iterparse() doesn't do dtd_validation

Hi,

Can anyone help me with validating XML file against internal DTD with
iterparse()? I just can't make iterparse() use dtd_validation flag. I
ended up with two execution paths, one that uses etree.XMLParser and
finds all errors in validated file and another one which uses iterparse
and just prints all elements from XML file. Below is the code I used,
and attached are sample XML and DTD files I used. Just to make it clear:
I'm using lxml in version 2.2.4 on Python 2.6.5.

The XmlWithDTDStream class (mentioned in code below) behaves like a
stream, returning XML declaration (first line of well-formed XML file),
DTD string and then - rest of XML file. It works correctly, since you
can see that XMLParser returns correct errors and iterparse returns
actual elements from input.xml.
<pre><code>

from lxml import etree
if __name__ == "__main__":
      print "XMLParser"
      with open("internalSchema.dtd") as dtdFile:
          dtd = dtdFile.read()
          stream = XmlWithDTDStream(dtd, 'input.xml')
          parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
          try:
              root = etree.XML(stream.read(32768), parser)
          except Exception, e:
              print "An exception:: ", e
              print "Error log:: ", parser.error_log
      print "iterparse: "
      with open("internalSchema.dtd") as dtdFile:
          dtd = dtdFile.read()
          stream = XmlWithDTDStream(dtd, 'input.xml')
          for aTuple in etree.iterparse(stream, dtd_validation=True,
load_dtd=True):
              print aTuple
</code></pre>

The result is then as follows:
<pre>
XMLParser
An exception::  No declaration for attribute badAttribute of element
pName, line 23, column 26
Error log::<string>:23:26:ERROR:VALID:DTD_UNKNOWN_ATTRIBUTE: No
declaration for attribute badAttribute of element pName
<string>:25:7:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element
property--badName
<string>:26:5:ERROR:VALID:DTD_CONTENT_MODEL: Element propertyGroup
content does not follow the DTD, expecting (property)+, got
(property--badName )
<string>:29:29:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element name

iterparse:
(u'end',<Element pName at 13e6450>)
(u'end',<Element pValue at 13e64e0>)
(u'end',<Element property--badName at 13e6510>)
(u'end',<Element propertyGroup at 13e6540>)
(u'end',<Element name at 13e6570>)
(u'end',<Element attributeInstance at 13e65a0>)
(u'end',<Element attributeGroup at 13e65d0>)
(u'end',<Element rootElement at 13e63f0>)
</pre>

Hope someone could assist me, or state that this is a bug in lxml. My
issues comes from the fact that I need to validate files that are too
large to be read into memory using etree.XML(). If there is a way I
could do it without iterparse - I'd gladly learn it.

Best regards,
      Adam.

Attachment (input.xml): text/xml, 418 bytes
Attachment (internalSchema.dtd): application/xml-dtd, 602 bytes
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Adam Bielański | 2 Jul 2010 17:41
Picon
Favicon

iterparse() doesn't do dtd_validation

Hi,

Can anyone help me with validating XML file against internal DTD with 
iterparse()? I just can't make iterparse() use dtd_validation flag. I 
ended up with two execution paths, one that uses etree.XMLParser and 
finds all errors in validated file and another one which uses iterparse 
and just prints all elements from XML file. Below is the code I used, 
and attached are sample XML and DTD files I used. Just to make it clear: 
I'm using lxml in version 2.2.4 on Python 2.6.5.

The XmlWithDTDStream class (mentioned in code below) behaves like a 
stream, returning XML declaration (first line of well-formed XML file), 
DTD string and then - rest of XML file. It works correctly, since you 
can see that XMLParser returns correct errors and iterparse returns 
actual elements from input.xml.
<pre><code>

from lxml import etree
if __name__ == "__main__":
     print "XMLParser"
     with open("internalSchema.dtd") as dtdFile:
         dtd = dtdFile.read()
         stream = XmlWithDTDStream(dtd, 'input.xml')
         parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
         try:
             root = etree.XML(stream.read(32768), parser)
         except Exception, e:
             print "An exception:: ", e
             print "Error log:: ", parser.error_log
     print "iterparse: "
     with open("internalSchema.dtd") as dtdFile:
         dtd = dtdFile.read()
         stream = XmlWithDTDStream(dtd, 'input.xml')
         for aTuple in etree.iterparse(stream, dtd_validation=True, 
load_dtd=True):
             print aTuple
</code></pre>
The result is as follows:
<pre>
XMLParser
An exception::  No declaration for attribute badAttribute of element 
pName, line 23, column 26
Error log:: <string>:23:26:ERROR:VALID:DTD_UNKNOWN_ATTRIBUTE: No 
declaration for attribute badAttribute of element pName
<string>:25:7:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element 
property--badName
<string>:26:5:ERROR:VALID:DTD_CONTENT_MODEL: Element propertyGroup 
content does not follow the DTD, expecting (property)+, got 
(property--badName )
<string>:29:29:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element name

iterparse:
(u'end', <Element pName at 13e6450>)
(u'end', <Element pValue at 13e64e0>)
(u'end', <Element property--badName at 13e6510>)
(u'end', <Element propertyGroup at 13e6540>)
(u'end', <Element name at 13e6570>)
(u'end', <Element attributeInstance at 13e65a0>)
(u'end', <Element attributeGroup at 13e65d0>)
(u'end', <Element rootElement at 13e63f0>)
</pre>

Hope someone could assist me, or state that this is a bug in lxml. My 
issues comes from the fact that I need to validate files that are too 
large to be read into memory using etree.XML(). If there is a way I 
could do it without iterparse - I'd gladly learn it.

Best regards,
     Adam.
Attachment (input.xml): text/xml, 418 bytes
Attachment (internalSchema.dtd): application/xml-dtd, 602 bytes
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 2 Jul 2010 17:57
Picon
Favicon

Re: Fwd: iterparse() doesn't do dtd_validation

Adam Bielański, 02.07.2010 17:47:
> Can anyone help me with validating XML file against internal DTD with
> iterparse()? I just can't make iterparse() use dtd_validation flag.
 > [...]
 > <?xml version="1.0" encoding="utf-8"?>
 > <rootElement version="3">
 > ...

To reference a DTD, your XML document needs a DOCTYPE declaration.

http://xmlsoft.org/xmldtd.html

Once that's in the document, the "dtd_validation" flag should work.

Stefan
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev

Gmane