Ian Bicking | 1 Jul 02:48
Gravatar

Re: Segmentation fault in lxml.html after pickling

Stefan Behnel wrote:
> Martijn Faassen wrote:
>> I'd love it if I could somehow store lxml trees in the ZODB, and that'd
>> need pickle support. Whether it could be made to be efficient I don't
>> know - you'd not want the whole tree to be pickled as a whole in case of
>> large trees, but some form of partitioning scheme into separate pickles.
>> You're right that custom-element binding would be nice in this case, and
>> that means the pickle can't simply be the XML content unless it's
>> somehow annotated first.
>>
>> Anyway, this is a rather out there use case. I am just intrigued to
>> learn that objectify elements can be pickled.
> 
> It's just easier to do in objectify, as it has a pretty comprehensive
> setup for Element class mapping. If you want to be sure to get back
> exactly the same Element tree after pickling, you can just annotate() an
> objectify tree before pickling it.
> 
> Doing the same thing in lxml.etree would require storing some information
> about the current Element lookup, which may be a lot of information, e.g.
> for the namespace class setup. That's a parser-local setup, so we can't
> just use the setup of the default parser either but need a concrete
> context for the unpickling.
> 
> lxml.html might be considered having such a context in a similar way
> lxml.objectify has it, as it comes with its own classes and lookup scheme.

Just what would end up being pickled, do you think?  The entire document?

A first thought is that the document gets pickled, and then the element 
(Continue reading)

Mike MacCana | 1 Jul 05:13
Picon

Premature end of data in tag - but it looks well formed

Hi gents,

Firstly, thanks for lxml. It's by far the nicest tool for someone who
needs to do xpath in python without being an XML god.

I'm a first time user of lxml attempting to etree.parse a document. My
code (below) works fine on some sample text, but libxml complains about
the real data with:

etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5

The data is below. Line 5 seems OK to me, but I'm new to XML coding so
maybe I'm missing something.
__________________________________
1
2
3 <?xml version="1.0" encoding="iso-8859-1"?>
4 <!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0
Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
5 <html xmlns="http://www.w3.org/1999/xhtml">
__________________________________

Any ideas? The full code is below.

Cheers,

Mike

#!/usr/bin/env python
import urllib, sys, lxml, StringIO
(Continue reading)

Stefan Behnel | 1 Jul 07:20
Picon
Favicon
Gravatar

Re: namespace strangeness in lxml 1.1

Hi,

Eric Jahn wrote:
> type="{http://domain2.info}someattribute
>
> element = etree.Element(NS2 + "secondelement", nsmap=NSMAP, type = NS2 +
> "someattribute")

You are setting a namespace as attribute /value/ here, not as attribute
/name/. lxml will not modify content unless you tell it to do so. If you want
it to replace the namespace by a resolved prefix, use

    type = etree.QName(NS2 + "...")

If it's just a mistake and you wanted to set the attribute /namespace/
instead, pass

    attrib = {NS2 + "someattribute" : "somevalue"}

to the Element factory. There should also be a section on this in the tutorial
IIRC.

Stefan
Stefan Behnel | 1 Jul 07:24
Picon
Favicon
Gravatar

Re: Premature end of data in tag - but it looks well formed

Hi,

Mike MacCana wrote:
> Hi gents,

Are you sure you don't want advice from any girls?

> I'm a first time user of lxml attempting to etree.parse a document. My
> code (below) works fine on some sample text, but libxml complains about
> the real data with:
> 
> etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5
> 
> The data is below. Line 5 seems OK to me, but I'm new to XML coding so
> maybe I'm missing something.

The problem is not in line 5 (where the html tag starts) but in line 196,
where it apparently ends. Try validating it at the W3C validator if you don't
believe lxml. ;)

Stefan
Stefan Behnel | 1 Jul 08:35
Picon
Favicon
Gravatar

Re: Segmentation fault in lxml.html after pickling

Ian Bicking wrote:
> A first thought is that the document gets pickled, and then the element
> is an offset in that document.

That's a brilliant idea, but why so complicated? :)

pickle:
    doc = self.getroottree()
    return (tostring(doc), doc.getpath(self))

unpickle:
    doc, path = pickle_value
    return doc.xpath(path)

would do the trick. Maybe we should serialise as XML instead of HTML, so
that we don't run into any "relaxed parser" problems (I remember a not so
old libxml2 HTML serialiser bug with <embed> roundtrips, for example).

> There is no return value for __setstate__, and no way to indicate a
> constructor method for creating instances.  That's dumb.  I don't like
> pickle.

:)

You don't have to use __[sg]etstate__(). You can define an external
function to do it for you, just like objectify does (search
src/lxml/lxml.objectify.pyx for "pickle"). The stupid thing is that this
function has to be registered /and/ public. It's not enough to register it
and delete it afterwards...

(Continue reading)

jholg | 1 Jul 09:37
Picon
Picon

Re: objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta

Hi,

Holger Joukl wrote:
> I have a usecase where I need to deannotate an objectified tree
> and then manually set py:pytype or xsi:type attributes.
>
> However, this seems to be getting difficult with 2.1beta as deannotate
> wipes out all nsmap information with its call to cleanup_namespaces(),
> and I cannot set a namespaced
>
> attribute through <elt>.set(...)

 

Just to be precise: 

A namespaced attribute value like "xsd:string".

It is easy to set a ns-qualified attribute using Clark notation, as anywhere

in lxml. 

 

> could we make the call to cleanup_namespaces optional (defaults
> to True) in deannotate()?

I wasn't entirely sure if it was a good idea when I added it. I guess it's
best to keep it out or make it optional (default False).

I'll remove it, then.

Rationale: lxml does a good job of keeping namespace declarations clean when

adding elements to a tree anyway, so with objectify's default nsmap namespace

declarations concerning xsi:type and py:pytype are usually located at the root element

only.

Anyone who needs a real clean document can still conveniently call etree.cleanup_namespaces()

after deannotate().

 

Holger




--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf <at> gmx
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
jholg | 1 Jul 10:48
Picon
Picon

Re: objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta


 

> could we make the call to cleanup_namespaces optional (defaults
> to True) in deannotate()?

I wasn't entirely sure if it was a good idea when I added it. I guess it's
best to keep it out or make it optional (default False).

I'll remove it, then.

Rationale: lxml does a good job of keeping namespace declarations clean when

adding elements to a tree anyway, so with objectify's default nsmap namespace

declarations concerning xsi:type and py:pytype are usually located at the root element

only.

Anyone who needs a real clean document can still conveniently call etree.cleanup_namespaces()

after deannotate().

 

Committed to trunk (revision 56199):

 

$ svn diff -r55702:56199 src/lxml/lxml.objectify.pyx
Index: src/lxml/lxml.objectify.pyx
===================================================================
--- src/lxml/lxml.objectify.pyx (revision 55702)
+++ src/lxml/lxml.objectify.pyx (revision 56199)
<at> <at> -1752,7 +1752,6 <at> <at>
             cetree.delAttributeFromNsName(
                 c_node, _XML_SCHEMA_INSTANCE_NS, "type")
         tree.END_FOR_EACH_ELEMENT_FROM(c_node)
-    etree.cleanup_namespaces(element)
 
 
 ################################################################################




--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf <at> gmx
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Mike MacCana | 1 Jul 11:12
Picon

A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed)

Ladies and gentleman, 

On Tue, 2008-07-01 at 07:24 +0200, Stefan Behnel wrote:
> Hi,
> 
> Mike MacCana wrote:
> > Hi gents,
> 
> Are you sure you don't want advice from any girls?
> 
> 
> > I'm a first time user of lxml attempting to etree.parse a document.
> My
> > code (below) works fine on some sample text, but libxml complains
> about
> > the real data with:
> > 
> > etree.XMLSyntaxError: line 196: Premature end of data in tag html
> line 5
> > 
> > The data is below. Line 5 seems OK to me, but I'm new to XML coding
> so
> > maybe I'm missing something.
> 
> The problem is not in line 5 (where the html tag starts) but in line
> 196,
> where it apparently ends. Try validating it at the W3C validator if
> you don't
> believe lxml. ;)

Thanks Stefan.

I solved the crap HTML problem as follows. Hopefully the following will
be useful to anyone beginning XPath with lxml.

#!/usr/bin/env python
import urllib, sys, lxml, StringIO, lxml.html,os

from lxml import etree
from StringIO import StringIO
from lxml.html.clean import Cleaner

## Point this at your XP VM used to get to Telstra
proxies = {'http': 'http://xpvm:3128'}
url='http://domain.com/page'

## Function to strip non-ascii characters
## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters
## for list
def onlyascii(char):
	if ord(char) < 32 or ord(char) > 176: 
		return ''
	else: 
		return char

## Open the URL and read its contents
filehandle = urllib.urlopen(url, proxies=proxies)
html=filehandle.read()
asciihtml=filter(onlyascii, html)

## Customer's HTML content is REALLY bad. Clean it.
## See http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html
## and 'pydoc lxml.html.clean.Cleaner'

## Clean HTML and strip a bunch of tags that are broken and that we dont
care about.
badtags=['img','a','div','span','h2','h1','style','title','ul','li','col']
cleaner = Cleaner(page_structure=False, links=False,
remove_tags=badtags )

## We can now access our cleaned content as 'cleanedcontent'
cleanedcontent=cleaner.clean_html(asciihtml)

## Save Clean content to disk for debugging purposes
os.remove('debug.html')
outputfile = open('debug.html','w')
outputfile.write(cleanedcontent)
outputfile.close()

## Go parse our content
cleanedcontentstringio = StringIO(cleanedcontent)
parser = etree.XMLParser(recover=True)
tree = etree.parse(cleanedcontentstringio)

## Xpath locations of what we're interested in (element zero is all we
care about
## text is the text within the tags, and strip off any whitespace
## You can find XPath locations by loading up 'debug.html' in Firefox
with the Firebug extension
name = tree.xpath('/html/body/table/tbody/tr/td')[0].text.strip()
email =
tree.xpath('/html/body/table/tbody/tr[7]/td')[0].text.strip().lower()

print name+","+email

Cheers,

Mike

________________________________________________
Mike MacCana
Technical Specialist
Australia Linux and Virtualisation Services

IBM Global Services
Level 14, 60 City Rd
Southgate Vic 3000 

Phone: +61-3-8656-2138
Fax: +61-3-8656-2423
Email: mmaccana <at> au1.ibm.com

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 1 Jul 13:51
Picon
Favicon
Gravatar

Re: objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta

jholg <at> gmx.de wrote:
>>> could we make the call to cleanup_namespaces optional (defaults
>>> to True) in deannotate()?
>>>
>>> I wasn't entirely sure if it was a good idea when I added it. I guess
>>> it's best to keep it out or make it optional (default False).
>>
>> I'll remove it, then.

Ok, thanks.

Stefan
Stefan Behnel | 1 Jul 14:03
Picon
Favicon
Gravatar

Re: A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed)

Hi,

Mike MacCana wrote:
> I solved the crap HTML problem as follows. Hopefully the following will
> be useful to anyone beginning XPath with lxml.

Just adding a few comments as I see fit.

> ## Function to strip non-ascii characters
> ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters
> ## for list
> def onlyascii(char):
> 	if ord(char) < 32 or ord(char) > 176:
> 		return ''
> 	else:
> 		return char

Note that this will not work as expected with multi-byte encodings such as
UTF-8.

> ## We can now access our cleaned content as 'cleanedcontent'
> cleanedcontent=cleaner.clean_html(asciihtml)

This will (obviously) parse the HTML into a tree internally, so it's more
efficient to pass a parsed tree directly.

> ## Go parse our content
> cleanedcontentstringio = StringIO(cleanedcontent)
> parser = etree.XMLParser(recover=True)
> tree = etree.parse(cleanedcontentstringio)

I wonder why you use an XML parser here. The HTML parser will likely work
better, as it knows about self-closing HTML tags.

Stefan

Gmane