A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed)
Mike MacCana <mmaccana <at> au1.ibm.com>
2008-07-01 09:12:58 GMT
Ladies and gentleman,
On Tue, 2008-07-01 at 07:24 +0200, Stefan Behnel wrote:
> Hi,
>
> Mike MacCana wrote:
> > Hi gents,
>
> Are you sure you don't want advice from any girls?
>
>
> > I'm a first time user of lxml attempting to etree.parse a document.
> My
> > code (below) works fine on some sample text, but libxml complains
> about
> > the real data with:
> >
> > etree.XMLSyntaxError: line 196: Premature end of data in tag html
> line 5
> >
> > The data is below. Line 5 seems OK to me, but I'm new to XML coding
> so
> > maybe I'm missing something.
>
> The problem is not in line 5 (where the html tag starts) but in line
> 196,
> where it apparently ends. Try validating it at the W3C validator if
> you don't
> believe lxml. ;)
Thanks Stefan.
I solved the crap HTML problem as follows. Hopefully the following will
be useful to anyone beginning XPath with lxml.
#!/usr/bin/env python
import urllib, sys, lxml, StringIO, lxml.html,os
from lxml import etree
from StringIO import StringIO
from lxml.html.clean import Cleaner
## Point this at your XP VM used to get to Telstra
proxies = {'http': 'http://xpvm:3128'}
url='http://domain.com/page'
## Function to strip non-ascii characters
## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters
## for list
def onlyascii(char):
if ord(char) < 32 or ord(char) > 176:
return ''
else:
return char
## Open the URL and read its contents
filehandle = urllib.urlopen(url, proxies=proxies)
html=filehandle.read()
asciihtml=filter(onlyascii, html)
## Customer's HTML content is REALLY bad. Clean it.
## See http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html
## and 'pydoc lxml.html.clean.Cleaner'
## Clean HTML and strip a bunch of tags that are broken and that we dont
care about.
badtags=['img','a','div','span','h2','h1','style','title','ul','li','col']
cleaner = Cleaner(page_structure=False, links=False,
remove_tags=badtags )
## We can now access our cleaned content as 'cleanedcontent'
cleanedcontent=cleaner.clean_html(asciihtml)
## Save Clean content to disk for debugging purposes
os.remove('debug.html')
outputfile = open('debug.html','w')
outputfile.write(cleanedcontent)
outputfile.close()
## Go parse our content
cleanedcontentstringio = StringIO(cleanedcontent)
parser = etree.XMLParser(recover=True)
tree = etree.parse(cleanedcontentstringio)
## Xpath locations of what we're interested in (element zero is all we
care about
## text is the text within the tags, and strip off any whitespace
## You can find XPath locations by loading up 'debug.html' in Firefox
with the Firebug extension
name = tree.xpath('/html/body/table/tbody/tr/td')[0].text.strip()
email =
tree.xpath('/html/body/table/tbody/tr[7]/td')[0].text.strip().lower()
print name+","+email
Cheers,
Mike
________________________________________________
Mike MacCana
Technical Specialist
Australia Linux and Virtualisation Services
IBM Global Services
Level 14, 60 City Rd
Southgate Vic 3000
Phone: +61-3-8656-2138
Fax: +61-3-8656-2423
Email: mmaccana <at> au1.ibm.com
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev