Re: HTML Processing
Gloria W <strangest <at> comcast.net>
2007-05-08 02:35:50 GMT
Ar18 <at> comcast.net wrote:
> I would like to investigate (and possibly implement it) the possibility of using Python for processing
html pages.
>
> The actual work would look something like this:
> * Retrieve pages from the net that are in any number of formats such as XML, XHML, HTML, HTML, with major
errors in it
> * Create a usable DOM for the files (considering the fact that they may have malformed html) OR... extract
the stuff I need directly from the potentially malformed html.
> * If the DOM route is used, then I would need something to retrieve stuff from certain areas of the DOM.
> Additional features needed:
>
> I wonder, is this a good place to talk about this?
>
> I know the goal is XML, but I think this still fits. What libraries should I be looking into to do things like
this? I would prefer to look at all the options, if possible.
> _______________________________________________
> XML-SIG maillist - XML-SIG <at> python.org
> http://mail.python.org/mailman/listinfo/xml-sig
>
>
I wrote an application to do just this. I found that the existing
xml.dom module had some serious bugs, has not been touched since 2004,
and had no easy way of creating and inserting subtrees in the DOM, or
working with subsets of the DOM. This looks like it was written, then
abandoned for some reason. Not sure why.
I tried to use the elementree from effbot, but also with no success. It
is not DOM compliant, and it's nesting is odd. For example, text
appearing after a <p>...</P. tag on the same line is stuffed into a
'tail variable of the same node, instead of being made into a sibling
(Continue reading)