Stefan Behnel | 17 Apr 22:16 2006
Picon

Re: Re: HTMLParser status and issues


Paul Everitt wrote:
> Stefan Behnel wrote:
>> Paul Everitt wrote:
>>> Howdy.  I was giving the htmlparser branch a try.  In trying to compile
>>> it, I got:
>>>
>>> src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__':
>>> src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first
>>> use in this function)
>>> src/lxml/etree.c:17245: error: (Each undeclared identifier is reported
>>> only once
>>> src/lxml/etree.c:17245: error: for each function it appears in.)
>>> src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first
>>> use in this function)
>>> src/lxml/etree.c: In function 'initetree':
>>> src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first
>>> use in this function)
>>> src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first
>>> use in this function)
>>> error: command 'gcc' failed with exit status 1
>>
>>
>> Hmm, I don't see a reason for that error. My clean checkout compiles
>> nicely.
>>
>> What's your libxml2 version on MacOS? In my
>> include/libxml2/HTMLparser.h it
>> says somewhere around line 175:
> 
(Continue reading)

Stefan Behnel | 20 Apr 08:10 2006
Picon

Re: Re: HTMLParser status and issues


Stefan Behnel wrote:
> Paul Everitt wrote:
>> I got this:
>>
>> # 175 "/usr/include/libxml2/libxml/HTMLparser.h"
>> typedef enum {
>>     HTML_PARSE_NOERROR = 1<<5,
>>     HTML_PARSE_NOWARNING= 1<<6,
>>     HTML_PARSE_PEDANTIC = 1<<7,
>>     HTML_PARSE_NOBLANKS = 1<<8,
>>     HTML_PARSE_NONET = 1<<11
>> } htmlParserOption;
> 
> 
> That's not libxml2 2.6.22 then. I think your C compiler uses the Mac-OS system
> libraries instead of the libraries installed by what your xmllint uses.

Back to this issue, the missing option HTML_PARSE_RECOVER came up in libxml2
2.6.21, while Mac-OS Tiger ships with 2.6.16. However, it looks like the
HTML_PARSE_* options follow the numeric values of the XML_PARSE_* enum
exactly. So, as a work-around, we could use XML_PARSE_RECOVER to make it
compile and simply state that libxml2 2.6.21+ is required for parsing broken
HTML. That way, it would keep working with the system libraries on Mac-OS X.

Paul, I applied the above change to the branch for now. I'd be glad if you
could check that it now compiles with the Mac-OS system libraries. Please run
the test suite. If everything works as expected, only the test case(s) for
parsing broken HTML should fail.

(Continue reading)

Stefan Behnel | 20 Apr 12:44 2006
Picon

HTML parser is in the trunk! Ready for 1.0 ?

Hi all,

I just merged the HTML parser branch into the trunk. Paul reported that the
latest branch version compiled cleanly on Mac-OS X Tiger (libxml 2.6.16) - and
it even passed all tests there, including those on broken HTML. Newer versions
of both libxml2 and libxslt are recommended, though.

Another recent update on the trunk is the support for xml:id, which is
currently available through an XMLDTDID function (XMLID was already in use by
ET and is compatible in lxml). The new functionality is now directly based on
the libxml2 ID hash table provided by the parser. This means that lxml now
supports dictionary-like access to elements having an "xml:id" attribute or
DTD-REF attributes.

I think it is now the time to fix features for lxml 1.0. Expect it to be
released next month (hopefully after Pyrex 0.9.4.1).

If you think that lxml still misses something that should be in 1.0 or if you
know about any remaining (or new) bugs, report back to the list. Please start
a separate thread in that case instead of replying to this mail. Martijn and I
are happy about any comment that helps us get lxml better.

Have fun,
Stefan
Stefan Behnel | 20 Apr 19:26 2006
Picon

Custom resolvers

Hi,

since Paul kept bugging me, I created a new branch (resolver-new) and
implemented an API for the custom resolvers stuff. It should be pretty simple
to use, just create a parser and register the resolver:

parser = XMLParser()
parser.resolvers.add(my_resolver)

"my_resolver" must be of type etree.Resolver and provide a method

resolve(system_url, public_id, context)

that returns either None (== "can't resolve, ask someone else") or a
_ParserInput object. These can be built from files or strings using the
Resolver methods 'resolve_string' and 'resolve_filename'.

So, to create a custom resolver, you basically do this

---------------
class MyResolver(lxml.etree.Resolver):
  entity = "This was an entity"
  def resolve(self, url, id, context):
    if url == 'my.dtd':
      # I can handle this
      return self.resolve_string(
                  u'<!ENTITY myentity "%s">' % self.entity, context)
    elif url.startswith('http://'):
      # the default resolver can handle this
      return super(MyResolver, self).resolve(url, id, context)
(Continue reading)

Brad Clements | 20 Apr 19:46 2006

Re: Custom resolvers

On 20 Apr 2006 at 19:26, Stefan Behnel wrote:

> parser = XMLParser()
> parser.resolvers.add(my_resolver)

Great, so does this resolver only get called when this one parser is used, or is it 
global to the process (like it is with libxml2)?

> I'll see how to integrate that in other places of the API, especially
> XSLT and schemas. Anyway, this works so far. Feel free to comment on

If I create  a parser, add my resolver, then load an .xslt file into that parser, I'd 
expect that subsequent use of the parsed document in a transform would 
continue to use my resolver.

and that my resolver would not be called by other documents or transforms.

Is that what really happens? If so, nirvana!

--

-- 
Brad Clements,                bkc <at> murkworks.com    (315)268-1000
http://www.murkworks.com                          
AOL-IM or SKYPE: BKClements
Stefan Behnel | 20 Apr 20:30 2006
Picon

Re: Custom resolvers


Brad Clements wrote:
> On 20 Apr 2006 at 19:26, Stefan Behnel wrote:
> 
>> parser = XMLParser()
>> parser.resolvers.add(my_resolver)
> 
> 
> Great, so does this resolver only get called when this one parser is used, or is it 
> global to the process (like it is with libxml2)?

It's currently local to a parser. I'm looking for a module level API also, but
I'm not sure yet how to make it look pretty. Anyway, the parser-level API is
likely the preferred one anyway.

>> I'll see how to integrate that in other places of the API, especially
>> XSLT and schemas. Anyway, this works so far. Feel free to comment on
> 
> If I create  a parser, add my resolver, then load an .xslt file into that parser, I'd 
> expect that subsequent use of the parsed document in a transform would 
> continue to use my resolver. and that my resolver would not be called by
> other documents or transforms.

So you'd want the resolvers stored at a per-document level rather than in XSLT
or RelaxNG? That would totally simplify the API. I think that's a good idea.

So, just to make that clear:

1) resolvers are only registered with parsers.

(Continue reading)

Brad Clements | 20 Apr 21:21 2006

Re: Custom resolvers

On 20 Apr 2006 at 20:30, Stefan Behnel wrote:

> > Great, so does this resolver only get called when this one parser is
> > used, or is it global to the process (like it is with libxml2)?
> 
> It's currently local to a parser. I'm looking for a module level API
> also, but I'm not sure yet how to make it look pretty. Anyway, the
> parser-level API is likely the preferred one anyway.

Is the ability to register a resolver by-parser new functionality in libxml2? 

> So you'd want the resolvers stored at a per-document level rather than
> in XSLT or RelaxNG? That would totally simplify the API. I think
> that's a good idea.

I don't know anything about RelaxNG.. But with respect to xslt.. see below

> So, just to make that clear:
> 
> 1) resolvers are only registered with parsers.

yes

> 
> 2) once a document is parsed, a reference to the parser-local
> resolvers is kept in the document to be reused in all operations where
> resolving is involved (XSLT, RelaxNG, XInclude, etc.).

yes

(Continue reading)

Stefan Behnel | 20 Apr 22:10 2006
Picon

Re: Custom resolvers


Brad Clements wrote:
> On 20 Apr 2006 at 20:30, Stefan Behnel wrote:
> 
>>> Great, so does this resolver only get called when this one parser is
>>> used, or is it global to the process (like it is with libxml2)?
>> It's currently local to a parser. I'm looking for a module level API
>> also, but I'm not sure yet how to make it look pretty. Anyway, the
>> parser-level API is likely the preferred one anyway.
> 
> Is the ability to register a resolver by-parser new functionality in libxml2?

No, lxml registers a global resolver and dispatches internally, possibly
falling back to the original default resolver.

>> Questions:
>>
>> * if you parse an XSL document with one set of resolvers and then use
>> it to transform an XML document with another set of resolvers - which
>> ones should be used during the transform?
> 
> Well hmm.. when does the xsl transform process xsl:include and xsl:import?
> I think those two statements should use the resolver assigned to the base xslt 
> document.

Includes and imports are handled at compilation time, which happens in
XSLT.__init__(). Libxslt uses a different mechanism than libxml2 here, which
(as usual) complicates things. It allows you to specify an
"xsltDocLoaderFunction" that is expected to operate in the current XSLT context.

(Continue reading)

Stefan Behnel | 21 Apr 11:17 2006
Picon

Pyrex 0.9.4.1 is out

Hi,

Pyrex 0.9.4.1 was released today. It finally compiles lxml nicely and
out-of-the-box with Python 2.4 and gcc 4.x. And it can be installed with

easy_install Pyrex

I updated the INSTALL.txt on the trunk and the 0.9.x branch to make this
version a requirement for those who want to twiddle with the source.

Stefan
Stefan Behnel | 21 Apr 13:17 2006
Picon

document('') fixed

Hi,

I played with the XSLT document loaders and found that the default loader can
apparently handle "document('')" on XSL documents read from strings as long as
they have a non-empty URL. This only makes sense when you know that libxslt
keeps a list of known documents during the transformation, so it apparently
searches that list for the URL of the requested document.

I changed the code on the trunk to create a fake URL for the case that the
document URL is empty. So, document('') should now work from any stylesheet
(if anyone wants to verify...)

Stefan

Gmane