Tres Seaver | 1 Jul 2009 01:15
Gravatar

Re: Binary egg for Mac OS X


kristian kvilekval wrote:
> Hi,
> 
>     We are critical need of an installable 
> recent lxml egg for Mac OS 10.5.7 with python 2.5
> 
> Could you please post the building instructions you used to 
> create the eggs for python 2.4

This is what I use to build lxml in CentOS4, which has the
"too-old-libxml2-and-libxslt" problem:

 $ wget http://pypi.python.org/packages/source/l/lxml/lxml-2.2.2.tar.gz
 ...
 $ tar xzf lxml-2.2.2.tar.gz
 $ cd lxml-2.2.2
 $ /path/to/python setup.py bdist_egg --static-deps

Tres.
--
===================================================================
Tres Seaver          +1 540-429-0999          tseaver <at> palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
Elliott Slaughter | 1 Jul 2009 02:29
Picon
Gravatar

Re: parsing DTDs - listing of valid elements

Please ignore my previous message; I solved my own problem by finding an XML schema for what I need to do.

Sorry for the noise.

On Tue, Jun 30, 2009 at 3:04 PM, Elliott Slaughter <elliottslaughter <at> gmail.com> wrote:
Hi,

I'm trying to get the elements in a DTD. Since these internals are not exported in the Python interface of lxml.etree, I am trying to write a Cython extension to do so, as previously suggested on this mailing list (see link below).

http://codespeak.net/pipermail/lxml-dev/2009-January/004298.html

To quote the message, "all you'd really need is the internal _c_dtd field of the DTD class, which you could cimport". I'm wondering exactly how I am supposed to do that (my attempts so far are described below). It would also be nice to know if the last attempt to do so was successful or not.

Thanks. Any help would be appreciated.



Here is what I've tried so far (on Python 2.5.4, Cython 0.11.2, Windows):

The DTD class is not declared in etreepublic.pxd, so I can't just "cimport etreepublic". The actual DTD class definition is in dtd.pxi, as stated in the message. But I can't just "include 'dtd.pxi' " because it inherits from the _Validator class in lxml.etree.pyx . And I can't "cimport lxml.etree" because there is no file lxml.etree.pxd.

I tried writing a lxml.etree.pxd file to circumvent these barriers (which was thoroughly confusing because _Validator contains an _ErrorLog which made me search through several other files...), but even when I got the entire thing to compile, it failed to load in Python:

>>> import mydtd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pxd", line 3, in mydtd (mydtd.c:513)
    cdef class _LogEntry:
ValueError: lxml.etree._LogEntry does not appear to be the correct type object

I have attached my lxml.etree.pxd in case I made any mistakes, in the event that this method can be made to work.

--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay



--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 1 Jul 2009 07:37
Picon
Favicon

Re: parsing DTDs - listing of valid elements

Hi,

Elliott Slaughter wrote:
> Please ignore my previous message; I solved my own problem by finding an XML
> schema for what I need to do.

Note that you can always use trang to convert a DTD to an XML Schema.

http://www.thaiopensource.com/relaxng/trang.html

If all you need is a list of allowed elements, the required logic to
extract that from the schema shouldn't be too hard to figure out. Although
I wonder if RelaxNG wouldn't be easier to work on.

Stefan
Stefan Behnel | 1 Jul 2009 08:14
Picon
Favicon

Re: parsing DTDs - listing of valid elements

Hi,

Elliott Slaughter wrote:
> I'm trying to get the elements in a DTD. Since these internals are not
> exported in the Python interface of lxml.etree, I am trying to write a
> Cython extension to do so, as previously suggested on this mailing list (see
> link below).
> 
> http://codespeak.net/pipermail/lxml-dev/2009-January/004298.html
> 
> To quote the message, "all you'd really need is the internal _c_dtd field of
> the DTD class, which you could cimport". I'm wondering exactly how I am
> supposed to do that
> [...]
> Here is what I've tried so far (on Python 2.5.4, Cython 0.11.2, Windows):
> 
> The DTD class is not declared in etreepublic.pxd, so I can't just "cimport
> etreepublic". The actual DTD class definition is in dtd.pxi, as stated in
> the message. But I can't just "include 'dtd.pxi' " because it inherits from
> the _Validator class in lxml.etree.pyx . And I can't "cimport lxml.etree"
> because there is no file lxml.etree.pxd.

True. So your only chance is to write one yourself. And yes, it needs to be
called "lxml.etree.pxd".

> I tried writing a lxml.etree.pxd file to circumvent these barriers (which
> was thoroughly confusing because _Validator contains an _ErrorLog which made
> me search through several other files...),

All you should really need is this:

	cimport tree

	cdef class _Validator:
	    cdef object _error_log

	cdef class DTD(_Validator):
	    cdef tree.xmlDtd* _c_dtd

Cython needs to know the exact /layout/ of the classes that you use (at
least if they are not exported as C header files), but it doesn't need to
know the exact class types of attributes. "object" will do just fine if you
don't care.

I know that this is harder than necessary (thanks for bringing this up,
BTW), but that's just because _DTD isn't an 'officially' C-exported type,
just like all other schema types.

Stefan
john mcginnis | 2 Jul 2009 04:08
Picon

Really dumb question

I have completed a build with easy_install. Looking at site-packages I also see a lxml directory that exists. What I don't find however is a lxml.py module anywhere.

If you have a successful build should there not be a lxml.py module somewhere?

Thanks.

JohnMc

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 2 Jul 2009 06:18
Picon
Favicon

Re: Really dumb question


john mcginnis wrote:
> I have completed a build with easy_install. Looking at site-packages I also
> see a lxml directory that exists. What I don't find however is a lxml.py
> module anywhere.
> 
> If you have a successful build should there not be a lxml.py module
> somewhere?

No, lxml is a package, not a module. When you import "lxml", Python will
load "lxml/__init__.py" instead.

Stefan
F Wolff | 3 Jul 2009 15:48
Picon

space normalisation for .text and .tail

Hallo all

On 2009-03-24 I wrote about space normalisation with reference to the
xml:space attribute, and the string() and normalize-string() functions
in xpath. I solved my problem in code, partly due to slightly changing
requirements.

Now I need to do similar magic, but need to handle the text nodes
separately, without descending into child nodes.

>From the xpath document:
> The string-value of an element node is the concatenation of the
> string-values of all text node descendants of the element node in
> document order.
...which is not what I need to do in this case.

Is there a way to apply the normalize-text() to a node's .text or .tail
only? Is there another way to obtain the same result? From the looks of
it, there is no reliable way that I can normalise correctly in code,
since I won't know if a newline (for example) was given as a newline or
as a character reference, and this should influence the normalisation.

Any help is appreciated.

Friedel

--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/presentation-afrilex-alasa-2009
Geoffrey Sneddon | 4 Jul 2009 11:13

Re: lxml.html, now with ignored namespaces!


On 27 Jun 2009, at 07:23, Stefan Behnel wrote:

>> The output:
>> -----
>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
>> cs="http://something.com/cs" xml:lang="en"
>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are
>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>> -----
>
> That's because HTML parsers are not namespace aware. Namespaces are  
> simply
> not defined for HTML. But if you get a difference on different  
> systems, I'd
> still suspect the reason to be different libxml2 versions. There's  
> nothing
> lxml can do about this.

It should still be outputting an element with a name of "cs:content",  
it shouldn't be dropping the "cs:", as, as you say, there are not  
namespaces in HTML, so it has no meaning.

My basic advice to the OP would be to use html5lib, which is far  
slower, but does cope with this fine.

--
Geoffrey Sneddon
<http://gsnedders.com/>
Stefan Behnel | 4 Jul 2009 12:03
Picon
Favicon

Re: lxml.html, now with ignored namespaces!

Hi,

Geoffrey Sneddon wrote:
>>> The output:
>>> -----
>>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
>>> cs="http://something.com/cs" xml:lang="en"
>>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are
>>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>>> -----
>
> My basic advice to the OP would be to use html5lib, which is far slower,
> but does cope with this fine.

Well, as I said, it just depends on the version of libxml2 that you are using.

>>> from lxml import etree
>>> print "lxml.etree:       ", etree.LXML_VERSION
lxml.etree:        (2, 2, 2, 0)
>>> print "libxml used:      ", etree.LIBXML_VERSION
libxml used:       (2, 6, 32)

>>> from lxml.html import fromstring

>>> document = fromstring("""<!DOCTYPE html PUBLIC "-//W3C//DTD
... XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html
... xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
... xmlns:cs="http://something.com/cs" xml:lang="en"
... lang="en"><head><title>Help!</title></head><body><p>My namespaces are
... going to disappear!</p><p cs:content='fruit'>FRUIT</p></body></html>
... """)

>>> print parser.tostring(document)
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
xmlns:cs="http://something.com/cs" xml:lang="en"
lang="en"><head><title>Help!</title></head><body><p>My namespaces are
going to disappear!</p><p cs:content="fruit">FRUIT</p></body></html>

Stefan
Stefan Behnel | 4 Jul 2009 13:24
Picon
Favicon

Re: space normalisation for .text and .tail

Hi,

F Wolff wrote:
> On 2009-03-24 I wrote about space normalisation with reference to the
> xml:space attribute, and the string() and normalize-string() functions
> in xpath. I solved my problem in code, partly due to slightly changing
> requirements.
> 
> Now I need to do similar magic, but need to handle the text nodes
> separately, without descending into child nodes.
> 
>>From the xpath document:
>> The string-value of an element node is the concatenation of the
>> string-values of all text node descendants of the element node in
>> document order.
> ...which is not what I need to do in this case.
> 
> Is there a way to apply the normalize-text() to a node's .text or .tail
> only? Is there another way to obtain the same result?

Well, lxml will not allow you to modify individual text nodes that the
parser created next to each other for whatever reason (likely due to
implementation details), even if XPath allows you to get your hands on them
using "text()". The text/tail properties are as deep down as it gets.

> From the looks of
> it, there is no reliable way that I can normalise correctly in code,
> since I won't know if a newline (for example) was given as a newline or
> as a character reference, and this should influence the normalisation.

Why is that? XML parsers will always replace character references by their
Unicode character value, and there is no way XPath could see them. If you
need that information for your algorithm, you will have to parse the XML
byte stream yourself. Neither the XML infoset nor the XPath data model
provide this.

Stefan

Gmane