F Wolff | 3 Jul 15:47
Favicon

space normalisation for .text and .tail

Hallo all

On 2009-03-24 I wrote about space normalisation with reference to the
xml:space attribute, and the string() and normalize-string() functions
in xpath. I solved my problem in code, partly due to slightly changing
requirements.

Now I need to do similar magic, but need to handle the text nodes
separately, without descending into child nodes.

>From the xpath document:
> The string-value of an element node is the concatenation of the
> string-values of all text node descendants of the element node in
> document order.
...which is not what I need to do in this case.

Is there a way to apply the normalize-text() to a node's .text or .tail
only? Is there another way to obtain the same result? From the looks of
it, there is no reliable way that I can normalise correctly in code,
since I won't know if a newline (for example) was given as a newline or
as a character reference, and this should influence the normalisation.

Any help is appreciated.

Friedel

--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/presentation-afrilex-alasa-2009
(Continue reading)

john mcginnis | 2 Jul 04:05

Really dumb question

I have completed a build with easy_install. Looking at site-packages I also see a lxml directory that exists. What I don't find however is a lxml.py module anywhere.

If you have a successful build should there not be a lxml.py module somewhere?

Thanks.

JohnMc

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Elliott Slaughter | 1 Jul 00:00
Gravatar

parsing DTDs - listing of valid elements

Hi,

I'm trying to get the elements in a DTD. Since these internals are not exported in the Python interface of lxml.etree, I am trying to write a Cython extension to do so, as previously suggested on this mailing list (see link below).

http://codespeak.net/pipermail/lxml-dev/2009-January/004298.html

To quote the message, "all you'd really need is the internal _c_dtd field of the DTD class, which you could cimport". I'm wondering exactly how I am supposed to do that (my attempts so far are described below). It would also be nice to know if the last attempt to do so was successful or not.

Thanks. Any help would be appreciated.



Here is what I've tried so far (on Python 2.5.4, Cython 0.11.2, Windows):

The DTD class is not declared in etreepublic.pxd, so I can't just "cimport etreepublic". The actual DTD class definition is in dtd.pxi, as stated in the message. But I can't just "include 'dtd.pxi' " because it inherits from the _Validator class in lxml.etree.pyx . And I can't "cimport lxml.etree" because there is no file lxml.etree.pxd.

I tried writing a lxml.etree.pxd file to circumvent these barriers (which was thoroughly confusing because _Validator contains an _ErrorLog which made me search through several other files...), but even when I got the entire thing to compile, it failed to load in Python:

>>> import mydtd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pxd", line 3, in mydtd (mydtd.c:513)
    cdef class _LogEntry:
ValueError: lxml.etree._LogEntry does not appear to be the correct type object

I have attached my lxml.etree.pxd in case I made any mistakes, in the event that this method can be made to work.

--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Attachment (lxml.etree.pxd): application/octet-stream, 603 bytes
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Elliott Slaughter | 30 Jun 01:10
Gravatar

Writing external modules in Cython

Hi,

On http://codespeak.net/lxml/capi.html#writing-external-modules-in-cython , I believe there is a typo in the second example section:

DefaultElementClassLookup

should be

etree.ElementDefaultClassLookup

Which is the way it is written on http://codespeak.net/lxml/element_classes.html#setting-up-a-class-lookup-scheme .

Thanks.

--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Thomas Weigel | 28 Jun 23:09

Re: lxml.html, now with ignored namespaces!

Stefan Behnel wrote:
> That's an XHTML document, for which the XML parser would be the right tool.

Sadly, not every page will be an XHTML document. Nor will every page be 
created by someone like me, an individual who loves XHTML and 
strictness. I apologize for giving the impression that my users might be 
sane and decent.

> If you have XHTML documents that contain unterminated <br> tags, they are
> not well-formed, and thus simply not XML, i.e. not XHTML.

I will have HTML 4 Loose and HTML5 documents that contain unterminated 
<br> tags, among others.

> Obviously, the best way to deal with this kind of problem is fixing the
> input documents.

Sadly, not possible. I mean, it would be nice. It surely would. But no.

>> -----
>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
>> cs="http://something.com/cs" xml:lang="en" 
>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>> -----
> 
> That's because HTML parsers are not namespace aware. Namespaces are simply
> not defined for HTML. But if you get a difference on different systems, I'd
> still suspect the reason to be different libxml2 versions. There's nothing
> lxml can do about this.

Yes, I gathered that from the previous reply. There's not much I can do 
about it, either, since I won't be in control of the specific libxml2 
installation.

Currently, I have a small unit test built in that checks the parser for 
eliminating namespaces or not. If the parser eliminates the namespace, I 
replace all "cs:something" attributes with "cs_something" attributes.

It's far from ideal, but it at least works.

Again, thank you.

Thomas Weigel
Francesco | 26 Jun 13:20
Favicon

XML file and XPath

What is the best way to load an XML file, then query it using XPath and keeping 
the same encoding?

Thanks,

Francesco
Hervé Cauwelier | 24 Jun 19:23
Favicon

parsing and serializing XML fragments

Hi, I'm trying to load fragments of XML to inject them in an existing
document tree.

They look like this:

  <table:table table:name="%s" table:style-name="%s"/>

(It's OpenDocument  format.)

Converting the fragment to the "{uri}name" syntax is not an option since
I must remain agnostic to the XML parser.

I would expect the XML() function to take an "nsmap" argument, like the
xpath() method on elements, or parts of the API for subclassing elements.

For now I have another template, a complete document with namespace
declaration, and I inject my fragment using string formatting. Lxml will
parse it and I extract the first child element. (I was using this
technique with the libxml2 Python wrapper and I have seen it in the
lxml.html source code for loading HTML fragments.)

I have looked at custom elements and other resolving methods but lxml
was raising a namespace error before my "print"'s show up.

Another issue is to save the element back to its snippet form, for unit
test validation. Lxml will produce a valid document with namespace
declaration. Either how to serialize without namespace declaration or
how to remove it while keeping prefixes?

Thanks in advance
Francesco | 24 Jun 18:23
Favicon

XPath return values to file?

How could I save to a file with the right encoding the results from a XPath call?

My XPath data is in a list... and I have problems saving it to a file...

Thanks,

Francesco
Francesco | 24 Jun 12:29
Favicon

clean_html

I have written the following code:

>>> from lxml.html.clean import clean_html
>>> html = "»"
>>> print clean_html(html)
<p>»</p>

I am wondering why I have an extra character (Â) in my output.
What should I do to avoid that?

Thanks,

Francesco

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
David Antliff | 23 Jun 11:37

Re: Using, or building, lxml in Windows with Cygwin

On Wed, Jun 17, 2009 at 18:35, Stefan Behnel<stefan_ml <at> behnel.de> wrote:
> David Antliff wrote:
>> I am trying a slightly different approach - compiling entirely within
>> Cygwin, using Cygwin's gcc.
>
> Right, that should be best anyway.
>
>
>> What I have done is unpacked lxml-2.2.1.tgz.gz into lxml-2.2.1, then
>> inside that directory I try:
>>
>> $ python setup.py build --static-deps
>>
>> This proceeds to download libxml2 and libxslt, unpack them, and build
>> them. But it runs into numerous problems related to include/library
>> paths.
>
> ... which you may be able to fix using appropriate CFLAGS/LDFLAGS.
[snip]
> I don't know if you install the libiconv developer package on your machine
> (I hope that exists in Cygwin), because building against the shared lib
> should work just fine here. Maybe you need to point gcc to the right
> include and/or lib directory. I wonder why it didn't add "-liconv"
> automatically...
>
> You can also try to go the ugly route and add "/usr/lib/libiconv.a" to your
> CFLAGS, but I think you'll be happier with the shared lib.
>
>
>> I wonder why setup.py didn't automatically download libiconv when it
>> downloaded libxml2 and libxslt... hmm
>
> Because the static build was designed for MacOS-X, where only those two
> libraries are a problem. The libiconv is binary compatible enough across
> versions not to pose any major problems. So it's best to build dynamically
> against libiconv.
>
> That said, it shouldn't be too hard to add code to also download libiconv
> and build it. I would be happy to receive a patch that accepts an optional
> list of library names for the --static-deps option, as in
> "--static-deps=libxml2,libxslt,iconv", and would then download and build
> all requested libraries. Although I doubt that it would make sense (or even
> work) to pass only "--static-deps=iconv", so maybe a new option
> "--build-iconv" would be better.

Hi Stefan,

I must admit I don't understand how setup.py, setupinfo.py and
distutils all fits together.  By watching the output of 'python
setup.py build --static-deps' it's fairly clear to me what is needed
to fix each step, but I can't work out where one sets extra CFLAGS or
LDFLAGS in setup.py so that I can continue the process beyond each
error.

What would also help is if I knew what command I was meant to be
using. The documentation suggests all sorts of options and I'm really
not sure what I'm doing. Here's what I want to end up with:
  - a statically compiled python 'egg' of lxml that I can simply
distribute with a python script and use in Cygwin.

Here's what I currently have:
 - lxml-2.2.1 tar.gz unpacked
 - Cygwin *without* libxml2 or libxslt installed (I don't want my
users to have to install these via Cygwin setup.exe).
 - gcc, ld, etc are all present

As per a previous email, I tried:
$ python setup.py build --static-deps

But this has problems finding libiconv. You suggested I could
use-liconv to fix this, but I can't work out where to actually put
this in setup.py or setupinfo.py.

$ ls /usr/lib/libic*
/usr/lib/libiconv.a  /usr/lib/libiconv.dll.a  /usr/lib/libiconv.la

However that's on my own machine - it turns out that nobody else has
Cygwin's libiconv installed, so I'd like to *statically* link in
libiconv too. Dynamically linking won't work in my case.

I think the biggest problem I have is that having to run "python
setup.py build" every time clears out everything. It would be good if
I could get setup.py to simply print all the command lines it intends
to use without actually running them...

Any assistance you can give me would be greatly appreciated please.

Regards,

-- David.
Thomas Weigel | 23 Jun 02:53

lxml.html, now with ignored namespaces!

I am using lxml to parse HTML documents, which include a custom 
namespace (for example, "<p cs:content='fruit'>FRUIT</p>").

In lxml 2.2.0, on Windows, this worked just fine, and elements could be 
processed based on this data.

In lxml 2.2.2, on Linux, this fails. The above example becomes "<p 
content='fruit'>FRUIT</p>" as soon as it is parsed by lxml.html (or 
lxml.etree.HTMLParser()).

I don't know if this is caused by the switch to Linux, or the upgrade to 
2.2.2. I don't have control over the installation, so I can't switch to 
2.2.2 under Windows, or 2.2.0 under Linux to check.

I did find this reference (the only reference to this I could find) to 
the HTML ignoring namespaces:
http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests

...however, it wasn't doing that before, and it seems odd that this is 
only mentioned in the doctests section.

Is there a way to work around this? Are custom namespaces simply not 
possible in lxml's HTML?

Notes:

1. The XML parser will not work. Some documents will have legal HTML 
that breaks an XML parser, like "<br>".

2. Here is the sample code:

-----
 >>> import lxml.html as parser
 >>> document = parser.fromstring("""<!DOCTYPE html PUBLIC "-//W3C//DTD 
XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html 
xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
xmlns:cs="http://something.com/cs" xml:lang="en" 
lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
going to disappear!</p><p cs:content='fruit'>FRUIT</p></body></html>""")
 >>> print parser.tostring(document)
-----

The output:
-----
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
cs="http://something.com/cs" xml:lang="en" 
lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
going to disappear!</p><p content="fruit">FRUIT</p></body></html>
-----

Thomas Weigel

Gmane