Stefan Behnel | 3 Dec 2011 09:41
Picon
Favicon

Re: Attribute encoding.

Evgeny Turnaev, 28.11.2011 12:01:
> Is there any reason why attributes of Element returned as bytestring
> if only contains ascii?

Yes. Partly for ElementTree compatibility and partly because it's faster 
and more memory friendly under Python 2.x.

Also note that it's not just attribute names and values. All string values 
work this way in lxml.

> In my application i need it to be unicode always.

In Python 3, lxml will always give you Unicode strings. In Python 2, ASCII 
encoded byte strings are compatible with the equivalent Unicode strings (as 
long as the platform default encoding is ASCII-compatible, which is 
"normally" the case), so you will rarely notice the difference in your code.

> Is there a way to force lxml return element attribute as unicode?

No.

> What is the preferred way of getting attributes as unicode?

If you really need a unicode string in Py2, you can do "unicode(value)" or 
"u'' + value".

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
(Continue reading)

Kristian Kvilekval | 7 Dec 2011 01:15
Picon
Gravatar

Truth and elements


I am seeing strange behavior on truth testing.. It appears to be set to
the whether a node has children or not.

a = etree.XML('<a><b/><b/></a>')
a or None
Out[47]: <Element a at 0x3b0acd0>
n [48]: a[0] or None

In [49]: a[0]
Out[49]: <Element b at 0x3b0ac80>

In [50]: a = etree.XML('<a><b><c/></b><b/></a>')

In [51]: a or None
Out[51]: <Element a at 0x3b0aa00>

In [52]: a[0] or None
Out[52]: <Element b at 0x3b0aeb0>

I am trying to get the first child of an expression with 

(len(a) and a[0]) or None

I am also seeing a future warning.
The behavior of this method will change in future versions. Use specific
'len(elem)' or 'elem is not None' test instead.

Is this related and will what I am trying work in the future?

(Continue reading)

Jens Quade | 7 Dec 2011 01:28
Picon
Gravatar

Re: Truth and elements


On 07.12.2011, at 01:15, Kristian Kvilekval wrote:

> 
> I am seeing strange behavior on truth testing.. It appears to be set to
> the whether a node has children or not.

This works as documented. For truth testing, elements work like a list of their subelements.
Because of that, empty elements are false. This is not very intuitive, and may change in the future.

If you use boolean logic directly on elements you get a warning because of this issue.

>>> from lxml.etree import XML

>>> a = XML('<a/>')
>>> len(a)
0
>>> bool(a)
__main__:1: FutureWarning: The behavior of this method will change in future versions. Use specific
'len(elem)' or 'elem is not None' test instead.
False
>>> a is not None
True
>>> 

I would probably use

>>> a = XML('<a/>')
>>> a[0] if len(a) else None 
>>> 
(Continue reading)

Simon Sapin | 12 Dec 2011 12:31
Picon

Implementing :root in cssselect, is XPath's count() buggy?

Hi,

I was thinking of implementing the :root pseudo-class in lxml.cssselect, 
which works by translating CSS selectors to XPath.

It is straightforward to find the root element with XPath:

     any_element.xpath('/*')

However, the current architecture of cssselect makes it difficult to 
implement :root that way. Instead, pseudo-classes are expected to 
produce an XPath condition like [position() = 1].

Though not optimal, I think that translating :root to [count(..) = 0] 
*should* work (only the root element has no parent). This is not the 
case, the condition is always false.

Example on an HTML document with a <html> root and a <body>:

The '..' part works fine to select the parent, and the (root) <html> 
element has no parent:

 >>> map(r.xpath, ['//html', '//body', '//body/..', '//html/..'])
[[<Element html at 0x16f2950>],
  [<Element body at 0x1962410>],
  [<Element html at 0x16f2950>],
  []]

The count() function looks fine too, including when the result is zero 
or for counting parents. Though it seems weird that results are 
(Continue reading)

Simon Sapin | 14 Dec 2011 14:06
Picon

Implemented :root

Hi,

count(..) always returns 1 even on the root element where I would expect 
0, but count(parent::*) works.
Also, not() is short for count() = 0

     >>> r.xpath('//html[not(parent::*)]')
     [<Element html at 0x1cda110>]
     >>> r.xpath('//body[not(parent::*)]')
     []

This can be used to implement the :root pseudo-class in cssselect. A 
patch is here:

https://github.com/lxml/lxml/pull/22/files#diff-2

--

-- 
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Duane Kaufman | 14 Dec 2011 22:02
Favicon

Minimal example of lxml writing an XML file and then reading and validating it against an XSD

Hi,

My name is Duane Kaufman, and I am new to this list (please be gentle :)

I am new to XML (not to Python), and I am wrestling with trying to get
lxml to perform a task for me (I am on Windows XP, Python 2.7 lxml 2.3).

I want to:

1) Use lxml to create an XML file
2) (Manually) create an XML Schema (XSD) file for the created XML file
3) Use lxml to read the XML file, validating it against the XSD file
from 2)

I have tried:
MyXMLWriter.py:
#-------------------------
ScriptRootDir = r'H:\My
Documents\Manufacturing\my_python\XML_Test\XMLSchema_test'

def main():
    global ScriptRootDir
    # Script to test the use of package lxml to pass XML messages
containing data
    # This script will write out an XML file which will be a message
with data
    from lxml import objectify
    from lxml import etree
    import os
    root = objectify.Element("root")
(Continue reading)

Ricky Wong | 15 Dec 2011 02:03
Picon

Using lxml to read epub/docbook files. Re-orders doctype/xml tags

This might be a simple question (also posted this on stackoverflow here - http://bit.ly/uHcIB9).

Noticed how after lxml reads in the doc it reverses the doctype and xml tags when it prints it back out.

Example, imagine I have a file test.html with content,

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Components of the SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body></body></html>

And in the python prompt you could enter.
>>> import lxml.html
>>> t = lxml.html.parse('test.html')
>>> lxml.html.etree.tostring(t)
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\n<?xml version="1.0" encoding="UTF-8" standalone="no"??><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Components of the SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body/></html>'

--

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 15 Dec 2011 08:58
Picon
Favicon

Re: Using lxml to read epub/docbook files. Re-orders doctype/xml tags

Ricky Wong, 15.12.2011 02:03:
> This might be a simple question (also posted this on stackoverflow here -
> http://bit.ly/uHcIB9).
>
> Noticed how after lxml reads in the doc it reverses the doctype and xml
> tags when it prints it back out.

No, it doesn't.

> Example, imagine I have a file test.html with content,
> *<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html
> PUBLIC "-//W3C//DTD XHTML 1.1//EN" "
> http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">*<html xmlns="
> http://www.w3.org/1999/xhtml"><head><title>Components of the
> SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta
> name="generator" content="DocBook XSL Stylesheets
> V1.74.0"/></head><body></body></html>
>
> And in the python prompt you could enter.
> >>> import lxml.html
> >>> t = lxml.html.parse('test.html')

Note that you are using the HTML parser to parse XML here.

> >>> lxml.html.etree.tostring(t)

You should use "etree.tostring()" if you want XML serialisation, not go 
through lxml.html. "etree" is not part of the module API of lxml.html.

> '*<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "
> http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\n<?xml version="1.0"
> encoding="UTF-8" standalone="no"??>*<html xmlns="
> http://www.w3.org/1999/xhtml"><head><title>Components of the
> SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta
> name="generator" content="DocBook XSL Stylesheets
> V1.74.0"/></head><body/></html>'

What happens here is that the HTML parser does not recognise the XML 
declaration (and why should it?) and parses it as a normal processing 
instruction that precedes the <html> root element. When serialising, it 
writes the content back out in the correct way: first the DOCTYPE, then any 
processing instructions that precede the root element, then the root 
element itself.

So the bug is in your code, not in lxml. Use the XML parser to parse XML 
documents. Note that there is also an XHTML-ish parser in lxml.html that 
you can use.

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Ricky Wong | 15 Dec 2011 09:13
Picon

Re: Using lxml to read epub/docbook files. Re-orders doctype/xml tags

On Wed, Dec 14, 2011 at 11:58 PM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:

Ricky Wong, 15.12.2011 02:03:
> This might be a simple question (also posted this on stackoverflow here -
> http://bit.ly/uHcIB9).
>
> Noticed how after lxml reads in the doc it reverses the doctype and xml
> tags when it prints it back out.

No, it doesn't.


> Example, imagine I have a file test.html with content,
> *<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html
> PUBLIC "-//W3C//DTD XHTML 1.1//EN" "
> http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">*<html xmlns="
> http://www.w3.org/1999/xhtml"><head><title>Components of the
> SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta
> name="generator" content="DocBook XSL Stylesheets
> V1.74.0"/></head><body></body></html>
>
> And in the python prompt you could enter.
> >>> import lxml.html
> >>> t = lxml.html.parse('test.html')

Note that you are using the HTML parser to parse XML here.

Yes. There's a reason for that (mainly the input isn't well formed with entity not matching up - such as &npsb;...)



> >>> lxml.html.etree.tostring(t)

You should use "etree.tostring()" if you want XML serialisation, not go
through lxml.html. "etree" is not part of the module API of lxml.html.

Is there a way to use the more forgiving xml.html parser to create the tree and serialize it as a xml doc? Or ask the etree parser to be more forgiving? The alternative way is to modify the input to fix the entity issue, but that seems to be more invasive and might cause other problems when other systems could depend on the original format.



> '*<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "
> encoding="UTF-8" standalone="no"??>*<html xmlns="
> http://www.w3.org/1999/xhtml"><head><title>Components of the
> SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta
> name="generator" content="DocBook XSL Stylesheets
> V1.74.0"/></head><body/></html>'

What happens here is that the HTML parser does not recognise the XML
declaration (and why should it?) and parses it as a normal processing
instruction that precedes the <html> root element. When serialising, it
writes the content back out in the correct way: first the DOCTYPE, then any
processing instructions that precede the root element, then the root
element itself.

So the bug is in your code, not in lxml. Use the XML parser to parse XML
documents. Note that there is also an XHTML-ish parser in lxml.html that
you can use.

I didn't imply that there's a bug in lxml :-)
 

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml



--

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 15 Dec 2011 09:18
Picon
Favicon

Re: Using lxml to read epub/docbook files. Re-orders doctype/xml tags

Ricky Wong, 15.12.2011 09:13:
> On Wed, Dec 14, 2011 at 11:58 PM, Stefan Behnel wrote:
>> Ricky Wong, 15.12.2011 02:03:
>>> Example, imagine I have a file test.html with content,
>>> *<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html
>>> PUBLIC "-//W3C//DTD XHTML 1.1//EN" "
>>> http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">*<html xmlns="
>>> http://www.w3.org/1999/xhtml"><head><title>Components of the
>>> SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta
>>> name="generator" content="DocBook XSL Stylesheets
>>> V1.74.0"/></head><body></body></html>
>>>
>>> And in the python prompt you could enter.
>>> >>> import lxml.html
>>> >>> t = lxml.html.parse('test.html')
>>
>> Note that you are using the HTML parser to parse XML here.
>
> Yes. There's a reason for that (mainly the input isn't well formed with
> entity not matching up - such as&npsb;...)

Ok, so what you actually want is to parse broken XHTML.

>> >>>> lxml.html.etree.tostring(t)
>>
>> You should use "etree.tostring()" if you want XML serialisation, not go
>> through lxml.html. "etree" is not part of the module API of lxml.html.
>
> Is there a way to use the more forgiving xml.html parser to create the tree
> and serialize it as a xml doc? Or ask the etree parser to be more
> forgiving?

You should do the latter. Pass the "recover" option.

http://lxml.de/parsing.html#parser-options

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane