Stefan Behnel | 1 Aug 06:46 2012
Picon

Re: getiterator vs xpath question

Tim Arnold, 01.08.2012 00:48:
> I am misunderstanding the difference between these two code blocks. I thought they would have the same result.
> I want to find every element in the tree that has an 'id' or a 'name' attribute so I can store that attribute.
> (I'm link-checking a static html site).

Have a look at the link iterator in lxml.html. It also handles references
and links in CSS, for example.

> starting out:
> >>> from lxml import etree
> >>> parser = etree.HTMLParser()
> >>> tree = etree.parse('ugdet17.htm', parser=parser)
> 
> Then using getiterator(),
> 
> >>> for elem in tree.getiterator():
> ...     if elem.tag == 'div':
> ...         if elem.get('class') == 'section':
> ...             print elem.attrib
> ...
> {'class': 'section', 'id': 'ugseldet'}
> 
> And I *thought* the xpath would return the same thing.
> 
> >>> for div in tree.xpath('//div[ <at> class="section"]'):
> ...     print div.attrib
> ...
> {'class': 'section', 'id': 'ugseldet'}
> {'class': 'section', 'id': 'ugstep'}

(Continue reading)

Butler, John (NIH/NLM) [E] | 2 Aug 14:48 2012
Picon

Escaped Non-ASCII characters in attribute values

The changelog for lxml 2.1beta3 states that “Non-ASCII characters in attribute values are no longer escaped on serialisation.”  I am using a product that is also making use of the C libraries libxml2 and libxslt.  The XML being produced has all the Non-ASCII characters in attribute values escaped even though they were initially valid UTF-8 encoded characters.  The vendor has informed me that there is nothing they can do about this as the escaping is being done in the libxslt C library.  Can someone on the lxml development team describe to me how lxml was able to resolve this issue?  Was this done by post processing strings produced by libxslt or is there some way to get libxslt not to escape Non-ASCII characters in attribute values?
 
Thanks,
John Butler
 
 
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 4 Aug 23:56 2012
Picon

Re: Escaped Non-ASCII characters in attribute values

Butler, John (NIH/NLM) [E], 02.08.2012 14:48:
> The changelog for lxml 2.1beta3  states that "Non-ASCII characters in
> attribute values are no longer escaped on serialisation."  I am using a
> product that is also making use of the C libraries libxml2 and libxslt.
> The XML being produced has all the Non-ASCII characters in attribute
> values escaped even though they were initially valid UTF-8 encoded
> characters.  The vendor has informed me that there is nothing they can
> do about this as the escaping is being done in the libxslt C library.
> Can someone on the lxml development team describe to me how lxml was
> able to resolve this issue?  Was this done by post processing strings
> produced by libxslt or is there some way to get libxslt not to escape
> Non-ASCII characters in attribute values?

All you have to do is request a non-ASCII encoding for the output. Passing
"UTF-8" into the serialiser will work just fine. For libxslt, it should
work to set the encoding in the <xsl:output> tag of your stylesheet.

Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Jack Bates | 6 Aug 08:19 2012

SelectorSyntaxError elements with name E in any namespace *|E

Hi, I love lxml, thanks for the hard work

I just tried "sel = CSSSelector('*|Content')" with the latest lxml 
release (2.3.5) and got the exception at the bottom of this message

I checked the CSS specification and it discusses selectors of this 
format [1][2]

> ns|E
>     elements with name E in namespace ns
> *|E
>     elements with name E in any namespace, including those without a namespace
> |E
>     elements with name E without a namespace
> E
>     if no default namespace has been declared for selectors, this is equivalent to *|E. Otherwise it is
equivalent to ns|E where ns is the default namespace.

I also checked the lxml bug tracker but didn't find this issue of 
elements in any namespace mentioned

Is this something that could be supported by lxml? Should I open an 
issue in the bug tracker for it?

Also, I found this thread on the mailing list which discusses support 
for elements in any namespace [3] and proposes a patch

   [1] http://www.w3.org/TR/css3-selectors/#typenmsp
   [2] http://www.w3.org/TR/css3-namespace/#css-qnames
   [3] http://thread.gmane.org/gmane.comp.python.lxml.devel/4937

> Traceback (most recent call last):
>   File "example.py", line 35, in <module>
>     sel = CSSSelector('*|Content')
>   File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 51, in __init__
>     path = css_to_xpath(css)
>   File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 537, in css_to_xpath
>     css_expr = parse(css_expr)
>   File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 662, in parse
>     return parse_selector_group(stream)
>   File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 677, in parse_selector_group
>     result.append(parse_selector(stream))
>   File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 688, in parse_selector
>     result = parse_simple_selector(stream)
>   File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 724, in parse_simple_selector
>     "Expected symbol, got '%s'" % next)
> lxml.cssselect.SelectorSyntaxError: Expected symbol, got '*' at [Token(u'*', 0), Token(u'|', 1),
Symbol(u'Content', 2)] -> None
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Simon Sapin | 6 Aug 09:15 2012
Picon

Re: SelectorSyntaxError elements with name E in any namespace *|E

Le 06/08/2012 08:19, Jack Bates a écrit :
> I just tried "sel = CSSSelector('*|Content')" with the latest lxml
> release (2.3.5) and got the exception at the bottom of this message

Hi Jack,

The parser in lxml.cssselect 2.3 and earlier is broken is many 
interesting ways. I recently took over maintenance of the project, made 
it a separate PyPI package and fixed a lot of stuff:

http://packages.python.org/cssselect/

In lxml 3.0, lxml.cssselect becomes a thin wrapper around the new 
cssselect. You can either use the alpha from last week, or use lxml 2.3 
with a more verbose syntax:

     import cssselect
     import lxml.etree
     translator = cssselect.GenericTranslator()  # or HTMLTranslator()
     sel = lxml.etree.XPath(translator.css_to_xpath('*|Content'))

Now, in the new cssselect the *syntax* for your selector is supported, 
but I’m not sure that the XPath translation is correct. Any suggestion 
in this area is welcome.

Related issue: https://github.com/SimonSapin/cssselect/issues/9

As a side note, the "translating to XPath" approach to CSS selectors 
seems easy at first but when trying to be spec-compliant you quickly 
find corner cases that are not obvious at all. In fact there are valid 
selectors for which I’m not sure that a correct XPath translation even 
exists:

https://github.com/SimonSapin/cssselect/issues/12

I started playing with a different approach: straightforward Python code 
to compile selectors into (element -> bool) callables that use the lxml 
API. It’s currently slower than cssselect (which profits from libxml2’s 
optimized XPath engine) but much simpler and probably more correct. Also 
very much work-in-progress: I literally typed 'git init' yesterday. 
(Suggestions welcome on this one too!)

https://github.com/SimonSapin/lselect

Have fun,
--

-- 
Simon Sapine
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 6 Aug 09:41 2012
Picon

translating CSS selectors to Python code (was: SelectorSyntaxError elements with name E in any namespace *|E)

Simon Sapin, 06.08.2012 09:15:
> I started playing with a different approach: straightforward Python code 
> to compile selectors into (element -> bool) callables that use the lxml 
> API. It’s currently slower than cssselect (which profits from libxml2’s 
> optimized XPath engine) but much simpler and probably more correct. Also 
> very much work-in-progress: I literally typed 'git init' yesterday. 
> (Suggestions welcome on this one too!)
> 
> https://github.com/SimonSapin/lselect

Interesting. Are you doing bottom-up evaluation here? It looks like it's
designed for testing a given element, i.e. each element separately if you
are looking for matches in a subtree. Is that really an important use case?

If you can manage to reverse the selection, you should get a huge speedup
in tree searches by translating descendent selectors into
"parent.iter(tag)" (which is faster than the equivalent XPath expression in
lxml), instead of instantiating each element and testing its tag, which is
horribly slow. Similar for iterchildren(), iterancestors() etc. That's a
breach of purity, sure, but an important optimisation. Take a look at the
_elementpath.py module in lxml, which has a partial 'XPath' implementation
based on generators. Having the same thing for CSS selectors would be great.

Stefan

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Simon Sapin | 6 Aug 10:19 2012
Picon

Re: translating CSS selectors to Python code

Le 06/08/2012 09:41, Stefan Behnel a écrit :
> Interesting. Are you doing bottom-up evaluation here?

Yes. I read a few times that web browsers match selectors right to left:

http://stackoverflow.com/questions/5797014/css-selectors-parsed-right-to-left-why
https://developer.mozilla.org/en/Writing_Efficient_CSS

I think they also loop on elements, and then find selectors that match, 
ie. the reversed nesting of a typical cssselect usage: for each selector 
find elements that match.

This is just an experiment is this direction. It also seems to make 
sense for the descendant combinator since an element often has many more 
descendants than ancestors. (Although this claim is based on nothing 
scientific.)

> It looks like it's
> designed for testing a given element, i.e. each element separately if you
> are looking for matches in a subtree. Is that really an important use case?

Not really. My main use case in WeasyPrint is to apply stylesheets to a 
whole document. I want (style rule, element) pairs but don’t care in 
which order they come.

> If you can manage to reverse the selection, you should get a huge speedup
> in tree searches by translating descendent selectors into
> "parent.iter(tag)" (which is faster than the equivalent XPath expression in
> lxml), instead of instantiating each element and testing its tag, which is
> horribly slow. Similar for iterchildren(), iterancestors() etc. That's a
> breach of purity, sure, but an important optimisation. Take a look at the
> _elementpath.py module in lxml, which has a partial 'XPath' implementation
> based on generators. Having the same thing for CSS selectors would be great.

Interesting. Of course this only work with element type selectors after 
a combinator but this is a common case.

The problem is that it only works with "fully qualified" element types 
such as |E or NS|E. When there is no default namespace declared in the 
stylesheet (that is most of the time) the E selector actually means *|E

Can I use ancestor.iter(tag) with some tag object that means "this local 
name in any namespace"? Currently I use this bool expression. It’s 
probably not fast at all:

     return 'el.tag.rsplit("}", 1)[-1] == %r' % tag

--

-- 
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 6 Aug 10:43 2012
Picon

Re: translating CSS selectors to Python code

Simon Sapin, 06.08.2012 10:19:
> Le 06/08/2012 09:41, Stefan Behnel a écrit :
>> Interesting. Are you doing bottom-up evaluation here?
> 
> Yes. I read a few times that web browsers match selectors right to left:
> 
> http://stackoverflow.com/questions/5797014/css-selectors-parsed-right-to-left-why
> https://developer.mozilla.org/en/Writing_Efficient_CSS
> 
> I think they also loop on elements, and then find selectors that match, 
> ie. the reversed nesting of a typical cssselect usage: for each selector 
> find elements that match.

Right. Totally different use case.

> This is just an experiment is this direction. It also seems to make 
> sense for the descendant combinator since an element often has many more 
> descendants than ancestors. (Although this claim is based on nothing 
> scientific.)

Still, it's faster to search for matching descendants in O(n) than to walk
over all elements in O(n) and test all (or many) of their ancestors, which
is closer to something like O(n*log(n)).

>> It looks like it's
>> designed for testing a given element, i.e. each element separately if you
>> are looking for matches in a subtree. Is that really an important use case?
> 
> Not really. My main use case in WeasyPrint is to apply stylesheets to a 
> whole document. I want (style rule, element) pairs but don’t care in 
> which order they come.

Hmm, in that case, you also have to take precedence rules between the
separate style sections into account - those aren't trivial.

Anyway, since styles are inherited by entire subtrees, it might really be
better to walk the tree and to collect styles along the path, than to apply
each style section separately to the entire tree. You might also want to do
a bit of hashing, at least for the easy (and common) cases where selectors
end with a tag name.

>> If you can manage to reverse the selection, you should get a huge speedup
>> in tree searches by translating descendent selectors into
>> "parent.iter(tag)" (which is faster than the equivalent XPath expression in
>> lxml), instead of instantiating each element and testing its tag, which is
>> horribly slow. Similar for iterchildren(), iterancestors() etc. That's a
>> breach of purity, sure, but an important optimisation. Take a look at the
>> _elementpath.py module in lxml, which has a partial 'XPath' implementation
>> based on generators. Having the same thing for CSS selectors would be great.
> 
> Interesting. Of course this only work with element type selectors after 
> a combinator but this is a common case.
> 
> The problem is that it only works with "fully qualified" element types 
> such as |E or NS|E. When there is no default namespace declared in the 
> stylesheet (that is most of the time) the E selector actually means *|E
> 
> Can I use ancestor.iter(tag) with some tag object that means "this local 
> name in any namespace"? Currently I use this bool expression. It’s 
> probably not fast at all:
> 
>      return 'el.tag.rsplit("}", 1)[-1] == %r' % tag

You can search for "{*}localname".

Stefan

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Simon Sapin | 6 Aug 11:25 2012
Picon

Re: translating CSS selectors to Python code

Le 06/08/2012 10:43, Stefan Behnel a écrit :
> Simon Sapin, 06.08.2012 10:19:
>> Le 06/08/2012 09:41, Stefan Behnel a écrit :
>>> Interesting. Are you doing bottom-up evaluation here?
>>
>> Yes. I read a few times that web browsers match selectors right to left:
>>
>> http://stackoverflow.com/questions/5797014/css-selectors-parsed-right-to-left-why
>> https://developer.mozilla.org/en/Writing_Efficient_CSS
>>
>> I think they also loop on elements, and then find selectors that match,
>> ie. the reversed nesting of a typical cssselect usage: for each selector
>> find elements that match.
>
> Right. Totally different use case.

The use case differs in that browsers want to do progressing rendering 
or minimal updates when the DOM changes dynamically, while PDF 
conversion is static. Other than that this is mostly implementation details.

>> This is just an experiment is this direction. It also seems to make
>> sense for the descendant combinator since an element often has many more
>> descendants than ancestors. (Although this claim is based on nothing
>> scientific.)
>
> Still, it's faster to search for matching descendants in O(n) than to walk
> over all elements in O(n) and test all (or many) of their ancestors, which
> is closer to something like O(n*log(n)).

There are also m selectors, but I guess this does not change the big-Os

>> My main use case in WeasyPrint is to apply stylesheets to a
>> whole document. I want (style rule, element) pairs but don’t care in
>> which order they come.
>
> Hmm, in that case, you also have to take precedence rules between the
> separate style sections into account - those aren't trivial.

The cascade is not trivial but not very hard either. I already spent 
time on this and figured it out. The rule precedence is by 
origin/priority, selector specificity, then source order. The latter can 
be encoded as an integer if needed, that’s not a problem.

> Anyway, since styles are inherited by entire subtrees, it might really be
> better to walk the tree and to collect styles along the path, than to apply
> each style section separately to the entire tree.

Yes, by looping on elements first and then selectors I could do the 
cascade and inheritance in one pass (against two currently.) That would 
help with memory usage but I’m not sure about speed.

> You might also want to do
> a bit of hashing, at least for the easy (and common) cases where selectors
> end with a tag name.

I don’t understand, hashing what?

>>       return 'el.tag.rsplit("}", 1)[-1] == %r' % tag
>
> You can search for "{*}localname".

Great! That’s what I was missing. That would be for the .iter*() 
methods. Can I do a similar test on a single element?

--

-- 
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Simon Sapin | 6 Aug 13:51 2012
Picon

Re: translating CSS selectors to Python code

Le 06/08/2012 10:43, Stefan Behnel a écrit :
>>       return 'el.tag.rsplit("}", 1)[-1] == %r' % tag
> You can search for "{*}localname".

Unfortunately that doesn’t seem to work:

     >>> t = lxml.etree.fromstring(
     ...   '<r xmlns:a="aa" xmlns:b="bb"><a:e/><b:e/></r>')
     >>> list(t.iter('{aa}*'))
     [<Element {aa}e at 0x273d500>]
     >>> list(t.iter('{*}e'))
     []  # I expected [{aa}e, {bb}e]

Same on lxml 2.3.5 and 3.0.alpha1

--

-- 
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane