Robert Pierce | 1 Jun 18:21

lxml.objectify.deannotate refuses to clean nil nodes


The nil node <Fubar/> is not deannotated as I would expect in the following snippet.  I could not find a reference to this behaviour in the archives or documentation.  Is this a design feature for which there is a work around, or a bug?  I'm using lxml-2.2-py2.5-linux-i686.

Thanks!

#### CODE ####

import lxml.etree
import lxml.objectify

x = lxml.objectify.fromstring('<root><Bar/></root>')
x.Foo = ''
x.Fubar = None
lxml.objectify.deannotate(x)
lxml.etree.cleanup_namespaces(x)
print lxml.etree.tostring(x)

#### END CODE ###

<root><Bar/><Foo></Foo><Fubar xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/></root>

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Alexis Georges | 1 Jun 22:36
Picon

Re: XML Documents & I18N (the way Cocoon does it)

Hi,

This is a bit late, but thanks for the response.

I am playing around with iterparse() and am following the advice you  
gave.

I have a question though: I could not find a way to consume an element  
and replace it with just text. For example <i18n:text>hello</ 
i18n:text> when found in the middle of a paragraph will be replaced by  
text. The replace() method requires the replacement to be an element.

Is this possible?

Thanks!

Alexis Georges

On 28-Apr-09, at 1:59 PM, Stefan Behnel wrote:

> Hi,
>
> Alexis Georges wrote:
>> I am maintaining a multilingual website which works with XML, XSLT to
>> generate XHTML.
>>
>> I am working with Apache Cocoon (http://cocoon.apache.org/2.1/) using
>> (among other things) their I18NTransformer. Basically I can use  
>> elements
>> in the I18N (http://apache.org/cocoon/i18n/2.1) namespace, and then  
>> tell
>> Cocoon to apply the I18NTransfomer to the document; this replaces the
>> I18N elements with a localized value (eg. a formatted date/number, a
>> translated label/attribute, etc...).
>>
>> I have been looking at lxml a little bit to see if I could move to a
>> Python-based framework for the website. I am not quite sure how to go
>> about the I18N part though.
>>
>> Using the Babel library (http://babel.edgewall.org/) along with  
>> request
>> headers to generate localized data, I have everything I need. What is
>> missing is the "parser" for the I18N elements. All I can think of  
>> right
>> now is to implement a SAX parser, the way Cocoon does (in Java).
>
> There is a SAX-like interface in lxml.etree, called "target parser".
>
> However, if your documents fit into memory, using iterparse() is a lot
> simpler (and likely not even much slower).
>
> Something like this might work:
>
>     context = etree.iterparse(
>              "somefile.xml",
>              tag = "{http://apache.org/cocoon/i18n/2.1}*")
>
>     for event, i18n_element in context:
>         new_element = get_i18n_replacement_for(i18n_element)
>         i18n_element.getparent().replace(i18n_element, new_element)
>
>     context.getroottree().write("newfile.xml")
>
> See here for some documentation:
>
> http://codespeak.net/lxml/parsing.html
>
> You can also achieve the same thing in XSLT, or using XPath, or ...
>
> Stefan
qhlonline | 2 Jun 04:25
Favicon

lxml deconding problem caused by <meta> tag specification

 Hi, all
    There are instances that when an HTML file has meta tags, the charset declared in tag is not right, because the HTML content next is using a different encoding. But lxml will parse accroding to what said. In this situation, it may report error information of error decoding, but some times it can parse, and generate a DOM that is not complete. eg. I have a WEB file has  while the following content is encoded with GBK(which is a Supper set of GB2312). We have got a result with only part of the HTML tags parsed out. I wan't to know, if lxml have any warning or error information reported for this situation? What it is? and how can we deal with this kind of fault ? Is there any common method?
   I have also seen some HTML files have tag attributes "lang", I don't know whether this attribute is used in the HTML parsing process. In meta tag like , there are also language statement, But in the htmlCheckMeta method of libxml2 library source, I didn't find any processing with the http-equiv attribute value "Content-Language". Is it because that "Content-Language" is not standerd?  Is lxml support this attribute? if so , how to deal with the " content="zh-cn"  " declaration when it was saying another different language?
                                                        yours




业务订单流失怎么办?
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
jholg | 2 Jun 09:59
Picon
Picon

Re: lxml.objectify.deannotate refuses to clean nil nodes

Hi,

> The nil node <Fubar/> is not deannotated as I would expect in the
> following
> snippet.  I could not find a reference to this behaviour in the archives
> or
> documentation.  Is this a design feature for which there is a work around,
> or a bug?  I'm using lxml-2.2-py2.5-linux-i686.

Design feature. Only py:pytype/xsi:type attributes get removed by deannotate():

>>> print etree.__version__
2.1.5
>>> help(objectify.deannotate)

Help on built-in function deannotate in module lxml.objectify:

deannotate(...)
    deannotate(element_or_tree, pytype=True, xsi=True)

    Recursively de-annotate the elements of an XML tree by removing 'pytype'
    and/or 'type' attributes.

    If the 'pytype' keyword argument is True (the default), 'pytype' attributes
    will be removed. If the 'xsi' keyword argument is True (the default),
    'xsi:type' attributes will be removed.

IMHO the xsi:nil concept in XML Schema pretty much corresponds to NULL values in databases, i.e. a typed
element/column may (or may not) be xsi:nil/NULL, but it does not so directly translate to the distinct
Python None object. OTOH I think mapping xsi:nil to None very much captures the meaning of xsi:nil/NULL,
because in most use cases you'd test if a value has been set (!=None) or not (==None).

Or course, you can always easily get rid of xsi:nil if you wish so:

>>> for elt in root.iter(): elt.attrib.pop('{http://www.w3.org/2001/XMLSchema-instance}nil', None)

Holger
--

-- 
Nur bis 31.05.: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate und
Telefonanschluss nur 17,95 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
qhlonline | 2 Jun 11:04
Favicon

lxml about Target Parser

Hi,all
   When I used the lxml with self defined  Target Parser, There is a function that can be redefined-- data . def data (self, data):
When can we use it? and what it will do when we simply write a single line: "return " ? Is there any encoding conversion?



网易全新推出企业邮箱
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 2 Jun 20:12
Picon
Favicon
Gravatar

Re: lxml about Target Parser


qhlonline wrote:
> When I used the lxml with self defined  Target Parser, There is a
> function that can be redefined-- data . def data (self, data): When can
> we use it?

when you want to receive character content from the document you parse.

> and what it will do when we simply write a single line: "return " ?

nothing? actually, a "pass" will do in that case, as will not implementing
the method (IIRC).

> Is there any encoding conversion?

You will get either ASCII encoded byte strings or unicode strings, just
like everywhere else.

BTW, it's sometimes faster to try these things out than to ask a mailing list.

Stefan
Stefan Behnel | 2 Jun 20:29
Picon
Favicon
Gravatar

Re: lxml deconding problem caused by <meta> tag specification

Hi,

qhlonline wrote:
> There are instances that when an HTML file has meta tags, the
> charset declared in  tag is not right, because the HTML content next is
> using a different encoding. But lxml will parse accroding to what  said.
> In this situation, it may report error information of error decoding,
> but some times it can parse, and generate a DOM that is not complete.

By default, the HTML parser will ignore errors and try to keep parsing
regardless. Pass "recover=False" if you want to get an exception instead.

Note that character decoding errors cannot always be detected, as they may
lead to valid (although unreadable) characters even when the wrong encoding
is assumed. Latin-1 is a good example, which uses a plain 8-bit encoding.
It will work perfectly well to read a UTF-8 encoded document with a Latin-1
decoder. It just won't give you readable output in most cases.

> eg. I have a WEB file has  while the following content is encoded with
> GBK(which is a Supper set of GB2312). We have got a result with only
> part of the HTML tags parsed out. I wan't to know, if lxml have any
> warning or error information reported for this situation? What it is?

See the error_log property on the parser.

http://codespeak.net/lxml/parsing.html#error-log

> Is there any common
> method? I have also seen some HTML files have  tag attributes "lang", I
> don't know whether this attribute is used in the HTML parsing process.

I don't think so.

> In meta tag like , there are also language statement, But in the
> htmlCheckMeta method of libxml2 library source, I didn't find any
> processing with the http-equiv attribute value "Content-Language".

The "language" is not relevant to the parser. The charset is. Just think of
UTF-8, which can encode any written language that uses characters defined
in Unicode.

Stefan
Stefan Behnel | 2 Jun 21:24
Picon
Favicon
Gravatar

Re: lxml.objectify.deannotate refuses to clean nil nodes

Hi,

Holger wrote:
>> The nil node <Fubar/> is not deannotated as I would expect in the
>> following
>> snippet.  I could not find a reference to this behaviour in the archives
>> or
>> documentation.  Is this a design feature for which there is a work around,
>> or a bug?  I'm using lxml-2.2-py2.5-linux-i686.
> 
> Design feature.

I'd be a little more careful with such a big word. ;)

> Only py:pytype/xsi:type attributes get removed by deannotate():
> 
>>>> print etree.__version__
> 2.1.5
>>>> help(objectify.deannotate)
> 
> Help on built-in function deannotate in module lxml.objectify:
>  
> deannotate(...)
>     deannotate(element_or_tree, pytype=True, xsi=True)
>  
>     Recursively de-annotate the elements of an XML tree by removing 'pytype'
>     and/or 'type' attributes.
>  
>     If the 'pytype' keyword argument is True (the default), 'pytype' attributes
>     will be removed. If the 'xsi' keyword argument is True (the default),
>     'xsi:type' attributes will be removed.

Yes, so it's even implicitly documented. :)

Anyway, I'm not sure it's always a good idea to leave this special case in
instead of cleaning everything up. I think if you remove it, you'd get an
empty string result, which may be surprising - but more surprising than not
getting it cleaned up? After all, deannotate() means deannotate()...

Stefan
Stefan Behnel | 2 Jun 22:13
Picon
Favicon
Gravatar

lxml 2.2.1 released

Hi all,

I just pushed lxml 2.2.1 to PyPI as a minor maintenance release.

Changelog follows below.

This release was built with Cython 0.11.2.

Have fun,

Stefan

2.2.1 (2009-06-02)
Features added

    * Injecting default attributes into a document during XML Schema
      validation (also at parse time).
    * Pass huge_tree parser option to disable parser security restrictions
      imposed by libxml2 2.7.

Bugs fixed

    * The script for statically building libxml2 and libxslt didn't work
      in Py3.
    * XMLSchema() also passes invalid schema documents on to libxml2 for
      parsing (which could lead to a crash before release 2.6.24).
Robert Pierce | 3 Jun 02:43

Re: lxml.objectify.deannotate refuses to clean nil nodes

Thanks!  That answers my questions.  The apparent asymmetry of handling nodes was confusing, but the distinction of pytypes vs xsi makes some sense.  I would naively agree that a seemingly general purpose function like deannotate should remove everything.  Otherwise, I have to walk the tree twice: once with deannotate and once to unlink remaining nill types.  Or recreate my own deannotate().  Not a big deal either way, though.

On Tue, Jun 2, 2009 at 12:24 PM, Stefan Behnel <stefan_ml <at> behnel.de> wrote:
Hi,

Holger wrote:
>> The nil node <Fubar/> is not deannotated as I would expect in the
>> following
>> snippet.  I could not find a reference to this behaviour in the archives
>> or
>> documentation.  Is this a design feature for which there is a work around,
>> or a bug?  I'm using lxml-2.2-py2.5-linux-i686.
>
> Design feature.

I'd be a little more careful with such a big word. ;)


> Only py:pytype/xsi:type attributes get removed by deannotate():
>
>>>> print etree.__version__
> 2.1.5
>>>> help(objectify.deannotate)
>
> Help on built-in function deannotate in module lxml.objectify:
>
> deannotate(...)
>     deannotate(element_or_tree, pytype=True, xsi=True)
>
>     Recursively de-annotate the elements of an XML tree by removing 'pytype'
>     and/or 'type' attributes.
>
>     If the 'pytype' keyword argument is True (the default), 'pytype' attributes
>     will be removed. If the 'xsi' keyword argument is True (the default),
>     'xsi:type' attributes will be removed.

Yes, so it's even implicitly documented. :)

Anyway, I'm not sure it's always a good idea to leave this special case in
instead of cleaning everything up. I think if you remove it, you'd get an
empty string result, which may be surprising - but more surprising than not
getting it cleaned up? After all, deannotate() means deannotate()...

Stefan


_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev

Gmane