Christian Zagrodnick | 16 May 14:20

bug: objectify removes text on replace()?

Hi,

with lxml 2.0.4 I get text removed when I replace a node. The text 
after the replaced node vanishes....

-----------------------
import lxml.objectify
import lxml.etree

xml = lxml.objectify.fromstring(
    '<foo><bar>before baz<baz/>after baz</bar><bar/></foo>')
print lxml.etree.tostring(xml, pretty_print=True)
print 50*'-'

baz = xml['bar']['baz']
xml['bar'].replace(baz, lxml.objectify.E.holler())

print lxml.etree.tostring(xml, pretty_print=True)
-----------------

Prints out:

<foo>
  <bar>before baz<baz/>after baz</bar>
  <bar/>
</foo>

--------------------------------------------------
<foo>
  <bar>before baz<holler 
(Continue reading)

Viksit Gaur | 16 May 04:57
Picon

Efficient methods to build a tree out of HTML structure?

Hi all,

I was wondering - what would be the most efficient method to access all 
the elements in the DOM tree, in some order, using lxml.etree?

The methods I currently see in the docs return a class like 
ElementDepthfirstIterator or iterwalk, which have 2 issues -

1) The first has a flat representation of the tree, so I lose 
child/parent structure

2) Things like iterwalk do return "start" and "end" actions - but 
instead of first doing an iterwalk and then parsing the results, is 
there a better way to construct the tree when iterwalk itself is running?

Or perhaps there is some method I've missed completely?

Quick note on what I'm trying to do - graphically represent the DOM 
structure of a page using a library like networkX..

Cheers,
Viksit
roger patterson | 15 May 06:21
Picon

Re: html entities and lxml.html.ElementSoup

Hi Viksit,

What you typed was correct, except you have to note that
lxml.html.soupparser.convert_tree(soup) returns a *list* of root
elements, so you can't just do a lxml.etree.tostring() on the list.
Depending on your HTML, choosing the first element will probably work.

I have moved to the trunk now, so am working well with the new
lxml.html.soupparser.  But if you're stuck on that branch, then that
work-around worked for me.  Hope it works for you!
cheers
-Roger

2008/5/14 Viksit Gaur <viksit <at> aya.yale.edu>:
> Hi there,
>
>>Roger Patterson wrote:
>>> I'm getting an interesting situation.  When using the very cool
>>> ElementSoup add-on to lxml.html with certain source-html files that
>>> already encode entities (eg. &#163;), using the ElementSoup.parse()
>>> messes up the entities.
>
> I'm running into the same problem.
>
>>It looks like it's not the parse(), but rather the serialisation. What
>> >happens
>>is that the entity references end up in the /text/ content, which is
>> >clearly
>>wrong as it leads to re-escaping of the references on the way out.
>
(Continue reading)

Kenneth Miller | 12 May 23:17
Picon

Looking for general insight.

All,

     Any opinions on my problem would be greatly appreciated.

    I've got a large pre-defined XML schema, tons of data types etc. I  
want to be able to create python objects from the schemas and traffic  
these objects in and out of some sort of a database. Could I perhaps  
create these objects using lxml and extend lxml to use zope persistence?

Regards,
Kenneth Miller
Stefan Behnel | 12 May 18:26
Picon

Re: install lxml 2.0.5 on Mac OS X Leopard - why is it so hard?

Hi,

Mike Meyer wrote:
> Apple's official position is that static linking of
> applications is unsupported. They don't provide static versions of any
> of the system libraries.
> 
> Likewise, macports doesn't provide static libraries for the libraries
> it installs, and the docs don't hint at anyway to get it to do so.

Great! Now that would have been too easy anyway, wouldn't it? :-/

Thanks for the infos. Now, anyone for a plan B?

Stefan
Kumar McMillan | 10 May 23:43
Picon

install lxml 2.0.5 on Mac OS X Leopard - why is it so hard?

I know this has been discussed over and over but I'm writing to see if
anyone has made a breakthrough yet.  The problem of course is that
Leopard's builtin libxml2 and libxslt are too old for lxml 2.0.  You
have to build libxml2 either from source or use a port.  There is
currently a problem with the libxml2 port, but the workaround is going
fine for me: http://trac.macports.org/ticket/15230 (I know because
postgres built just fine and I have some tests exerising psycopg2 as
well)

So after updating my libxml2 to 2.6.31 and libxslt to 1.1.23 and
updating my $PATH so that the new xml2-config and xslt-config can be
found, I can build lxml *without errors* but I see these warnings:

$ sudo easy_install lxml-2.0.5.tgz
Processing lxml-2.0.5.tgz
Running lxml-2.0.5/setup.py -q bdist_egg --dist-dir
/tmp/easy_install-3azY8e/lxml-2.0.5/egg-dist-tmp-t80esG
Building lxml version 2.0.5.
NOTE: Trying to build without Cython, pre-generated 'src/lxml/etree.c'
needs to be available.
Using build configuration of libxslt 1.1.23

ld: warning in /opt/local/lib/libxslt.dylib, file is not of required
architecture
ld: warning in /opt/local/lib/libexslt.dylib, file is not of required
architecture
ld: warning in /opt/local/lib/libxml2.dylib, file is not of required
architecture
[... and more like this ...]
...
(Continue reading)

Arye | 9 May 18:25
Picon
Favicon

validation with multiple XSD files


Hello all,
I would like to so some schema validation and started with the instructions in :
http://codespeak.net/lxml/dev/validation.html#xmlschema


This all works great. Now I would like to extend this to a XSD file that includes many other files. In other words I have a directory of XSD files that I would like to use. The include statement look like this (the included file is referenced by its name):

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xsd:include schemaLocation="base.xsd"/>
    <xsd:element name="Price">
       ...
       ... some types defined in "base.xsd" are used here



I am new to lxml so sorry in advance if the question does not make sense.

Regards,
Arye.



_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Ben | 9 May 17:15
Favicon

Re: Getting info from an XML file that has invalid character data in it (and how to specify recover option)

> Stefan wrote:
> 
> Not sure if this is just a "find-a-short-example" error, but you parse
> the filename, not the file here. This should read
> 
>    Xml   = etree.parse(XmlFileName, parser)

(LOL) This is indeed a "find-a-short-example" error - which is what you use when you are a sysadmin.  Now it
works and gets me past the invalid characters too.

Thanks for lxml
Ben | 9 May 16:09
Favicon

Getting info from an XML file that has invalid character data in it (and how to specify recover option)

Hello

I'm writing some code to check whether our daily backups worked.   Backup Exec stores its
results in XML files.   Sometimes bad characters - or maybe it is binary data - ends up in
these XML files and then lxml chokes:

C:\>python sb-lxml.py
Traceback (most recent call last):
  File "sb-lxml.py", line 5, in <module>
    Xml = etree.parse(XmlFileName)
  File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062)
  File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088)
  File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:53337)
  File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:52584)
  File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile
(src/lxml/lxml.etree.c:50115)
  File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:47023)
  File "parser.pxi", line 536, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:47861)
  File "parser.pxi", line 478, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:47285)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95

The offending line looks like this (not sure if the bad characters will make it through
the email):

</error><error>Directory not found. Can not backup directory \Data\\l Strategy - Progress
Rep.doc\\\??ā?\\VIC-ve\TT\miscellaneous and its subdirectories.

Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2):
##################################
Xml = etree.parse(XmlFileName)
print Xml.findtext(".//end_time")
print Xml.findtext(".//engine_completion_status")
##############################

The code works fine unless there are invalid characters in, and I am happy for any
suggestion, because the bit I'm interested in is always near the end of the xml file, and
there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope)

Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get
this:

C:\>python sb-lxml.py
Traceback (most recent call last):
  File "sb-lxml.py", line 9, in <module>
    print Xml.findtext(".//end_time")
  File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext
(src/lxml/lxml.etree.c:15354)
  File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot
(src/lxml/lxml.etree.c:14116)
AssertionError: ElementTree not initialized, missing root

The code I tried for the 'recover' parser option:

XmlFileName = r'c:/BEX03194.xml'
parser = etree.XMLParser(recover=True)
Xml   = etree.parse(StringIO(XmlFileName), parser)
print Xml.findtext(".//end_time")
print Xml.findtext(".//engine_completion_status")

I guess I'm just specifying the option wrong, but can't see how I should be doing it.

Any suggestion, including how to circumvent/work around the problem is most welcome.

ReplyReply AllForwardTrash

____________________________________________________________
FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop!
Check it out at http://www.inbox.com/marineaquarium
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Brad Smith | 8 May 22:27
Picon

Re: Querying valid children of an element?

To clarify about what I'm doing. The goal is to have a shorthand
language (not entirely tag-based) that is easier for subject matter
experts to learn than docbook, which can then be converted into full
docbook once they've written a first draft. So, to illustrate one
aspect of it, instead of writing

<itemizedlist>
<listitem><para><command>foomaster</command> example...</para>
  <itemizedlist>
    <listitem><screen>$ foomaster [OPTIONS]</screen></listitem>
  </itemizedlist>
</listitem>
</itemizedlist>

They can write

* <command>foomaster</command> example...
** <screen>$ foomaster [OPTIONS]</screen>

As you can see, the translation process consists of not just
converting asterisks into the appropriate combination of itemizedlists
and listitems, but also protecting cdata within paras where necessary.
In the first one, the interpreter sees that <command> isn't allowed
inside <listitem>, which is its cue to try inserting a <para>.
<screen> is allowed within <listitem>, so it does not insert a para.
Making that determination is what I'm trying to find the best approach
for.

Currently I use a function like this:

def validateAppend(parent,child):
	parent.append(child)
	if not dtd.validate(parent):
		dbg("Appending %s to %s failed DTD validation" % (child.tag,parent.tag))
		del(parent[-1])
		return False
	return True

This works but, like I said, is not terribly efficient, so I just
wanted to see if there was another method for making the
determination.

--Brad

On Thu, May 8, 2008 at 11:44 AM, Mike Meyer <mwm <at> mired.org> wrote:
> On Thu, 08 May 2008 09:33:17 +0200 Stefan Behnel <stefan_ml <at> behnel.de> wrote:
>
>> > I am writing a tool that translates xml tags mixed with a wiki-like
>> > shorthand into full xml. It would be helpful to be able to
>> > sanity-check the mix of explicit tags and implicit tags I'm deriving
>> > from the shorthand by querying our DTD along the lines: "Is element
>> > foo legal within element bar" Same for CDATA.
>> >
>> > Is this possible using lxml? If not, is it possible using anything
>> > else?
>>
>> You could define your grammar in a way that is easily usable for you in your
>> program and then generate a DTD from that.
>
> Are you really using DTDs, and not using that as a catchall for the
> various Schema languages?
>
> If so, then you might consider switching to a modern schema
> language. RelaxNG lets you write regular expressions for CDATA, which
> ought to work with wiki-like "tags", and I wouldn't be surprised to
> find that Schematron is turing complete.
>
>     <mike
>

--

-- 
~ Second Shift: An original, serialized audio adventure ~
 http://www.secondshiftpodcast.com
Jeffrey Ollie | 8 May 21:26
Picon

Building lxml 2.0.5 on RHEL/CentOS 4

Has anyone built lxml 2.0.5 on RHEL 4 or CentOS 4?  When I submit it
to the Fedora/EPEL buildsystem I get the following error:

libxml/schematron.h: No such file or directory

I don't have direct access to a RHEL/CentOS 4 box so I can't do much
more debugging until I do get one set up.  libxml2 is at version
2.6.16 in RHEL/CentOS 4.  The full build log is here:

http://buildsys.fedoraproject.org/logs/fedora-4-epel/38964-python-lxml-2.0.5-1.el4/ppc/build.log

Jeff

Gmane