Martin Mueller | 27 Jan 16:48 2015

moving all attributes from one element to another


I want to move all attributes of an element to another element. If the
attributes of elements x are x.attrib and attributes = x.attrib, is there
a single command that says "take the attribute dictionary of x and attach
it to y"?

Or do I have to use the .set method for each individual attribute?

MM

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Alex Vincent | 20 Jan 18:24 2015
Picon

Possible bug with etree.XPath

Hi, everyone.  I've been searching around for a simple XPath expression validator, just to check that the XPaths we hand-write are really valid.  lxml looks like it might do well at this.

However, I did find a case where an invalid expression doesn't throw:
[ajvincent <at> localhost ~]$ python
Python 2.7.8 (default, Nov 10 2014, 08:19:18)
[GCC 4.9.2 20141101 (Red Hat 4.9.2-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> etree.XPath("//b[contains(.)]")
//b[contains(.)]

The contains function, as I understand it, takes exactly two arguments.
http://www.w3.org/TR/xpath/#section-String-Functions

I reproduced this with the python-lxml-3.3.6-1.fc21 package that Fedora 21 Linux provides, and on my MacBook with the py-lxml-3.4.1_0 MacPorts distribution.

Please advise:  is this a legitimate bug in lxml?  If so, I'll file in the bug tracker.

--
"The first step in confirming there is a bug in someone else's work is confirming there are no bugs in your own."
-- Alexander J. Vincent, June 30, 2001
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Gabor Toth | 7 Jan 18:35 2015
Picon

Installation Problem

Dear All,

I am trying to install lxml with pip on a Debian server and my tmp folder is not large enough ( I cannot increase it at the moment) and I am getting the following message, could you please suggest a workaround?

Thanks,

Gabor

Downloading/unpacking lxml                                                             
  Running setup.py egg_info for package lxml                                           
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    Building lxml version 3.4.1.
    Building without Cython.
    Using build configuration of libxslt 1.1.26
    Building against libxml2/libxslt in the following directory: /usr/lib
   
Installing collected packages: lxml
  Running setup.py install for lxml
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    Building lxml version 3.4.1.
    Building without Cython.
    Using build configuration of libxslt 1.1.26
    Building against libxml2/libxslt in the following directory: /usr/lib
    building 'lxml.etree' extension
    gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/libxml2 -I/home/roysy/build/lxml/src/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w
    src/lxml/lxml.etree.c:206109:1: fatal error: error writing to /tmp/ccruZe2F.s: No space left on device
    compilation terminated.
    error: command 'gcc' failed with exit status 1
    Complete output from command /usr/bin/python -c "import setuptools;__file__='/home/roysy/build/lxml/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --single-version-externally-managed --record /tmp/pip-VVbtrn-record/install-record.txt:
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'

  warnings.warn(msg)

Building lxml version 3.4.1.

Building without Cython.

Using build configuration of libxslt 1.1.26

Building against libxml2/libxslt in the following directory: /usr/lib

running install

running build

running build_py

copying src/lxml/includes/lxml-version.h -> build/lib.linux-x86_64-2.7/lxml/includes

running build_ext

building 'lxml.etree' extension

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/libxml2 -I/home/roysy/build/lxml/src/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w

src/lxml/lxml.etree.c:206109:1: fatal error: error writing to /tmp/ccruZe2F.s: No space left on device

compilation terminated.

error: command 'gcc' failed with exit status 1

----------------------------------------
Command /usr/bin/python -c "import setuptools;__file__='/home/roysy/build/lxml/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --single-version-externally-managed --record /tmp/pip-VVbtrn-record/install-record.txt failed with error code 1 in /home/roysy/build/lxml
Storing complete log in /root/.pip/pip.log

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 2 Jan 18:14 2015

problem with going from linear to hierarchical structure


I have been struggling with what looks like a very simple problem but it
is beyond my feeble powers. The longer XML fragment  below is a very
elaborate but flat representation of the first two lines from
Shakespeare's Comedy of Errors:

<sp><speaker>EGEON</speaker>
<l> Proceed, Solinus, to procure my fall,</l>
<l> And by the doom of death end woes and all</l>
</sp>

In this encoding the word, punctuation, and space tokens are not
hierarchically gathered into <l> elements. Instead all such tokens are
children of an <ab> element, and an empty milestone element marks the
hierarchical ordering into chunks of verse or prose. Don't ask me why this
somewhat counterintuitive encoding was chosen in the first place, but my
goal is to "unflatten" this very flat structure, which would involve (not
necessarily in that order)

1. creating an appropriate container element (<l> or <ab>) for all the
tokens between one milestone and the next (or renaming the milestone
element)
2 inserting all tokens between one milestone and the next into that
element in their current document order
3. deleting the surrounding <ab> element

The end product of these transformations would still keep the <w>, <pc>,
and <c>tokens, but would have the hierarchical structure that I listed
above. 
I see the general problem of "from linear to hierarchical," but I don't
see any examples in the lxml documentation that allows me to get from here
to there. It's probably a very simple thing that I should know about, but
I don't, and I'll be grateful for any help.

<sp xml:id="sp-0001" who="#Egeon_Err">
<speaker xml:id="spk-0001">
<w xml:id="w0000410">EGEON</w>
</speaker>
<ab xml:id="ab-0001">
<lb xml:id="lb-00009"/>
<milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse"
corresp="#w0000420 #p0000430 #c0000440 #w0000450 #p0000460 #c0000470
#w0000480 #c0000490 #w0000500 #c0000510 #w0000520 #c0000530 #w0000540
#p0000550"/>
<w xml:id="w0000420" n="1.1.1">Proceed</w>
<pc xml:id="p0000430" n="1.1.1">,</pc>
<c xml:id="c0000440" n="1.1.1"> </c>
<w xml:id="w0000450" n="1.1.1">Solinus</w>
<pc xml:id="p0000460" n="1.1.1">,</pc>
<c xml:id="c0000470" n="1.1.1"> </c>
<w xml:id="w0000480" n="1.1.1">to</w>
<c xml:id="c0000490" n="1.1.1"> </c>
<w xml:id="w0000500" n="1.1.1">procure</w>
<c xml:id="c0000510" n="1.1.1"> </c>
<w xml:id="w0000520" n="1.1.1">my</w>
<c xml:id="c0000530" n="1.1.1"> </c>
<w xml:id="w0000540" n="1.1.1">fall</w>
<pc xml:id="p0000550" n="1.1.1">,</pc>
<lb xml:id="lb-00010"/>
<milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse"
corresp="#w0000560 #c0000570 #w0000580 #c0000590 #w0000600 #c0000610
#w0000620 #c0000630 #w0000640 #c0000650 #w0000660 #c0000670 #w0000680
#c0000690 #w0000700 #c0000710 #w0000720 #c0000730 #w0000740 #p0000750"/>
<w xml:id="w0000560" n="1.1.2">And</w>
<c xml:id="c0000570" n="1.1.2"> </c>
<w xml:id="w0000580" n="1.1.2">by</w>
<c xml:id="c0000590" n="1.1.2"> </c>
<w xml:id="w0000600" n="1.1.2">the</w>
<c xml:id="c0000610" n="1.1.2"> </c>
<w xml:id="w0000620" n="1.1.2">doom</w>
<c xml:id="c0000630" n="1.1.2"> </c>
<w xml:id="w0000640" n="1.1.2">of</w>
<c xml:id="c0000650" n="1.1.2"> </c>
<w xml:id="w0000660" n="1.1.2">death</w>
<c xml:id="c0000670" n="1.1.2"> </c>
<w xml:id="w0000680" n="1.1.2">end</w>
<c xml:id="c0000690" n="1.1.2"> </c>
<w xml:id="w0000700" n="1.1.2">woes</w>
<c xml:id="c0000710" n="1.1.2"> </c>
<w xml:id="w0000720" n="1.1.2">and</w>
<c xml:id="c0000730" n="1.1.2"> </c>
<w xml:id="w0000740" n="1.1.2">all</w>
<pc xml:id="p0000750" n="1.1.2">.</pc>
</ab>
</sp>

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Luis González Fernández | 26 Dec 22:37 2014
Picon

Bug with lxml 3.4.1

Hi all:

I'm working in a parser for XML files and i found and behaviour in 
lxml-python both in linux and os x.

When i try to use .find() or findall() functions, the problem seams to 
be when someone try to deal with tag declared inside a namespace without 
prefix.

Here an example to reproduce the bug:

Vengeance:ioc luisgf$ python3
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> data = open("duqu.ioc","rb").read()
>>> root = etree.fromstring(data)
>>> print(root.nsmap)
{'xsi': 'http://www.w3.org/2001/XMLSchema-instance', None: 
'http://schemas.mandiant.com/2010/ioc', 'xsd': 
'http://www.w3.org/2001/XMLSchema'}
>>> print(root[6].tag)
{http://schemas.mandiant.com/2010/ioc}definition
>>> root.find('definition', root.nsmap)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "lxml.etree.pyx", line 1448, in lxml.etree._Element.find 
(src/lxml/lxml.etree.c:51339)
   File

"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/lxml/_elementpath.py", 
line 281, in find
     it = iterfind(elem, path, namespaces)
   File

"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/lxml/_elementpath.py", 
line 271, in iterfind
     selector = _build_path_iterator(path, namespaces)
   File

"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/lxml/_elementpath.py", 
line 234, in _build_path_iterator
     return _cache[(path, namespaces and 
tuple(sorted(namespaces.items())) or None)]
TypeError: unorderable types: NoneType() < str()
>>> 

--

-- 
--
Luis González Fernández
https://www.luisgf.es
PGP ID: C918B80F (DD6F BFC1 FC14 4C81 34F8 EA1E 6BCB C27F C918 B80F)
Twitter:  <at> luisgf_2001 / Jabber: luisgf <at> mijabber.es
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Jason Williams | 24 Dec 13:13 2014
Picon

Extracting titles from links

I want to be able to grab the title of articles from a webpage. I wrote my
script using the following XPath

en_tree.xpath('//a[ <at> class="pubSectionTitle"]')

To grab from the following example XML:

<a href='/en/publications/magazines/wp20141201/ancient-city-timgad/'
class="pubSectionTitle" title="Timgad—A Buried City Reveals Its Secrets">

                        Timgad<wbr />—A Buried City Reveals Its Secrets
                     </a>

When I encounter the above example, and continue with my script (See the
code below) I only get 'Timgad', not the entire title.

Thanks for any help, I'm very inexperienced with this!



en_toc = en_tree.xpath('//a[ <at> class="pubSectionTitle"]')

for title in chs_toc:
        entry = title.text.strip()
        en_titles.append(entry)
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Adarsh LR | 16 Dec 17:56 2014
Picon

Help me on installation please ...

Namaskar ...

I am from india, Anyone who can help me with a quick installer of pip.. I am not even able to install this python script

Anyways ... even I am working on this

Lets go

Adarsh L Raju
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Ken Norris | 16 Dec 08:39 2014
Picon

Trouble running code on Windows Server 2012

Hi, 

I'm having trouble running a code I wrote that uses lxml in python on a Windows 2012 Server. I originally wrote the script on MacOSx and it works well. I tried to run the same on the server. The code runs without error and it produces PDFs, however when I try to open them up they are corrupted or show up as blank. I followed some steps (i.e. following the instructions in the documentation to install the dependencies of lxml and by reseting the PATH env to make sure the dependencies were routing to the proper place) to no avail. I'll attach the script. 

I'm doing research in the economics department at UBC in Vancouver, Canada. My coding skills are coming along but they're far from perfect. Any help you could provide would be great. 

Thanks, 

Ken
Attachment (Test.py): text/x-python-script, 1612 bytes
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 13 Dec 19:36 2014
Picon

Re: Support default_namespace parameter in write method?

Charlie Clark schrieb am 07.12.2014 um 12:00:
> Am .12.2014, 10:05 Uhr, schrieb Stefan Behnel:
>> Not sure. It's due to an inherent difference in the tree models of libxml2
>> and ElementTree: the latter does not know about prefixes at all and handles
>> them only in the serialiser, whereas libxml2 takes them from the in-memory
>> tree. There is no override in libxml2's serialiser, so it would require
>> modifying the tree's namespaces during serialisation (i.e. modification
>> during a supposed read-only operation). I guess it's worth trying out,
>> though.
> 
> I thought there might be something like that. But would it be possible to
> allow the registration of an empty namespace (as ElementTree allows)?
> Currently this is only possible with nsmap. BTW. what are the constraints
> on nsmap? The attribute is not writable but __setitem__ does not raise an
> exception.

Yes, it's actually documented (see docstring) that changing an Element's
nsmap has no effect. The property returns a real dict that contains all
prefix-namespace mappings known in the context of this Element in the tree.
But it has no actual connection to that Element that could pass back
modifications.

I think the problem is that it was never obvious to me what the semantics
should be here. It feels wrong if modifying a mapping that includes all
ancestor namespaces changes the namespaces defined on a specific node. And
would it then update the subtree of that Element? Then reading the property
would give you access to the ancestor state and modifying it would change
the children? Strange. If you change the namespace of a prefix, would it
then have to go back to the ancestor that defined the modified prefix and
changing its entire subtree? That sounds even worse. And what would happen
if you deleted a prefix that's in use?

I agree that the feature of modifying the namespace mapping is missing,
though. If someone has a good idea how this should work, I'm all eyes.

> register_namespace("", "http://example.com/whatever") # works in
> ElementTree but not in lxml

Ah, yes. Not sure the error you get was intended. It would be easy to make
this work, but then, register_namespace() changes global library state.
Globally setting up a prefix-namespace mapping is not a great idea already
(it's ok as long as the prefix is a generally accepted de-facto standard),
but setting a global default namespace seems like asking for trouble and
interference with other code. The mapping is bidirectional and if multiple
default namespace mappings are registered, one will overwrite the other and
the result will depend on who happens to be last. Rearrange some imports
and loose your pretty output? If you're lucky, you'll at least see some of
your tests fail, but I guess prefixes are rarely tested for.

Stefan

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Martin Mueller | 11 Dec 05:25 2014

using Xpath to look for one or another element


I know how to use finditer to select one type of element, as in

speech.iterfind('.//tei:w', namespaces =
{'tei':'http://www.tei-c.org/ns/1.0'})

which selects all the <w> descendants of the parent element. But what if I
want to find <w> OR <c> elements. As I understand from a posting in
stackoverflow,

iterfind('.//tei:w |.//tei:c',namespaces =
{'tei':'http://www.tei-c.org/ns/1.0'})

should do the trick in regular expression mode. But it doesn't seem to
work. Am I doing something wrong or have I hit a limit of the program?

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
D.H.J. Takken | 5 Dec 15:31 2014
Picon
Picon

etree.xmlfile shadows exceptions thrown by output

Hello,

I use etree.xmlfile and a coroutine to generate XML into a file-like
object that does post-processing on the XML data. Now, if I use the
send() method to produce invalid XML data, the post-processor throws an
exception.

Here, I see a little problem: No matter what exception is thrown by the
post-processor, the send() method always generates a generic IO
exception. Apparently, lxml catches any exception thrown by the
file-like object and produces a generic IO exception in stead. This
means that there is no way to know what exactly went wrong in the
post-processor. That information is lost.

Inside the coroutine, I *do* get the original exception from the
post-processor, but doing the error handling there is kind of ugly.

Is there a better way to solve this problem?

Thanks a lot!

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane