Alexandre Delanoë | 24 Apr 07:00 2014

Font inside


Hello all,

Suppose this example:

parser = etree.HTMLParser()
tree = etree.parse(file, parser)
docs = tree.xpath('/html/body/table')

And considering each doc structure:

<tr>
	<td>
			<span class="DocHeader">NOT WANTED DATA</span>
		
		TEXT1

			<p xmlns:scripts="urn:scripts.this" </p>
		TEXT2

		TEXT3
			<font color="red" xmlns:scripts="urn:scripts.this">RED</font>
		TEXT4.
	</td>
</tr>

for doc in docs:
	print doc.xpath("./tr/td/text()")

Result:
(Continue reading)

lxml.etree._Element.{base,sourceline}

Hi,

I realized libxml2 has a feature optimizing xml:base fixup for XInclude,
which impacts lxml.etree._Element base and sourceline attributes.

If a document is including another document in the same directory, an
xml:base fixup strictly speaking is not needed, as any relative URI in
the included part will resolve correctly.  And that, in the eyes of
libxml2, seems to be the primary stated purpose for xml:base.  The
optimization leads to a smaller overall document, which seems to be
considered a feature, usefull to keep the document small and uncluttered
when including many small fragments.

Now there is this secondary purpose of xml:base, telling the original
document of any element in the expanded, xinclude processed document.

I personally find that extremely usefull, and it's actually what makes
the lxml.etree._Element base/sourceline attributes so elegant.

Except that it currently only works if the files change directory at
every xinclude level, to work around aforementioned libxml2
optimization.

I have submitted two patches to the libxml2 mailing list, the first one
would just make it work (and disable the optimization) the second one
adds a run time option, which would need a corresponding fix in lxml.

If you are involved in libxml2, feel free to share your opinion there,

thx,
(Continue reading)

Fabio Sangiovanni | 18 Apr 09:40 2014
Picon

lxml + pypy?

Hello list,

I'm interested in using lxml in a project, as underlying parsing
layer for BeautifulSoup. The whole project should run on pypy.
What's the current status of lxml on pypy?
Is it considered production ready?

Thanks a lot!

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 18 Apr 09:12 2014
Picon

lxml 3.3.5 released

Hi all,

I just released lxml 3.3.5. This is a bug-fix-only release for the stable
lxml 3.3 series that fixes a JavaScript leak in the HTML cleaner.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.3.5.tgz

Signature: http://lxml.de/files/lxml-3.3.5.tgz.asc

Changelog: http://lxml.de/3.3/changes-3.3.5.html

Github:
https://github.com/lxml/lxml/commit/32337d2a820d77768de8f8b082119bf79c86ef93

This release was built using Cython 0.20.1.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

3.3.5 (2014-04-18)
==================

Bugs fixed
----------
(Continue reading)

Martin Mueller | 17 Apr 15:15 2014

lxml on Mavericks

Thank you, Marius, for your advice, which worked.

Stefan, would it make sense to add something like the following to your
documentation?

If you have different versions of Python on your machine, the simple
command 'pip' will by default look for the version that came with your
system. Use a more specific command such as

	pip3.4 install lxml

to install lxml in a particular version of Python on your machine. Look in
/usr/local/bin for available versions of the pip installer.

There is a bug in Mavericks that may prevent the installation of lxml. As
a workaround, run the command

  	export CFLAGS=-Qunused-arguments

before running 

	STATIC_DEPS=true pip install lxml

For a discussion of the bug see  http://bugs.python.org/issue21244
https://github.com/python-imaging/Pillow/issues/527

Martin Mueller
Professor emeritus of English and Classics

_________________________________________________________________
(Continue reading)

Martin Mueller | 17 Apr 04:46 2014

installing lxml to run with python3.4 on OS Mavericks

I have tried to install lxml on my laptop, which runs OS X Mavericks. Not
much success, and from scanning the Web, I'm not alone.

There are actually two issues here. First, I want to install lxml, and
second, I want it to run with Python 3.4, which I successfully installed
on my computer. 

I used the install routine recommended in Building lxml in Mac OS X and
used the command 

STATIC_DEPS=true pip install lxml

From the log file I gather that everything runs more or less as expected
until it hits a glitch towards the very end, which I reproduce below. I
don't really understand them, but I note that the whole installation
process is geared towards the 2.7 version of Python that is part of the
system. I haven't found instructons on how to force pip to look for Python
3.4. Perhaps it doesn't matter.

I'll be grateful for help. On my desktop Mac (Lion) I managed to associate
an earlier version of lxml with Python3, but I don't remember how I did
it. 

copying 
/Users/martin/build/lxml/build/tmp/libxml2/include/libxslt/xsltInternals.h
-> build/lib.macosx-10.9-intel-2.7/lxml/includes/libxslt

copying 
/Users/martin/build/lxml/build/tmp/libxml2/include/libxslt/xsltlocale.h ->
build/lib.macosx-10.9-intel-2.7/lxml/includes/libxslt
(Continue reading)

Максим Кочкин | 15 Apr 20:33 2014
Picon

lxml.html.clean vulnerability

Hi, guys.

I've accidentally found vulnerability in clean_html function. User can break schema of url with nonprinted chars (\x01-\x08). Here is PoC.


from lxml.html.clean import clean_html

html = '''\
<html>
<body>
<a href="javascript:alert(0)">aaa</a>
<a href="javas\x01cript:alert(1)">bbb</a>
<a href="javas\x02cript:alert(1)">bbb</a>
<a href="javas\x03cript:alert(1)">bbb</a>
<a href="javas\x04cript:alert(1)">bbb</a>
<a href="javas\x05cript:alert(1)">bbb</a>
<a href="javas\x06cript:alert(1)">bbb</a>
<a href="javas\x07cript:alert(1)">bbb</a>
<a href="javas\x08cript:alert(1)">bbb</a>
<a href="javas\x09cript:alert(1)">bbb</a>
</body>
</html>'''

print clean_html(html)


Output:

<div>
<body>
<a href="">aaa</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="">bbb</a>
</body>
</div>


I'm not a python programmer, so can't give you quick fix. Found it by blackbox testing on one site that uses lxml. I'm not sure if it's bug or maybe I just got things wrong.

----
ksimka ( <at> m_ksimka)
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Mark Grandi | 4 Apr 23:39 2014
Picon

XML Creation - streaming output

Hello,

I have been very happy with lxml so far, so thanks again for maintaining this for so long! However, there is a use case that lxml does not provide, and i'm not sure if its a limitation of libxml2 or not, but while there is a streaming parser for xml, there is no such thing for outputting / generating xml. As a result, generating a very large XML file is completely dependent on having a quite large amount of computer memory which many people (like me) don't have!

Is there some hidden api in lxml, or maybe an api in libxml2 (that hasn't been made available for lxml) that accomplishes this?

Thanks!

~mark
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 4 Apr 07:22 2014
Picon

Re: Local installation issue

Ivan Pozdeev, 03.04.2014 23:54:
>> KM Here And There, 02.04.2014 23:59:
>>> Following up:
>>>
>>> Jens' note caused me to look somewhere OTHER than:
>>> https://github.com/lxml/lxml (which is where the documentation points
>>> you to).
>>>
>>> And the distribution at
>>> https://pypi.python.org/packages/source/l/lxml/lxml-3.3.3.tar.gz#md5=f2675837b4358a5ecab5fd9a783fd0e5
>>> seems to have the right stuff.
>>>
>>> I think the page at http://lxml.de/build.html is confusing in this
>>> regard and may need a bit of clarification so no one else does what I
>>> did.  Or maybe I'm just a dumb noob.
> 
>> Hmm, it's actually *very* explicit, although that also makes it a bit
>> verbose. Suggestions for improvements welcome.
> 
> 1) Can we get "source code" of these pages so we can suggest patches right off the bat?

https://github.com/lxml/lxml

Specifically, INSTALL.txt and doc/build.txt.

Pull requests welcome.

> 2) "Static linking on Windows":
> 
> 2.1) Replace the section's content with
> "run with --static-deps", possibly abbreviate the former
> content into a brief explanation of what the option does.

Right. I updated it.

> 2.2) Add the "insert http://msinttypes.googlecode.com/svn/trunk/stdint.h to VC9.0's include dir"
> fix that I mentioned in
https://mailman-mail5.webfaction.com/pipermail/lxml/2014-January/007065.html .

That shouldn't be necessary anymore.

Thanks for the comments.

Stefan

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
Stefan Behnel | 3 Apr 22:13 2014
Picon

lxml 3.3.4 released

Hi all,

I'm happy to announce the release of lxml 3.3.4. This is a bug-fix release
for the stable lxml 3.3 series that adds one little feature: full line
number support (beyond 65535) when using libxml2 2.9.x.

The documentation is here: http://lxml.de/

Download:  http://lxml.de/files/lxml-3.3.4.tgz

Signature: http://lxml.de/files/lxml-3.3.4.tgz.asc

Changelog: http://lxml.de/3.3/changes-3.3.4.html

Github:
https://github.com/lxml/lxml/commit/076efc798ee7eae048d9ee764f30e2980a7c870f

This release was built using Cython 0.20.1.

If you are interested in commercial support or customisations for the lxml
package, please contact me directly.

Have fun,

Stefan

3.3.4 (2014-04-03)
==================

Features added
--------------

* Source line numbers above 65535 are available on Elements when
  using libxml2 2.9 or later.

Bugs fixed
----------

* lxml.html.fragment_fromstring() failed for bytes input in Py3.
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
KM Here And There | 2 Apr 22:31 2014
Picon

Local installation issue

I'm trying to build a local copy of lxml on an Ubuntu x86-64 system. 
I'm not doing lxml development, just need the package.

The documentation for building on lxml.de says that *all* I need to type is:

    python setup.py build --without-cython

The results below show that there is a file missing from the
distribution.  If the file is supposed to be generated, the
documentation does not say how this is accomplished.

kevin <at> ubuntu:~/build/lxml-lxml-3.3$ python setup.py build --without-cython
Building lxml version 3.3.3.
WARNING: Trying to build without Cython, but pre-generated
'src/lxml/lxml.etree.c' is not available.
WARNING: Trying to build without Cython, but pre-generated
'src/lxml/lxml.objectify.c' is not available.
Building without Cython.
Using build configuration of libxslt 1.1.28
Building against libxml2/libxslt in the following directory:
/home/kevin/usr/local/lib
/home/kevin/usr/local/lib/python2.7/distutils/dist.py:267: UserWarning:
Unknown distribution option: 'bugtrack_url'
  warnings.warn(msg)
running build
running build_py
copying src/lxml/includes/lxml-version.h ->
build/lib.linux-i686-2.7/lxml/includes
running build_ext
building 'lxml.etree' extension
gcc -pthread -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
-I/home/kevin/usr/local/include
-I/home/kevin/usr/local/include/python2.7
-I/home/kevin/usr/local/include/libxml2
-I/home/kevin/usr/local/include/libexslt
-I/home/kevin/usr/local/include/libxslt -I/home/kevin/usr/local/include
-fPIC -I/home/kevin/usr/local/include
-I/home/kevin/usr/local/include/libxml2
-I/home/kevin/build/lxml-lxml-3.3/src/lxml/includes
-I/home/kevin/usr/local/include/python2.7 -c src/lxml/lxml.etree.c -o
build/temp.linux-i686-2.7/src/lxml/lxml.etree.o -w
gcc: error: src/lxml/lxml.etree.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4

So it appears that the file 'src/lxml/lxml.etree.c' is missing from the
source distribution.  I've checked 3.3.3 back to 2.3 and it has NEVER
been there AFAICT.

My question is, where do I get this file or how do I generate it, or is
this a bug?

Thanks,
Kevin

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml <at> lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Gmane