Phil Thompson | 24 Jul 18:55 2014

Does Zip Importer have to be Special?

I have an importer for use in applications that embed an interpreter 
that does a similar job to the Zip importer (except that the storage is 
a C data structure rather than a .zip file). Just like the Zip importer 
I need to import my importer and add it to sys.path_hooks. However the 
earliest opportunity I have to do this is after the Py_Initialize() call 
returns - but this is too late because some parts of the standard 
library have already needed to be imported.

My current workaround is to include a modified version of _bootstrap.py 
as a frozen module that has the necessary steps added to the end of its 
_install() function.

The Zip importer doesn't have this problem because it gets special 
treatment - the call to its equivalent code is hard-coded and happens 
exactly when needed.

What would help is a table of functions that were called where 
_PyImportZip_Init() is currently called. By default the only entry in 
the table would be _PyImportZip_Init. There would be a way of modifying 
the table, either like how PyImport_FrozenModules is handled or how 
Inittab is handled.

...or if there is a better solution that I have missed that doesn't 
require a modified _bootstrap.py.

Thanks,
Phil
Alex Gaynor | 22 Jul 23:03 2014
Picon

[PEP466] SSLSockets, and sockets, _socketobjects oh my!

Hi all,

I've been happily working on the SSL module backports for Python2 (pursuant to
PEP466), and I've hit something of a snag:

In python3, the SSLSocket keeps a weak reference to the underlying socket,
rather than a strong reference, as Python2 uses.

Unfortunately, due to the way sockets work in Python2, this doesn't work:

On Python2, _socketobject composes around _real_socket from the _socket module,
whereas on Python3, it subclasses _socket.socket. Since you now have a Python-
level class, you can weak reference it.

The question is:

a) Should we backport weak referencing _socket.sockets (changing the structure
   of the module seems overly invasive, albeit completely backwards
   compatible)?
b) Does anyone know why weak references are used in the first place? The commit
   message just alludes to fixing a leak with no reference to an issue.

Anyone who's interested in the state of the branch can see it at:
github.com/alex/cpython on the backport-ssl branch. Note that many many tests
are still failing, and you'll need to apply the patch from
http://bugs.python.org/issue22023 to get it to work.

Thanks,
Alex

(Continue reading)

Victor Stinner | 22 Jul 00:26 2014
Picon

PEP 471 "scandir" accepted

Hi,

I asked privately Guido van Rossum if I can be the BDFL-delegate for
the PEP 471 and he agreed. I accept the latest version of the PEP:

    http://legacy.python.org/dev/peps/pep-0471/

I consider that the PEP 471 "scandir" was discussed enough to collect
all possible options (variations of the API) and that main flaws have
been detected. Ben Hoyt modified his PEP to list all these options,
and for each option gives advantages and drawbacks. Great job Ben :-)
Thanks all developers who contributed to the threads on the python-dev
mailing list!

The new version of the PEP has an optional "follow_symlinks" parameter
which is True by default. IMO this API fits better the common case,
list the content of a single directory, and it's now simple to not
follow symlinks to implement a recursive function like os.walk().

The PEP also explicitly mentions that os.walk() will be modified to
benefit of the new os.scandir() function.

I'm happy because the final API is very close to os.path functions and
pathlib.Path methods. Python stays consistent, which is a great power
of this language!

The PEP is accepted. It's time to review the implementation ;-) The
current code can be found at:

   https://github.com/benhoyt/scandir
(Continue reading)

anatoly techtonik | 20 Jul 16:34 2014
Picon

subprocess research - max limit for piped output

I am trying to figure out what is maximum size
for piped input in subprocess.check_output()

I've got limitation of about 500Mb after which
Python exits with MemoryError without any
additional details.

I have only 2.76Gb memory used out of 8Gb,
so what limit do I hit?

1. subprocess output read buffer
2. Python limit on size of variable
3. some OS limit on output pipes

Testcase attached.

C:\discovery\interface\subprocess>py dead.py
Testing size: 520Mb
..truncating to 545259520
..
Traceback (most recent call last):
  File "dead.py", line 66, in <module>
    backticks(r'type largefile')
  File "dead.py", line 36, in backticks
    output = subprocess.check_output(command, shell=True)
  File "C:\Python27\lib\subprocess.py", line 567, in check_output
    output, unused_err = process.communicate()
  File "C:\Python27\lib\subprocess.py", line 791, in communicate
    stdout = _eintr_retry_call(self.stdout.read)
  File "C:\Python27\lib\subprocess.py", line 476, in _eintr_retry_call
(Continue reading)

Python tracker | 18 Jul 18:07 2014

Summary of Python tracker Issues


ACTIVITY SUMMARY (2014-07-11 - 2014-07-18)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open    4589 ( +1)
  closed 29188 (+47)
  total  33777 (+48)

Open issues with patches: 2154 

Issues opened (36)
==================

#21044: tarfile does not handle file .name being an int
http://bugs.python.org/issue21044  reopened by zach.ware

#21946: 'python -u' yields trailing carriage return '\r'  (Python2 for
http://bugs.python.org/issue21946  reopened by haypo

#21950: import sqlite3 not running after configure --prefix=/alt/path;
http://bugs.python.org/issue21950  reopened by r.david.murray

#21958: Allow python 2.7 to compile with Visual Studio 2013
http://bugs.python.org/issue21958  opened by Zachary.Turner

#21960: Better path handling in Idle find in files
(Continue reading)

Mikhail Korobov | 16 Jul 23:44 2014
Picon

cStringIO vs io.BytesIO

Hi,

cStringIO was removed from Python 3. It seems the suggested replacement is io.BytesIO. But there is a problem: cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later).

This means io.BytesIO is not suited well to cases when you want to get a readonly file-like interface for existing byte strings. Isn't it one of the main io.BytesIO use cases? Wrapping bytes in cStringIO.StringIO used to be almost free, but this is not true for io.BytesIO.

So making code 3.x compatible by ditching cStringIO can cause a serious performance/memory  regressions. One can change the code to build the data using BytesIO (without creating bytes objects in the first place), but that is not always possible or convenient.

I believe this problem affects tornado (https://github.com/tornadoweb/tornado/issues/1110), Scrapy (this is how I became aware of this issue), NLTK (anecdotical evidence - I tried to port some hairy NLTK module to io.BytesIO, it became many times slower) and maybe pretty much every IO-related project ported to Python 3.x (django - check, werkzeug and frameworks based on it - check, requests - check - they all wrap user data to BytesIO, and this may cause slowdowns and up to 2x memory usage in Python 3.x).

Do you know if there a workaround? Maybe there is some stdlib part that I'm missing, or a module on PyPI? It is not that hard to write an own wrapper that won't do copies (or to port [c]StringIO to 3.x), but I wonder if there is an existing solution or plans to fix it in Python itself - this BytesIO use case looks quite important.
<div><div dir="ltr">
<div>
<div>Hi,<br><br>
</div>cStringIO was removed from Python 3. It seems the suggested replacement is io.BytesIO. But there is a problem: cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later).<br><br>This means io.BytesIO is not suited well to cases when you want to get a readonly file-like interface for existing byte strings. Isn't it one of the main io.BytesIO use cases?  Wrapping bytes in cStringIO.StringIO used to be almost free, but this is not true for io.BytesIO. <br><br>So making code 3.x compatible by ditching cStringIO can cause a serious performance/memory&nbsp; regressions. One can change the code to build the data using BytesIO (without creating bytes objects in the first place), but that is not always possible or convenient.<br><br>I believe this problem affects tornado 
(<a href="https://github.com/tornadoweb/tornado/issues/1110">https://github.com/tornadoweb/tornado/issues/1110</a>), Scrapy (this is how
 I became aware of this issue), NLTK (anecdotical evidence - I tried to port some hairy NLTK module 
to io.BytesIO, it became many times slower) and maybe pretty much every 
IO-related project ported to Python 3.x (django - <a href="https://github.com/django/django/blob/fff7b507ef2f85bb47abd2ee32982682d7822ac4/django/http/request.py#L225">check</a>, werkzeug and frameworks based on it - <a href="https://github.com/mitsuhiko/werkzeug/blob/976b63cadf3d5482aa975df053fa458ff638e571/werkzeug/wrappers.py#L375">check</a>, requests - <a href="https://github.com/kennethreitz/requests/blob/6b21e5c8f0c8fafda661d80f4555ce530507bd68/requests/models.py">check</a> - they all wrap user data to BytesIO, and this may cause slowdowns and up to 2x memory usage in Python 3.x).<br><br>
</div>Do you know if there a workaround? Maybe there is some stdlib part that I'm missing, or a module on PyPI? It is not that hard to write an own wrapper that won't do copies (or to port [c]StringIO to 3.x), but I wonder if there is an existing solution or plans to fix it in Python itself - this BytesIO use case looks quite important.<br>
</div></div>
Ben Hoyt | 15 Jul 14:19 2014
Picon

Re: Remaining decisions on PEP 471 -- os.scandir()

> I'd *keep DirEntry.lstat() method* regardless of existence of
> .stat(*, follow_symlinks=True) method (despite the slight violation of
> DRY principle) for readability. `dir_entry.lstat().st_mode` is more
> consice than `dir_entry.stat(follow_symlinks=False).st_mode` and the
> meaning of lstat is well-established -- get (symbolic link) status [2].

The meaning of lstat() is well-established, so I don't mind this. But
I don't think it's necessary, either. My thought would be that in new
code/functions we should kind of prescribe best-practices rather than
leave the options open. Yes, it's a few more characters, but
"follow_symlinks=True" is allow much clear than "l" to describe this
behaviour, especially for non-Linux hackers.

> I suggest *renaming .full_name -> .path* due to reasons outlined in [1].
>
> [1]: https://mail.python.org/pipermail/python-dev/2014-July/135441.html

Hmmm, perhaps. You suggest .full_name implies it's the absolute path,
which isn't true. I don't mind .path, but it kind of sounds like "the
Path object associated with this entry". I think "full_name" is fine
-- it's not "abs_name".

> follow_symlinks (if added) should be *keyword-only parameter* because
> `dir_entry.is_dir(False)` is unreadable (it is not clear at a glance
> what `False` means in this case).

Agreed follow_symlinks should be a keyword-only parameter (as it is in
os.stat() in Python 3).

> Exceptions are part of the public API. pathlib is inconsitent with
> os.path here e.g., os.path.isdir() ignores all OS errors raised by
> the stat() call but the corresponding pathlib call ignores only broken
> symlinks (non-existent entries).
>
> The cherry-picking of which stat errors to silence (implicitly) seems
> worse than either silencing the errors (like os.path.isdir does) or
> allowing them to propagate.

Hmmm, you're right there's a subtle difference here. I think the
os.path.isdir() behaviour could mask real errors, and the pathlib
behaviour is more correct. pathlib's behaviour is not implicit though
-- it's clearly documented in the docs:
https://docs.python.org/3/library/pathlib.html#pathlib.Path.is_dir

> Returning False instead of raising OSError in is_dir() method simplifies
> the usage greatly without (much) negative consequences. It is a *rare*
> case when silencing errors could be more practical.

I think is_X() *should* fail if there are permissions errors or other
fatal errors. Whether or not they should fail if the file doesn't
exist (unlikely to happen anyway) or on a broken symlink is a
different question, but there's a good prececent with the existing
os/pathlib functions there.

-Ben
Ethan Furman | 14 Jul 18:16 2014
Picon

Python Job Board

has now been dead for five months.

--
~Ethan~
Tim Tisdall | 14 Jul 15:57 2014
Picon

Bluetooth 4.0 support in "socket" module

I was interested in providing patches for the socket module to add Bluetooth 4.0 support.  I couldn't find any details on how to provide contributions to the Python project, though...  Is there some online documentation with guidelines on how to contribute?  Should I just provide a patch to this mailing list?

Also, is there a method to test changes against all the different *nix variations?  Is Bluez the standard across the different *nix variations?

-Tim
<div><div dir="ltr">I was interested in providing patches for the socket module to add Bluetooth 4.0 support. &nbsp;I couldn't find any details on how to provide contributions to the Python project, though... &nbsp;Is there some online documentation with guidelines on how to contribute? &nbsp;Should I just provide a patch to this mailing list?<div>
<br>
</div>
<div>Also, is there a method to test changes against all the different *nix variations? &nbsp;Is Bluez the standard across the different *nix variations?<br><div><br></div>
<div>-Tim</div>
</div>
</div></div>
Ben Hoyt | 14 Jul 02:33 2014
Picon

Remaining decisions on PEP 471 -- os.scandir()

Hi folks,

Thanks Victor, Nick, Ethan, and others for continued discussion on the
scandir PEP 471 (most recent thread starts at
https://mail.python.org/pipermail/python-dev/2014-July/135377.html).

Just an aside ... I was reminded again recently why scandir() matters:
a scandir user emailed me the other day, saying "I used scandir to
dump the contents of a network dir in under 15 seconds. 13 root dirs,
60,000 files in the structure. This will replace some old VBA code
embedded in a spreadsheet that was taking 15-20 minutes to do the
exact same thing." I asked if he could run scandir's benchmark.py on
his directory tree, and here's what it printed out:

C:\Python34\scandir-master>benchmark.py "\\my\network\directory"
Using fast C version of scandir
Priming the system's cache...
Benchmarking walks on \\my\network\directory, repeat 1/3...
Benchmarking walks on \\my\network\directory, repeat 2/3...
Benchmarking walks on \\my\network\directory, repeat 3/3...
os.walk took 8739.851s, scandir.walk took 129.500s -- 67.5x as fast

That's right -- os.walk() with scandir was almost 70x as fast as the
current version! Admittedly this is a network file system, but that's
still a real and important use case. It really pays not to throw away
information the OS gives you for free. :-)

On the recent python-dev thread, Victor especially made some well
thought out suggestions. It seems to me there's general agreement that
the basic API in PEP 471 is good (with Ethan not a fan at first, but
it seems he's on board after further discussion :-).

That said, I think there's basically one thing remaining to decide:
whether or not to have DirEntry.is_dir() and .is_file() follow
symlinks by default. I think Victor made a pretty good case that:

(a) following links is usually what you want
(b) that's the precedent set by the similar functions os.path.isdir()
and pathlib.Path.is_dir(), so to do otherwise would be confusing
(c) with the non-link-following version, if you wanted to follow links
you'd have to say something like "if (entry.is_symlink() and
os.path.isdir(entry.full_name)) or entry.is_dir()" instead of just "if
entry.is_dir()"
(d) it's error prone to have to do (c), as I found out recently when I
had a bug in my implementation of os.walk() with scandir -- I had a
bug due to getting this exact test wrong

If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.

As a KISS enthusiast, I admit I'm still somewhat partial to the
DirEntry methods just returning (non-link following) info about the
*directory entry* itself. However, I can definitely see the
error-proneness of that, and the advantages given the points above. So
I guess I'm on the fence.

Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?

I'd be very keen to come to a consensus on this, so that I can make
some final updates to the PEP and see about getting it accepted and/or
implemented. :-)

-Ben
Jason R. Coombs | 13 Jul 16:04 2014

Another case for frozendict

I repeatedly run into situations where a frozendict would be useful, and every time I do, I go searching and find the (unfortunately rejected) PEP-416. I’d just like to share another case where having a frozendict in the stdlib would be useful to me.

 

I was interacting with a database and had a list of results from 206 queries:

 

>>> res = [db.cases.remove({'_id': doc['_id']}) for doc in fives]

>>> len(res)

206

 

I can see that the results are the same for the first two queries.

 

>>> res[0]

{'n': 1, 'err': None, 'ok': 1.0}

>>> res[1]

{'n': 1, 'err': None, 'ok': 1.0}

 

So I’d like to test to see if that’s the case, so I try to construct a ‘set’ on the results, which in theory would give me a list of unique results:

 

>>> set(res)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

TypeError: unhashable type: 'dict'

 

I can’t do that because dict is unhashable. That’s reasonable, and if I had a frozen dict, I could easily work around this limitation and accomplish what I need.

 

>>> set(map(frozendict, res))

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

NameError: name 'frozendict' is not defined

 

PEP-416 mentions a MappingProxyType, but that’s no help.

 

>>> res_ex = list(map(types.MappingProxyType, res))

>>> set(res_ex)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

TypeError: unhashable type: 'mappingproxy'

 

I can achieve what I need by constructing a set on the ‘items’ of the dict.

 

>>> set(tuple(doc.items()) for doc in res)

{(('n', 1), ('err', None), ('ok', 1.0))}

 

But that syntax would be nicer if the result had the same representation as the input (mapping instead of tuple of pairs). A frozendict would have readily enabled the desirable behavior.

 

Although hashability is mentioned in the PEP under constraints, there are many use-cases that fall out of the ability to hash a dict, such as the one described above, which are not mentioned at all in use-cases for the PEP.

 

If there’s ever any interest in reviving that PEP, I’m in favor of its implementation.

<div>
<div class="WordSection1">
<p class="MsoNormal">I repeatedly run into situations where a frozendict would be useful, and every time I do, I go searching and find the (unfortunately rejected) PEP-416. I&rsquo;d just like to share another case where having a frozendict in the stdlib would be
 useful to me.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">I was interacting with a database and had a list of results from 206 queries:<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">&gt;&gt;&gt; res = [db.cases.remove({'_id': doc['_id']}) for doc in fives]<p></p></p>
<p class="MsoNormal">&gt;&gt;&gt; len(res)<p></p></p>
<p class="MsoNormal">206<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">I can see that the results are the same for the first two queries.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">&gt;&gt;&gt; res[0]<p></p></p>
<p class="MsoNormal">{'n': 1, 'err': None, 'ok': 1.0}<p></p></p>
<p class="MsoNormal">&gt;&gt;&gt; res[1]<p></p></p>
<p class="MsoNormal">{'n': 1, 'err': None, 'ok': 1.0}<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">So I&rsquo;d like to test to see if that&rsquo;s the case, so I try to construct a &lsquo;set&rsquo; on the results, which in theory would give me a list of unique results:<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">&gt;&gt;&gt; set(res)<p></p></p>
<p class="MsoNormal">Traceback (most recent call last):<p></p></p>
<p class="MsoNormal">&nbsp; File "&lt;stdin&gt;", line 1, in &lt;module&gt;<p></p></p>
<p class="MsoNormal">TypeError: unhashable type: 'dict'<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">I can&rsquo;t do that because dict is unhashable. That&rsquo;s reasonable, and if I had a frozen dict, I could easily work around this limitation and accomplish what I need.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">&gt;&gt;&gt; set(map(frozendict, res))<p></p></p>
<p class="MsoNormal">Traceback (most recent call last):<p></p></p>
<p class="MsoNormal">&nbsp; File "&lt;stdin&gt;", line 1, in &lt;module&gt;<p></p></p>
<p class="MsoNormal">NameError: name 'frozendict' is not defined<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">PEP-416 mentions a MappingProxyType, but that&rsquo;s no help.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">&gt;&gt;&gt; res_ex = list(map(types.MappingProxyType, res))<p></p></p>
<p class="MsoNormal">&gt;&gt;&gt; set(res_ex)<p></p></p>
<p class="MsoNormal">Traceback (most recent call last):<p></p></p>
<p class="MsoNormal">&nbsp; File "&lt;stdin&gt;", line 1, in &lt;module&gt;<p></p></p>
<p class="MsoNormal">TypeError: unhashable type: 'mappingproxy'<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">I can achieve what I need by constructing a set on the &lsquo;items&rsquo; of the dict.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">&gt;&gt;&gt; set(tuple(doc.items()) for doc in res)<p></p></p>
<p class="MsoNormal">{(('n', 1), ('err', None), ('ok', 1.0))}<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">But that syntax would be nicer if the result had the same representation as the input (mapping instead of tuple of pairs). A frozendict would have readily enabled the desirable behavior.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">Although hashability is mentioned in the PEP under constraints, there are many use-cases that fall out of the ability to hash a dict, such as the one described above, which are not mentioned at all in use-cases for the PEP.<p></p></p>
<p class="MsoNormal"><p>&nbsp;</p></p>
<p class="MsoNormal">If there&rsquo;s ever any interest in reviving that PEP, I&rsquo;m in favor of its implementation.<p></p></p>
</div>
</div>

Gmane