Sanne Korzec | 2 Sep 2009 14:34
Picon

[Cython] FW: cython and hash tables / dictionary

Hi mailing,

 

I’ve been writing a complex program in python, which I am currently scaling up. I find myself in the position now, where I run out of memory or out of time. I have been looking at alternatives like cython and ctypes. I implemented ctypes which fixes the memory problem but doubles the time problem.

 

Currently I am implementing a cython version and ran into a problem. I hope someone can help me out.

 

The main bottleneck in my code is a large dictionary / hash table which I would like to optimize. Since a dictionary is a python datatype I have no idea how to make this cython.

 

Currently I have tried to keep the ‘keys’ intact and store the values as ctypes floats, but I think it might be better to do something else. Do I need to make the entire hash table c? Or is there a more simple solution like combining the python dict with cython? If so, how do I do this?

 

Thanks in advance.

 

Additional details: I use a double dict where the key of the first dict stores another dict as value.

 

S.

<div>

<div class="Section1">

<p class="MsoNormal"><span lang="EN-US">Hi mailing,<p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">I&rsquo;ve been writing a complex program in python,
which I am currently scaling up. I find myself in the position now, where I run
out of memory or out of time. I have been looking at alternatives like cython
and ctypes. I implemented ctypes which fixes the memory problem but doubles the
time problem.<p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">Currently I am implementing a cython version and ran
into a problem. I hope someone can help me out.<p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">The main bottleneck in my code is a large dictionary
/ hash table which I would like to optimize. Since a dictionary is a python
datatype I have no idea how to make this cython.<p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">Currently I have tried to<span> keep the &lsquo;keys&rsquo; intact and</span> store
the <span>&lsquo;</span>values<span>&rsquo;</span> as ctypes floats, but
I think it might be better to do something else. Do I need to make the entire
hash table c? Or is there a more simple solution like combining the python dict
with cython?<span> If so, how do I do this?</span><p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">Thanks in advance.<p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">Additional details: I use a double dict where the key
of the first dict stores another dict as value.<p></p></span></p>

<p class="MsoNormal"><span lang="EN-US"><p>&nbsp;</p></span></p>

<p class="MsoNormal"><span lang="EN-US">S.<p></p></span></p>

</div>

</div>
Robert Bradshaw | 2 Sep 2009 18:01
Favicon

Re: [Cython] FW: cython and hash tables / dictionary

On Sep 2, 2009, at 5:34 AM, Sanne Korzec wrote:

> Hi mailing,
>
> I’ve been writing a complex program in python, which I am currently  
> scaling up. I find myself in the position now, where I run out of  
> memory or out of time. I have been looking at alternatives like  
> cython and ctypes. I implemented ctypes which fixes the memory  
> problem but doubles the time problem.
>
> Currently I am implementing a cython version and ran into a  
> problem. I hope someone can help me out.
>
> The main bottleneck in my code is a large dictionary / hash table  
> which I would like to optimize. Since a dictionary is a python  
> datatype I have no idea how to make this cython.
>
> Currently I have tried to keep the ‘keys’ intact and store the  
> ‘values’ as ctypes floats, but I think it might be better to do  
> something else. Do I need to make the entire hash table c? Or is  
> there a more simple solution like combining the python dict with  
> cython? If so, how do I do this?
>
> Thanks in advance.
>
> Additional details: I use a double dict where the key of the first  
> dict stores another dict as value.
>
> S.
>

The short answer is yes, to avoid using the Python dictionary (which  
can only hold Python objects), you need to write your own hashtable.  
That's not very hard though--I bet only a hundred or two lines in  
Cython would be sufficient (and very fast). You could also look into  
using an external C or C++ library, though C++ support is still a  
work in progress.

- Robert

Christopher Barker | 2 Sep 2009 18:30
Picon
Favicon

Re: [Cython] FW: cython and hash tables / dictionary

Sanne Korzec wrote:
> The main bottleneck in my code is a large dictionary / hash table which 
> I would like to optimize.

In what way do you need to optimize it? i.e. how is it used? do you have 
memory issues or speed issues? python dicts are highly optimized 
already, so you're not likely to do much better with the look-up speed.

-Chris

--

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@...
Sebastien Binet | 2 Sep 2009 19:01
Picon
Gravatar

Re: [Cython] FW: cython and hash tables / dictionary

On Wednesday 02 September 2009 18:01:43 Robert Bradshaw wrote:
> On Sep 2, 2009, at 5:34 AM, Sanne Korzec wrote:
> > Hi mailing,
> >
> > I’ve been writing a complex program in python, which I am currently
> > scaling up. I find myself in the position now, where I run out of
> > memory or out of time. I have been looking at alternatives like
> > cython and ctypes. I implemented ctypes which fixes the memory
> > problem but doubles the time problem.
> >
> > Currently I am implementing a cython version and ran into a
> > problem. I hope someone can help me out.
> >
> > The main bottleneck in my code is a large dictionary / hash table
> > which I would like to optimize. Since a dictionary is a python
> > datatype I have no idea how to make this cython.
> >
> > Currently I have tried to keep the ‘keys’ intact and store the
> > ‘values’ as ctypes floats, but I think it might be better to do
> > something else. Do I need to make the entire hash table c? Or is
> > there a more simple solution like combining the python dict with
> > cython? If so, how do I do this?
> >
> > Thanks in advance.
> >
> > Additional details: I use a double dict where the key of the first
> > dict stores another dict as value.
> >
> > S.
>
> The short answer is yes, to avoid using the Python dictionary (which
> can only hold Python objects), you need to write your own hashtable.
> That's not very hard though--I bet only a hundred or two lines in
> Cython would be sufficient (and very fast). You could also look into
> using an external C or C++ library, though C++ support is still a
> work in progress.

I'd recommand using this C library:
http://c-algorithms.sourceforge.net/

having a cython-stl sounds nice though :)

cheers,
sebastien.
--

-- 
#########################################
# Dr. Sebastien Binet
# Laboratoire de l'Accelerateur Lineaire
# Universite Paris-Sud XI
# Batiment 200
# 91898 Orsay
#########################################
Robert Bradshaw | 2 Sep 2009 19:22
Favicon

Re: [Cython] FW: cython and hash tables / dictionary

On Wed, 2 Sep 2009, Christopher Barker wrote:

> Sanne Korzec wrote:
>> The main bottleneck in my code is a large dictionary / hash table which
>> I would like to optimize.
>
> In what way do you need to optimize it? i.e. how is it used? do you have
> memory issues or speed issues? python dicts are highly optimized
> already, so you're not likely to do much better with the look-up speed.

It could help with both memory and speed. In particular, to do a lookup in 
a Python hashtable you need to

(1) Wrap the float in a Python object
(2) call __hash__ on that new object
(3) actually do the lookup
(4) unwrap the result back into a float.

Python does have a highly optimized (3), but the overhead for the rest 
will probably overwhelm it speedwise, so I bet a simple, unwrapped 
implementation would still be quite a win.

- Robert
Dag Sverre Seljebotn | 2 Sep 2009 19:52
Picon
Picon

Re: [Cython] FW: cython and hash tables / dictionary

Robert Bradshaw wrote:
> On Wed, 2 Sep 2009, Christopher Barker wrote:
> 
>> Sanne Korzec wrote:
>>> The main bottleneck in my code is a large dictionary / hash table which
>>> I would like to optimize.
>> In what way do you need to optimize it? i.e. how is it used? do you have
>> memory issues or speed issues? python dicts are highly optimized
>> already, so you're not likely to do much better with the look-up speed.
> 
> It could help with both memory and speed. In particular, to do a lookup in 
> a Python hashtable you need to
> 
> (1) Wrap the float in a Python object
> (2) call __hash__ on that new object

I hope that isn't what actually happens :-) Floating point and hashes 
don't seem like a good idea.

(Just a note, I believe the OP was talking about floats in the values so 
we're good.)

> (3) actually do the lookup
> (4) unwrap the result back into a float.
> 
> Python does have a highly optimized (3), but the overhead for the rest 
> will probably overwhelm it speedwise, so I bet a simple, unwrapped 
> implementation would still be quite a win.
> 
> - Robert
> _______________________________________________
> Cython-dev mailing list
> Cython-dev@...
> http://codespeak.net/mailman/listinfo/cython-dev

--

-- 
Dag Sverre
Stefan Behnel | 2 Sep 2009 20:01
Picon
Favicon

Re: [Cython] FW: cython and hash tables / dictionary


Dag Sverre Seljebotn wrote:
> Robert Bradshaw wrote:
>> to do a lookup in a Python hashtable you need to
>>
>> (1) Wrap the float in a Python object
>> (2) call __hash__ on that new object
> 
> I hope that isn't what actually happens :-) Floating point and hashes 
> don't seem like a good idea.

That certainly depends on the hash function. A float value is not more than
a bunch of bits after all, just like an int or string. It even has the
advantage of being exactly a multiple of 4 bytes large, so a hash function
can actually deploy extremely efficient CPU operations.

Stefan

Stefan Behnel | 2 Sep 2009 20:08
Picon
Favicon

Re: [Cython] FW: cython and hash tables / dictionary


Sebastien Binet wrote:
> I'd recommand using this C library:
> http://c-algorithms.sourceforge.net/

Interesting. That would certainly make a nice C-level standard library.

Would you have ready-made .pxd files for this library? Or even some Cython
example code that you could post somewhere?

Stefan
Dag Sverre Seljebotn | 2 Sep 2009 21:21
Picon
Picon

Re: [Cython] FW: cython and hash tables / dictionary

Stefan Behnel wrote:
> Dag Sverre Seljebotn wrote:
>   
>> Robert Bradshaw wrote:
>>     
>>> to do a lookup in a Python hashtable you need to
>>>
>>> (1) Wrap the float in a Python object
>>> (2) call __hash__ on that new object
>>>       
>> I hope that isn't what actually happens :-) Floating point and hashes 
>> don't seem like a good idea.
>>     
>
> That certainly depends on the hash function. A float value is not more than
> a bunch of bits after all, just like an int or string. It even has the
> advantage of being exactly a multiple of 4 bytes large, so a hash function
> can actually deploy extremely efficient CPU operations.
>   
Yes, but the hash function to use would in the majority of real world 
cases depend heavily on the context. Up to what precision should two 
floats be compared for equality? Since roundoff errors will usually lead 
to slightly different values for storage and retrieval, unless you're 
really, really careful. (Always compare floats by abs(a - b) < eps and 
so on).

(I'd like to hear about actual use cases if there is any though! -- 
anything I can think of where store/retrieve would make sense is better 
represented by intervals on the real line than a single float.)

It is certainly possible to create hashes which rounds off floats in the 
right manner, but that's certainly more magic than I hope is embedded in 
Python's hash/eq -- I'd like to be able to specify my eps in such cases!

Dag Sverre
Stefan Behnel | 2 Sep 2009 21:28
Picon
Favicon

Re: [Cython] Thoughts after SciPy 09

Hi Dag,

Dag Sverre Seljebotn wrote:
> As many of you know me and Kurt attended SciPy 09. Four Cython-related
> events were held:
> 
>   - An introductory tutorial to Cython (by me)
>   - A talk about Cython for numerics (by me again)
>   - A talk on Fwrap (by Kurt)
>   - A Cython BoF
> 
> You can find links to slides and videos for the three first on
> http://conference.scipy.org.
> 
> An intensive week like that makes me reflect on what Cython is good
> about, lacking, etc. etc.
> 
> First of all, there seems to be quite a lot of interest in Cython, many
> thinks it is excellent, and many thanked me personally for our efforts.
> 
> One thing that's also very interesting to me personally is that there's 
> some talk of porting parts of NumPy over to Cython for easier Python 3 
> support.

This sounds like Cython is still heavily growing in interest, which is a
really good thing. Thanks for spreading the word so loudly.

> Beyond that, I've got a nice list of topics for further improvement. For 
> instance one thing that is very possible to fix was a real dealbreaker 
> that some complained about, and in one case stopped somebody from 
> recommending it to co-workers. It's always nice to get the "outside" 
> perspective that I get when I present Cython to lots of people.

Well, the list of requested features has been growing pretty long by now,
as has the list of things that need fixing. The Cython project is
definitely not lacking ideas.

> It seems to me that many has the impression that
> 
> a) Cython is complicated technology which takes much work
> b) A lot of effort is put into steadily improving it
> 
> BUT, I feel the reality is that
> 
> a) Core developers can implement new features or fix bugs rather quickly
> b) Relatively little time is spent in total on Cython, compared to some
> other projects

I totally second this. While the amount of developer time that is currently
available from the core developers is strictly limited by a variety of
factors, I keep getting surprised about the amount of features and
improvements that are doable with little effort. When I look at the code in
lxml, for example, I constantly notice hand written tweaks that are no
longer necessary today, simply because Cython got so much better over the
last year or so. You can really feel what it buys you to fix the code
generator instead of the code itself.

> Or put another way: Putting relatively little in can, at least at this 
> point in Cython's development, yield high returns.
> 
> Example: Profiling was a feature many at SciPy was anxious about
> getting and was asking about a lot. That's in trunk now, mainly because 
> Robert had an intercontinental flight (!!). (That admittedly might say a 
> lot more about Robert than Cython, but still.)

One problem I see here is the current release schedule. We keep getting
more and more away from the "release early, release often" principle.
Getting 0.12 out in one way or another would finally make all the great
trunk improvements available to regular users. Thing is: every release
needs a driver, so that's the first place where we could benefit from a
dedicated helping hand. I mean, Sun pays the Jython project lead, even
full-time. Guido is payed half-time for CPython evolution. I wonder if
there isn't enough commercial interest in Cython by now that could express
itself in contributed project time or financing. A developer day invested
in Cython development can easily pay off by making your own code easier to
write and/or faster to execute. Remember, we write C so you don't have to!

(maybe we should put the last two sentences on the front page ;)

> Cython can thrive without this too though! Looking at the coming 
> half-to-three-quarter year, here's what I'm guessing will happen:
> 
>   - I might get the new memoryviews from summer finished and merged with
> trunk

I wouldn't mind putting major new features off for 0.13 and releasing 0.12
sooner.

>   - Cython might run properly in Python 3 (w/ 2to3)

It almost does, except for two remaining bugs. Even the test runner works
out of the box now.

>   - Get -unstable stabilized and released (significant portion)

I'm all for making that the current priority. I even consider it mostly
stable, except for the few failing test cases (well, and for the open bugs
that lack a test case, obviously).

>   - Fwrap released

Independent of Cython's own schedule, except if there's a requirement for
better integration on Cython side (which would be fine for 0.13).

>   - Closures
>   - Better C++ support merged
>   - Perhaps some pyximport improvements

Again, fine for 0.13. While I guess that much of this done, I honestly
prefer a soon-to-be-outdated release over ever-lasting developer checkouts.

> Not bad at all! But, there's also a long list of projects we already 
> badly want to have done that we can't possibly reach now, IMO:
> 
>   - Fix the bugs, complete the test suite

Oh, well... ;)

I split up the following list into major enhancements that need real work
and will arrive as major new features:

>   - SIMD
>   - Control flow analysis!
>   - Type inference/a better pure Python mode

(note that type inference pretty much depends on control flow analysis)

and those that will continue to improve through applied spare time:

>   - Speed up compilation speed, break up compilation units/utility code
>   - Convenient debugging, line-by-line profiling
>   - Many rather low-hanging fruit CEPs which would make using Cython a 
> nicer experience
>   - Full Python semantics compatability for untyped code

Major new features (even the "we know how to do it" ones) will certainly
require more than the average spare time of the current core developers.

Stefan

Gmane