Just a short mail to summarize some of the stuff that is currently sort of waiting for decisions / checks. First of all, I want to thank you guys for all the work you have been doing. We managed to get the whole thing up & running for production in the last month (the biggest data set in there is actually 2 billion records, though we spread that over multiple files). The performance, reliability and stability is extremely good: we have an internal metadata system that translates report requests into internal calls and asks it to the relevant server (with automatic joins over multiple fact files, groupbys, dimension additions etc.). We moved from the in-mem Pandas based situation (which relied heavily on dataframe caching and therefore issues with keeping it up to date, adding processes for hot data sets etc) to BCOLZ where we have a much more manageable situation (larger availability of files and many more cache-less processes which pick files as needed). The performance is around 2-3x slower than in-mem Pandas (I do believe we can bring this near to 1-2x in time), but that in itself is really impressive already.
The things aren't in the master yet though, which I understand but just want you to think about and let us know how you want to go from here; also, please realize that we do not have hard core programming backgrounds, so I'm sure there are stuff that can be improved upon too.
The building block that is really needed for everything, a carray_ext that is "cimport-able". Valentin did work of his own on this before, this has been made separately. It works 100% but might not be optimized yet in terms of structure. Still, I hope you can see it as a 100% working version 0.1 that can be incorporated into the master
Also because all the next discussions cannot be rolled out well without having this in place... Francesc Elies (who made this) is on holidays back home in Barcelona atm, so if there are major questions we cannot answer this until early January unfortunately.
Based on the help from Valentin + looking into existing Pandas functionality, we created a factorization for carrays
Also, we made a function where out-of-core ctables can cache the factorizations; not nicely integrated into any ctable metadata yet, but works really well
Then a out-of-core approach for groupbys. The limitation here is that the aggregated result does need to be in-core (it's a temporary in-mem numpy array), this has to do with inherent limitations of bcolz (non-sequential writes), but it works really well. 60 million record bcolz files go really fast, the memory usage normally is quite limited etc.
In / Not In Filters
The branch contains a workaround based on Pandas; as noted by Francesc Alted before, we need to add "in/not in" functionality to numexpr (this would also benefit Pandas actually, which uses a cython based "in set" check atm). This still has to be started and should be in numexpr, not here, so feel free to ignore it!
I understand if you do not want the Factorize and Groupby in your maintenance if you want to keep it as small as possible, in that case we would make a separate public package which would work on top of BCOLZ (we would have to think of a fancy name for BCOLZ Query Framework ;) We would still need the cython import though, so I hope you can really put that in your next release...
I hope this makes everything clear to everyone! As I said before, BCOLZ is very impressive, I hope we're helping you guys with this new functionality more than giving new headaches ;)
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to