Nick Eubank | 8 Mar 20:07 2016
Picon

Best tools / idioms for cross-tabs of DataFrames?

Hi All,

Are there any tools or compact idioms people would recommend for doing things like simple cross-tabulations of DataFrame series? 

A simple tabulation is easy with `value_counts()` in pandas, but I haven't been able to find easy ways to do cross-tabs (like the following output from stata):

. tab  w_2_1 w_2_4

 No - Does | Generator - Does this
 this house |    house have ELEC
 have ELEC |        No        Yes |     Total
    -----------+----------------------+----------
              No |       516          5 |       521 
            Yes |       359          1 |       360 
   -----------+----------------------+----------
          Total |       875          6 |       881 

I can do with pivots and groupbys, but seems inefficient...

Thanks!

Nick

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Francesc Alted | 8 Mar 14:27 2016
Picon
Gravatar

[ANN] bcolz 1.0.0 RC1 released

==========================
Announcing bcolz 1.0.0 RC1
==========================

What's new
==========

Yeah, 1.0.0 is finally here.  We are not introducing any exciting new
feature (just some optimizations and bug fixes), but bcolz is already 6
years old and it implements most of the capabilities that it was
designed for, so I decided to release a 1.0.0 meaning that the format is
declared stable and that people can be assured that future bcolz
releases will be able to read bcolz 1.0 data files (and probably much
earlier ones too) for a long while.  Such a format is fully described
at:

https://github.com/Blosc/bcolz/blob/master/DISK_FORMAT_v1.rst

Also, a 1.0.0 release means that bcolz 1.x series will be based on
C-Blosc 1.x series (https://github.com/Blosc/c-blosc).  After C-Blosc
2.x (https://github.com/Blosc/c-blosc2) would be out, a new bcolz 2.x is
expected taking advantage of shiny new features of C-Blosc2 (more
compressors, more filters, native variable length support and the
concept of super-chunks), which should be very beneficial for next bcolz
generation.

Important: this is a Release Candidate, so please test it as much as you
can.  If no issues would appear in a week or so, I will proceed to tag
and release 1.0.0 final.  Enjoy!

For a more detailed change log, see:

https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst


What it is
==========

*bcolz* provides columnar and compressed data containers that can live
either on-disk or in-memory.  Column storage allows for efficiently
querying tables with a large number of columns.  It also allows for
cheap addition and removal of column.  In addition, bcolz objects are
compressed by default for reducing memory/disk I/O needs. The
compression process is carried out internally by Blosc, an
extremely fast meta-compressor that is optimized for binary data. Lastly,
high-performance iterators (like ``iter()``, ``where()``) for querying
the objects are provided.

bcolz can use numexpr internally so as to accelerate many vector and
query operations (although it can use pure NumPy for doing so too).
numexpr optimizes the memory usage and use several cores for doing the
computations, so it is blazing fast.  Moreover, since the carray/ctable
containers can be disk-based, and it is possible to use them for
seamlessly performing out-of-memory computations.

bcolz has minimal dependencies (NumPy), comes with an exhaustive test
suite and fully supports both 32-bit and 64-bit platforms.  Also, it is
typically tested on both UNIX and Windows operating systems.

Together, bcolz and the Blosc compressor, are finally fulfilling the
promise of accelerating memory I/O, at least for some real scenarios:

http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots

Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the
Blaze project (http://blaze.pydata.org/), Quantopian
(https://www.quantopian.com/) and Scikit-Allel
(https://github.com/cggh/scikit-allel) which you can read more about by
pointing your browser at the links below.

* Visualfabriq:

  * *bquery*, A query and aggregation framework for Bcolz:
  * https://github.com/visualfabriq/bquery

* Blaze:

  * Notebooks showing Blaze + Pandas + BColz interaction:
  * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb
  * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb

* Quantopian:

  * Using compressed data containers for faster backtesting at scale:
  * https://quantopian.github.io/talks/NeedForSpeed/slides.html

* Scikit-Allel

  * Provides an alternative backend to work with compressed arrays
  * https://scikit-allel.readthedocs.org/en/latest/model/bcolz.html

Installing
==========

bcolz is in the PyPI repository, so installing it is easy::

    $ pip install -U bcolz


Resources
=========

Visit the main bcolz site repository at:
http://github.com/Blosc/bcolz

Manual:
http://bcolz.blosc.org

Home of Blosc compressor:
http://blosc.org

User's mail list:
bcolz-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
http://groups.google.com/group/bcolz

License is the new BSD:
https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt

Release notes can be found in the Git repository:
https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst

----

  **Enjoy data!**

--
Francesc Alted

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
DavidT | 7 Mar 04:26 2016
Picon
Gravatar

pandas groupby aggregate histogram bin columns

I'm trying to take a dataframe of format:

  timestamp,type,value

Group the data by minutes and type and bucket the values for each into histogram bin labeled columns containing the count of values for that bin, minute and type.

Here is code that attempts to do this, but it's not quite right:

import datetime
import random
import pandas as pd
import numpy as np

start_time = datetime.datetime.now()
type_list = ['a', 'b', 'c', 'd']
ten_minute_time_frame = 10 * 60 * 1000000
hist_bins = [i*20 for i in range(5)]

data = []
for i in range(500):
    d = {
        'timestamp': start_time + datetime.timedelta(microseconds=random.randint(0, ten_minute_time_frame)),
        'type': random.choice(type_list),
        'value': random.randint(0, 100),
    }
    data.append(d)

df = pd.DataFrame(data)
df['minute_stamp'] = map(lambda x: x.to_period(freq='T'), df['timestamp'])
print df.groupby(['minute_stamp', 'type']).agg(lambda x: np.histogram(x, bins=hist_bins))

What I'm looking for instead is output like this:

minute_stamp,type,20,40,60,80,100
2016-03-06 15:01,a,3,4,2,6,1
2016-03-06 15:02,a,1,7,4,5,4
2016-03-06 15:03,a,4,3,3,2,4

Can someone help?

Thanks,
David

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Denis Akhiyarov | 4 Mar 17:17 2016
Picon

delete original object used to construct dataframe during conversion

is there any way to delete original object used to construct dataframe during "conversion" process without growing memory?

for example if dictionary is used to construct dataframe, every key is deleted once corresponding column is added to dataframe.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
jillc | 1 Mar 22:20 2016

SciPy 2016

**SciPy 2016 Conference (Scientific Computing with Python) Announcement**

*Call for Proposals: Submit Your Tutorial and Talk Ideas by March 25, 2015 at http://scipy2016.scipy.org.

SciPy 2016, the 15th annual Scientific Computing with Python conference, will be held July 11-17, 2016 in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference brings together over 650 participants from industry, academia, and government to showcase their latest projects,
learn from skilled users and developers, and collaborate on code development.

The full program will consist of 2 days of tutorials (July 11-12), 3 days of talks (July 13-15), and 2
days of developer sprints (July 16-17). More info is available on the conference website at http://scipy2016.scipy.org (where you can sign up for the mailing list); or follow <at> scipyconf on Twitter.  

We hope you’ll join us - early bird registration is open until May 22, 2016 at http://scipy2016.scipy.org/ehome/146062/332936/?&&

We encourage you to submit tutorial or talk proposals in the categories below; please also share with others who you’d like to see participate! Submit via the conference website: http://scipy2016.scipy.org.

------------------------------------------------------------------------------------------------------------------------------

*SUBMIT A SCIPY 2016 TUTORIAL PROPOSAL- DUE MARCH 21, 2016*

------------------------------------------------------------------------------------------------------------------------------

Details and submission here: http://scipy2016.scipy.org/ehome/146062/332967/?&&

These sessions provide extremely affordable access to expert training, and consistently receive fantastic feedback from participants. We're looking for submissions on topics from introductory to advanced - we'll have attendees across the gamut looking to learn. Whether you are a major contributor to a scientific Python library or an expert-level user, this is a great opportunity to share your knowledge and stipends are available.

------------------------------------------------------------------------------------------------------------------------------

**SUBMIT A SCIPY 2016 TALK / POSTER PROPOSAL - DUE MARCH 25, 2016*

------------------------------------------------------------------------------------------------------------------------------

Details and submission here: http://scipy2016.scipy.org/ehome/146062/332968/?&&

SciPy 2016 will include 2 major topic tracks and 8 mini-symposia tracks.

Major topic tracks include:

- Python in Data Science (Big data and not so big data)

- High Performance Computing

Mini-symposia will include the applications of Python in:


  • Earth and Space Science

  • Biology and Medicine

  • Engineering

  • Social Science

  • Special Purpose Databases

  • Case Studies in Industry

  • Education

  • Reproducibility



If you have any questions or comments, feel free to contact us at: scipy-organizers-HDzwSpiosTzYtjvyW6yDsg@public.gmane.org

--------------------------------------------------------------------------

**SCIPY 2016 REGISTRATION IS OPEN**

--------------------------------------------------------------------------

Please register early. SciPy early bird registration until May 22, 2016! Register at http://scipy2016.scipy.org. Plus, enter our t-shirt design contest to win a free registration. (Send a vector art file to scipy-SCgzsaguwNrby3iVrkZq2A@public.gmane.org by March 31 to enter).


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Amol Sharma | 28 Feb 20:07 2016
Picon

Reading csv from pandas having both quotechar and delimiter for a column value

Here is the content of a csv file 'test.csv', i am trying to read it via pandas read_csv()

    "col1", "col2", "col3", "col4"
    "v1", "v2", "v3", "v4"
    "v21", "v22", "v23", "this, "creating, what to do? " problems"

This is the command i am using -
     
    messages = pd.read_csv('test.csv', sep=',', skipinitialspace=True)

But i am getting the following error -

    CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

i want the content for column4 in line3 to be 'this, "creating, what to do? " problems'

How to read file when a column can have quotechar and delimiter included in it ?


--
Thanks and Regards,
Amol Sharma

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
John Anderson | 26 Feb 18:37 2016
Picon
Gravatar

What is the best way to group by defined frequency with start/end dates?

I have a pandas dataframe that has 2 columns "event_id" and "date_created", what I would like to do is groupby date_created on a frequency (daily, weekly, monthly) and have the range that is generated span a start and end date.

So for example, if I have:

event_id   date_created
1                Feb 24, 2016
1                Feb 25, 2016
2                Feb 26, 2016
3                Feb 27, 2016
3                Feb  28, 2016
1                Feb  29, 2016
2                March 14, 2016
1                March 15, 2016

What I'd like to do is group by weeks and event_id, so within those ranges I can find the first and last date:

df.date_created.min() == Feb 24, 2016
df.date_created.max() == March 14, 2016

and then I can figure out what the first week in my range should be:

start_range = (
            first_date - timedelta(days=first_date.weekday())
).replace(hour=0, minute=0, second=0, microsecond=0)

end_range = (
            last_date + timedelta(days=6 - first_date.weekday())
).replace(hour=23, minute=59, second=59, microsecond=9999)


But once I have those... How do I tell pandas.groupby function that I want to group my events by frequency of W-Mon where the first week is Feb 22, last week is March 15 and that it should fill in empty counts for the the weeks that don't have any events?   With the example data I provided above the result I want to generate is:

Week = Feb 22, 2016, event_id = 1, count = 2
Week = Feb 22, 2016, event_id = 2, count = 1
Week = Feb 22, 2016, event_id = 3, count = 2

Week = Feb 29, 2016, event_id = 1, count = 1
Week = Feb 29, 2016, event_id = 2, count = 0
Week = Feb 29, 2016, event_id = 3, count = 0

Week = March 7, 2016, event_id = 1, count = 0
Week = March 7, 2016, event_id = 2, count = 0
Week = March 7, 2016, event_id = 3, count = 0

Week = March 14, 2016, event_id = 1, count = 1
Week = March 14, 2016, event_id = 2, count = 1
Week = March 14,, 2016, event_id = 3, count = 0


I realize I could write a function and apply it to the date_created column like:

    def get_week_trend(date_created):
        week_start = (
            date_created - timedelta(days=date_created.weekday())
        ).replace(hour=0, minute=0, second=0, microsecond=0)

        return week_start

    df['trend'] = df['date_created'].apply(get_week_trend)

and then group on that:

df.groupby([df.event_id, pd.TimeGrouper(key='trend', freq='W-Mon')]).size()


But this only gets the weeks that are populated and doesn't fill in the periods that have no results.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
John Anderson | 25 Feb 19:59 2016
Picon
Gravatar

How to get the *current* week for a date instead of the *next* week

I'm trying to generate a timeseries group by a date frequency (month, day, week, etc) and when I get the week for a date it is returning me the next Sunday instead of the Sunday for the week the date is in.

For example, here is today (February 25th, 2016):

>>> date_range('2/25/2016', periods=3, freq='W-SUN')   
DatetimeIndex(['2016-02-28', '2016-03-06', '2016-03-13'], dtype='datetime64[ns]', freq='W-SUN')

Today should be in the week starting on February 21st.  February 28th is *next* week.   I see this same issue when using TimeGrouper so I assume its also using date_range:

>>> df.groupby([pd.TimeGrouper(key='date_created', freq='W-SUN')]).size()
date_created
2014-08-10    6
2016-02-28    2

and all the dates in the frame are:

>>> df.date_created.unique()
array(['2014-08-04T22:38:00.000000000-0700',
       '2016-02-25T08:51:00.000000000-0800',
       '2014-08-04T22:39:00.000000000-0700'], dtype='datetime64[ns]')

Am I expecting the wrong thing?  Is there a way to get the thing I want?

Thanks!

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Robert Sullivan | 24 Feb 17:15 2016
Picon

pydata@... - 1 update in 1 topic

On 24 Feb 2016 1:18 p.m., <pydata-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> wrote:
"Dr. Leo" <fhaxbox66-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>: Feb 24 07:03AM +0100

Hi,
 
some time ago I saw that this is not implemented yet. Filling this gap
should appeal to students who want to get familiar with MS Office
internals. I could not mentor this.
 
If there is a chance to accept topics more broadly within the pandas
ecosystem, I would happily mentor one or two relating to pandaSDMX, e.g.
 
https://github.com/dr-leo/pandaSDMX/issues/7

Leo
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
'Dr. Leo' via PyData | 24 Feb 07:03 2016

GSoC proposal: Exporting multi-indexed DataFrames to Excel - other projects from the ecosystem?

Hi,

some time ago I saw that this is not implemented yet. Filling this gap
should appeal to students who want to get familiar with MS Office
internals. I could not mentor this.

If there is a chance to accept topics more broadly within the pandas
ecosystem, I would happily mentor one or two relating to pandaSDMX, e.g.

https://github.com/dr-leo/pandaSDMX/issues/7

Leo

--

-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@...
For more options, visit https://groups.google.com/d/optout.

josef.pktd | 21 Feb 06:55 2016
Picon

rdata reader

I just saw this today

MIT licensed.

Is there something like this in python already?


Josef

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.

Gmane