Nathaniel Smith | 17 Jul 03:21 2014
Picon

[ANN] patsy v0.3.0 released

Hi all,

I'm pleased to announce the v0.3.0 release of patsy. The main
highlight of this release is the addition of builtin functions to
compute natural and restricted cubic splines, and tensor spline
products, with optional constraints, and all compatible with the R
package 'mgcv'. (Note that if you wanted to replace mgcv itself then
you still need to implement their penalized fitting algorithm -- these
are just the spline basis functions. But these are very useful on
their own, and allow you to fit model coefficients with mgcv and then
use python to generate predictions from that model.) We also dropped
support for python 2.4 and 2.5, and have switched to a single polyglot
codebase for py2 and py3, allowing us to distribute universal wheels.

Patsy is a Python library for describing statistical models
(especially linear models, or models that have a linear component) and
building design matrices. Patsy brings the convenience of R "formulas"
to Python.

Changes: https://patsy.readthedocs.org/en/latest/changes.html#v0-3-0

General information: https://github.com/pydata/patsy/blob/master/README

Share and enjoy,
-n

--

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
(Continue reading)

Tim Michelsen | 16 Jul 21:01 2014
Picon

Spyderlib: support for pandas objects added to Variable Explorer

Hi,
thanks to Carlos Cordoba & Daniel H√łegh, the great IDE Spyder has now 
support for pandas in its Variable Explorer:

* https://code.google.com/p/spyderlib/issues/detail?id=1160

* 
https://bitbucket.org/spyder-ide/spyderlib/commits/008607f1fd22665b8b32695ddf235f0866cf7e32

It should also land tomorrow after the nightly build in the Ubuntu PPA:
https://code.launchpad.net/~pythonxy/+archive/ubuntu/pythonxy-devel

regards.

--

-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@...
For more options, visit https://groups.google.com/d/optout.

Phillip Cloud | 15 Jul 04:00 2014
Picon

Dropping numpy 1.6 support

Hi all,

We over at pandas would like to drop support for numpy 1.6 in the next release v0.15.0. It's become a burden to support, in large part due to broken datetime functionality. All manner of ugly hacks and workarounds await the brave soul who dares venture into pandas/core/common.py. My feeling (and hopefully others) is that if you want to use modern versions of pandas you should upgrade your numpy version to something modern-ish as well. If there are folks who feel we should NOT drop numpy 1.6 support please speak up! Thanks. 


--

--
Best,
Phillip Cloud

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
John E | 14 Jul 03:11 2014
Picon

Re: Converting a big Stata program to Pandas?

Charles, completely agree that correct >> fast.  At the same time, the main reason for doing a conversion is to make it faster.  If Pandas isn't faster, it will probably have to be done in either fortran or numpy, but that would take longer to program of course.


On Sunday, July 13, 2014 1:13:26 PM UTC-4, Charles Cloud wrote:
The last thing you should be worried about is speed. With a port of this size you should start with tests that compare the output of the stata program with the output of pandas. Ideally you'd have this for every processing step so that you can verify that you haven't broken anything along the way. Only when your output is correct should you worry about whether things are fast enough. There are many ways to speed up your program but correctness should be your main concern at this point. I promise you'll thank your future self for writing a bunch of tests so that you don't have to say a prayer every time the program runs. I would start with creating some kind of wrapper that allows you to call stata from within Python, then writing a series of tests that cover as much of the functionality as you want to preserve. Then start porting and every time you make a change run the tests. If something breaks you'll know before writing a bunch more code that introduces even more bugs. 

On Sunday, July 13, 2014, John Eiler <eil...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
I've been handed a fairly large Stata program (around 5,000 lines of code spread throughout several do files), and asked to improve its readability, maintainability and speed (it takes about an hour to run).  At this point, I haven't deeply dived into the code, but was hoping someone here might be able to give some general tips.  I'm decent with Stata and can do some basic things in Pandas (with a lot of help from the documentation).

The main question I have at this point is:  what sort of speed differences can I expect between standard tasks in Stata and Pandas?  E.g. merges, sorts, group by, creating new variables from functions of old ones, etc.

I thought I would be able to find some sort of Stata vs Pandas benchmarks to get a sense of this but couldn't find anything via google.  I do know from past experience that python/numba is much faster than Stata for basic tasks like generating new variables and summing over the data.

Anyway, my current plan is to take a section of Stata code and convert it to Pandas and see what sort of speed difference I get.  On account of Pandas reading and writing *.dta files, I think it should be pretty straightforward to selectively replace things as a first step.

It's probably early to ask this, but would buying something like MKL Optimizations from Continuum speed things up?  That is, I understand that would speed up NumPy, so would that indirectly speed up Pandas also?

Anyway, thanks in advance for any comments, suggestions, or warnings!

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe <at> googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

--
Best,
Phillip Cloud

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
John Eiler | 13 Jul 18:36 2014
Picon

Converting a big Stata program to Pandas?

I've been handed a fairly large Stata program (around 5,000 lines of code spread throughout several do files), and asked to improve its readability, maintainability and speed (it takes about an hour to run).  At this point, I haven't deeply dived into the code, but was hoping someone here might be able to give some general tips.  I'm decent with Stata and can do some basic things in Pandas (with a lot of help from the documentation).

The main question I have at this point is:  what sort of speed differences can I expect between standard tasks in Stata and Pandas?  E.g. merges, sorts, group by, creating new variables from functions of old ones, etc.

I thought I would be able to find some sort of Stata vs Pandas benchmarks to get a sense of this but couldn't find anything via google.  I do know from past experience that python/numba is much faster than Stata for basic tasks like generating new variables and summing over the data.

Anyway, my current plan is to take a section of Stata code and convert it to Pandas and see what sort of speed difference I get.  On account of Pandas reading and writing *.dta files, I think it should be pretty straightforward to selectively replace things as a first step.

It's probably early to ask this, but would buying something like MKL Optimizations from Continuum speed things up?  That is, I understand that would speed up NumPy, so would that indirectly speed up Pandas also?

Anyway, thanks in advance for any comments, suggestions, or warnings!

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Jeff Reback | 11 Jul 15:31 2014
Picon

ANN: pandas 0.14.1 released

Hello,

We are proud to announce v0.14.1 of pandas, a minor release from 0.14.0. 

This release includes a small number of API changes, several new features,
enhancements, and performance improvements along with a large number of bug fixes. 

This was 1.5 months of work with 244 commits by 45 authors encompassing 306 issues.

We recommend that all users upgrade to this version.

Highlights:

  • New method select_dtypes() to select columns based on the dtype
  • New method sem() to calculate the standard error of the mean.
  • Support for dateutil timezones (see docs).
  • Support for ignoring full line comments in the read_csv()text parser.
  • New documentation section on Options and Settings.
  • Lots of bug fixes

For a more a full description of Whatsnew for v0.14.1 here:

pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python. Additionally, it has the
broader goal of becoming the most powerful and flexible open source data
analysis / manipulation tool available in any language.


Documentation:
http://pandas.pydata.org/pandas-docs/stable/

Source tarballs, windows binaries are available on PyPI:

windows binaries are courtesy of  Christoph Gohlke and are built on Numpy 1.8
macosx wheels will be available soon, courtesy of Matthew Brett

Please report any issues here:
https://github.com/pydata/pandas/issues


Thanks

The Pandas Development Team


Contributors to the 0.14.1 release

  • Andrew Rosenfeld
  • Andy Hayden
  • Benjamin Adams
  • Benjamin M. Gross
  • Brian Quistorff
  • Brian Wignall
  • bwignall
  • clham
  • Daniel Waeber
  • David Bew
  • David Stephens
  • DSM
  • dsm054
  • helger
  • immerrr
  • Jacob Schaer
  • jaimefrio
  • Jan Schulz
  • John David Reaver
  • John W. O’Brien
  • Joris Van den Bossche
  • jreback
  • Julien Danjou
  • Kevin Sheppard
  • K.-Michael Aye
  • Kyle Meyer
  • lexual
  • Matthew Brett
  • Matt Wittmann
  • Michael Mueller
  • Mortada Mehyar
  • onesandzeroes
  • Phillip Cloud
  • Rob Levy
  • rockg
  • sanguineturtle
  • Schaer, Jacob C
  • seth-p
  • sinhrks
  • Stephan Hoyer
  • Thomas Kluyver
  • Todd Jennings
  • TomAugspurger
  • unknown
  • yelite

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Allie Wang | 9 Jul 20:38 2014
Picon

Cython/Pandas performance

We are using pandas in a high performance environment and trying to get as much speed as possible.  We can make things much faster by working with the underlying numpy arrays.  However, construction of dataframes and groupby's are slow it's unclear how to make them faster.  Any advice besides rewriting in C or Cython?

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Bryan Van de Ven | 9 Jul 17:13 2014

ANN: Bokeh 0.5 released

I am very happy to announce the release of Bokeh version 0.5! (http://continuum.io/blog/bokeh-0.5)

Bokeh is a Python library for visualizing large and realtime datasets on the web.

This release includes many new features: weekly dev releases, a new plot frame, a click tool, "always on"
hover tool, multiple axes, log axes, minor ticks, gears and gauges glyphs, and an NPM BokehJS package.
Several usability enhancements have been made to the plotting.py interface to make it even easier to use.
The Bokeh tutorial also now includes exercises in IPython notebook form. Of course, we've made many
little bug fixes - see the CHANGELOG for full details.

The biggest news is all the long-term and architectural goals landing in Bokeh 0.5:

    * Widgets! Build apps and dashboards with Bokeh
    * Very high level bokeh.charts interface
    * Initial Abstract Rendering support for big data visualizations
    * Tighter Pandas integration
    * Simpler, easier plot embedding options

Expect dynamic, data-driven layouts, including ggplot style auto-faceting in upcoming releases, as
well as R language bindings, more statistical plot types in bokeh.charts, and cloud hosting for Bokeh apps.

Check out the full documentation, interactive gallery, and tutorial at

    http://bokeh.pydata.org

as well as the new Bokeh IPython notebook nbviewer index (including all the tutorials) at:

    http://nbviewer.ipython.org/github/ContinuumIO/bokeh-notebooks/blob/master/index.ipynb

If you are using Anaconda, you can install with conda:

    conda install bokeh

Alternatively, you can install with pip:

    pip install bokeh

BokehJS is also available by CDN for use in standalone javascript applications:

    http://cdn.pydata.org/bokeh-0.5.min.js
    http://cdn.pydata.org/bokeh-0.5.min.css

Issues, enhancement requests, and pull requests can be made on the Bokeh Github page: https://github.com/continuumio/bokeh

Questions can be directed to the Bokeh mailing list: bokeh@...

If you have interest in helping to develop Bokeh, please get involved! Special thanks to recent
contributors: Tabish Chasmawala, Samuel Colvin, Christina Doig, Tarun Gaba, Maggie Mari, Amy
Troschinetz, Ben Zaitlen.

Bryan Van de Ven
Continuum Analytics
http://continuum.io

--

-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@...
For more options, visit https://groups.google.com/d/optout.

Simon Cropper | 9 Jul 16:48 2014
Picon

BIG data... What is BIG?

Hi,

I have been exploring various projects that claim to handle BIG data but 
to be honest most do not qualify what BIG actually means.

I remember the days when programs specified the maximum number of 
records, maximum number of fields and maximum number of tables in a 
database that could be manipulated at any one time. Why don't these 
types of specs get provided for languages and libraries anymore?

What are peoples impression of what BIG actually means when used to 
describe large datasets?

To me BIG is millions of records and multiple linked tables.

Simon

--

-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@...
For more options, visit https://groups.google.com/d/optout.

'Michael' via PyData | 9 Jul 11:49 2014

docs, pandas ecosystem, visualization, glue

I just found Glue
and it's quite amazing in its simplicity and usefulness

I'm not sure how i bumped into it, i realize it's not so easy to find

I wanted to raise awareness for it
because it's aware of pandas dataframes
and, at least from my point of view,
it deserves a place in pandas docs (ecosystem, visualization)

Anyway, very very recommended for data exploration

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Simon Cropper | 8 Jul 14:01 2014
Picon

pandas -- maximum size of records / size of dataframe

Hi,

Can some one please document or point me to some documentation on 
maximum numbers of records / size of dataframe that can be manipulated 
by pandas at one time?

My understanding is that a dataframe is resident in memory when being 
work on so is the limit set by available memory or does pandas cache 
sections of the dataframe?

Any information would be appreciated.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages           http://www.fossworkflowguides.com/gis
    bash / Python    http://www.fossworkflowguides.com/scripting

--

-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@...
For more options, visit https://groups.google.com/d/optout.


Gmane