15 Aug 16:59 2014

### Histogram as its own class

Johann Goetz <theodore.goetz <at> gmail.com>

2014-08-15 14:59:47 GMT

2014-08-15 14:59:47 GMT

Hello,

I'm a long-time user of scipy doing mostly multivariate big-data (several terabytes) analysis in the high-energy physics realm. One thing I've found useful was to promote the histogram to it's own class. Instead of creating yet another package, I have a mind to include it into the scipy.stats module and I would like some feed-back. I.e. is this the right place for such an object?

I have some documentation, but not enough I would say, and the classes are currently buried in my "pyhep" project, but they are easily extracted out.

https://bitbucket.org/theodoregoetz/pyhep/wiki/Home

Here are some details:

The histograms I am addressing are N-dimensional over a continuous-domain (floating-point data, no gaps - though bins can have value inf or nan if need-be) along each axis. The axes need not be uniform.

There are two classes: HistogramAxis and Histogram. The Axes are always floating point, but the histogram's data can be any dtype (default: np.int, a "cast" to float is done when dividing two histograms). I make use of np.histogramdd() and store the data along with the uncertainty. Many operations are supported including adding, subtracting, multiplying, dividing, bin-merging, cutting/clipping along one or more axes, projecting along an axis, iterating over an axis, filling from a sample with or without weights.

Most of power in this package is in the fitting method of the histogram which makes use of scipy.curve_fit(). It handles missing data (when a bin is inf or nan), can include the uncertainty in the fit, and calculates a goodness of fit.

On top of this, I have free functions to plot 1D and 2D histograms using matplotlib, as well as functions to handle reading in large HDF5 files. These are auxiliary and may not fit into scipy directly.

Thank you all,

Johann.

I'm a long-time user of scipy doing mostly multivariate big-data (several terabytes) analysis in the high-energy physics realm. One thing I've found useful was to promote the histogram to it's own class. Instead of creating yet another package, I have a mind to include it into the scipy.stats module and I would like some feed-back. I.e. is this the right place for such an object?

I have some documentation, but not enough I would say, and the classes are currently buried in my "pyhep" project, but they are easily extracted out.

https://bitbucket.org/theodoregoetz/pyhep/wiki/Home

Here are some details:

The histograms I am addressing are N-dimensional over a continuous-domain (floating-point data, no gaps - though bins can have value inf or nan if need-be) along each axis. The axes need not be uniform.

There are two classes: HistogramAxis and Histogram. The Axes are always floating point, but the histogram's data can be any dtype (default: np.int, a "cast" to float is done when dividing two histograms). I make use of np.histogramdd() and store the data along with the uncertainty. Many operations are supported including adding, subtracting, multiplying, dividing, bin-merging, cutting/clipping along one or more axes, projecting along an axis, iterating over an axis, filling from a sample with or without weights.

Most of power in this package is in the fitting method of the histogram which makes use of scipy.curve_fit(). It handles missing data (when a bin is inf or nan), can include the uncertainty in the fit, and calculates a goodness of fit.

On top of this, I have free functions to plot 1D and 2D histograms using matplotlib, as well as functions to handle reading in large HDF5 files. These are auxiliary and may not fit into scipy directly.

Thank you all,

Johann.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev <at> scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev