Julian Taylor | 18 May 2013 08:12

faster (selection based) median, 2013 edition

hi,

once again I want to bring up the median algorithm which is implemented
in terms of sorting in numpy.
median (and percentile and a couple more functions) can be more
efficiently implemented in terms of a selection algorithm. The
complexity can them be linear instead of linearithmic.

I found numerous discussions of this in the list archives [1, 2, 3] but
I did not find why those attempts failed, the threads all just seemed to
stop.
Did the previous attempts fail due to lack of time or was there a
fundamental reason blocking this change?

In the hope of the former, I went ahead and implemented a prototype of a
partition function (similar to [3] but only one argument) and
implemented median in terms of it.
partition not like C++ partition, its equivalent to nth_element in C++,
maybe its better to name it nth_element?

The code is available here:
https://github.com/juliantaylor/numpy/tree/select-median

the partition interface is:
ndarray.partition(kth, axis=-1)
kth is an integer
The array is transformed so the k-th element of the array is in its
final sorted order, all below are smaller all above are greater, but the
ordering is undefined

(Continue reading)

Joe Piccoli | 18 May 2013 07:11
Picon

Newbie trying to install NumPy

Hello,

 

I've been trying to install NumPy to run with Eclipse on Windows Vista. After installing (I thought) NumPy I was seeing:

 

ImportError: Error importing numpy: you should not try to import numpy from

        its source directory; please exit the numpy source tree, and relaunch

        your python intepreter from there.

 

I next tried to follow the instructions from the scipy.org website and downloaded and ran:

 

numpy-1.7.1-win32-superpack-python27.exe

 

This started up but I immediately saw the following dialog:

 

---------------------------

Cannot install

---------------------------

Python version 2.7 required, which was not found in the registry.

---------------------------

OK  

---------------------------

 

The next dialog prompted for a Python installation to use but the list box was empty and it would not allow me to enter a path.

 

Is it absolutely necessary to build NumPy myself or is there a working installation out there? I know I'm doing something wrong but I don't know what it is. Any assistance would be greatly appreciated :).

 

Thanks,

Joseph A. Piccoli

joe13676 <at> comcast.net

 

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Francesc Alted | 17 May 2013 20:36
Favicon
Gravatar

ANN: python-blosc 1.1 RC1 available for testing

================================
Announcing python-blosc 1.1 RC1
================================

What is it?
===========

python-blosc (http://blosc.pydata.org) is a Python wrapper for the
Blosc compression library.

Blosc (http://blosc.org) is a high performance compressor optimized for
binary data.  It has been designed to transmit data to the processor
cache faster than the traditional, non-compressed, direct memory fetch
approach via a memcpy() OS call.  Whether this is achieved or not
depends of the data compressibility, the number of cores in the system,
and other factors.  See a series of benchmarks conducted for many
different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks.

Blosc works well for compressing numerical arrays that contains data
with relatively low entropy, like sparse data, time series, grids with
regular-spaced values, etc.

There is also a handy command line for Blosc called Bloscpack
(https://github.com/esc/bloscpack) that allows you to compress large
binary datafiles on-disk.  Although the format for Bloscpack has not
stabilized yet, it allows you to effectively use Blosc from your
favorite shell.

What is new?
============

- Added new `compress_ptr` and `decompress_ptr` functions that allows to
   compress and decompress from/to a data pointer.  These are low level
   calls and user must make sure that the pointer data area is safe.

- Since Blosc (the C library) already supports to be installed as an
   standalone library (via cmake), it is also possible to link
   python-blosc against a system Blosc library.

- The Python calls to Blosc are now thread-safe (another consequence of
   recent Blosc library supporting this at C level).

- Many checks on types and ranges of values have been added.  Most of
   the calls will now complain when passed the wrong values.

- Docstrings are much improved. Also, Sphinx-based docs are available
   now.

Many thanks to Valentin Hänel for his impressive work for this release.

For more info, you can see the release notes in:

https://github.com/FrancescAlted/python-blosc/wiki/Release-notes

More docs and examples are available in the documentation site:

http://blosc.pydata.org

Installing
==========

python-blosc is in PyPI repository, so installing it is easy:

$ pip install -U blosc  # yes, you should omit the blosc- prefix

Download sources
================

The sources are managed through github services at:

http://github.com/FrancescAlted/python-blosc

Documentation
=============

There is Sphinx-based documentation site at:

http://blosc.pydata.org/

Mailing list
============

There is an official mailing list for Blosc at:

blosc <at> googlegroups.com
http://groups.google.es/group/blosc

Licenses
========

Both Blosc and its Python wrapper are distributed using the MIT license.
See:

https://github.com/FrancescAlted/python-blosc/blob/master/LICENSES

for more details.

--
Francesc Alted
Rodrigo Botafogo | 17 May 2013 15:20

[ANN] Multidimensional Array - MDArray (0.5.0)

Although this is not directly connected to NumPy, I believe that it could be of interest to the NymPy community.  If, by any reason it is inproper to post this type of announcement on this list, please let me know.

I´m happy to announce a new version of MDArray...


MDArray
=======

MDArray is a multi dimensional array implemented for JRuby inspired by NumPy (www.numpy.org
and Narray (narray.rubyforge.org) by Masahiro Tanaka.  MDArray stands on the shoulders of 
Java-NetCDF and Parallel Colt.
 
NetCDF-Java Library is a Java interface to NetCDF files, as well as to many other types of 
scientific data formats.  It is developed and distributed by Unidata (http://www.unidata.ucar.edu). 

version of Colt (http://acs.lbl.gov/software/colt/).  Colt provides a set of Open Source 
Libraries for High Performance Scientific and Technical Computing in Java. Scientific 
and technical computing is characterized by demanding problem sizes and a need for high 
performance at reasonably small memory footprint.

MDArray and SciRuby
===================

MDArray subscribes fully to the SciRuby Manifesto (http://sciruby.com/).  

"Ruby has for some time had no equivalent to the beautifully constructed NumPy, SciPy, 
and matplotlib libraries for Python. 

We believe that the time for a Ruby science and visualization package has come. Sometimes 
when a solution of sugar and water becomes super-saturated, from it precipitates a pure, 
delicious, and diabetes-inducing crystal of sweetness, induced by no more than the tap 
of a finger. So is occurring now, we believe, with numeric and visualization libraries for Ruby."

Main properties
===============

  + Homogeneous multidimensional array, a table of elements (usually numbers), all of the 
      same type, indexed by a tuple of positive integers;
  + Easy calculation for large numerical multi dimensional arrays;
  + Basic types are: boolean, byte, short, int, long, float, double, string, structure;
  + Based on JRuby, which allows importing Java libraries;
  + Operator: +,-,*,/,%,**, >, >=, etc.
  + Functions: abs, ceil, floor, truncate, is_zero, square, cube, fourth;
  + Binary Operators: &, |, ^, ~ (binary_ones_complement), <<, >>;
  + Ruby Math functions: acos, acosh, asin, asinh, atan, atan2, atanh, cbrt, cos, erf, exp, 
      gamma, hypot, ldexp, log, log10, log2, sin, sinh, sqrt, tan, tanh, neg;
  + Boolean operations on boolean arrays: and, or, not;
  + Fast descriptive statistics from Parallel Colt (complete list found bellow);
  + Easy manipulation of arrays: reshape, reduce dimension, permute, section, slice, etc.
  + Reading of two dimensional arrays from CSV files (mainly for debugging and simple 
      testing purposes);
  + StatList: a list that can grow/shrink and that can compute Parallel Colt descriptive 
      statistics. 

Descriptive statistics methods
==============================

auto_correlation, correlation, covariance, durbin_watson, frequencies, geometric_mean, 
harmonic_mean, kurtosis, lag1, max, mean, mean_deviation, median, min, moment, moment3, 
moment4, pooled_mean, pooled_variance, product, quantile, quantile_inverse, 
rank_interpolated, rms, sample_covariance, sample_kurtosis, 
sample_kurtosis_standard_error, sample_skew, sample_skew_standard_error, 
sample_standard_deviation, sample_variance, sample_weighted_variance, skew, split,  
standard_deviation, standard_error, sum, sum_of_inversions, sum_of_logarithms, 
sum_of_powers, sum_of_power_deviations, sum_of_squares, sum_of_squared_deviations, 
trimmed_mean, variance, weighted_mean, weighted_rms, weighted_sums, winsorized_mean.

Installation and download
=========================

  + Install Jruby
  + jruby -S gem install mdarray

Contributors
============

  + Contributors are welcome.

Homepages
=========



HISTORY
=======

  + 16/05/2013: Version 0.5.0: All loops transfered to Java with over 50% performance 
      improvement.  Descriptive statistics from Parallel Colt.
  + 19/04/2013: Version 0.4.3: Fixes a simple (but fatal bug).  No new features
  + 17/04/2013: Version 0.4.2: Adds simple statistics and boolean operators
  + 05/05/2013: Version 0.4.0: Initial release

--
Rodrigo Botafogo



_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Phillip Feldman | 17 May 2013 00:09
Picon

numpy.nanmin, numpy.nanmax, and scipy.stats.nanmean

It seems odd that `nanmin` and `nanmax` are in NumPy, while `nanmean` is in SciPy.stats.  I'd like to propose that a `nanmean` function be added to NumPy.

Phillip
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Neal Becker | 16 May 2013 20:42
Picon

RuntimeWarning: divide by zero encountered in log

Is there a way to get a traceback instead of just printing the
line that triggered the error?
Julian Taylor | 16 May 2013 19:42

experiments with SSE vectorization

Hi,
I have been experimenting a bit with how applicable SSE vectorization is
to NumPy.
In principle the core of NumPy mostly deals with memory bound
operations, but it turns out on modern machines with large caches you
can still get decent speed ups.

The experiments are available on this fork:
https://github.com/juliantaylor/numpy/tree/simd-experiments
It includes a simple benchmark 'npbench.py' in the top level.
No runtime detection is used, it is only enabled on amd64 systems(which
always has SSE2).

The simd-experiments branch vectorizes the sqrt, basic math operations
and min/max reductions.
For float32 operations you get speedups around 2 (simple ops) - 4 (sqrt).
For double it is around 1.2 - 2, depending on the cpu.
My Phenom(tm) II X4 955 retains a good speedup even for very large
datasizes but on intel cpus (xeon and core2duo) you don't gain anything
if the data is larger than the L3 cache.
The vectorized version was never slower on phenom and xeon.
But on a core2duo the normal addition with very large datasets got 10%
slower. This can be compensated by using aligned load operations, but
its not implemented yet.
I'm interested in your results of npbench.py command on other cpus, so
if you want to try it please send me the output (include /proc/cpuinfo)

The code is a little rough, it can probably be cleaned up a bit by
adapting the code generator used.
Would this be something worth including in NumPy?

Further vectorization targets on my todo list are things like
std/var/mean, basically anything that has a high computation/memory
ration, suggestions are welcome.

Here the detailed results for my phenom:
float32 datasize (2MB)
operation:                         speedup
np.float32 np.max(d)                 3.04
np.float32 np.min(d)                  3.1
np.float32 np.sum(d)                 3.02
np.float32 np.prod(d)                3.04
np.float32 np.add(1, d)              1.44
np.float32 np.add(d, 1)              1.45
np.float32 np.divide(1, d)           3.41
np.float32 np.divide(d, 1)           3.41
np.float32 np.divide(d, d)           3.42
np.float32 np.add(d, d)              1.42
np.float32 np.multiply(d, d)         1.43
np.float32 np.sqrt(d)                4.26

float64 datasize (4MB)
operation:                         speedup
np.float64 np.max(d)                    2
np.float64 np.min(d)                 1.89
np.float64 np.sum(d)                 1.62
np.float64 np.prod(d)                1.63
np.float64 np.add(1, d)              1.08
np.float64 np.add(d, 1)             0.993
np.float64 np.divide(1, d)           1.83
np.float64 np.divide(d, 1)           1.74
np.float64 np.divide(d, d)            1.8
np.float64 np.add(d, d)              1.02
np.float64 np.multiply(d, d)         1.05
np.float64 np.sqrt(d)                2.22

attached the results for intel cpus.
Attachment (results.tar.gz): application/gzip, 8 KiB
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Thomas Robitaille | 16 May 2013 15:19
Picon
Gravatar

__array_priority__ ignored if __array__ is present

Hi everyone,

(this was posted as part of another topic, but since it was unrelated,
I'm reposting as a separate thread)

I've also been having issues with __array_priority__ - the following
code behaves differently for __mul__ and __rmul__:

"""
import numpy as np

class TestClass(object):

    def __init__(self, input_array):
        self.array = input_array

    def __mul__(self, other):
        print "Called __mul__"

    def __rmul__(self, other):
        print "Called __rmul__"

    def __array_wrap__(self, out_arr, context=None):
        print "Called __array_wrap__"
        return TestClass(out_arr)

    def __array__(self):
        print "Called __array__"
        return np.array(self.array)
"""

with output:

"""
In [7]: a = TestClass([1,2,3])

In [8]: print type(np.array([1,2,3]) * a)
Called __array__
Called __array_wrap__
<class '__main__.TestClass'>

In [9]: print type(a * np.array([1,2,3]))
Called __mul__
<type 'NoneType'>
"""

Is this also an oversight? I opened a ticket for it a little while ago:

https://github.com/numpy/numpy/issues/3164

Any ideas?

Thanks!
Tom
Martin Raspaud | 16 May 2013 09:35
Picon
Favicon
Gravatar

Strange memory consumption in numpy?

Hi all,

In the context of memory profiling an application (with memory_profiler
module) we came up a strange behaviour in numpy, see for yourselves:

Line #    Mem usage    Increment   Line Contents
================================================
    29                              <at> profile
    30    23.832 MB     0.000 MB   def main():
    31    46.730 MB    22.898 MB       arr1 = np.random.rand(1000000, 3)
    32    58.180 MB    11.449 MB       arr1s = arr1.astype(np.float32)
    33    35.289 MB   -22.891 MB       del arr1
    34    35.289 MB     0.000 MB       gc.collect()
    35    58.059 MB    22.770 MB       arr2 = np.random.rand(1000000, 3)
    36    69.500 MB    11.441 MB       arr2s = arr2.astype(np.float32)
    37    69.500 MB     0.000 MB       del arr2
    38    69.500 MB     0.000 MB       gc.collect()
    39    69.500 MB     0.000 MB       arr3 = np.random.rand(1000000, 3)
    40    80.945 MB    11.445 MB       arr3s = arr3.astype(np.float32)
    41    80.945 MB     0.000 MB       del arr3
    42    80.945 MB     0.000 MB       gc.collect()
    43    80.945 MB     0.000 MB       return arr1s, arr2s, arr3s

The lines 31-34 are behaving as expected, but then we don't understand
35-38 (why is arr2 not garbage collected ?) and 39-42 (why doesn't the
random allocate any memory ?).

Can anyone give a reasonable explanation ?

I attach the full script for reference.

Best regards,
Martin
Attachment (testnumpymem.py): text/x-python, 1235 bytes
Attachment (martin_raspaud.vcf): text/x-vcard, 303 bytes
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Toder, Evgeny | 14 May 2013 19:26
Picon
Favicon

Integer overflow in test_einsum (1.7.1)

Hello,

 

One of the test cases in test_einsum causes integer overflow for i2 type. The test goes like this:

 

>>> import numpy as np

>>> dtype = 'i2'

>>> n = 15

>>> a = np.arange(4*n, dtype=dtype).reshape(4,n)

>>> b = np.arange(n*6, dtype=dtype).reshape(n,6)

>>> c = np.arange(24, dtype=dtype).reshape(4,6)

 

It then calculates AxB using einsum. The problem is that the values in the last row of the result do not fit into i2:

 

>>> np.einsum("ij,jk", a, b, dtype='f8', casting='unsafe')

array([[  6090.,   6195.,   6300.,   6405.,   6510.,   6615.],

       [ 15540.,  15870.,  16200.,  16530.,  16860.,  17190.],

       [ 24990.,  25545.,  26100.,  26655.,  27210.,  27765.],

       [ 34440.,  35220.,  36000.,  36780.,  37560.,  38340.]])

 

In my build this produces different results depending on whether out or .astype is used:

 

>>> np.einsum("ij,jk", a, b, dtype='f8', casting='unsafe').astype(dtype)

array([[  6090,   6195,   6300,   6405,   6510,   6615],

       [ 15540,  15870,  16200,  16530,  16860,  17190],

       [ 24990,  25545,  26100,  26655,  27210,  27765],

       [-31096, -30316, -29536, -28756, -27976, -27196]], dtype=int16)

 

>>> np.einsum("ij,jk", a, b, out=c, dtype='f8', casting='unsafe')

array([[  6090,   6195,   6300,   6405,   6510,   6615],

       [ 15540,  15870,  16200,  16530,  16860,  17190],

       [ 24990,  25545,  26100,  26655,  27210,  27765],

       [-32768, -32768, -32768, -32768, -32768, -32768]], dtype=int16)

 

The test wants these (actually the same using numpy.dot) to be equal, so this difference causes it to fail. Both ways to handle overflow seem reasonable to me.

 

Does numpy in general assign a defined behavior to integer overflow (e.g. two’s complement)?

Is this use of integer overflow in the test intentional and is expected to work, or is my build broken?

 

Best regards,

Eugene

This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Sebastian Berg | 11 May 2013 17:41
Favicon

slight MapIter change

Hey,

(this is only interesting if you know what MapIter and actually use it)

In case anyone already uses the newly exposed mapiter (it was never
released yet). There is a tiny change, which only affects indexes that
start with np.newaxis but otherwise just simplifies a tiny bit. The old
block for swapping axes should be changed like this:

     if ((mit->subspace != NULL) && (mit->consec)) {
-        if (mit->iteraxes[0] > 0) {
-            PyArray_MapIterSwapAxes(mit, (PyArrayObject **)&arr, 0);
-            if (arr == NULL) {
-                return -1;
-            }
+        PyArray_MapIterSwapAxes(mit, (PyArrayObject **)&arr, 0);
+        if (arr == NULL) {
+            return -1;
         }
     }

Regards,

Sebastian

Gmane