Dr. Leo | 29 Sep 16:15 2014

Attaching metadata to dataframes

Hi,

I know: this has been debated before, and there was a PR at

https://github.com/pydata/pandas/pull/2695

Thus, if I want to tell the user that the unit is "thousand tons" or
light-years, rather than tons or inches, do I bluntly set some attribute
like:

df.meta = {'unit' : 'thd_tons'},

knowing that it will not be preserved by operations returning a new df?
- Have I missed any brilliant idea?

Leo

--

-- 
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@...
For more options, visit https://groups.google.com/d/optout.

Seth P | 28 Sep 17:53 2014
Picon

append_to_multiple() and dropna

I'm a bit confused by the discussion in http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations of the dropna argument to append_to_multiple():

The argument dropna will drop rows from the input DataFrame to ensure tables are synchronized. This means that if a row for one of the tables being written to is entirely np.NaN, that row will be dropped from all tables.

If dropna is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES. Remember that entirely np.Nan rows are not written to the HDFStore, so if you choose to call dropna=False, some tables may have more rows than others, and therefore select_as_multiple may not work or it may return unexpected results.

Wouldn't the tables risk having their indices get out of sync if dropna is True (i.e. if in a given row all the entries in the columns of one table are NaN (so the row isn't written to that table, but the entries in some columns of the other table are not (so that the row is written to that table))? I would think that if dropna is False then they would necessarily remain in sync. Or am I misunderstanding how this works?



--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Seth P | 27 Sep 18:56 2014
Picon

{Series,DataFrame,..}.astype(bool) converts NaN values to True

As mentioned in the subject, {Series,DataFrame,..}.astype(bool) converts NaN values to True. I realize that bool(NaN) is True, so there's obvious consistency there. However my intuition, especially when using a container of bools as a mask, would be that NaN values would convert to False. Perhaps this is one of those cases where the Pandas treatment of NaN should differ from numpy's?

Here are some related discussions, though none seem to address explicitly what the desired treatment of NaNs (or Nones) by .astype(bool):
https://groups.google.com/d/msg/pydata/pOz9LCx3JF0/selM28IIbCsJ
https://github.com/pydata/pandas/issues/6528
https://github.com/pydata/pandas/pull/8151

Apologies if there have been other discussions on the topic that I've missed.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Nick Schultz | 27 Sep 02:01 2014
Picon

pd.read_csv(compression='gzip') fails with html url

I'm thinking there is a problem with pandas when trying to read a gzip'd CSV file via an html url.  Loading gzip'd CSVs via the filesystem works fine, so I'm thinking it's something specific with the HTML url. Can anybody reproduce the following outcomes (see below)?  What could be some possible workarounds?

Thanks,

Nick

uncompressed CSV:

import pandas as pd

filename
= r'http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv'
df
= pd.read_csv(filename)
print(df.shape)

Output:
(998, 12)

with gzip'd CSV:
import pandas as pd


filename
= r'http://nodestreams.com/input/people.csv.gz'
df
= pd.read_csv(filename, compression='gzip')
print(df.shape)

Output:
Traceback (most recent call last):
  File "/nfs/site/home/nschultz/mydisk4/web/test.py", line 33, in <module>
    df = pd.read_csv(filename, compression='gzip')
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 452, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 234, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 542, in __init__
    self._make_engine(self.engine)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 679, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 1041, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 485, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4413)
  File "parser.pyx", line 600, in pandas.parser.TextReader._get_header (pandas/parser.c:5649)
  File "parser.pyx", line 791, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7599)
  File "parser.pyx", line 1699, in pandas.parser.raise_parser_error (pandas/parser.c:19062)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.



with gzip'd CSV (engine ='python'):
import pandas as pd


filename
= r'http://nodestreams.com/input/people.csv.gz'
df
= pd.read_csv(filename, compression='gzip', engine= 'python')
print(df.shape)

 
Output:
Traceback (most recent call last):
  File "/nfs/site/home/nschultz/mydisk4/web/test.py", line 33, in <module>
    df = pd.read_csv(filename, compression='gzip', engine= 'python')
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 452, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 234, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 542, in __init__
    self._make_engine(self.engine)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 685, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 1373, in __init__
    self.columns, self.num_original_columns = self._infer_columns()
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 1587, in _infer_columns
    line = self._buffered_line()
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 1713, in _buffered_line
    return self._next_line()
  File "/nfs/fm/disks/fm_cse_05026/nschultz/python/lib/python3.3/site-packages/pandas-0.14.1-py3.3-linux-x86_64.egg/pandas/io/parsers.py", line 1738, in _next_line
    orig_line = next(self.data)
  File "/usr/intel/pkgs/python/3.3.2/lib/python3.3/gzip.py", line 393, in read1
    self._read()
  File "/usr/intel/pkgs/python/3.3.2/lib/python3.3/gzip.py", line 441, in _read
    self._read_gzip_header()
  File "/usr/intel/pkgs/python/3.3.2/lib/python3.3/gzip.py", line 285, in _read_gzip_header
    magic = self.fileobj.read(2)
  File "/usr/intel/pkgs/python/3.3.2/lib/python3.3/gzip.py", line 93, in read
    self.file.read(size-self._length+read)
TypeError: can't concat bytes to str



 

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Ivan Ogassawara | 26 Sep 16:03 2014
Picon

Executing some function on IPython is slower than a normal python function

Dear all,

I'm testing some functionalities of ipython and I'm think I'm doing something wrong.

I'm testing 3 different ways to execute some math operation.

  • 1st using  <at> parallel.parallel(view=dview, block=True) and function map
  • 2nd using single core function (python normal function)
  • 3rd using clients load balance function

the code is here: https://stackoverflow.com/questions/26039254/executing-some-function-on-ipython-is-slower-than-a-normal-python-function

My result is:

True True 0.040741 secs (multicore) 0.004004 secs (singlecore) 1.286592 secs (multicore_load_balance)

Why are my multicore routines slower than my single core routine? What is wrong with this approach? What can I do to fix it?

Some environment information: python3.4.1, ipython 2.2.0, numpy 1.9.0, ipcluster starting 8 Engines with LocalEngineSetLauncher


My best regards,

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Jason Sachs | 26 Sep 02:45 2014
Picon

pandas Series and Index do not maintain dtype?

If I do this:

import numpy as np
import pandas as pd
t = np.arange(0,1,0.001)
print t.dtype
y = t*t
S = pd.Series(data=y,index=t)
print S.index.values.dtype
idx2=index=pd.Index(data=t,dtype=np.float)
print idx2.values.dtype


I get:

float64
object
object

Why doesn't Series.index.values or Index.values return a dtype=float64 when it is used to wrap index data which is a numpy array with dtype=float64?

I'm using pandas 0.12 and can't easily upgrade at the moment so I apologize if this has been addressed in 0.13 or 0.14

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Damian Avila | 26 Sep 05:34 2014

ANN: Bokeh 0.6.1 release

On behalf of the Bokeh team, I am very happy to announce the release of Bokeh version 0.6.1!

Bokeh is a Python library for visualizing large and realtime datasets on the web. Its goal is to provide to developers (and domain experts) with capabilities to easily create novel and powerful visualizations that extract insight from local or remote (possibly large) data sets, and to easily publish those visualization to the web for others to explore and interact with.

This point release includes several bug fixes and improvements over our most recent 0.6.0 release:

* Toolbar enhancements
* bokeh-server fixes
* Improved documentation
* Button widgets
* Google map support in the Python side
* Code cleanup in the JS side and examples
* New examples

See the CHANGELOG for full details.

In upcoming releases, you should expect to see more new layout capabilities (colorbar axes, better grid plots and improved annotations), additional tools, even more widgets and more charts, R language bindings, Blaze integration and cloud hosting for Bokeh apps.

Don't forget to check out the full documentation, interactive gallery, and tutorial at


as well as the Bokeh IPython notebook nbviewer index (including all the tutorials) at:


If you are using Anaconda or miniconda, you can install with conda:

    conda install bokeh

Alternatively, you can install with pip:

    pip install bokeh

BokehJS is also available by CDN for use in standalone javascript applications:


Issues, enhancement requests, and pull requests can be made on the Bokeh Github page:


Questions can be directed to the Bokeh mailing list: bokeh-aihBOO89d3ITaNkGU808tA@public.gmane.org

If you have interest in helping to develop Bokeh, please get involved!

Cheers,


Damián Avila

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Kynn Jones | 24 Sep 16:43 2014
Picon

DataFrame constructor produces all NaN's when given columns argument

[NOTE: I originally posted this question in StackOverflow, but it got no answers, so
I'm reposting it here.  To avoid cross-posting, I deleted the original SO question.]

The purpose of this question is to find out whether the behavior illustrated below is a
bug or not, and if it is not a bug, to find out where it is explained in the
Pandas documentation.

Here's a toy example to illustrate the behavior in question.  First, I create a
simple dataframe:

    import pandas as pd
    import collections as co

    data = [['a',  1],
            ['b',  2],
            ['a',  3]]

    colnames = tuple('XY')

    df = pd.DataFrame(co.OrderedDict([(colnames[i],
                                       [row[i] for row in data])
                                      for i in range(len(colnames))]))

The dataframe `df` looks like this:

    In [577]: df
    Out[577]:
       X  Y
    0  a  1
    1  b  2
    2  a  3

Here's a 1-column dataframe produced from one of the columns in `df`:

    In [578]: df.iloc[:, [1]]
    Out[578]:
       Y
    0  1
    1  2
    2  3

But I want the single column to have a different name.  I thought I could achieve
this by passing the last dataframe above to the `DataFrame` constructor,
along with a suitable setting for its `columns` argument, but this produces a
dataframe consisting entirely of `NaN` values:

    In [579]: pd.DataFrame(df.iloc[:, [1]], columns=['Z'])
    Out[579]:
        Z
    0 NaN
    1 NaN
    2 NaN

If I omit the `columns=['Z']` argument, the contents of the resulting dataframe,
once again, correspond to those of the input dataframe:

    In [580]: pd.DataFrame(df.iloc[:, [1]])
    Out[580]:
       Y
    0  1
    1  2
    2  3

(Of course, the last dataframe does not have the desired column name, but the last
output at least shows that the first argument to the constructor is not screwed up somehow.)

To reiterate the clarification made at the beginning of this post, The purpose of this
question is to determine whether the behavior described above is a
bug or not, and if it is not, to find out where it is explained in
the Pandas documentation.

(In particular, the purpose of this question is *not* to find out how to generate
the dataframe described in the example.  I know ways to do this.)

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Anthony O' Brien | 20 Sep 02:57 2014
Picon

PR #7619: DataFrame memory usage via df.info()

PR #7619 addresses issue #6852.

Specifically, the memory usage of a DataFrame is accessible via the `info` method and displays like so:

df = pd.DataFrame({"col1": randint(100, size=1000), "col2": randint(100, size=1000), "col3": randint(1000, size=1000)})

df.info(memory_usage=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 3 columns):
col1    1000 non-null int64
col2    1000 non-null int64
col3    1000 non-null int64
dtypes: int64(3)
memory usage: 31.2 KB

The `df.info()` method contains an argument `memory_usage` which explicitly specifies whether or not to show the memory usage (as above). The default argument for memory_usage is None, and under these circumstances the behavior is decided by a `display.memory_usage` setting (see Options and Settings).

Prior to merging this new functionality some the pandas developers were curious to get an idea of what the community thinks the default behavior for `display.memory_usage` should be. Would you like to see the memory usage of a DataFrame shown by default (in v15.0+)?

Please let me know if you have thoughts as to whether the memory usage should be displayed by default or not via the df.info() method.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.
Fabian Braennstroem | 23 Sep 20:24 2014

read_html tree like table structure

Hello all,

I am trying to extract data from a simulation setting-file. The picture below gives a short extract.



Reading it with read_html gives no error message and results in something like:



My current idea is now to get some grouping on different levels based on column 0.
So e.g. there should be a group 'Continua' with the subsettings.
In the end it should be possible to filter the table by the current 1st and 2nd column values, e.g.
I would like to extract the value of "Begin" (index 12) from the "Eddy Break-up" group, which belongs to "Models" ...
Maybe I need to create for each subsetting an additional column, but I am a bit clueless right now, how to structure the available information.

It would be nice, if someone has a suggestion!
Best Regards
Fabian

Here is the direct text copy of the output:

                                     0                               1   2
0            1 V2_M03_star_fine_smooth                             NaN NaT
1                            +-1 Parts                             NaN NaT
2                         +-2 Contacts                             NaN NaT
3                    +-3 3D-CAD Models                             NaN NaT
4                          +-4 Filters                             NaN NaT
5                             +-5 Tags                             NaN NaT
6                         +-6 Continua                        Continua NaT
7                       | +-1 GasPhase                         Regions NaT
8                                | | |                      Interfaces NaT
9                       | | +-1 Models                             NaN NaT
10  | | | +-1 Cell Quality Remediation                             NaN NaT
11             | | | +-2 Eddy Break-up          Source Enabled Trigger NaT
12                             | | | |                           Begin NaT
13                             | | | |            Source Term Limiting NaT
14                             | | | |            Store Reaction Rates NaT
15                             | | | |                Reaction Control NaT
16                 | | | +-3 Gradients                 Gradient Method NaT
17                             | | | |                  Limiter Method NaT
18                             | | | |  Custom Accuracy Level Selector NaT
19                             | | | |                         Verbose NaT


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe <at> googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

'Michael' via PyData | 23 Sep 15:59 2014

Does df.query() have some identifier for df.index to use?

I want to know if there's something like:

df.query('a > b and _index > 50')   # _index = df.index


Or, asking in another way: how can I use the index when filtering with df.query() ?


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
For more options, visit https://groups.google.com/d/optout.

Gmane