BUG: Error when creating sparse dataframe with nan column label by artemyk · Pull Request #8822 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation21 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

artemyk

Right now the following raises an exception:

from pandas import DataFrame, Series
from numpy import nan
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])
nan_colname_sparse = nan_colname.to_sparse()

This is because sparse dataframes use a dictionary to store information about columns, with the column label as the key. nan's do not equal themselves and create problems as dictionary keys. This avoids the issue by uses a dataframe to store this information.

@jreback

@artemyk

@jreback, I totally agree with you. Except that this is not allowed in DataFrame now -- you can create a DataFrame with nan columns, but not a SparseDataFrame.

However, I am working on a get_dummies() that returns a sparse dataframe instead of dense.
There is a test of get_dummies which creates a dataframe with a nan column name (see last part of test_include_na in pandas/tests). So I wrote this in order to pass that test.

I would argue that if this shouldn't be done, it should be explicitly forbidden. E.g. so that people don't write tests creating using DataFrames with nan column names.

@jreback

it can be done. just not like you are doing it. You can do it from a dict only (or set the columns explicity). Its implicity allowed, even though if you try to index with more than one nan in an index it will fail.

I know their are some tests / behavior which allow this. Its a problem. Not sure how much work it would be to either not disallow it. I agree with your sentiments.

So not allowing expansion of this (have nixed a couple of pr's that tried to do this). it is a can of worms.

@artemyk

OK, I will soon commit a pull request that fails a test due to this. Maybe we can discuss it further there. The issue can probably be avoided with a simple test for this edge case.

@jreback

@artemyk

@jreback This might be necessary after all, for the relatively normal following case:

>>> print pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True)

    1    2   NaN
0    1    0    0
1    0    1    0
2    0    0    1

@jreback

that's an ok case
lol and see how it is constructed

@artemyk

@jreback Not sure how to get sparse get_dummies with dummy_na without this.
BTW, I realize SparseDataFrame doesn't store columns in dictionaries, but during one part of its creation, it does store column information in a dict w. column name as key (that it then passes to dict_to_manager).
I suppose another variation would be to create this dict using (True, None) as the key for np.nan-named columns, and (False, colname) as the key for non-np.nan-named columns. Not sure if this is better than using a regular DataFrame (I kind of like that this enforces that if a column name can be a valid column in a DataFrame, it works here, and if not, it should fail here also).

@jreback

the usual way to do this is to stringify the nans, e.g. use 'nan' then subsitute it out at the end. Much simpler that way (or 2 be honest prob better), if a bit tricker on access.

@artemyk

But then if one of the values is 'nan' (in string form), then np.nan's and 'nan's would get confused. I think a

def safe_key(val):
  isnan = np.isnan(val)
  return (isnan, val if not isnan else None)

would be more robust.

@artemyk

@jreback I'm wondering how to proceed with this. The basic issue is that w/o this PR, the following raises an error:

import pandas as pd
import numpy as np
pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True).to_sparse()

What are your thoughts? Keep as is? Or merge, but using dicts instead of a dataframe to store column information? (and if so --- because it's more lightweight?)

@jreback

@artemyk I am excited you are working on this. As we need a person really interested in sparese!

I haven't had a chance to look at how best to do this. Will get back to you next week.

@artemyk

@jreback

@artemyk can you rebase this and I'll take a look......

@artemyk

@artemyk

jreback

@@ -1663,6 +1663,11 @@ def test_as_blocks(self):
self.assertEqual(list(df_blocks.keys()), ['float64'])
assert_frame_equal(df_blocks['float64'], df)
def test_nan_columnname(self):
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue as a comment here

@jreback

pls add a release note (use this PR number as the issue number)

Support for nan columns

Fix

Trigger Travis CI

jreback fixes

Release note update

@artemyk

jreback added a commit that referenced this pull request

Apr 11, 2015

@jreback

BUG: Error when creating sparse dataframe with nan column label

@jreback

@kernc kernc mentioned this pull request

Jul 12, 2017

4 tasks

2 participants

@artemyk @jreback