BUG: Error when creating sparse dataframe with nan column label by artemyk · Pull Request #8822 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation21 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Right now the following raises an exception:

from pandas import DataFrame, Series
from numpy import nan
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])
nan_colname_sparse = nan_colname.to_sparse()

This is because sparse dataframes use a dictionary to store information about columns, with the column label as the key. nan's do not equal themselves and create problems as dictionary keys. This avoids the issue by uses a dataframe to store this information.

why would you ever want to actually have a nan as a column label?
nans cause many many issues in indexes. They are technically allowed, but highly discouraged
the columns are not stored in a dictionary, a SparseDataFrame is very much like a regular DataFrame internally
this is not allowed in a DataFrame now, why should SparseDataFrame support it?
You could need a whole battery of tests for this to be considered. eg. sdf[nan] will simply fail

@jreback, I totally agree with you. Except that this is not allowed in DataFrame now -- you can create a DataFrame with nan columns, but not a SparseDataFrame.

However, I am working on a get_dummies() that returns a sparse dataframe instead of dense.
There is a test of get_dummies which creates a dataframe with a nan column name (see last part of test_include_na in pandas/tests). So I wrote this in order to pass that test.

I would argue that if this shouldn't be done, it should be explicitly forbidden. E.g. so that people don't write tests creating using DataFrames with nan column names.

it can be done. just not like you are doing it. You can do it from a dict only (or set the columns explicity). Its implicity allowed, even though if you try to index with more than one nan in an index it will fail.

I know their are some tests / behavior which allow this. Its a problem. Not sure how much work it would be to either not disallow it. I agree with your sentiments.

So not allowing expansion of this (have nixed a couple of pr's that tried to do this). it is a can of worms.

OK, I will soon commit a pull request that fails a test due to this. Maybe we can discuss it further there. The issue can probably be avoided with a simple test for this edge case.

@jreback This might be necessary after all, for the relatively normal following case:

>>> print pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True)

    1    2   NaN
0    1    0    0
1    0    1    0
2    0    0    1

that's an ok case
lol and see how it is constructed

@jreback Not sure how to get sparse get_dummies with dummy_na without this.
BTW, I realize SparseDataFrame doesn't store columns in dictionaries, but during one part of its creation, it does store column information in a dict w. column name as key (that it then passes to dict_to_manager).
I suppose another variation would be to create this dict using (True, None) as the key for np.nan-named columns, and (False, colname) as the key for non-np.nan-named columns. Not sure if this is better than using a regular DataFrame (I kind of like that this enforces that if a column name can be a valid column in a DataFrame, it works here, and if not, it should fail here also).

the usual way to do this is to stringify the nans, e.g. use 'nan' then subsitute it out at the end. Much simpler that way (or 2 be honest prob better), if a bit tricker on access.

But then if one of the values is 'nan' (in string form), then np.nan's and 'nan's would get confused. I think a

def safe_key(val):
  isnan = np.isnan(val)
  return (isnan, val if not isnan else None)

would be more robust.

@jreback I'm wondering how to proceed with this. The basic issue is that w/o this PR, the following raises an error:

import pandas as pd
import numpy as np
pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True).to_sparse()

What are your thoughts? Keep as is? Or merge, but using dicts instead of a dataframe to store column information? (and if so --- because it's more lightweight?)

@artemyk I am excited you are working on this. As we need a person really interested in sparese!

I haven't had a chance to look at how best to do this. Will get back to you next week.

@artemyk can you rebase this and I'll take a look......

jreback

@@ -1663,6 +1663,11 @@ def test_as_blocks(self):
self.assertEqual(list(df_blocks.keys()), ['float64'])
assert_frame_equal(df_blocks['float64'], df)

def test_nan_columnname(self):
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])