Method dropna does not work on SparseDataFrames · Issue #21172 · pandas-dev/pandas (original) (raw)

Function dropna may return wrong result on SparseDataFrame. The following code

import pandas as pd

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all') pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')

outputs

import pandas as pd

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')) F1 F2 0 NaN 0 1 NaN 1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all')) F1 F2 0 NaN 0 1 NaN 1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')) F1 F2 0 NaN 0 1 NaN 1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')) F1 0 NaN 1 NaN

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')) F1 0 NaN 1 NaN

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')) F2 F3 0 0 NaN 1 1 0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all')) F2 F3 0 0 NaN 1 1 0.0

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')) F2 F3 0 0 NaN 1 1 0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')) F2 0 0 1 1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')) F2 0 0 1 1

Problem description

dropna method behaves differently for SparseDataFrames and dense ones. Also it may happen that it does not drop nan columns at all (see the last examples in the first batch). The correct behaviour is in the second batch of commands.

Expected Output

   F2   F3
0   0  NaN
1   1  0.0

   F2   F3
0   0  NaN
1   1  0.0

   F2   F3
0   0  NaN
1   1  0.0

   F2
0   0
1   1

   F2
0   0
1   1

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64  
OS: Linux       
OS-release: 4.15.0-20-generic
machine: x86_64
processor:
byteorder: little                                                                
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
                                                                                 
pandas: 0.23.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2                                                                   
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None                                                                     
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2                                                                  
pytz: 2018.4
blosc: None
bottleneck: None
tables: None                                                                     
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None                                                                   
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1       
sqlalchemy: 1.2.7
pymysql: None     
psycopg2: None    
jinja2: 2.10
s3fs: None           
fastparquet: None
pandas_gbq: None
pandas_datareader: None