BUG: Sparse incorrectly handle fill_value · Issue #12797 · pandas-dev/pandas (original) (raw)

Sparse looks to handle missing (NaN) and fill_value confusingly. Based on the doc, I understand fill_value is a user-specified value to be omitted in the sparse internal repr. fill_value may be different from missing (NaN).

Code Sample, a copy-pastable example if possible

# NG, 2nd and last element must be NaN
pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
# array([ 1.,  0.,  0.,  3.,  0.])

# NG, 2nd element must be NaN
orig = pd.Series([1, np.nan, 0, 3, np.nan], index=list('ABCDE'))
sparse = orig.to_sparse(fill_value=0)
sparse.reindex(['A', 'B', 'C'])
# A    1.0
# B    0.0
# C    0.0
# dtype: float64
# BlockIndex
# Block locations: array([0], dtype=int32)
# Block lengths: array([1], dtype=int32)

Expected Output

pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
# array([ 1.,  np.nan,  0.,  3.,  np.nan])

sparse = orig.to_sparse(fill_value=0)
sparse.reindex(['A', 'B', 'C'])
# A    1.0
# B    NaN
# C    0.0
# dtype: float64
# BlockIndex
# Block locations: array([0], dtype=int32)
# Block lengths: array([1], dtype=int32)

output of pd.show_versions()

Current master.

The fix itself looks straightforward, but it breaks some tests use dubious comparison.