ENH: Sparse int64 and bool dtype support enhancement by sinhrks · Pull Request #13849 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation37 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

sinhrks

Currently, sparse doesn't support int64 and bool dtypes actually. When int or bool values are passed, it is coerced to float64 if dtypekw is not explicitly specified.

on current master

pd.SparseArray([1, 2, 0, 0 ])
# [1.0, 2.0, 0.0, 0.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2, 3], dtype=int32)

pd.SparseArray([True, False, True])
# [1.0, 0.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2], dtype=int32)

after this PR

The created data should have the dtype of passed values (as the same as normal Series).

pd.SparseArray([1, 2, 0, 0 ])
# [1, 2, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([0, 1], dtype=int32)

pd.SparseArray([True, False, True])
# [True, False, True]
# Fill: False
# IntIndex
# Indices: array([0, 2], dtype=int32)

Also, fill_value is automatically specified according to the following rules (because np.nan cannot appear in int or bool dtype):

Basic rule: sparse dtype must not be changed when it is converted to dense.

@codecov-io

Current coverage is 85.27% (diff: 98.63%)

Merging #13849 into master will increase coverage by <.01%

@@ master #13849 diff @@

Files 139 139
Lines 50511 50523 +12
Methods 0 0
Messages 0 0
Branches 0 0

Powered by Codecov. Last update 10bf721...341585a

This was referenced

Aug 1, 2016

jreback pushed a commit that referenced this pull request

Aug 4, 2016

@sinhrks @jreback

split from #13849

Author: sinhrks sinhrks@gmail.com

Closes #13900 from sinhrks/sparse_astype and squashes the following commits:

1c669ad [sinhrks] ENH: sparse astype now supports int64 and bool

@jreback

@sinhrks getting tons of warnings compiling on windows....all the same

pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data

This was referenced

Aug 8, 2016

jreback pushed a commit that referenced this pull request

Aug 9, 2016

@sinhrks @jreback

@jorisvandenbossche

Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.

@jorisvandenbossche

@sinhrks

I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).

#13110 should be closed. Added whatsnew.

jreback

@@ -17,6 +17,7 @@ Highlights include:
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
- ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>`
- Sparse now supports other ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would leave out other

@jorisvandenbossche

Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.

Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True])?
Of course this is currently because boolean serieses cannot have anything else as True or False.

@jorisvandenbossche

OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse does, so seems like a sensible default then.

jorisvandenbossche

Sparse data should have the same dtype as its dense representation. Currently,
``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original
dtype, ``fill_value`` default changes:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note here somewhere that for int and bool this was only added from 0.19 ?

@jreback

joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)

so this is not a conceptual change at all just a change to keep dtype consistency

@jorisvandenbossche

@jreback I was looking at the to_sparse examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.

@jorisvandenbossche

@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))

@jorisvandenbossche

@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.

@sinhrks

@jorisvandenbossche

@jorisvandenbossche

@sinhrks appveyor started failing (some int dtype issues):

======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
    tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
    assert_numpy_array_equal(left.sp_values, right.sp_values)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
    assert_attr_equal('dtype', left, right, obj=obj)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]:  int64
[right]: int32

@sinhrks