ENH: Sparse int64 and bool dtype support enhancement by sinhrks · Pull Request #13849 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation37 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- closes Support dtypes other than float in sparse data structures #667, closes BUG: int sparse series is converted to float if sliced #8292, closes _combine_const() in pandas.sparse.frame does not have uniform method signature with pandas.core.frame #13001, closes SparseDataFrame.isnull raises an error #8276, closes BUG: SparseSeries/DataFrame non-float dtypes repr #13110
- tests added / passed
- passes
git diff upstream/master | flake8 --diff
- whatsnew entry
Currently, sparse doesn't support int64
and bool
dtypes actually. When int
or bool
values are passed, it is coerced to float64
if dtype
kw is not explicitly specified.
on current master
pd.SparseArray([1, 2, 0, 0 ])
# [1.0, 2.0, 0.0, 0.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2, 3], dtype=int32)
pd.SparseArray([True, False, True])
# [1.0, 0.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2], dtype=int32)
after this PR
The created data should have the dtype
of passed values (as the same as normal Series
).
pd.SparseArray([1, 2, 0, 0 ])
# [1, 2, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([0, 1], dtype=int32)
pd.SparseArray([True, False, True])
# [True, False, True]
# Fill: False
# IntIndex
# Indices: array([0, 2], dtype=int32)
Also, fill_value
is automatically specified according to the following rules (because np.nan
cannot appear in int
or bool
dtype):
Basic rule: sparse dtype
must not be changed when it is converted to dense.
- If
sparse_index
is specified and data has a hole (missing values):fill_value
is np.nandtype
isfloat64
orobject
(which can store bothdata
andfill_value
)
- If
sparse_index
is None (all values are provided viadata
, no missing values)- if
fill_value
is not explicitly passed, following default will be used depending on its dtype.
*float
:np.nan
*int
:0
*bool
:False
- if
Current coverage is 85.27% (diff: 98.63%)
@@ master #13849 diff @@
Files 139 139
Lines 50511 50523 +12
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 43071 43083 +12
Misses 7440 7440
Partials 0 0
Powered by Codecov. Last update 10bf721...341585a
This was referenced
Aug 1, 2016
jreback pushed a commit that referenced this pull request
split from #13849
Author: sinhrks sinhrks@gmail.com
Closes #13900 from sinhrks/sparse_astype and squashes the following commits:
1c669ad [sinhrks] ENH: sparse astype now supports int64 and bool
@sinhrks getting tons of warnings compiling on windows....all the same
pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
This was referenced
Aug 8, 2016
jreback pushed a commit that referenced this pull request
Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.
I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).
#13110 should be closed. Added whatsnew.
@@ -17,6 +17,7 @@ Highlights include: |
---|
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>` |
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>` |
- ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>` |
- Sparse now supports other ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>` |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would leave out other
Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.
Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True]
)?
Of course this is currently because boolean serieses cannot have anything else as True or False.
OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse
does, so seems like a sensible default then.
Sparse data should have the same dtype as its dense representation. Currently, |
``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original |
dtype, ``fill_value`` default changes: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note here somewhere that for int and bool this was only added from 0.19 ?
joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)
so this is not a conceptual change at all just a change to keep dtype consistency
@jreback I was looking at the to_sparse
examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()
) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.
@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))
@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.
@sinhrks appveyor started failing (some int dtype issues):
======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
assert_numpy_array_equal(left.sp_values, right.sp_values)
File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
assert_attr_equal('dtype', left, right, obj=obj)
File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
left_attr, right_attr)
File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]: int64
[right]: int32