BUG: multi-type SparseDataFrame fixes and improvements by sstanovnik · Pull Request #13917 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- 1 tests added / passed
- passes
git diff upstream/master | flake8 --diff - whatsnew entry
Types were incorrectly determined when slicing SparseDataFrames with
multiple dtypes (such as float and object) into SparseSeries.
import pandas as pd sdf = pd.SparseDataFrame([[1, 2, 'a'], [4, 5, 'b']]) sdf.iloc[[0]] # OK sdf.iloc[0] # ValueError: could not convert string to float: 'a'
No existing issue covers this.
Current coverage is 85.30% (diff: 100%)
@@ master #13917 diff @@
Files 139 139
Lines 50157 50157
Methods 0 0
Messages 0 0
Branches 0 0
Hits 42785 42785
Misses 7372 7372
Partials 0 0
Powered by Codecov. Last update b7abef4...8c7d1ea
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't np.find_common_type do this? (or .common_type)?
should create a pandas version of this to isolate (e.g. this won't handle non-numpy types)
as sparse mainly supports float64, adding more tests for mixed dtype is appreciated (maybe in the same file as separate class).
Types were incorrectly determined when slicing SparseDataFrames with multiple dtypes (such as float and object). Also enables type inference for SparseArrays by default.
sstanovnik changed the title
BUG: slicing single rows of multi-type SparseDataFrames. BUG: multi-type SparseDataFrame fixes and improvements
I used numpy's find_common_type instead of that local function, this changed a test (test_block_internals.py), see if that numpy behaviour is desired.
Another (maybe non-minor) change is the default parameter in sparse/array.py, but that didn't seem to really break anything.
I also added some new tests as suggested, but don't feel confident in adding the proposed pandas alternative to the numpy find_common_type - my knowledge of pandas dtypes isn't really great. It would likely be similar to _interleaved_dtype in core/internals.py right?
| def _lcd_dtype(l): |
|---|
| """ find the lowest dtype that can accomodate the given types """ |
| m = l[0].dtype |
| for x in l[1:]: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant was can start off by simply moving this (your version or new one with np.find_common_type to pandas.types.cast (and if we have specific tests move similarly; if not, ideally add some). we can add the pandas specific functionaility later.
can u also add tests to check normal SparseArray and its dtypes (related to the default change)?
current default (float64) is because it doesn't fully support int and bool, should be solved by #13849.
It turns out this PR works without the default argument change, I was too hasty to change it. Your PR fixes that better, so I reverted the change.
Common type discovery moved to types/cast.py:_find_common_type. I did notice, however, that types/common.py:_lcd_dtypes exists (seems to only work with numpy dtypes, but differently than np.find_common_type), but changing either implementation to the other broke some things, so I left it as it is.
yep, need to fix this.
@sstanovnik can you create a new issue for this.
[jreback-~/pandas] grin _lcd_dtype pandas
pandas/core/frame.py:
47 : _lcd_dtypes,
3705 : new_dtype = _lcd_dtypes(this_dtype, other_dtype)
pandas/core/internals.py:
4438 : def _lcd_dtype(l):
4473 : lcd = _lcd_dtype(counts[IntBlock])
4489 : return _lcd_dtype(counts[FloatBlock] + counts[SparseBlock])
pandas/types/common.py:
389 : def _lcd_dtypes(a_dtype, b_dtype):
| self.assertEqual(values.dtype, np.int64) |
|---|
| # guess all ints are cast to uints.... |
| # B uint64 forces float because there are other signed int types |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might fix another bug, can you search for uint64 issues and see?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue as a reference here
and in the whatsnew
Opened issue. Moved the tests, added new tests. Found and processed #10364.
| class TestCommonTypes(tm.TestCase): |
| def setUp(self): |
| super(TestCommonTypes, self).setUp() |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need setUp here
| self.assertEqual(_find_common_type([np.object]), np.object) |
|---|
| self.assertEqual(_find_common_type([np.int16, np.int64]), |
| np.int64) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group them
eg ints
floats
easier to read
| - ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`) |
|---|
| - ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`) |
| - ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`) |
| - ``.values`` will now return ``np.float64`` with a ``DataFrame`` with ``np.int64`` and ``np.uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame.values will now ....with a frame of mixed int64 and uint64 dtypes.....
| - Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`) |
|---|
| - Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`) |
| - Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`) |
| - Bug in single row slicing on multi-type ``SparseDataFrame``s: types were previously forced to float (:issue:`13917`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, types where previously...
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads fine to me? There's a colon after SparseDataFrames
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change : to ,
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait sorry I misread your comment.
Thanks for your patience.
Labels
Unexpected or buggy dtype conversions
Related to indexing on series/frames, not to indexes themselves