Allow indices to be mapped through through dictionaries or series by nateyoder · Pull Request #15081 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation85 Commits22 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes DataFrame indices can't map through dictionaries or series #12756
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

@@ -2395,6 +2394,25 @@ def get_indexer_for(self, target, **kwargs):
indexer, _ = self.get_indexer_non_unique(target, **kwargs)
return indexer

def get_values_from_dict(self, input_dict):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a private function (_get_values.....)


Parameters
----------
input_dict : dict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this data


Returns
-------
Union[np.array, list]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always return np.ndarray no?

@@ -2438,8 +2456,8 @@ def map(self, mapper):

Parameters
----------
mapper : callable
Function to be applied.
mapper : Union[function, dict, Series]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use {callable, dict, Series}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say new in 0.20.0 for (dict/Series)

@@ -2450,7 +2468,18 @@ def map(self, mapper):

"""
from .multi import MultiIndex
mapped_values = self._arrmap(self.values, mapper)

if isinstance(mapper, ABCSeries):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a dict, why can' you just Series(data) then treat it like a Series?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think its worth the effort to combine this code with Series.map (in reality you can prob just move the Series.map code to core.base.Base I think (with maybe some modifications here).

def get_values_from_dict(self, input_dict):
"""Return the values of the input dictionary in the order the keys are
in the index. np.nan is returned for index values not in the
dictionary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same doc-string

Returns
-------
Union[np.array, list]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move this to tseries.common, then for the DatetimeIndex you do a super call (e.g. with the dict_compat)

@@ -3511,6 +3511,10 @@ def test_map(self):
result = index.map(lambda x: x + 1)
expected = index + 1
tm.assert_index_equal(result, expected)
series_map = pd.Series(expected, index)
tm.assert_index_equal(index.map(series_map), expected)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try with an empty, try with some NaT's

rng = timedelta_range('1 day', periods=10)

f = lambda x: x.days
result = rng.map(f)
exp = Int64Index([f(x) for x in rng])
tm.assert_index_equal(result, exp)

s = Series(exp, index=rng)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sam as above (you might wish to move these tests to test_datetimelike to avoid duplication)

exp_values = [f(x) for x in rng]
exp = Index(exp_values, dtype='<U8')

tm.assert_index_equal(rng.map(f), exp)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test for empty

Hi @jreback sorry for the delay. Work got a little crazy. Let me know what other changes you'd like to see made.

jreback changed the title~~Allow indeices to be mapped through through dictionaries or series~~ Allow indices to be mapped through through dictionaries or series

Jan 22, 2017

@@ -139,6 +139,8 @@ Other enhancements
- ``pd.merge_asof()`` gained the option ``direction='backward'\|'forward'
- ``Series/DataFrame.asfreq()`` have gained a ``fill_value`` parameter, to fill missing values (:issue:`3715`).
- ``Series/DataFrame.resample.asfreq`` have gained a ``fill_value`` parameter, to fill missing values during resampling (:issue:`3715`).
- ``Index.map`` can now accept series and dictionary input object.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number

@@ -2450,7 +2468,18 @@ def map(self, mapper):

"""
from .multi import MultiIndex
mapped_values = self._arrmap(self.values, mapper)

if isinstance(mapper, ABCSeries):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2450,7 +2468,18 @@ def map(self, mapper):

"""
from .multi import MultiIndex
mapped_values = self._arrmap(self.values, mapper)

if isinstance(mapper, ABCSeries):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think its worth the effort to combine this code with Series.map (in reality you can prob just move the Series.map code to core.base.Base I think (with maybe some modifications here).

@nateyoder this was quite close. can you rebase / update and we can get it in?

Hi Jeff, My apologies as work has swamped me again. I'll try to get everything up to date and working again this weekend and you can let me know if there are other changes you'd like. Thanks for your patience.

@nateyoder great thanks. yeah when you rebase will take another look. feel free to ping.

Hi @jreback. Apologies again on the delay. I updated the tests to include some of the missing cases but was a little unsure how to re-use the mapping code from series like you had suggested.

I took a shot at combining the code but in at least my current implementation I still had to do some object specific code. If you've got a better idea let me know and I'll be happy to take a look.

I also took this opportunity to not create a series from a dict but instead use an index directly which as you can see from the below ASV results provided a slight but pretty consistent performance improvement for dicts and didn't significantly slow down series maps:

asv continuous -f 1.1 upstream/master HEAD -b ^series_methods.series_map -E conda:2.7
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash 99e11b4b:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.......................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running series_methods.series_map_dict.time_series_map_dict              1.61±0.03ms
[ 50.00%] ··· Running series_methods.series_map_series.time_series_map_series             995±20μs
[ 50.00%] · For pandas commit hash 19fc8dac:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running series_methods.series_map_dict.time_series_map_dict              2.10±0.09ms
[100.00%] ··· Running series_methods.series_map_series.time_series_map_series             970±60μs       before           after         ratio
     [19fc8dac]       [99e11b4b]
-     2.10±0.09ms      1.61±0.03ms     0.77  series_methods.series_map_dict.time_series_map_dict

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

EDIT: Note that give that the above method did NOT work in both 2.7 and 3 this performance improvement was reverted and the included changes did not significantly change performance.

Sorry didn't test with 2.7. I'll go back to just using series even though its a little slower.

Sorry for the trashing around. I wasn't familiar with what should be an index vs a multi-index in mapping with a dictionary made of tuples. I ended up just going back to using a series because I couldn't keep the performance going straight through the _ensure_index function.

@@ -521,6 +521,7 @@ Other Enhancements
- The ``display.show_dimensions`` option can now also be used to specify
whether the length of a ``Series`` should be shown in its repr (:issue:`7117`).
- ``parallel_coordinates()`` has gained a ``sort_labels`` keyword arg that sorts class labels and the colours assigned to them (:issue:`15908`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove for now. This will be in 0.21.0 (though file doesn't exist yet, don't create it).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -8,10 +8,10 @@

from pandas.core.dtypes.missing import isnull
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar, is_extension_type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this causes lint issues, use parens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -933,6 +933,44 @@ def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None,
klass=self.__class__.__name__, op=name))
return func(**kwds)

def _map_values(self, values, arg, na_action=None):
if is_extension_type(self.dtype):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


Parameters
----------
data : {dict, DictWithoutMissing}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just dict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -2706,31 +2726,36 @@ def groupby(self, values):

return result

def map(self, mapper):
"""Apply mapper function to an index.
def map(self, arg, na_action=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically need to deprecate and rename mapper -> arg add a versionadded (0.21.0) for na_action.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something I need to do? If so can you show me how you've done this in the past? Otherwise should I just change arg -> mapper?

tm.assert_index_equal(self.index.map(dict_map), expected)

# empty mappable
nan_index = Index([pd.np.nan] * len(self.index))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.nan

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

result = self.index.map(lambda x: pd.NaT if x == self.index[0] else x)
expected = Index([pd.NaT] + self.index[1:].tolist())
tm.assert_index_equal(result, expected)
series_map = pd.Series(expected, self.index)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line between cases

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

index = PeriodIndex([2005, 2007, 2009], freq='A')
result = index.map(lambda x: x + 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do now, raise? put a test for that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it still works. I understood from your previous comment on Jan 9th that I could move it to test_datetimelike but have since tested using them using pandas/tests/indexes/datetimelike.py

@@ -74,3 +74,29 @@ def test_union(self):
for case in cases:
result = first.union(case)
self.assertTrue(tm.equalContents(result, everything))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally could push some of these (NaT & empty) into pandas/tests/indexes/datetimelike.py so that they get tested for all DTI, PI, and TDI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then in the sub-classes call the super-class for testing (and add addtional); so these are all called test_map.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I misunderstood your comment on the last PR and thought you asked me to put these here (test_datetimelike) rather than than there. Thanks for making that clearer.

def test_map_with_categorical_series(self):
# GH 12756
a = pd.Index([1, 2, 3, 4])
b = pd.Series(["even", "odd", "even", "odd"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really odd. you are mapping the codes??

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to see if someone did try to do this (not sure why you would) that it would work as expected. So yes it maps the codes in this case. Is there a different preferred behavior?

pls rebase / update when you can

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay here. I ran into (and am still running into) some issues with IntervalIndex in which the get_indexer function appears to be buggy (#16410). I skipped that test for now but depending on the desired behavior will probably have to do a custom map function for that? Also I'm not sure I fully understand the new pytest behavior so sorry if there are issues there as well.

def test_map_with_categorical_series(self):
# GH 12756
a = pd.Index([1, 2, 3, 4])
b = pd.Series(["even", "odd", "even", "odd"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to see if someone did try to do this (not sure why you would) that it would work as expected. So yes it maps the codes in this case. Is there a different preferred behavior?

@@ -521,6 +521,7 @@ Other Enhancements
- The ``display.show_dimensions`` option can now also be used to specify
whether the length of a ``Series`` should be shown in its repr (:issue:`7117`).
- ``parallel_coordinates()`` has gained a ``sort_labels`` keyword arg that sorts class labels and the colours assigned to them (:issue:`15908`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -8,10 +8,10 @@

from pandas.core.dtypes.missing import isnull
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar
from pandas.core.dtypes.common import is_object_dtype, is_list_like, is_scalar, is_extension_type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -933,6 +933,44 @@ def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None,
klass=self.__class__.__name__, op=name))
return func(**kwds)

def _map_values(self, values, arg, na_action=None):
if is_extension_type(self.dtype):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


Parameters
----------
data : {dict, DictWithoutMissing}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

result = self.index.map(lambda x: pd.NaT if x == self.index[0] else x)
expected = Index([pd.NaT] + self.index[1:].tolist())
tm.assert_index_equal(result, expected)
series_map = pd.Series(expected, self.index)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -74,3 +74,29 @@ def test_union(self):
for case in cases:
result = first.union(case)
self.assertTrue(tm.equalContents(result, everything))

def test_map(self):
expected = self.index + 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -74,3 +74,29 @@ def test_union(self):
for case in cases:
result = first.union(case)
self.assertTrue(tm.equalContents(result, everything))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I misunderstood your comment on the last PR and thought you asked me to put these here (test_datetimelike) rather than than there. Thanks for making that clearer.

@@ -1398,6 +1398,17 @@ def get_value_maybe_box(self, series, key):
key, tz=self.tz)
return _maybe_box(self, values, series, key)

@Appender(_index_shared_docs['_get_values_from_dict'])
def _get_values_from_dict(self, data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above comment.

@@ -704,6 +704,14 @@ def __rsub__(self, other):
def _add_delta(self, other):
return NotImplemented

@Appender(_index_shared_docs['_get_values_from_dict'])
def _get_values_from_dict(self, data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume it does get hit indirectly as it was essentially pulled into the subclasses from the pd.Series instantiation. It is essentially replicating the behavior that was previously in lines 188-203 of pandas/core/series.py and allowing each object to handle it as seen fit rather than doing it in an if conditional. Personally this way make more sense to me but since I ended up not making use of this functionality outside of the Series instantiation I'm happy to revert the changes if you'd prefer.

@nateyoder can you rebase / update. sorry didn't get to review this soon.

@nateyoder going to finaly get this in! can you rebase and move note to 0.22.0

Hi @jreback. Sorry the problem has totally been on my side. Yes I'd love to get it in as well though I unfortunately won't be able to work on it this weekend. If I try to get you updates early next week would that work? Sorry for the delay!

@nateyoder sure! that is fine. just ping when ready.

…d add dict performance test

also refactored to remove turning things into object dtypes unecessarily

In [1]:         np.random.seed(1234)
   ...:         map_size = 10000
   ...:         s = Series(np.random.randint(0, map_size, 100000))
   ...:         map_series = Series(map_size - np.arange(map_size))
   ...: 

In [2]: %timeit s.map(map_series)
773 µs ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

0.21.0

In [2]: %timeit s.map(map_series)
9.3 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a small comment

tm.assert_index_equal(self.index.map(series_map), expected)

dict_map = {i: e for e, i in zip(expected, self.index)}
tm.assert_index_equal(self.index.map(dict_map), expected)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: in the base tests you have the case of an empty mapping generating an Index of NaNs. Should there be a test for that here as well that it gives an Index of NaTs ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no easy way to do this actually, e.g.

In [2]: pd.date_range('20130101', periods=3).map({})
Out[2]: Float64Index([nan, nan, nan], dtype='float64')

I could special case this I suppose (e.g. empty dict).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried to fix this, but it goes down a rabbit hole. if you want to open an issue ok I guess.

# Dictionary does not have a default. Thus it's safe to
# convert to an Series for efficiency.
from pandas import Series
mapper = Series(mapper, index=mapper.keys())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is passing index=mapper.keys() necessary?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is to handle some odd-ball cases where you have tuples as keys in the dictionary

def map(self, mapper):
"""Apply mapper function to an index.
def map(self, mapper, na_action=None):
"""Map values of Series using input correspondence (which can be a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence should be a single line so that it looks OK in the autosummary table.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2245,14 +2230,14 @@ def unstack(self, level=-1, fill_value=None):
# ----------------------------------------------------------------------
# function application

def map(self, arg, na_action=None):
def map(self, mapper, na_action=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a release note for this in API-breaking changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted (was an error)

@jreback thanks for doing all the work to finish it up!! Really appreciate you picking up the slack!