ENH: fastpath indexer API proposal (draft) · Issue #6328 · pandas-dev/pandas (original) (raw)
The discussion in #6134 has inspired an idea that I'm writing down for
discussion. The idea is pretty obvious so it should've been considered before,
but I still think pandas as it is right now can benefit from it.
My main complaint about pandas when using it in non-interactive way is that
lookups are significantly slower than with ndarray
containers. I do realize
that this happens because of many ways the indexing may be done, but at some
point I've really started thinking about ditching pandas in some
performance-critical paths of my project and replacing them with the dreadfuldict/ndarray
combo. Not only doing arr = df.values[df.idx.get_loc[key]]
gets old pretty fast but it's also slower when the frame contains different
dtypes and then you need to go deeper to fix that.
Now I thought what if this slowdown can be reduced by creating fastpath
indexers that look like the IndexSlice
from #6134 and would convey a
message to pandas
indexing facilities, like "trust me, I've done all the
preprocessing, just look it up already". I'm talking about something like that
(the names are arbitrary and chosen for illustrative purposes only):
masked_rows = df.fastloc[pd.bool_slice[bool_array]]
or
masked_rows = df.fastloc[pd.bool_series_slice[bool_series]]
or
rows_3_and_10 = df.fastloc[pd.pos_slice[3, 10]]
or
rows_3_through_10 = df.fastloc[pd.range_slice[3:10]]
or
rows_for_two_days = df.fastloc[pd.tpos_slice['2014-01-01', '2014-01-08']]
Given the actual slice objects will have a common base class, the
implementation could be as easy as:
class FastLocAttribute(object): def init(self, container): self._container = container
def __getitem__(self, smth):
if not isinstance(smth, FastpathIndexer):
raise TypeError("Indexing object is not a FastpathIndexer")
# open to custom FastpathIndexer implementations
return smth.getitem(self._container)
# or a better encapsulated, but not so open
return self._container._index_method[type(smth)](smth)
Cons:
- a change in public API
- one more lookup type
- inconvenient to use interactively
Pros:
- adheres to the Zen of Python (explicit is better than implicit)
- when used in programs, most of the time you know what will the indexing
object look like and how do you want to use its contents (e.g. no guessing if
np.array([0,1,0,1]) is a boolean mask or a series of "takeable" indices) - lengthier than existing lookup schemes but still shorter than jumping through
the hoops ofNDFrame
andIndex
internals to avoid premature
pessimization (also, more reliable w.r.t. new releases) - fastpath indexing API could be used in
pandas
internally for the speed (and
clarity, as in "interesting, what does this function pass to df.loc[...],
let's find this out")