BUG: preserve object-dtype index when accessing DataFrame column / PERF: improve perf of Series fastpath constructor by jorisvandenbossche · Pull Request #42950 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@jorisvandenbossche

This came up while discussing the CoW proof of concept (and need for item_cache / speed of creating a Series of a column). Those small changes gives a decent reduction of some overhead (~30% reduction):

arr = np.array([1, 2, 3]) idx = pd.Index(["a", "b", "c"])

pd.Series(arr, index=idx, name="name", fastpath=True)

In [5]: %timeit pd.Series(arr, index=idx, name="name", fastpath=True)
10.9 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
7.48 µs ± 187 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

@jorisvandenbossche

jbrockmendel

# labels may exceeds datetime bounds,
# or not be a DatetimeIndex
pass
if labels._is_all_dates:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is being relied upon in the pytables code though ..

And looking into it further, the change I made here of course changes behaviour, as it no longer try to infer object dtype as datetime in the fastpath.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that should probably be considered as a bug fix. With master:

In [1]: df = pd.DataFrame({'a': [1, 2, 3]}, index=pd.date_range("2012", periods=3).astype(object))

In [2]: df.index
Out[2]: Index([2012-01-01 00:00:00, 2012-01-02 00:00:00, 2012-01-03 00:00:00], dtype='object')

In [3]: df['a'].index
Out[3]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq=None)

where the index of the column-as-Series becomes different ..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we deprecated this, then un-deprecated bc we didn't have good warning handling/suppression, and were going to try to re-deprecate before long

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference (issue/PR) for this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is #36697, but that's for the non-fastpath way, I think? While here it's specifically the fastpath that I am changing (which is not used when a user creates a Series directly themselves).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we did that i dont remember any distinction being made between fastpath vs non-fastpath

@jorisvandenbossche

@jorisvandenbossche

@jorisvandenbossche

jbrockmendel

@github-actions

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@mroeschke

Appears this PR has been dormant for a while closing to clear the queue. Feel free to reopen when you have time to work on this PR and address the comment.

@jorisvandenbossche

@jorisvandenbossche

@jreback

@jorisvandenbossche not averse to these PRs but can you merge master & address any comments esp on the older ones

@jorisvandenbossche

jreback

jbrockmendel

@jbrockmendel

Current thought: the if labels._is_all_dates is a PITA that we should rip out in 2.0 (we tried deprecating before and it was messy). Could we just avoid going through these paths by using _from_mgr in the relevant places?

@jorisvandenbossche

Could we just avoid going through these paths by using _from_mgr in the relevant places?

I suppose you mean as alternative to passing fastpath=True, right?

First, the Series(..) fastpath still sets a name as well (which is not included in _from_mgr, so that would make replacing it more cumbersome, or a name keyword could be added to _from_mgr of course)

Looking at where fastpath=True is currently used, this is limited to:

So from that, I don't think it is super easy to change fastpath=True to _from_mgr

@jorisvandenbossche

BTW, I still need to add a test case for this (the output is on master, with this PR both will be object dtype):

In [1]: df = pd.DataFrame({'a': [1, 2, 3]}, index=pd.date_range("2012", periods=3).astype(object))

In [2]: df.index
Out[2]: Index([2012-01-01 00:00:00, 2012-01-02 00:00:00, 2012-01-03 00:00:00], dtype='object')

In [3]: df['a'].index
Out[3]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq=None)

And we need to decide if we see that as a bug fix.

@jbrockmendel

And we need to decide if we see that as a bug fix.

definitely looks like a bugfix, nice catch

@jorisvandenbossche

@jreback

can you rebase i think we merged some parts of this elsewhere

@jorisvandenbossche

@jorisvandenbossche

i think we merged some parts of this elsewhere

Any more specific pointer?

@jorisvandenbossche

@jorisvandenbossche jorisvandenbossche changed the titlePERF: improve perf of Series fastpath constructor BUG: preserve object-dtype index when accessing DataFrame columns / PERF: improve perf of Series fastpath constructor

Dec 6, 2021

@jorisvandenbossche jorisvandenbossche changed the titleBUG: preserve object-dtype index when accessing DataFrame columns / PERF: improve perf of Series fastpath constructor BUG: preserve object-dtype index when accessing DataFrame column / PERF: improve perf of Series fastpath constructor

Dec 6, 2021

jbrockmendel

@jorisvandenbossche

@jorisvandenbossche

jbrockmendel

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel

@jorisvandenbossche

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls also rebase

- Bug in :meth:`DataFrame.loc.__setitem__` changing dtype when indexer was completely ``False`` (:issue:`37550`)
- Bug in :meth:`IntervalIndex.get_indexer_non_unique` returning boolean mask instead of array of integers for a non unique and non monotonic index (:issue:`44084`)
- Bug in :meth:`IntervalIndex.get_indexer_non_unique` not handling targets of ``dtype`` 'object' with NaNs correctly (:issue:`44482`)
- Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (:issue:`42950`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 1.5

@jorisvandenbossche

@jorisvandenbossche

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request

Jul 13, 2022

@jorisvandenbossche @yehoshuadimarsky

Labels