Remove use_nullable_dtypes and add dtype_backend keyword by phofl · Pull Request #51853 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@mroeschke this is heavy unfortunately. Should be ready for a first look for wording and similar, I expect tests to fail since http tests timed out locally which made running tests tricky. Will fix when ci is through

I think we could split theoretically,

sql
arrow stuff (orc, parquet, feather)
to_numeric and convert_dtypes
everything else

Let me know what you prefer

phofl mentioned this pull request

Mar 9, 2023

Looks good in general. I do think it'd make things much easier to split this in multiple PRs. In particular for you, and we could work in the different components in parallel, so you don't need to do all the work yourself. In any case, great that you already have this, and since it looks like the CI is almost green here, also ok to merge all together, makes reviewing more complex than by parts, but fine with me.

Only to general questions:

Why was it decided to use nodefault instead of "numpy" for the numpy dtypes? Seeing the PR seems to make things more complex and difficult to understand than using dtype_backend='numpy' by default.
Why in some places we're using use_backend_dtype: bool? Again, seems like propagating dtype_backend and checking if different from numpy would be clearer (if that's what it means to use dtype backend, which is not very clear)

I guess I'm missing something, I don't see the advantage.

Answers to your two questions:

We agreed that no one should set numpy explicitly as dtype_backend right now, so we don’t allow it. I’ll create a follow up pr raising on invalid inputs after this is in.

i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me

We agreed that no one should set numpy explicitly as dtype_backend right now

I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use dtype='numpy'?

i am using use_dtype_backend in situations where there is a second condition involved (like in read_csv where a supplied dtype takes precedence over the dtype_backend) or when we’d have to check many times for the default, this made it less readable to me

Makes sense, thanks for clarifying, it wasn't obvious from the diff.

I see. Is this with the idea to eventually deprecate the numpy backend? Or why don't we want users to be able to use dtype='numpy'?

I think the rational is to setup a future where we can more easily change the default dtype backend to numpy_nullable or pyarrow.

Co-authored-by: Matthew Roeschke 10647082+mroeschke@users.noreply.github.com

columns,
coerce_float: bool = True,
use_nullable_dtypes: bool = False,
dtype_backend: DtypeBackend \| Literal["numpy"] = "numpy",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these internal usages of dtype_backend, is there a particular reason why you're using "numpy" instead of no_default?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are checking dtype_backend == "numpy" quite often in sql and this reads better than checking dtype_backend is/is not lib.no_default. This clutters the if conditions too much imo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that's fine then. Can you make sure we have tests where our public facing function with dtype_backend="numpy" raises a ValueError?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet. I wanted to a check for all functions in a follow up. This would have made this pr even larger.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let make sure we have that checking before we release the RC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll start a pr that builds on top of this one

I think the rational is to setup a future where we can more easily change the default dtype backend to numpy_nullable or pyarrow.

@mroeschke @phofl I'm still a bit confused, sorry. If we allow dtype_backend='numpy then users can use that, and make sure that a future change in the pandas default won't change their backend. I fail to see how removing the numpy backend option is helping.

@jorisvandenbossche mind chiming in with your original reason for not having a dtype_backend="numpy" option (we didn't document this sufficiently in our meeting notes)

How would this interact with the dtype keyword for functions like read_csv?

Should that overwrite dtype backend or should the pyarrow equivalent be returned?
e.g.

>>> import pandas as pd
>>> pd.options.mode.nullable_dtypes = True
>>> pd.options.mode.dtype_backend = "pyarrow"
>>> from io import StringIO
>>> pd.read_csv(StringIO("a,b,c\n1,2,3"), dtype="int64")

Explicit dtypes take precedence

My bad, did not read comment above.

This is also the most logical thing. A possible use case: users want numpy dtypes except string columns, they should be arrow backed for example. There are probably more where Mixing to a certain extent is beneficial

Getting back to numpy backend: right now a numpy backend doesn’t make any sense, it’s the default. Nobody should have to request it explicitly. In a future where we potentially switch the backend option (not sure if/when we get there) we have to support the new default backend everywhere. This would also mean that we would need a global option to opt into, this is where a numpy backend could make sense, depending on what we want to do exactly. At this point we can either get rid of the keyword or add a new option (probably the first). So right now it doesn’t help anyone and also doesn’t block anything in the future

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here LGTM

So right now it doesn’t help anyone and also doesn’t block anything in the future

Yes, that sums it up for me. There is really no reason that anyone should now explicitly do dtype_backend="numpy", so why would we then allow it?
Not having it keeps it more flexible to do whatever we want later (we can still add it later if at some point that makes sense).

(one could also argue about the name "numpy" being slightly confusing/incorrect, since they are not all numpy dtypes (categorical, datetimetz, interval, etc), although they all use "numpy semantics", one could say)

implementation, even if no nulls are present.
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames.
Which dtype backend to use. If
set to True, nullable dtypes or pyarrow dtypes are used for all dtypes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is True also an option here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this shouldn't accept True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, updated

msg += (
"Use dtype_backend='numpy_nullable' instead of use_nullable_dtype=True."
)
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would people think about not yet actually deprecating this keyword? (for parquet (and feather?), i.e. where it already existed)
That gives us a bit more time to see if what we decide now with dtype_backed is actually the way to go.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we could still un-deprecate this later in an 2.x release. I think I would rather prefer being consistent now with dtype_backend rather than allowing both use_nullable_dtypes and dtype_backend

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed I'd rather make this consistent as well

Comment on lines 68 to 66

use_nullable_dtypes : bool, default False
Whether or not to use nullable dtypes as default when reading data. If
set to True, nullable dtypes are used for all dtypes that have a nullable
implementation, even if no nulls are present.

.. note::

The nullable dtype implementation can be configured by calling
``pd.set_option("mode.dtype_backend", "pandas")`` to use
numpy-backed nullable dtypes or
``pd.set_option("mode.dtype_backend", "pyarrow")`` to use
pyarrow-backed nullable dtypes (using ``pd.ArrowDtype``).
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy
arrays, nullable extension dtypes or pyarrow dtypes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here, but it's about the docstring addition in general:

I would keep the notion of "nullable dtypes are used for all dtypes that have a nullable implementation, even if no nulls are present." for the case of "nullable_numpy" (and can mention that for pyarrow it is used for all dypes)
I would leave out the "extension", since also the pyarrow ones are based on extension dtypes. And also some of the default numpy ones are as well actually, so the "should have NumPy arrays" is not fully correct (but that's maybe too nitpicky for the docstring)
I think we should mention everywhere briefly that this is still experimental to set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docstrings

Merging this to rebase the other pr. Happy to address comments regarding wording in a follow up

phofl deleted the dtype_backend branch

March 13, 2023 18:20

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 ab76540ada429ffb6f10785f1623c47a997b5763

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #51853: Remove use_nullable_dtypes and add dtype_backend keyword'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-51853-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51853 on branch 2.0.x (Remove use_nullable_dtypes and add dtype_backend keyword)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

phofl added a commit to phofl/pandas that referenced this pull request

Mar 13, 2023

phofl mentioned this pull request

Mar 13, 2023

5 tasks

phofl added a commit that referenced this pull request

Mar 13, 2023

…d dtype_backend keyword) (#51935)

Remove use_nullable_dtypes and add dtype_backend keyword (#51853)

Remove use_nullable_dtypes and add dtype_backend keyword by phofl · Pull Request #51853 · pandas-dev/pandas (original) (raw)

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels