ENH: add in extension dtype registry by jreback · Pull Request #21185 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation63 Commits22 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jreback

@codecov

jbrockmendel

^^^^^^^^^^^^^^^^^^^^^
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instaniate a registered ``DecimalDtype`` (:issue:`21185`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo instantiate

Dr-Irv

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that there are changes in this PR that go beyond just adding in the extension dtype registry. Wasn't your goal to just make this PR about the registry?

@jreback

split out ops to a separate PR #21191

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at all the tests yet.

In addition:

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()``, and attribute ``array_type`` (:issue:`21185`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the addition of dropna and append orthogonal to the registry, or are they in some way needed for that?

I think we should be hesitant in adding more and more methods to the ExtensionArray if they are not strictly needed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, these are necessary for followons. The api is not fleshed out enough.

"""Construct a new ExtensionArray from a sequence of scalars.
Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
copy : boolean, default True
if True, copy the underlying data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this exactly mean?
Often, the scalars objects are either not stored as is, and then copy or not copy makes no difference. Or if they are stored, they don't necessarily have a 'copy' method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same guarantee with copy we have already, if its True, then copy if possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the docstring?

* _concat_same_type
* array_type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an attribute of the array class I think?

# dispatch on extension dtype if needed
if is_extension_array_dtype(dtype):
return dtype.array_type._from_sequence(arr, copy=copy)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be any validation about arr ? Eg that it is 1D?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validation is a contract of the Array itself

"""
Parameters
----------
dtype : PandasExtension Dtype

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document constructor parameter?
and "PandasExtension Dtype" -> "ExtensionDtype"

----------
dtype : PandasExtension Dtype
"""
if not issubclass(dtype, (PandasExtensionDtype, ExtensionDtype)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't ExtensionDtype be enough? (our internals one subclass both I think?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet, are extension dtypes are not all ExtensiDtypes yet (this is why we have the whole PandasExtensionDtype bizness)

if string.startswith('period[') or string.startswith('Period['):
# do not parse string like U as period[U]
return PeriodDtype.construct_from_string(string)
raise TypeError("could not construct PeriodDtype")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider rather changing construct_from_string itself? As those don't really adhere to the ExtensionDtype requirements I think?

If we keep both, I would make this one private.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

class BaseOpsTests(BaseExtensionTests):
"""Various Series and DataFrame ops methos."""
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a left-over from splitting it from the previous PR?

@jreback

@jreback

see if can make this dispatch to create_array_type as a function

@jreback

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments.

From my previous review:

We need to discuss what is the exact contract for this registry and array_type: where is it used, and how is it used?

Can you add some explanation about this to the docs/docstrings? (eg that it relies on _from_sequence)

"""Construct a new ExtensionArray from a sequence of scalars.
Parameters
----------
scalars : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
copy : boolean, default True
if True, copy the underlying data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the docstring?

type
"""
if array is None:
return cls

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cls is never the correct return? (since it is a dtype class, while this should return an array class?)

""" Registry for dtype inference
We can directly construct dtypes in pandas_dtypes if they are
a type; the registry allows us to register an extension dtype

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence "We can directly construct dtypes in pandas_dtypes if they are a type" is not really clear to me. What does "if they are a type" mean? To which context does this refer?

'dtype, expected',
[('int64', None),
('interval', IntervalDtype()),
('interval[int64]', IntervalDtype()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace this (or also add) one that is not the default?
Eg ('interval[datetime64[ns]]', IntervalDtype('datetime64[ns]')),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although that is maybe covered with the period or datetime64 one below. Is 'interval[datetime64[ns]]' already tested for IntervalDtype?)

dtype = data.dtype
expected = pd.Series(data)
result = pd.Series(np.array(data), dtype=dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a restriction on what the result of __array__ should be? Maybe not necessarily the scalars?

So maybe better to do list(data) or data.astype(object)

@pytest.mark.xfail(reason="not implemented constructor from dtype")
def test_from_dtype(self, data):
# construct from our dtype & string dtype
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually implement this? (only change needed would be to register DecimalDtype I think?)
As that way we actually have a test for external dtypes registering?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do in subsequent PR's trying to keep this diff down.

# TODO(extension)
@pytest.mark.xfail(reason=(
"raising AssertionError as this is not implemented, "
"though easy enough to do"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can change the AssertionError to a ValueError in the code, and then we can still test this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok for now

@pytest.mark.xfail(reason="not implemented constructor from dtype")
def test_from_dtype(self, data):
# construct from our dtype & string dtype
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test here that an appropriate error is raised (instead of skipping the test) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@@ -109,6 +109,11 @@ class ExtensionDtype(_DtypeOpsMixin):
* name
* construct_from_string
Optionally one can override construct_array_type for construction
with the name of this dtype via the Registry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give some additional explanation of what the Registry is, because now this is not explained here?

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, I am not convinced we should necessarily add those methods (certainly append), so would personally leave this for another PR to discuss

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with not adding append.

Could go either way on dropna.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need them for subsquent PR's I guess will move them.

TomAugspurger

ExtensionType Changes
^^^^^^^^^^^^^^^^^^^^^
- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with not adding append.

Could go either way on dropna.

- ``ExtensionArray`` has gained the abstract methods ``.dropna()`` and ``.append()`` (:issue:`21185`)
- ``ExtensionDtype`` has gained the ability to instantiate from string dtypes, e.g. ``decimal`` would instantiate a registered ``DecimalDtype``; furthermore
the dtype has gained the ``construct_array_type`` (:issue:`21185`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'the dtype' -> ExtensionDtype has gained :meth:`ExtensionDtype.construct_array_type`

.. _whatsnew_0240.api.other:
Other API Changes
^^^^^^^^^^^^^^^^^
-
- Invalid consruction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consruction -> construction.

@@ -379,6 +383,16 @@ def fillna(self, value=None, method=None, limit=None):
new_values = self.copy()
return new_values
def dropna(self):
""" Return ExtensionArray without NA values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the leading space. Add a trailing .

Parameters
----------
array : array-like, optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche by "this" you mean the array argument right? I'm also wondering that.

@@ -8,6 +8,64 @@
from .base import ExtensionDtype, _DtypeOpsMixin
class Registry(object):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking at the uses yet, could we simplify this a by just allowing string lookup? Ideally, registry would be a simple class holding a dict mapping dtype.name -> ExtensionDtype.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current code also supports finding the dtype for eg 'interval[int64]' and not just interval (so parametrized versions of the strings)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I suppose we want that to support .astype('interval[int64]'). That's fair...

@@ -800,7 +800,7 @@ def astype(self, dtype, copy=True):
@cache_readonly
def dtype(self):
"""Return the dtype object of the underlying data"""
return IntervalDtype.construct_from_string(str(self.left.dtype))
return IntervalDtype(str(self.left.dtype))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left.dtype.name? I'm not sure when these differ, but using .name seems safer.

@jreback

@Dr-Irv

@jreback I'm wondering if there is a meta issue with the registry that should be thought about. Let's say that pandas registers 10 types - they are always in the registry for all pandas users. Now let's say you have 2 different people who independently extend pandas with the EA capabilities creating libraries ABC and DEF. But they happen to both call their extension array type MyEAType . Now if someone wants to use both libraries ABC and DEF, there will be a collision in the registry of MyEAType. So does there need to be some "master" registry in the documentation containing the names of all EA types that people have registered?

I realize this is a highly unlikely occurrence.

Put another way, shouldn't there be code for the registry that says "this type has already been registered"?

Also, the implementation of the registry is using a list. Shouldn't you use a dict or set?

@jreback

@Dr-Irv you don;t need to register the EA types, this is only for the purpose of translating a string dtype name -> EA dtype (e.g. 'categorical' -> Categorical), basically for convenience.

so sure you could have a collision, but then user would have to deal with that (this wouldn't show up on pandas because we won't have colliding names).

@jreback

@Dr-Irv I had this an OrderdedDict but its actually not necessary, and is actually not correct. If we have a string, we have to ask the dtypes can you construct a type from it. The issue is that multiple strings can map to a dtype when its parameterized, e.g. datetime64[ns, US/Eastern] is a valid string dtype as is datetime64[ns, Asia/Tokyo] and w/o imbuing the knowledge to say its 'good' inside the Register (bad idea), we can't know something works unless we duck type it.

@Dr-Irv

@jreback Thanks for the clarification of the need versus require of registration of EA types.

On the implementation, wouldn't OrderedDict be faster once there are a lot of types?

TomAugspurger

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from adding EA.append, I'm +1 on this.

Given that NumPy doesn't implement it, I don't think we should either.

@jreback

I had removed append FYI (it should not longer be in this PR)

@jreback

any final comments. going to rebase.

@jreback

@jreback

@jreback

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request

Aug 14, 2018

@TomAugspurger

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request

Oct 1, 2018

@jreback