BUG: IntegerArray/FloatingArray constructors mismatched NAs by jbrockmendel · Pull Request #44514 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation27 Commits17 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jbrockmendel

Same yak-shaving as #44495 (which turned out to be a dead end for this particular yak, but still a perf bump)

@jbrockmendel

jreback

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setitem change / bug fix is unrelated to the constructor fix? Or it's because you are testing that through setitem as well?

mask2 = isna(values)
if not (mask == mask2).all():
# e.g. if we have a timedelta64("NaT")
raise TypeError(f"{values.dtype} cannot be converted to a FloatingDtype")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could it be libmissing.is_numeric_na that already raises on encountering a "non-numeric NA"? (is there a use case for is_numeric_na to not be strict about this, i.e. to get a "partial" mask?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you planning to address this one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to address in a follow-up

@jbrockmendel

The setitem change / bug fix is unrelated to the constructor fix? Or it's because you are testing that through setitem as well?

The setitem bug was identified first and the cause tracked back to the constructor.

jreback

@jreback

@jbrockmendel

jorisvandenbossche

@@ -357,6 +364,31 @@ def test_setitem_series(self, data, full_indexer):
)
self.assert_series_equal(result, expected)
def test_setitem_frame_2d_values(self, data, using_array_manager, request):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this test out of the extension base tests, or remove the need to use using_array_manager? (this is not defined by external users of those tests, and would be a bit annoying to replicate)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@jbrockmendel

gentle ping; this is a blocker for fixing a bug in Series.where, which in turn should allow us to share some more Block methods.

@jbrockmendel

jorisvandenbossche

df = pd.DataFrame({"A": data})
# Avoiding using_array_manager fixture
# https://github.com/pandas-dev/pandas/pull/44514#discussion\_r754002410

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing to not use the fixture. But generally, is this actually needed to have as base extension test? (since the fix was inside the Blocks code, it's not really testing a specific behaviour that the EA needs to have?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually needed to have as base extension test?

This seems like the best way to systematically test it for all EA dtypes.

it's not really testing a specific behaviour that the EA needs to have?

That seems like it applies to most tests that aren't directly testing EA methods.

@jorisvandenbossche

Two questions, but feel free to merge as well

@jbrockmendel

@jbrockmendel

@jbrockmendel

@jbrockmendel

updated to raise inside is_numeric_na

@jreback

@jbrockmendel

i think comments have all been addressed here? bugfix follow-up is ready.

This was referenced

Nov 30, 2021

@jorisvandenbossche

@jbrockmendel

Yah, looks like a lot of time is being taken in is_numeric_na.

from asv_bench.benchmarks.groupby import *
self = Cumulative()
self.setup("Float64", "cumsum")

%timeit self.time_frame_transform("Float64", "cumsum")

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.791    1.791 {built-in method builtins.exec}
        1    0.000    0.000    1.791    1.791 <string>:1(<module>)
       10    0.014    0.001    1.791    0.179 groupby.py:569(time_frame_transform)
       10    0.000    0.000    1.773    0.177 generic.py:1179(transform)
       10    0.000    0.000    1.773    0.177 groupby.py:1608(_transform)
       10    0.000    0.000    1.773    0.177 groupby.py:3117(cumsum)
       10    0.000    0.000    1.773    0.177 generic.py:1103(_cython_transform)
       10    0.000    0.000    1.762    0.176 managers.py:1266(grouped_reduce)
       50    0.000    0.000    1.762    0.035 blocks.py:381(apply)
       50    0.000    0.000    1.758    0.035 generic.py:1118(arr_func)
       50    0.000    0.000    1.758    0.035 ops.py:919(_cython_operation)
       50    0.000    0.000    1.699    0.034 ops.py:587(cython_operation)
       50    0.001    0.000    1.697    0.034 ops.py:319(_ea_wrap_cython_operation)
       50    0.000    0.000    1.559    0.031 ops.py:376(_reconstruct_ea_result)
       50    0.000    0.000    1.558    0.031 floating.py:263(_from_sequence)
       50    0.001    0.000    1.557    0.031 floating.py:85(coerce_to_array)
       50    1.551    0.031    1.551    0.031 {pandas._libs.missing.is_numeric_na}
       50    0.000    0.000    0.134    0.003 ops.py:434(_cython_op_ndim_compat)

Looks like we are calling is_numeric_na in cases where we have float64 dtype, so can just use np.isnan. Should be an easy patch.