BUG: Fix Series constructor for Categorical with index by cbertinato · Pull Request #19714 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation39 Commits9 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Fixes Series constructor so that ValueError is raised when a Categorical and index of incorrect length are given. Closes issue #19342

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks.

@@ -690,6 +690,7 @@ Categorical
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of incorrect length is given (:issue:`19342`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say "index of different length". It could be the categorical that's incorrect :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. Will do.

map(lambda x: x, range(3))])
def test_constructor_index_mismatch(self, input):
# GH 19342
pytest.raises(ValueError, Series, input, index=np.arange(4))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also check the error message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Hello @cbertinato! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 26, 2018 at 14:53 Hours UTC

# raises an error
idx = np.arange(4)
if compat.PY2:
typs = types.GeneratorType

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need all of this, its just confusing just to construct an error message. just do a simpler check on the error.

@@ -210,6 +210,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
raise ValueError("cannot specify a dtype with a "
"Categorical unless "
"dtype='category'")
if index is not None and len(index) != len(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go a little further down after

if index is None:
    ....
else:
      # this the check here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe this should go in _sanitize_array around L3242

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an advantage to moving it into _sanitize_array versus putting it in the if at L230. But I could be missing something. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s prob ok here but a bit lower
early failure is good

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Proposed a placement in this next push. Placing it in the

at ~L226 as an else breaks some tests if data is a scalar or SingleBlockManager. It's difficult to catch all cases if we put it in _sanitize_array because of the returns in the if cases. So the next best place appears to be after the call to _sanitize_array. Not ideal with regard to early failure, but really the only place that I can see to put a single check just to be able to do len(data). Not much different from the current location except that it catches cases other than Categorical.

@@ -690,6 +690,7 @@ Categorical
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in reshaping

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you merge in master and fix the merge conflict. Also a couple linting errors.

@@ -844,6 +844,7 @@ Reshaping
- Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`)
- Bug in :func:`DataFrame.join` which does an outer instead of a left join when being called with multiple DataFrames and some have non-unique indices (:issue:`19624`)
- :func:`Series.rename` now accepts ``axis`` as a kwarg (:issue:`18589`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify error -> ValueError

@@ -5,6 +5,7 @@

from datetime import datetime, timedelta
from collections import OrderedDict
import types

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports aren't needed now.

@@ -238,6 +239,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
data = _sanitize_array(data, index, dtype, copy,
raise_cast_failure=True)

if index is not None and len(index) != len(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go a touch higher,

            if index is None:
                if not is_list_like(data):
                    data = [data]
                index = com._default_index(len(data))
           else:
                # add here

            # create/copy the manager
            if isinstance(data, SingleBlockManager):
                if dtype is not None:
                    data = data.astype(dtype=dtype, errors='ignore',
                                       copy=copy)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Added a scalar check that lets scalars through, so we are assuming that the Series is shaped correctly when the scalar is broadcast to fit the index, which is probably ok.

if index is None:
    ...
else:
    if isscalar(data) and len(index) != len(data):
        ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. A few other inputs that break. np.array and np.dtype appear to be just two of them. Unless we add specific checks for these, I think we may need to move it lower, below _sanitize_array.

@@ -226,6 +227,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
if not is_list_like(data):
data = [data]
index = com._default_index(len(data))
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this an elif here

i rebased. ping on green.

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) > 1 and len(index) != len(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, did this change? 0-len should be ok, can you add a test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0-len gets caught deeper, in the SingleBlockManager, but len 1 gets caught here. It should be let through to be broadcast in _sanitize_array. I'll add a test.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, would be ok with catching both cases here, or are they different?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think catching the 0-len case here would be good for consistency. We don't want to catch the len 1 case because it will get broadcast, so it will look something like:

                # a scalar numpy array is list-like but doesn't
                # have a proper length
                try:
                    if len(data) != 1 and len(index) != len(data):

Unless the intention is not to broadcast a list-like of length 1. One could argue that it would be better to raise an error instead of broadcasting. If one wanted to broadcast a scalar, then just pass a scalar.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would have to show the test which fails for this, len(data) == 0 is valid

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_apply_subset in tests/io/formats/test_style.py raises an error. The traceback indicates the input to the Series constructor is:

data = ['color: baz'], index = RangeIndex(start=0, stop=2, step=1), dtype = None

This should be valid. Checking that len(data) != 1 lets this case pass.

@@ -418,8 +418,8 @@ def test_constructor_numpy_scalar(self):
# GH 19342
# construction with a numpy scalar
# should not raise
result = Series(np.array(100), index=np.arange(4))
expected = Series(100, index=np.arange(4))
result = Series(np.array(100), index=np.arange(4), dtype='int64')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, ok thanks

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) > 1 and len(index) != len(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would have to show the test which fails for this, len(data) == 0 is valid

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) != 1 and len(index) != len(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not convinced about this, what fails for len(data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove len(data) != 1 and run python -m pytest pandas/tests/io/formats/test_style.py I get:

                try:
                    if len(index) != len(data):
                        raise ValueError(
                            'Length of passed values is {val}, '
                            'index implies {ind}'
>                           .format(val=len(data), ind=len(index)))
E                           ValueError: ('Length of passed values is 1, index implies 2', 'occurred at index A')

pandas/core/series.py:246: ValueError

I should add a test for this case in test_constructors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok add a test in test_constructors. which is this failing on in test_style?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_apply_subset. Input to the Series constructor is:

data = ['color: baz'],
index = RangeIndex(start=0, stop=2, step=1), 
dtype = None, 
name = 'A', 
copy = False, 
fastpath = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should raise! if the data is a scalar this is ok, but we can't broadcast a list like that (well we can, but we shoudn't)

In [2]: data = ['color: baz']
   ...: index = pd.RangeIndex(start=0, stop=2, step=1)
   ...: 

In [3]: pd.Series(data, index)
Out[3]: 
0    color: baz
1    color: baz
dtype: object

Fixes Series constructor so that ValueError is raised when a Categorical and index of different length are given.

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

I wasn’t sure whether anything else relied on this behavior.

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

yes, and add this as an additional test in test_constructor.

Modified test setup in io/formats/test_style.py accordingly

thanks @cbertinato sometimes the seemingly small changes are hard!

Thanks for the help and advice!

harisbal pushed a commit to harisbal/pandas that referenced this pull request

Feb 28, 2018