Slice by column then by index fails if columns/rows are repeated. · Issue #6121 · pandas-dev/pandas (original) (raw)

We've found a problem where repeating a row and a column in a DataFrame fails with a "Cannot create BlockManager._ref_locs" assertion error.

The dataframe is very simple:

df = pd.DataFrame(np.arange(25.).reshape(5,5), index=['a', 'b', 'c', 'd', 'e'], columns=['a', 'b', 'c', 'd', 'e'])

And we pull the data out like this:

z = df[['a', 'c', 'a']] z.ix[['a', 'c', 'a']] Traceback (most recent call last): File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/ipython-1.1.0_1_ahl1-py2.7.egg/IPython/core/interactiveshell.py", line 2830, in run_code exec code_obj in self.user_global_ns, self.user_ns File "", line 1, in z.ix[['a', 'c', 'a']] File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 56, in getitem return self._getitem_axis(key, axis=0) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 744, in _getitem_axis return self._getitem_iterable(key, axis=axis) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 816, in _getitem_iterable convert=False) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 1164, in take new_data = self._data.take(indices, axis=baxis) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 3366, in take ref_items=new_axes[0], axis=axis) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2337, in apply do_integrity_check=do_integrity_check) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 1990, in init self._set_ref_locs(do_refs=True) File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2130, in _set_ref_locs 'have _ref_locs set' % (block, labels)) AssertionError: Cannot create BlockManager._ref_locs because block [FloatBlock: [a], 1 x 3, dtype: float64] with duplicate items [Index([u'a', u'c', u'a'], dtype='object')] does not have _ref_locs set

If instead we take a copy of the intermediate step, then it works:

z = df[['a', 'c', 'a']].copy() z.ix[['a', 'c', 'a']] Out[89]: a c a a 0 2 0 c 10 12 10 a 0 2 0

[3 rows x 3 columns]

This means that if you several functions which each do a part of the data processing, you need to know the history of an object to know whether what you're doing works. I think .ix should always succeed on a DataFrame or Series, regardless of how it was constructed.

(I've read the discussion at #6056 about chained operations - but it's not something you can avoid if you have a pipeline of small steps instead of one big step).

This wasn't an issue in 0.11.0 but is failing in 0.13.0 and the latest master. Here's the output of installed versions when running on the master:

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-308.el5
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB

pandas: 0.13.0-292-g4dcecb0
Cython: 0.16
numpy: 1.7.1
scipy: 0.9.0
statsmodels: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: None
bottleneck: 0.6.0
tables: 2.3.1-1
numexpr: 2.0.1
matplotlib: 1.1.1
openpyxl: None
xlrd: 0.8.0
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 2.3.6
bs4: None
html5lib: None
bq: None
apiclient: None