BUG: fix multi-column sort that includes Categoricals / concat (GH7848/GH7864) by jreback · Pull Request #7850 · pandas-dev/pandas (original) (raw)
Ok, the I now see where the reverse sorting happens in _codes_as_ordered
(that was too much magic for my amateur numpy knowledge), but I still would vote for inlining this function.
There are also still two bugs, which didn't show up in the unittests :-( :
edit: The second (and third) is the problem with the "-1 to 0/len(levels)" conversation. Not sure why the first shows up, but I think because np.where(mask, n, n-codes-1)
is only working in some specific cases, e.g. [0,1,2,3,4]
becomes [5-0-1,5-1-1,...]=[4,3,2,1,0]
but [0,1,1,2,3]
becomes [3,2,2,1,0]
.
class TestCategorical(tm.TestCase):
[....]
def test_sort(self):
[...]
# reverse
cat = Categorical(["a","c","c","b","d"], ordered=True)
res = cat.order(ascending=False)
exp_val = np.array(["d","c", "c", "b","a"],dtype=object)
exp_levels = np.array(["a","b","c","d"],dtype=object)
# FIXME: res.__array__() ends up as ['d' 'c' 'b' 'b' 'a']
#self.assert_numpy_array_equal(res.__array__(), exp_val)
self.assert_numpy_array_equal(res.levels, exp_levels)
# some NaN positions
cat = Categorical(["a","c","b","d", np.nan], ordered=True)
res = cat.order(ascending=False, na_position='last')
exp_val = np.array(["d","c","b","a", np.nan],dtype=object)
exp_levels = np.array(["a","b","c","d"],dtype=object)
# FIXME: IndexError: Out of bounds on buffer access (axis 0)
#self.assert_numpy_array_equal(res.__array__(), exp_val)
self.assert_numpy_array_equal(res.levels, exp_levels)
cat = Categorical(["a","c","b","d", np.nan], ordered=True)
res = cat.order(ascending=False, na_position='first')
exp_val = np.array([np.nan, "d","c","b","a"],dtype=object)
exp_levels = np.array(["a","b","c","d"],dtype=object)
# FIXME: IndexError: Out of bounds on buffer access (axis 0)
#self.assert_numpy_array_equal(res.__array__(), exp_val)
self.assert_numpy_array_equal(res.levels, exp_levels)