str.cat does not align on index? · Issue #18657 · pandas-dev/pandas (original) (raw)

The implicit index-matching of pandas for operations between different DataFrame/Series is great and most of the times, it just works. It does so consistently enough, that the expectation (for me) is that different Series will be aligned before an operation is performed.

For some reason, str.cat does not seem to do so.

import pandas as pd # 0.21.0
import numpy as np # 1.13.3
col = pd.Series(['a','b','c','d','e','f','g','h','i','j'])

# choose random subsets
ss1 = [8, 1, 2, 0, 6] # list(col.sample(5).index) 
ss2 = [4, 0, 9, 2, 6] # list(col.sample(5).index)

# perform str.cat
col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index()
# 0    ac <-- UNMATCHED!
# 1    ba <-- UNMATCHED!
# 2    cj <-- UNMATCHED!
# 6    gg <-- correct by sheer luck
# 8    ie <-- UNMATCHED!

# compared for example with Boolean operations on unmatched series
# (matching indices and returning Series with union of both indices),
# this is inconsistent!
b = col.loc[ss1].astype(bool) & col.loc[ss2].astype(bool)
b
# 0     True
# 1    False
# 2     True
# 4    False
# 6     True
# 8    False
# 9    False

# if we manually align the Series
# (easy here by masking from the Series we just subsampled, hard in practice),
# then the NaNs are handled as expected:
m = col.where(np.isin(col.index, ss1)).str.cat(col.where(np.isin(col.index, ss2)), sep = '')
m
# 0     aa
# 1    NaN
# 2     cc
# 3    NaN
# 4    NaN
# 5    NaN
# 6     gg
# 7    NaN
# 8    NaN
# 9    NaN

# based on the normal pandas-behaviour for unmatched Series
# (for example as for Boolean "and" above), the following would be
# the expected result of col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index() !
m.loc[b.index]
# 0     aa <-- MATCHED!
# 1    NaN
# 2     cc <-- MATCHED!
# 4    NaN
# 6     gg <-- MATCHED!
# 8    NaN
# 9    NaN