BUG: read_excel return empty dataframe when using usecols by jacksonjos · Pull Request #20480 · pandas-dev/pandas (original) (raw)
Take a look at the examples below comparing the current pandas behavior to the new one,
so maybe is easier to understand.
@jreback, do you get it what is happening here?
# CURRENT BEHAVIOR
In [5]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[5]:
B C
2000-01-03 3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05 0.731168 -0.537677
2000-01-06 1.567621 0.003641
2000-01-07 0.571455 -1.611639
2000-01-10 0.246462 0.588543
2000-01-11 1.340307 1.195778
In [6]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
... index_col=0)
Out[6]:
B C
2000-01-03 3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05 0.731168 -0.537677
2000-01-06 1.567621 0.003641
2000-01-07 0.571455 -1.611639
2000-01-10 0.246462 0.588543
2000-01-11 1.340307 1.195778
# Index 0 is not in used_cols, so it's not possible to use it in the index
In [7]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[7]:
A B
0 0.980269 3.685731
1 1.047916 -0.041232
2 0.498581 0.731168
3 1.120202 1.567621
4 -0.487094 0.571455
5 0.836649 0.246462
6 -0.157161 1.340307
# index_col is related to the 0-th col in the cols selected in usecols. In this case, [1, 2].
# So, the date column is not the index_col
In [8]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2],
... index_col=0)
Out[8]:
B
A
0.980269 3.685731
1.047916 -0.041232
0.498581 0.731168
1.120202 1.567621
-0.487094 0.571455
0.836649 0.246462
-0.157161 1.340307
In [9]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
... index_col=1)
Out[9]:
B C
3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
0.731168 2000-01-05 -0.537677
1.567621 2000-01-06 0.003641
0.571455 2000-01-07 -1.611639
0.246462 2000-01-10 0.588543
1.340307 2000-01-11 1.195778
#PROPOSED BEHAVIOR
In [1]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[1]:
A C D
2000-01-03 0.980269 -0.364217 -1.159738
2000-01-04 1.047916 -0.161812 0.212549
2000-01-05 0.498581 -0.537677 1.346270
2000-01-06 1.120202 0.003641 0.675253
2000-01-07 -0.487094 -1.611639 0.103469
2000-01-10 0.836649 0.588543 1.062782
2000-01-11 -0.157161 1.195778 -1.097007
In [2]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3], index_col=0)
Out[2]:
A C D
2000-01-03 0.980269 -0.364217 -1.159738
2000-01-04 1.047916 -0.161812 0.212549
2000-01-05 0.498581 -0.537677 1.346270
2000-01-06 1.120202 0.003641 0.675253
2000-01-07 -0.487094 -1.611639 0.103469
2000-01-10 0.836649 0.588543 1.062782
2000-01-11 -0.157161 1.195778 -1.097007
# Index 0 is not in used_cols, but TextParser receives all columns. So, the default index
# is the first column because it does not have a column name (1st column header is
# empty).
In [3]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[3]:
B C
2000-01-03 3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05 0.731168 -0.537677
2000-01-06 1.567621 0.003641
2000-01-07 0.571455 -1.611639
2000-01-10 0.246462 0.588543
2000-01-11 1.340307 1.195778
# index_col is the 0-th col in the DataFrame, not the 0-th in usecols. In this case, [1, 2].
# So, the date column is selected to be the index because is the first column in the
# DataFrame.
In [4]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=0)
Out[4]:
B C
2000-01-03 3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05 0.731168 -0.537677
2000-01-06 1.567621 0.003641
2000-01-07 0.571455 -1.611639
2000-01-10 0.246462 0.588543
2000-01-11 1.340307 1.195778
In [5]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=1)
Out[5]:
B C
3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
0.731168 2000-01-05 -0.537677
1.567621 2000-01-06 0.003641
0.571455 2000-01-07 -1.611639
0.246462 2000-01-10 0.588543
1.340307 2000-01-11 1.195778
I'm also dumping the behavior of csv you used as an example because I saw that it
diverges from current and proposed behavior of read_excel. So, maybe we can settle,
or agree about how read_excel is supposed to behave.
In [11]: pd.read_csv('tmp.csv', usecols=[0, 2, 3])
Out[11]:
Unnamed: 0 B C
0 2000-01-03 3.685731 -0.364217
1 2000-01-04 -0.041232 -0.161812
2 2000-01-05 0.731168 -0.537677
3 2000-01-06 1.567621 0.003641
4 2000-01-07 0.571455 -1.611639
5 2000-01-10 0.246462 0.588543
6 2000-01-11 1.340307 1.195778
In [12]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[12]:
B C
2000-01-03 3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05 0.731168 -0.537677
2000-01-06 1.567621 0.003641
2000-01-07 0.571455 -1.611639
2000-01-10 0.246462 0.588543
2000-01-11 1.340307 1.195778
In [13]: pd.read_csv('tmp.csv', usecols=[1, 2])
Out[13]:
A B
0 0.980269 3.685731
1 1.047916 -0.041232
2 0.498581 0.731168
3 1.120202 1.567621
4 -0.487094 0.571455
5 0.836649 0.246462
6 -0.157161 1.340307
In [14]: pd.read_csv('tmp.csv', usecols=[1, 2], index_col=0)
Out[14]:
B
A
0.980269 3.685731
1.047916 -0.041232
0.498581 0.731168
1.120202 1.567621
-0.487094 0.571455
0.836649 0.246462
-0.157161 1.340307
What do you think?