BUG: read_excel return empty dataframe when using usecols by jacksonjos · Pull Request #20480 · pandas-dev/pandas (original) (raw)

#@chris-b1,

Take a look at the examples below comparing the current pandas behavior to the new one,
so maybe is easier to understand.
@jreback, do you get it what is happening here?

# CURRENT BEHAVIOR
In [5]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[5]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [6]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...   index_col=0)
Out[6]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# Index 0 is not in used_cols, so it's not possible to use it in the index
In [7]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[7]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

# index_col is related to the 0-th col in the cols selected in usecols. In this case, [1, 2].
# So, the date column is not the index_col
In [8]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[1, 2],
...     index_col=0)
Out[8]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

In [9]: pd.read_excel('pandas-jackson/pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3],
...    index_col=1)
Out[9]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778

#PROPOSED BEHAVIOR

In [1]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3])
Out[1]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

In [2]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[0, 2, 3], index_col=0)
Out[2]: 
                   A         C         D
2000-01-03  0.980269 -0.364217 -1.159738
2000-01-04  1.047916 -0.161812  0.212549
2000-01-05  0.498581 -0.537677  1.346270
2000-01-06  1.120202  0.003641  0.675253
2000-01-07 -0.487094 -1.611639  0.103469
2000-01-10  0.836649  0.588543  1.062782
2000-01-11 -0.157161  1.195778 -1.097007

# Index 0 is not in used_cols, but TextParser receives all columns. So, the default index
# is the first column because it does not have a column name (1st column header is
# empty).
In [3]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2])
Out[3]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

# index_col is the 0-th col in the DataFrame, not the 0-th in usecols. In this case, [1, 2].
# So, the date column is selected to be the index because is the first column in the
# DataFrame.
In [4]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=0)
Out[4]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [5]: pd.read_excel('pandas/tests/io/data/test1.xlsx', usecols=[1, 2], index_col=1)
Out[5]: 
                   B         C
 3.685731 2000-01-03 -0.364217
-0.041232 2000-01-04 -0.161812
 0.731168 2000-01-05 -0.537677
 1.567621 2000-01-06  0.003641
 0.571455 2000-01-07 -1.611639
 0.246462 2000-01-10  0.588543
 1.340307 2000-01-11  1.195778


I'm also dumping the behavior of csv you used as an example because I saw that it
diverges from current and proposed behavior of read_excel. So, maybe we can settle,
or agree about how read_excel is supposed to behave.

In [11]: pd.read_csv('tmp.csv', usecols=[0, 2, 3])
Out[11]: 
   Unnamed: 0         B         C
0  2000-01-03  3.685731 -0.364217
1  2000-01-04 -0.041232 -0.161812
2  2000-01-05  0.731168 -0.537677
3  2000-01-06  1.567621  0.003641
4  2000-01-07  0.571455 -1.611639
5  2000-01-10  0.246462  0.588543
6  2000-01-11  1.340307  1.195778

In [12]: pd.read_csv('tmp.csv', usecols=[0, 2, 3], index_col=0)
Out[12]: 
                   B         C
2000-01-03  3.685731 -0.364217
2000-01-04 -0.041232 -0.161812
2000-01-05  0.731168 -0.537677
2000-01-06  1.567621  0.003641
2000-01-07  0.571455 -1.611639
2000-01-10  0.246462  0.588543
2000-01-11  1.340307  1.195778

In [13]: pd.read_csv('tmp.csv', usecols=[1, 2])
Out[13]: 
          A         B
0  0.980269  3.685731
1  1.047916 -0.041232
2  0.498581  0.731168
3  1.120202  1.567621
4 -0.487094  0.571455
5  0.836649  0.246462
6 -0.157161  1.340307

In [14]: pd.read_csv('tmp.csv', usecols=[1, 2], index_col=0)
Out[14]: 
                  B
A                  
 0.980269  3.685731
 1.047916 -0.041232
 0.498581  0.731168
 1.120202  1.567621
-0.487094  0.571455
 0.836649  0.246462
-0.157161  1.340307

What do you think?