bug using read_excel
and index_col
option · Issue #18792 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
Example 1 (weird column/value assignment)
import pandas as pd
# Writing excel file with no column name in first column
tab = pd.DataFrame([['a', 'x'], ['b', 'y']], columns=['col1', 'col2'])
print('Original:\n', tab)
tab.to_excel('tmp_can_del.xlsx')
# The table read using read_excel assigns values of col1 to col2
new_tab = pd.read_excel('tmp_can_del.xlsx', index_col=2)
print('Read:\n', new_tab)
Example 2 (index_col=None seems not to work)
import pandas as pd
# Writing excel file with no column name in first column
tab = pd.DataFrame([['a', 'x'], ['b', 'y']], columns=['col1', 'col2'], index=['i1', 'i2'])
print('Original:\n', tab)
tab.to_excel('tmp_can_del.xlsx')
# The table read using read_excel should have three columns (`Unnamed: 0`, `col1`, `col2`) but just has two
new_tab = pd.read_excel('tmp_can_del.xlsx', index_col=None)
print('Read:\n', new_tab)
print(new_tab.columns)
Problem description
The index_col
option in pd.read_excel
seems not to behave properly when the first column name is missing (for example, when a file is written using to_excel(..., index=True)
).
Example 1
We end up with a DataFrame where the data for col2
was originally the data below col1
. This does not seem intuitive.
Original:
col1 col2
0 a x
1 b y
Read:
col1 col2
x 0 a
y 1 b
Example 2
It seems like the index_col=None
option does not work. The first column of the file is treated as the DataFrame index.
Original:
col1 col2
i1 a x
i2 b y
Read:
col1 col2
i1 a x
i2 b y
Index(['col1', 'col2'], dtype='object')
Expected Output
Example 1
If column names are not specified, I would expect the data assigned to col1
to be that below col1
in the original excel file (same for col2
). Something like this:
Read:
col1 col2
x a x
y b y
Example 2
I would expect the newly read tab to look somewhat like this (as is the case when doing the same operations using write_csv
and read_csv
):
Unnamed: 0 col1 col2
0 i1 a x
1 i2 b y
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
pandas: 0.21.1
pytest: 3.1.2
pip: 9.0.1
setuptools: 37.0.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None