ENH: Allow overwriting existing sheets when appending to excel files (original) (raw)

As far as I can tell there is no way to write a DataFrame to a specific (named) sheet in an existing Excel file, creating the sheet if it doesn't exist or overwriting if it does. The closest thing to this that exists is the openpyxl append mode, but if you try to write to sheet_which_already_exists with this, the data will be written to sheet_which_already_exists1. Using write mode overwrites the entire book.

Being able to overwrite existing sheets would be useful for populating "raw data" sheets in existing reports/publications on a regular basis, with pivot tables etc updating automatically.

This seems to be a very common issue:
https://stackoverflow.com/questions/62618680/overwrite-an-excel-sheet-with-pandas-dataframe-without-affecting-other-sheets
https://stackoverflow.com/questions/47602533/append-a-pandas-dataframe-to-an-existing-excel-table?noredirect=1&lq=1
https://stackoverflow.com/questions/34744863/python-how-to-use-excelwriter-to-write-into-an-existing-worksheet?noredirect=1&lq=1
#28653
#29990

Describe the solution you'd like

There are a few potential behaviours someone might want when they write to an existing sheet in append mode:

Originally I considered adding a flag like overwrite=True to pd.ExcelWriter but I don't think this would be sufficient to cover all the potential behaviours, so would suggest an enum along the lines of the if_exists option in df.to_sql, with the following options:

This option would only be relevant when using mode="a" with the openpyxl engine.

API breaking implications

The new_sheet option would replicate the current behaviour, so this could be made the default.

Additional context

Demonstration of current behaviour:

import pandas as pd from openpyxl import load_workbook import io

df1 = pd.DataFrame({ "col1": ["apple", "banana", "carrot"], "col2": [1, 2, 3], }) df2 = pd.DataFrame({"col1": [5]})

excel = io.BytesIO()

with pd.ExcelWriter(excel, engine="openpyxl", mode="w") as writer: # write two sheets df1.to_excel(writer, sheet_name="sheet1", index=False) df1.to_excel(writer, sheet_name="sheet2", index=False)

with pd.ExcelWriter(excel, engine="openpyxl", mode="a") as writer: # try to overwrite one of them df2.to_excel(writer, sheet_name="sheet1", index=False)

wb = load_workbook(excel)

sheet1, sheet2, sheet11

print(wb.sheetnames)

Why does it work this way? At the moment, when the openpyxl ExcelWriter is initialised in append mode it loads the workbook, but does not update self.sheets to reflect the sheets in the loaded file.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L54-L58

This seems unintentional and means that when write_cells is called with a sheet name that already exists in the file, it is not found in self.sheets:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L414-L419
so we follow the else branch, and when assigning to wks.title openpyxl suffixes the sheet name to avoid a naming conflict.