ENH Read mutiple excel sheets in single API call by jnmclarty · Pull Request #9450 · pandas-dev/pandas (original) (raw)

Enables reading of multiple excel sheets in a single API call, reducing read time substantially.

Essentially, 2O(n) becomes O(1) + O(n).

Before

dfs = {}
for sheet in [‘Sheet1’,’Sheet2’,’Sheet3’]:
    #We have to open the file 3 times = Super Slow
    dfs[sheet] = pd.read_excel("TestData.xlsx",sheetname=sheet)

This PR

#We open the file only 1 time = Much Faster, and it's just one python line
dfs = pd.read_excel("TestData.xlsx",sheetname=[‘Sheet1’,’Sheet2’,’Sheet3’])

...but as a bonus...

#the above is the same as...
dfs = pd.read_excel_sheets("TestData.xlsx",[‘Sheet1’,’Sheet2’,’Sheet3’])
#or, assuming those are there are only the 3 sheets...
dfs = pd.read_excel_sheets("TestData.xlsx")
#and this also works. But the dictionary returned has the integer keys instead of strings.
dfs = pd.read_excel_sheets("TestData.xlsx",[0,1,2])

Return Objects

For the sheetname argument, specifying an int or a string, gets you a DataFrame (same as before the PR). But, specify a list of strings/int, returns a dictionary of dataframes. Specify None, and get all sheets in dictionary of dataframes, where the keys are the sheetnames.

More Discussion

The first commit (2db986d) implements 100% of the functionality, while the next two commits (5c2304c and 1a53b01) layer on optional changes to the documentation and API (of the two, only one is actually mandatory, but keeping both is the most explicit). Since the argument, sheetname, in read_excel was not plural, and defaults to 0, to maintain backwards compatibility AND create the “read all sheets” functionality associated with the non-default of None, the documentation of the argument gets really awkward. Hence, creating read_excel_sheets with an argument sheetnames which defaults to None instead of 0.

Things I need feedback on:

Confirm print statement acceptability
Confirm if sheetname argument documentation should be changed (or not, to avoid confusion of dictionary-based return cases)
Confirm if read_excel_sheets addition to API is right approach

Things I still have to do:

Test/write tests
Update the release notes
Build the docs and check

Note: There might have been another way of doing this, but I couldn't easily figure it out, (SO/google wasn't helpful). I just wanted to read my 623 sheets without it taking hours.