ENH Read mutiple excel sheets in single API call by jnmclarty · Pull Request #9450 · pandas-dev/pandas (original) (raw)
Enables reading of multiple excel sheets in a single API call, reducing read time substantially.
Essentially, 2O(n) becomes O(1) + O(n).
Before
dfs = {}
for sheet in [‘Sheet1’,’Sheet2’,’Sheet3’]:
#We have to open the file 3 times = Super Slow
dfs[sheet] = pd.read_excel("TestData.xlsx",sheetname=sheet)
This PR
#We open the file only 1 time = Much Faster, and it's just one python line
dfs = pd.read_excel("TestData.xlsx",sheetname=[‘Sheet1’,’Sheet2’,’Sheet3’])
...but as a bonus...
#the above is the same as...
dfs = pd.read_excel_sheets("TestData.xlsx",[‘Sheet1’,’Sheet2’,’Sheet3’])
#or, assuming those are there are only the 3 sheets...
dfs = pd.read_excel_sheets("TestData.xlsx")
#and this also works. But the dictionary returned has the integer keys instead of strings.
dfs = pd.read_excel_sheets("TestData.xlsx",[0,1,2])
Return Objects
For the sheetname argument, specifying an int or a string, gets you a DataFrame (same as before the PR). But, specify a list of strings/int, returns a dictionary of dataframes. Specify None, and get all sheets in dictionary of dataframes, where the keys are the sheetnames.
More Discussion
The first commit (2db986d) implements 100% of the functionality, while the next two commits (5c2304c and 1a53b01) layer on optional changes to the documentation and API (of the two, only one is actually mandatory, but keeping both is the most explicit). Since the argument, sheetname, in read_excel was not plural, and defaults to 0, to maintain backwards compatibility AND create the “read all sheets” functionality associated with the non-default of None, the documentation of the argument gets really awkward. Hence, creating read_excel_sheets with an argument sheetnames which defaults to None instead of 0.
Things I need feedback on:
- Confirm print statement acceptability
- Confirm if sheetname argument documentation should be changed (or not, to avoid confusion of dictionary-based return cases)
- Confirm if read_excel_sheets addition to API is right approach
Things I still have to do:
- Test/write tests
- Update the release notes
- Build the docs and check
Note: There might have been another way of doing this, but I couldn't easily figure it out, (SO/google wasn't helpful). I just wanted to read my 623 sheets without it taking hours.