Using pandas to Read Large Excel Files in Python (original) (raw)

In this tutorial you’re going to learn how to work with large Excel files in pandas, focusing on reading and analyzing an xls file and then working with a subset of the original data.

This tutorial utilizes Python (tested with 64-bit versions of v2.7.9 and v3.4.3), pandas (v0.16.1), and XlsxWriter (v0.7.3). We recommend using the Anaconda distribution to quickly get started, as it comes pre-installed with all the needed libraries.

Reading the File

The first file we’ll work with is a compilation of all the car accidents in England from 1979-2004, to extract all accidents that happened in London in the year 2000.

Excel

Start by downloading the source ZIP file from data.gov.uk, and extract the contents. Then try to open Accidents7904.csv in Excel. Be careful. If you don’t have enough memory, this could very well crash your computer.

What happens?

You should see a “File Not Loaded Completely” error since Excel can only handle one million rows at a time.

We tested this in LibreOffice as well and received a similar error - “The data could not be loaded completely because the maximum number of rows per sheet was exceeded.”

To solve this, we can open the file in pandas. Before we start, the source code is on Github.

pandas

Within a new project directory, activate a virtualenv, and then install pandas:

Now let’s build the script. Create a file called pandas_accidents.py and the add the following code:

Here, we imported pandas, read in the file—which could take some time, depending on how much memory your system has—and outputted the total number of rows the file has as well as the available headers (e.g., column titles).

When ran, you should see:

So, there are over six millions rows! No wonder Excel choked. Turn your attention to the list of headers, the first one in particular:

This should read Accident_Index. What’s with the extra \xef\xbb\xbf at the beginning? Well, the \x actually means that the value is hexadecimal, which is a Byte Order Mark, indicating that the text is Unicode.

Why does it matter to us?

You cannot assume the files you read are clean. They might contain extra symbols like this that can throw your scripts off.

This file is good, in that it is otherwise clean - but many files have missing data, data in internal inconsistent format, etc.. So any time you have a file to analyze, the first thing you must do is clean it. How much cleaning? Enough to allow you to do some analysis. Follow the KISS principle.

What sort of cleanup might you require?

Analyzing

For those of you who know SQL, you can use the SELECT, WHERE, AND/OR statements with different keywords to refine your search. We can do the same in pandas, and in a way that is more programmer friendly.

To start off, let’s find all the accidents that happened on a Sunday. Looking at the headers above, there is a Day_of_Weeks field, which we will use.

In the ZIP file you downloaded, there’s a file called Road-Accident-Safety-Data-Guide-1979-2004.xls, which contains extra info on the codes used. If you open it up, you will see that Sunday has the code 1.

That’s how simple it is.

Here, we targeted the Day_of_Weeks field and returned a DataFrame with the condition we checked for - day of week == 1.

When ran you should see:

As you can see, there were 693,847 accidents that happened on a Sunday.

Let’s make our query more complicated: Find out all accidents that happened on a Sunday and involved more than twenty cars:

Run the script. Now we have 10 accidents:

Let’s add another condition: weather.

Open the Road-Accident-Safety-Data-Guide-1979-2004.xls, and go to the Weather sheet. You’ll see that the code 2 means, “Raining with no heavy winds”.

Add that to our query:

So there were four accidents that happened on a Sunday, involving more than twenty cars, while it was raining:

We could continue making this more and more complicated, as needed. For now, we’ll stop since our main interest is to look at accidents in London.

If you look at Road-Accident-Safety-Data-Guide-1979-2004.xls again, there is a sheet called Police Force. The code for 1 says, “Metropolitan Police”. This is what is more commonly known as Scotland Yard, and is the police force responsible for most (though not all) of London. For our case, this is good enough, and we can extract this info like so:

Run the script. This created a new DataFrame with the accidents handled by the “Metropolitan Police” from 1979 to 2004 on a Sunday:

What if you wanted to create a new DataFrame that only contains accidents in the year 2000?

The first thing we need to do is convert the date format to one which Python can understand using the pd.to_datetime() function. This takes a date in any format and converts it to a format that we can understand (yyyy-mm-dd). Then we can create another DataFrame that only contains accidents for 2000:

When ran, you should see:

So, this is a bit confusing at first. Normally, to filter an array you would just use a for loop with a conditional:

However, you really shouldn’t define your own loop since many high-performance libraries, like pandas, have helper functions in place. In this case, the above code loops over all the elements and filters out data outside the set dates, and then returns the data points that do fall within the dates.

Nice!

Converting

Chances are that, while using pandas, everyone else in your organization is stuck with Excel. Want to share the DataFrame with those using Excel?

First, we need to do some cleanup. Remember the byte order mark we saw earlier? That causes problems when writing this data to an Excel file - pandas throws a UnicodeDecodeError. Why? Because the rest of the text is decoded as ASCII, but the hexadecimal values can’t be represented in ASCII.

We could write everything as Unicode, but remember this byte order mark is an unnecessary (to us) extra we don’t want or need. So we will get rid of it by renaming the column header:

This is the way to rename a column in pandas; a bit complicated, to be honest. inplace = True is needed because we want to modify the existing structure, and not create a copy, which is what pandas does by default.

Now we can save the data to Excel:

Make sure to install XlsxWriter before running:

If all went well, this should have created a file called London_Sundays_2000.xlsx, and then saved our data to Sheet1. Open this file up in Excel or LibreOffice, and confirm that the data is correct.

Conclusion

So, what did we accomplish? Well, we took a very large file that Excel could not open and utilized pandas to-

  1. Open the file.
  2. Perform SQL-like queries against the data.
  3. Create a new XLSX file with a subset of the original data.

Keep in mind that even though this file is nearly 800MB, in the age of big data, it’s still quite small. What if you wanted to open a 4GB file? Even if you have 8GB or more of RAM, that might still not be possible since much of your RAM is reserved for the OS and other system processes. In fact, my laptop froze a few times when first reading in the 800MB file. If I opened a 4GB file, it would have a heart attack.

So how do we proceed?

The trick is not to open the whole file in one go. That’s what we’ll look at in the next blog post. Until then, analyze your own data. Leave questions or comments below. You can grab the code for this tutorial from the repo.