pandas: How to Read and Write Files (original) (raw)

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Reading and Writing Files With pandas

pandas is a powerful and flexible Python package that allows you to work with labeled and time series data. It also provides statistics methods, enables plotting, and more. One crucial feature of pandas is its ability to write and read Excel, CSV, and many other types of files. Functions like the pandas read_csv() method enable you to work with files effectively. You can use them to save the data and labels from pandas objects to a file and load them later as pandas Series or DataFrame instances.

In this tutorial, you’ll learn:

Let’s start reading and writing files!

Installing pandas

The code in this tutorial is executed with CPython 3.7.4 and pandas 0.25.1. It would be beneficial to make sure you have the latest versions of Python and pandas on your machine. You might want to create a new virtual environment and install the dependencies for this tutorial.

First, you’ll need the pandas library. You may already have it installed. If you don’t, then you can install it with pip:

Once the installation process completes, you should have pandas installed and ready.

Anaconda is an excellent Python distribution that comes with Python, many useful packages like pandas, and a package and environment manager called Conda. To learn more about Anaconda, check out Setting Up Python for Machine Learning on Windows.

If you don’t have pandas in your virtual environment, then you can install it with Conda:

Conda is powerful as it manages the dependencies and their versions. To learn more about working with Conda, you can check out the official documentation.

Preparing Data

In this tutorial, you’ll use the data related to 20 countries. Here’s an overview of the data and sources you’ll be working with:

This is how the data looks as a table:

COUNTRY POP AREA GDP CONT IND_DAY
CHN China 1398.72 9596.96 12234.78 Asia
IND India 1351.16 3287.26 2575.67 Asia 1947-08-15
USA US 329.74 9833.52 19485.39 N.America 1776-07-04
IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17
BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07
PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14
NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01
BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26
RUS Russia 146.79 17098.25 1530.75 1992-06-12
MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16
JPN Japan 126.22 377.97 4872.42 Asia
DEU Germany 83.02 357.11 3693.20 Europe
FRA France 67.02 640.68 2582.49 Europe 1789-07-14
GBR UK 66.44 242.50 2631.23 Europe
ITA Italy 60.36 301.34 1943.84 Europe
ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09
DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05
CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01
AUS Australia 25.47 7692.02 1408.68 Oceania
KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-16

You may notice that some of the data is missing. For example, the continent for Russia is not specified because it spreads across both Europe and Asia. There are also several missing independence days because the data source omits them.

You can organize this data in Python using a nested dictionary:

Each row of the table is written as an inner dictionary whose keys are the column names and values are the corresponding data. These dictionaries are then collected as the values in the outer data dictionary. The corresponding keys for data are the three-letter country codes.

You can use this data to create an instance of a pandas DataFrame. First, you need to import pandas:

Now that you have pandas imported, you can use the DataFrame constructor and data to create a DataFrame object.

data is organized in such a way that the country codes correspond to columns. You can reverse the rows and columns of a DataFrame with the property .T:

Now you have your DataFrame object populated with the data about each country.

Versions of Python older than 3.6 did not guarantee the order of keys in dictionaries. To ensure the order of columns is maintained for older versions of Python and pandas, you can specify index=columns:

Now that you’ve prepared your data, you’re ready to start working with files!

Using the pandas read_csv() and .to_csv() Functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your pandas DataFrame as a CSV file with .to_csv():

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

This text file contains the data separated with commas. The first column contains the row labels. In some cases, you’ll find them irrelevant. If you don’t want to keep them, then you can pass the argument index=False to .to_csv().

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the pandas read_csv() function:

In this case, the pandas read_csv() function returns a new DataFrame with the data and labels from the file data.csv, which you specified with the first argument. This string can be any valid path, including URLs.

The parameter index_col specifies the column from the CSV file that contains the row labels. You assign a zero-based column index to this parameter. You should determine the value of index_col when the CSV file contains the row labels to avoid loading them as data.

You’ll learn more about using pandas with CSV files later on in this tutorial. You can also check out Reading and Writing CSV Files in Python to see how to handle CSV files with the built-in Python library csv as well.

Using pandas to Write and Read Excel Files

Microsoft Excel is probably the most widely-used spreadsheet software. While older versions used binary .xls files, Excel 2007 introduced the new XML-based .xlsx file. You can read and write Excel files in pandas, similar to CSV files. However, you’ll need to install the following Python packages first:

You can install them using pip with a single command:

You can also use Conda:

Please note that you don’t have to install all these packages. For example, you don’t need both openpyxl and XlsxWriter. If you’re going to work just with .xls files, then you don’t need any of them! However, if you intend to work only with .xlsx files, then you’re going to need at least one of them, but not xlwt. Take some time to decide which packages are right for your project.

Write an Excel File

Once you have those packages installed, you can save your DataFrame in an Excel file with .to_excel():

The argument 'data.xlsx' represents the target file and, optionally, its path. The above statement should create the file data.xlsx in your current working directory. That file should look like this:

mmst-pandas-rw-files-excel

The first column of the file contains the labels of the rows, while the other columns store data.

Read an Excel File

You can load data from Excel files with read_excel():

read_excel() returns a new DataFrame that contains the values from data.xlsx. You can also use read_excel() with OpenDocument spreadsheets, or .ods files.

You’ll learn more about working with Excel files later on in this tutorial. You can also check out Using pandas to Read Large Excel Files in Python.

Understanding the pandas IO API

pandas IO Tools is the API that allows you to save the contents of Series and DataFrame objects to the clipboard, objects, or files of various types. It also enables loading data from the clipboard, objects, or files.

Write Files

Series and DataFrame objects have methods that enable writing data and labels to the clipboard or files. They’re named with the pattern .to_<file-type>(), where <file-type> is the type of the target file.

You’ve learned about .to_csv() and .to_excel(), but there are others, including:

There are still more file types that you can write to, so this list is not exhaustive.

These methods have parameters specifying the target file path where you saved the data and labels. This is mandatory in some cases and optional in others. If this option is available and you choose to omit it, then the methods return the objects (like strings or iterables) with the contents of DataFrame instances.

The optional parameter compression decides how to compress the file with the data and labels. You’ll learn more about it later on. There are a few other parameters, but they’re mostly specific to one or several methods. You won’t go into them in detail here.

Read Files

pandas functions for reading the contents of files are named using the pattern .read_<file-type>(), where <file-type> indicates the type of the file to read. You’ve already seen the pandas read_csv() and read_excel() functions. Here are a few others:

These functions have a parameter that specifies the target file path. It can be any valid string that represents the path, either on a local machine or in a URL. Other objects are also acceptable depending on the file type.

The optional parameter compression determines the type of decompression to use for the compressed files. You’ll learn about it later on in this tutorial. There are other parameters, but they’re specific to one or several functions. You won’t go into them in detail here.

Working With Different File Types

The pandas library offers a wide range of possibilities for saving your data to files and loading data from files. In this section, you’ll learn more about working with CSV and Excel files. You’ll also see how to use other types of files, like JSON, web pages, databases, and Python pickle files.

CSV Files

You’ve already learned how to read and write CSV files. Now let’s dig a little deeper into the details. When you use .to_csv() to save your DataFrame, you can provide an argument for the parameter path_or_buf to specify the path, name, and extension of the target file.

path_or_buf is the first argument .to_csv() will get. It can be any string that represents a valid file path that includes the file name and its extension. You’ve seen this in a previous example. However, if you omit path_or_buf, then .to_csv() won’t create any files. Instead, it’ll return the corresponding string:

Now you have the string s instead of a CSV file. You also have some missing values in your DataFrame object. For example, the continent for Russia and the independence days for several countries (China, Japan, and so on) are not available. In data science and machine learning, you must handle missing values carefully. pandas excels here! By default, pandas uses the NaN value to replace the missing values.

The continent that corresponds to Russia in df is nan:

This example uses .loc[] to get data with the specified row and column names.

When you save your DataFrame to a CSV file, empty strings ('') will represent the missing data. You can see this both in your file data.csv and in the string s. If you want to change this behavior, then use the optional parameter na_rep:

This code produces the file new-data.csv where the missing values are no longer empty strings. You can expand the code block below to see how this file should look:

Now, the string '(missing)' in the file corresponds to the nan values from df.

When pandas reads files, it considers the empty string ('') and a few others as missing values by default:

If you don’t want this behavior, then you can pass keep_default_na=False to the pandas read_csv() function. To specify other labels for missing values, use the parameter na_values:

Here, you’ve marked the string '(missing)' as a new missing data label, and pandas replaced it with nan when it read the file.

When you load data from a file, pandas assigns the data types to the values of each column by default. You can check these types with .dtypes:

The columns with strings and dates ('COUNTRY', 'CONT', and 'IND_DAY') have the data type object. Meanwhile, the numeric columns contain 64-bit floating-point numbers (float64).

You can use the parameter dtype to specify the desired data types and parse_dates to force use of datetimes:

Now, you have 32-bit floating-point numbers (float32) as specified with dtype. These differ slightly from the original 64-bit numbers because of smaller precision. The values in the last column are considered as dates and have the data type datetime64. That’s why the NaN values in this column are replaced with NaT.

Now that you have real dates, you can save them in the format you like:

Here, you’ve specified the parameter date_format to be '%B %d, %Y'. You can expand the code block below to see the resulting file:

The format of the dates is different now. The format '%B %d, %Y' means the date will first display the full name of the month, then the day followed by a comma, and finally the full year.

There are several other optional parameters that you can use with .to_csv():

Here’s how you would pass arguments for sep and header:

The data is separated with a semicolon (';') because you’ve specified sep=';'. Also, since you passed header=False, you see your data without the header row of column names.

The pandas read_csv() function has many additional options for managing missing data, working with dates and times, quoting, encoding, handling errors, and more. For instance, if you have a file with one data column and want to get a Series object instead of a DataFrame, then you can pass squeeze=True to read_csv(). You’ll learn later on about data compression and decompression, as well as how to skip rows and columns.

JSON Files

JSON stands for JavaScript object notation. JSON files are plaintext files used for data interchange, and humans can read them easily. They follow the ISO/IEC 21778:2017 and ECMA-404 standards and use the .json extension. Python and pandas work well with JSON files, as Python’s json library offers built-in support for them.

You can save the data from your DataFrame to a JSON file with .to_json(). Start by creating a DataFrame object again. Use the dictionary data that holds the data about countries and then apply .to_json():

This code produces the file data-columns.json. You can expand the code block below to see how this file should look:

data-columns.json has one large dictionary with the column labels as keys and the corresponding inner dictionaries as values.

You can get a different file structure if you pass an argument for the optional parameter orient:

The orient parameter defaults to 'columns'. Here, you’ve set it to index.

You should get a new file data-index.json. You can expand the code block below to see the changes:

data-index.json also has one large dictionary, but this time the row labels are the keys, and the inner dictionaries are the values.

There are few more options for orient. One of them is 'records':

This code should yield the file data-records.json. You can expand the code block below to see the content:

data-records.json holds a list with one dictionary for each row. The row labels are not written.

You can get another interesting file structure with orient='split':

The resulting file is data-split.json. You can expand the code block below to see how this file should look:

data-split.json contains one dictionary that holds the following lists:

If you don’t provide the value for the optional parameter path_or_buf that defines the file path, then .to_json() will return a JSON string instead of writing the results to a file. This behavior is consistent with .to_csv().

There are other optional parameters you can use. For instance, you can set index=False to forgo saving row labels. You can manipulate precision with double_precision, and dates with date_format and date_unit. These last two parameters are particularly important when you have time series among your data:

In this example, you’ve created the DataFrame from the dictionary data and used to_datetime() to convert the values in the last column to datetime64. You can expand the code block below to see the resulting file:

In this file, you have large integers instead of dates for the independence days. That’s because the default value of the optional parameter date_format is 'epoch' whenever orient isn’t 'table'. This default behavior expresses dates as an epoch in milliseconds relative to midnight on January 1, 1970.

However, if you pass date_format='iso', then you’ll get the dates in the ISO 8601 format. In addition, date_unit decides the units of time:

This code produces the following JSON file:

The dates in the resulting file are in the ISO 8601 format.

You can load the data from a JSON file with read_json():

The parameter convert_dates has a similar purpose as parse_dates when you use it to read CSV files. The optional parameter orient is very important because it specifies how pandas understands the structure of the file.

There are other optional parameters you can use as well:

Note that you might lose the order of rows and columns when using the JSON format to store your data.

HTML Files

An HTML is a plaintext file that uses hypertext markup language to help browsers render web pages. The extensions for HTML files are .html and .htm. You’ll need to install an HTML parser library like lxml or html5lib to be able to work with HTML files:

You can also use Conda to install the same packages:

Once you have these libraries, you can save the contents of your DataFrame as an HTML file with .to_html():

This code generates a file data.html. You can expand the code block below to see how this file should look:

This file shows the DataFrame contents nicely. However, notice that you haven’t obtained an entire web page. You’ve just output the data that corresponds to df in the HTML format.

.to_html() won’t create a file if you don’t provide the optional parameter buf, which denotes the buffer to write to. If you leave this parameter out, then your code will return a string as it did with .to_csv() and .to_json().

Here are some other optional parameters:

You use parameters like these to specify different aspects of the resulting files or strings.

You can create a DataFrame object from a suitable HTML file using read_html(), which will return a DataFrame instance or a list of them:

This is very similar to what you did when reading CSV files. You also have parameters that help you work with dates, missing values, precision, encoding, HTML parsers, and more.

Excel Files

You’ve already learned how to read and write Excel files with pandas. However, there are a few more options worth considering. For one, when you use .to_excel(), you can specify the name of the target worksheet with the optional parameter sheet_name:

Here, you create a file data.xlsx with a worksheet called COUNTRIES that stores the data. The string 'data.xlsx' is the argument for the parameter excel_writer that defines the name of the Excel file or its path.

The optional parameters startrow and startcol both default to 0 and indicate the upper left-most cell where the data should start being written:

Here, you specify that the table should start in the third row and the fifth column. You also used zero-based indexing, so the third row is denoted by 2 and the fifth column by 4.

Now the resulting worksheet looks like this:

mmst-pandas-rw-files-excel-shifted

As you can see, the table starts in the third row 2 and the fifth column E.

.read_excel() also has the optional parameter sheet_name that specifies which worksheets to read when loading data. It can take on one of the following values:

Here’s how you would use this parameter in your code:

Both statements above create the same DataFrame because the sheet_name parameters have the same values. In both cases, sheet_name=0 and sheet_name='COUNTRIES' refer to the same worksheet. The argument parse_dates=['IND_DAY'] tells pandas to try to consider the values in this column as dates or times.

There are other optional parameters you can use with .read_excel() and .to_excel() to determine the Excel engine, the encoding, the way to handle missing values and infinities, the method for writing column names and row labels, and so on.

SQL Files

pandas IO tools can also read and write databases. In this next example, you’ll write your data to a database called data.db. To get started, you’ll need the SQLAlchemy package. To learn more about it, you can read the official ORM tutorial. You’ll also need the database driver. Python has a built-in driver for SQLite.

You can install SQLAlchemy with pip:

You can also install it with Conda:

Once you have SQLAlchemy installed, import create_engine() and create a database engine:

Now that you have everything set up, the next step is to create a DataFrame object. It’s convenient to specify the data types and apply .to_sql().

.astype() is a very convenient method you can use to set multiple data types at once.

Once you’ve created your DataFrame, you can save it to the database with .to_sql():

The parameter con is used to specify the database connection or engine that you want to use. The optional parameter index_label specifies how to call the database column with the row labels. You’ll often see it take on the value ID, Id, or id.

You should get the database data.db with a single table that looks like this:

mmst-pandas-rw-files-db

The first column contains the row labels. To omit writing them into the database, pass index=False to .to_sql(). The other columns correspond to the columns of the DataFrame.

There are a few more optional parameters. For example, you can use schema to specify the database schema and dtype to determine the types of the database columns. You can also use if_exists, which says what to do if a database with the same name and path already exists:

You can load the data from the database with read_sql():

The parameter index_col specifies the name of the column with the row labels. Note that this inserts an extra row after the header that starts with ID. You can fix this behavior with the following line of code:

Now you have the same DataFrame object as before.

Note that the continent for Russia is now None instead of nan. If you want to fill the missing values with nan, then you can use .fillna():

.fillna() replaces all missing values with whatever you pass to value. Here, you passed float('nan'), which says to fill all missing values with nan.

Also note that you didn’t have to pass parse_dates=['IND_DAY'] to read_sql(). That’s because your database was able to detect that the last column contains dates. However, you can pass parse_dates if you’d like. You’ll get the same results.

There are other functions that you can use to read databases, like read_sql_table() and read_sql_query(). Feel free to try them out!

Pickle Files

Pickling is the act of converting Python objects into byte streams. Unpickling is the inverse process. Python pickle files are the binary files that keep the data and hierarchy of Python objects. They usually have the extension .pickle or .pkl.

You can save your DataFrame in a pickle file with .to_pickle():

Like you did with databases, it can be convenient first to specify the data types. Then, you create a file data.pickle to contain your data. You could also pass an integer value to the optional parameter protocol, which specifies the protocol of the pickler.

You can get the data from a pickle file with read_pickle():

read_pickle() returns the DataFrame with the stored data. You can also check the data types:

These are the same ones that you specified before using .to_pickle().

As a word of caution, you should always beware of loading pickles from untrusted sources. This can be dangerous! When you unpickle an untrustworthy file, it could execute arbitrary code on your machine, gain remote access to your computer, or otherwise exploit your device in other ways.

Working With Big Data

If your files are too large for saving or processing, then there are several approaches you can take to reduce the required disk space:

You’ll take a look at each of these techniques in turn.

Compress and Decompress Files

You can create an archive file like you would a regular one, with the addition of a suffix that corresponds to the desired compression type:

pandas can deduce the compression type by itself:

Here, you create a compressed .csv file as an archive. The size of the regular .csv file is 1048 bytes, while the compressed file only has 766 bytes.

You can open this compressed file as usual with the pandas read_csv() function:

read_csv() decompresses the file before reading it into a DataFrame.

You can specify the type of compression with the optional parameter compression, which can take on any of the following values:

The default value compression='infer' indicates that pandas should deduce the compression type from the file extension.

Here’s how you would compress a pickle file:

You should get the file data.pickle.compress that you can later decompress and read:

df again corresponds to the DataFrame with the same data as before.

You can give the other compression methods a try, as well. If you’re using pickle files, then keep in mind that the .zip format supports reading only.

Choose Columns

The pandas read_csv() and read_excel() functions have the optional parameter usecols that you can use to specify the columns you want to load from the file. You can pass the list of column names as the corresponding argument:

Now you have a DataFrame that contains less data than before. Here, there are only the names of the countries and their areas.

Instead of the column names, you can also pass their indices:

Expand the code block below to compare these results with the file 'data.csv':

You can see the following columns:

Simlarly, read_sql() has the optional parameter columns that takes a list of column names to read:

Again, the DataFrame only contains the columns with the names of the countries and areas. If columns is None or omitted, then all of the columns will be read, as you saw before. The default behavior is columns=None.

Omit Rows

When you test an algorithm for data processing or machine learning, you often don’t need the entire dataset. It’s convenient to load only a subset of the data to speed up the process. The pandas read_csv() and read_excel() functions have some optional parameters that allow you to select which rows you want to load:

Here’s how you would skip rows with odd zero-based indices, keeping the even ones:

In this example, skiprows is range(1, 20, 2) and corresponds to the values 1, 3, …, 19. The instances of the Python built-in class range behave like sequences. The first row of the file data.csv is the header row. It has the index 0, so pandas loads it in. The second row with index 1 corresponds to the label CHN, and pandas skips it. The third row with the index 2 and label IND is loaded, and so on.

If you want to choose rows randomly, then skiprows can be a list or NumPy array with pseudo-random numbers, obtained either with pure Python or with NumPy.

Force Less Precise Data Types

If you’re okay with less precise data types, then you can potentially save a significant amount of memory! First, get the data types with .dtypes again:

The columns with the floating-point numbers are 64-bit floats. Each number of this type float64 consumes 64 bits or 8 bytes. Each column has 20 numbers and requires 160 bytes. You can verify this with .memory_usage():

.memory_usage() returns an instance of Series with the memory usage of each column in bytes. You can conveniently combine it with .loc[] and .sum() to get the memory for a group of columns:

This example shows how you can combine the numeric columns 'POP', 'AREA', and 'GDP' to get their total memory requirement. The argument index=False excludes data for row labels from the resulting Series object. For these three columns, you’ll need 480 bytes.

You can also extract the data values in the form of a NumPy array with .to_numpy() or .values. Then, use the .nbytes attribute to get the total bytes consumed by the items of the array:

The result is the same 480 bytes. So, how do you save memory?

In this case, you can specify that your numeric columns 'POP', 'AREA', and 'GDP' should have the type float32. Use the optional parameter dtype to do this:

The dictionary dtypes specifies the desired data types for each column. It’s passed to the pandas read_csv() function as the argument that corresponds to the parameter dtype.

Now you can verify that each numeric column needs 80 bytes, or 4 bytes per item:

Each value is a floating-point number of 32 bits or 4 bytes. The three numeric columns contain 20 items each. In total, you’ll need 240 bytes of memory when you work with the type float32. This is half the size of the 480 bytes you’d need to work with float64.

In addition to saving memory, you can significantly reduce the time required to process data by using float32 instead of float64 in some cases.

Use Chunks to Iterate Through Files

Another way to deal with very large datasets is to split the data into smaller chunks and process one chunk at a time. If you use read_csv(), read_json() or read_sql(), then you can specify the optional parameter chunksize:

chunksize defaults to None and can take on an integer value that indicates the number of items in a single chunk. When chunksize is an integer, read_csv() returns an iterable that you can use in a for loop to get and process only a fragment of the dataset in each iteration:

In this example, the chunksize is 8. The first iteration of the for loop returns a DataFrame with the first eight rows of the dataset only. The second iteration returns another DataFrame with the next eight rows. The third and last iteration returns the remaining four rows.

In each iteration, you get and process the DataFrame with the number of rows equal to chunksize. It’s possible to have fewer rows than the value of chunksize in the last iteration. You can use this functionality to control the amount of memory required to process data and keep that amount reasonably small.

Conclusion

You now know how to save the data and labels from pandas DataFrame objects to different kinds of files. You also know how to load your data from files and create DataFrame objects.

You’ve used the pandas read_csv() and .to_csv() methods to read and write CSV files. You also used similar methods to read and write Excel, JSON, HTML, SQL, and pickle files. These functions are very convenient and widely used. They allow you to save or load your data in a single function or method call.

You’ve also learned how to save time, memory, and disk space when working with large data files:

You’ve mastered a significant step in the machine learning and data science process! If you have any questions or comments, then please put them in the comments section below.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Reading and Writing Files With pandas