Python pandas: Tricks & Features You May Not Know (original) (raw)

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Idiomatic pandas: Tricks & Features You May Not Know

pandas is a foundational library for analytics, data processing, and data science. It’s a huge project with tons of optionality and depth.

This tutorial will cover some lesser-used but idiomatic pandas capabilities that lend your code better readability, versatility, and speed, à la the Buzzfeed listicle.

If you feel comfortable with the core concepts of Python’s pandas library, hopefully you’ll find a trick or two in this article that you haven’t stumbled across previously. (If you’re just starting out with the library, 10 Minutes to pandas is a good place to start.)

1. Configure Options & Settings at Interpreter Startup

You may have run across pandas’ rich options and settings system before.

It’s a huge productivity saver to set customized pandas options at interpreter startup, especially if you work in a scripting environment. You can use pd.set_option() to configure to your heart’s content with a Python or IPython startup file.

The options use a dot notation such as pd.set_option('display.max_colwidth', 25), which lends itself well to a nested dictionary of options:

If you launch an interpreter session, you’ll see that everything in the startup script has been executed, and pandas is imported for you automatically with your suite of options:

Let’s use some data on abalone hosted by the UCI Machine Learning Repository to demonstrate the formatting that was set in the startup file. The data will truncate at 14 rows with 4 digits of precision for floats:

You’ll see this dataset pop up in other examples later as well.

2. Make Toy Data Structures With pandas’ Testing Module

Hidden way down in pandas’ testing module are a number of convenient functions for quickly building quasi-realistic Series and DataFrames:

There are around 30 of these, and you can see the full list by calling dir() on the module object. Here are a few:

These can be useful for benchmarking, testing assertions, and experimenting with pandas methods that you are less familiar with.

3. Take Advantage of Accessor Methods

Perhaps you’ve heard of the term accessor, which is somewhat like a getter (although getters and setters are used infrequently in Python). For our purposes here, you can think of a pandas accessor as a property that serves as an interface to additional methods.

pandas Series have three of them:

Yes, that definition above is a mouthful, so let’s take a look at a few examples before discussing the internals.

.cat is for categorical data, .str is for string (object) data, and .dt is for datetime-like data. Let’s start off with .str: imagine that you have some raw city/state/ZIP data as a single field within a pandas Series.

pandas string methods are vectorized, meaning that they operate on the entire array without an explicit for loop:

For a more involved example, let’s say that you want to separate out the three city/state/ZIP components neatly into DataFrame fields.

You can pass a regular expression to .str.extract() to “extract” parts of each cell in the Series. In .str.extract(), .str is the accessor, and .str.extract() is an accessor method:

This also illustrates what is known as method-chaining, where .str.extract(regex) is called on the result of addr.str.replace('.', ''), which cleans up use of periods to get a nice 2-character state abbreviation.

It’s helpful to know a tiny bit about how these accessor methods work as a motivating reason for why you should use them in the first place, rather than something like addr.apply(re.findall, ...).

Each accessor is itself a bona fide Python class:

.str maps to StringMethods.
.dt maps to CombinedDatetimelikeProperties.
.cat routes to CategoricalAccessor.

These standalone classes are then “attached” to the Series class using a CachedAccessor. It is when the classes are wrapped in CachedAccessor that a bit of magic happens.

CachedAccessor is inspired by a “cached property” design: a property is only computed once per instance and then replaced by an ordinary attribute. It does this by overloading the .__get__() method, which is part of Python’s descriptor protocol.

The second accessor, .dt, is for datetime-like data. It technically belongs to pandas’ DatetimeIndex, and if called on a Series, it is converted to a DatetimeIndex first:

The third accessor, .cat, is for Categorical data only, which you’ll see shortly in its own section.

4. Create a DatetimeIndex From Component Columns

Speaking of datetime-like data, as in daterng above, it’s possible to create a pandas DatetimeIndex from multiple component columns that together form a date or datetime:

Finally, you can drop the old individual columns and convert to a Series:

The intuition behind passing a DataFrame is that a DataFrame resembles a Python dictionary where the column names are keys, and the individual columns (Series) are the dictionary values. That’s why pd.to_datetime(df[datecols].to_dict(orient='list')) would also work in this case. This mirrors the construction of Python’s datetime.datetime, where you pass keyword arguments such as datetime.datetime(year=2000, month=1, day=15, hour=10).

5. Use Categorical Data to Save on Time and Space

One powerful pandas feature is its Categorical dtype.

Even if you’re not always working with gigabytes of data in RAM, you’ve probably run into cases where straightforward operations on a large DataFrame seem to hang up for more than a few seconds.

pandas object dtype is often a great candidate for conversion to category data. (object is a container for Python str, heterogeneous data types, or “other” types.) Strings occupy a significant amount of space in memory:

Now, what if we could take the unique colors above and map each to a less space-hogging integer? Here is a naive implementation of that:

You’ll notice immediately that memory usage is just about cut in half compared to when the full strings are used with object dtype.

Earlier in the section on accessors, I mentioned the .cat (categorical) accessor. The above with mapper is a rough illustration of what is happening internally with pandas’ Categorical dtype:

“The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data.” (Source)

In colors above, you have a ratio of 2 values for every unique value (category):

As a result, the memory savings from converting to Categorical is good, but not great:

However, if you blow out the proportion above, with a lot of data and few unique values (think about data on demographics or alphabetic test scores), the reduction in memory required is over 10 times:

A bonus is that computational efficiency gets a boost too: for categorical Series, the string operations are performed on the .cat.categories attribute rather than on each original element of the Series.

In other words, the operation is done once per unique category, and the results are mapped back to the values. Categorical data has a .cat accessor that is a window into attributes and methods for manipulating the categories:

In fact, you can reproduce something similar to the example above that you did manually:

All that you need to do to exactly mimic the earlier manual output is to reorder the codes:

Notice that the dtype is NumPy’s int8, an 8-bit signed integer that can take on values from -127 to 128. (Only a single byte is needed to represent a value in memory. 64-bit signed ints would be overkill in terms of memory usage.) Our rough-hewn example resulted in int64 data by default, whereas pandas is smart enough to downcast categorical data to the smallest numerical dtype possible.

Most of the attributes for .cat are related to viewing and manipulating the underlying categories themselves:

There are a few caveats, though. Categorical data is generally less flexible. For instance, if inserting previously unseen values, you need to add this value to a .categories container first:

If you plan to be setting values or reshaping data rather than deriving new computations, Categorical types may be less nimble.

6. Introspect Groupby Objects via Iteration

When you call df.groupby('x'), the resulting pandas groupby objects can be a bit opaque. This object is lazily instantiated and doesn’t have any meaningful representation on its own.

You can demonstrate with the abalone dataset from example 1:

Alright, now you have a groupby object, but what is this thing, and how do I see it?

Before you call something like grouped.apply(func), you can take advantage of the fact that groupby objects are iterable:

Each “thing” yielded by grouped.__iter__() is a tuple of (name, subsetted object), where name is the value of the column on which you’re grouping, and subsetted object is a DataFrame that is a subset of the original DataFrame based on whatever grouping condition you specify. That is, the data gets chunked by group:

Relatedly, a groupby object also has .groups and a group-getter, .get_group():

This can help you be a little more confident that the operation you’re performing is the one you want:

No matter what calculation you perform on grouped, be it a single pandas method or custom-built function, each of these “sub-frames” is passed one-by-one as an argument to that callable. This is where the term “split-apply-combine” comes from: break the data up by groups, perform a per-group calculation, and recombine in some aggregated fashion.

If you’re having trouble visualizing exactly what the groups will actually look like, simply iterating over them and printing a few can be tremendously useful.

7. Use This Mapping Trick for Membership Binning

Let’s say that you have a Series and a corresponding “mapping table” where each value belongs to a multi-member group, or to no groups at all:

In other words, you need to map countries to the following result:

What you need here is a function similar to pandas’ pd.cut(), but for binning based on categorical membership. You can use pd.Series.map(), which you already saw in example #5, to mimic this:

This should be significantly faster than a nested Python loop through groups for each country in countries.

Here’s a test drive:

Let’s break down what’s going on here. (Sidenote: this is a great place to step into a function’s scope with Python’s debugger, pdb, to inspect what variables are local to the function.)

The objective is to map each group in groups to an integer. However, Series.map() will not recognize 'ab'—it needs the broken-out version with each character from each group mapped to an integer. This is what the dictionary comprehension is doing:

This dictionary can be passed to s.map() to map or “translate” its values to their corresponding group indices.

8. Understand How pandas Uses Boolean Operators

You may be familiar with Python’s operator precedence, where and, not, and or have lower precedence than arithmetic operators such as <, <=, >, >=, !=, and ==. Consider the two statements below, where < and > have higher precedence than the and operator:

pandas (and NumPy, on which pandas is built) does not use and, or, or not. Instead, it uses &, |, and ~, respectively, which are normal, bona fide Python bitwise operators.

These operators are not “invented” by pandas. Rather, &, |, and ~ are valid Python built-in operators that have higher (rather than lower) precedence than arithmetic operators. (pandas overrides dunder methods like .__ror__() that map to the | operator.) To sacrifice some detail, you can think of “bitwise” as “elementwise” as it relates to pandas and NumPy:

It pays to understand this concept in full. Let’s say that you have a range-like Series:

I would guess that you may have seen this exception raised at some point:

What’s happening here? It’s helpful to incrementally bind the expression with parentheses, spelling out how Python expands this expression step by step:

The expression s % 2 == 0 & s > 3 is equivalent to (or gets treated as) ((s % 2) == (0 & s)) and ((0 & s) > 3). This is called expansion: x < y <= z is equivalent to x < y and y <= z.

Okay, now stop there, and let’s bring this back to pandas-speak. You have two pandas Series that we’ll call left and right:

You know that a statement of the form left and right is truth-value testing both left and right, as in the following:

The problem is that pandas developers intentionally don’t establish a truth-value (truthiness) for an entire Series. Is a Series True or False? Who knows? The result is ambiguous:

The only comparison that makes sense is an elementwise comparison. That’s why, if an arithmetic operator is involved, you’ll need parentheses:

In short, if you see the ValueError above pop up with boolean indexing, the first thing you should probably look to do is sprinkle in some needed parentheses.

9. Load Data From the Clipboard

It’s a common situation to need to transfer data from a place like Excel or Sublime Text to a pandas data structure. Ideally, you want to do this without going through the intermediate step of saving the data to a file and afterwards reading in the file to pandas.

You can load in DataFrames from your computer’s clipboard data buffer with pd.read_clipboard(). Its keyword arguments are passed on to pd.read_table().

This allows you to copy structured text directly to a DataFrame or Series. In Excel, the data would look something like this:

Its plain-text representation (for example, in a text editor) would look like this:

a b c d 0 1 inf 1/1/00 2 7.389056099 N/A 5-Jan-13 4 54.59815003 nan 7/24/18 6 403.4287935 None NaT

Simply highlight and copy the plain text above, and call pd.read_clipboard():

10. Write pandas Objects Directly to Compressed Format

This one’s short and sweet to round out the list. As of pandas version 0.21.0, you can write pandas objects directly to gzip, bz2, zip, or xz compression, rather than stashing the uncompressed file in memory and converting it. Here’s an example using the abalone data from trick #1:

In this case, the size difference is 11.6x:

Want to Add to This List? Let Us Know

Hopefully, you were able to pick up a couple of useful tricks from this list to lend your pandas code better readability, versatility, and performance.

If you have something up your sleeve that’s not covered here, please leave a suggestion in the comments or as a GitHub Gist. We will gladly add to this list and give credit where it’s due.