Pandas Cheat Sheet PDF (original) (raw)

Pandas

This cheat sheet—part of our Complete Guide to NumPy, pandas, and Data Visualization—offers a handy reference for essential pandas commands, focused on efficient data manipulation and analysis. Using examples from the Fortune 500 Companies Dataset, it covers key pandas operations such as reading and writing data, selecting and filtering DataFrame values, and performing common transformations.

You'll find easy-to-follow examples for grouping, sorting, and aggregating data, as well as calculating statistics like mean, correlation, and summary statistics. Whether you're cleaning datasets, analyzing trends, or visualizing data, this cheat sheet provides concise instructions to help you navigate pandas’ powerful functionality.

Designed to be practical and actionable, this guide ensures you can quickly apply pandas’ versatile data manipulation tools in your workflow.

Have the Dataquest Pandas Cheat Sheet at your fingertips when you need it!

Table of Contents

Importing data

Importing Data

IMPORT, READ_CSV, READ_TABLE, READ_EXCEL, READ_SQL, READ_JSON, READ_HTML, CLIPBOARD, DATAFRAME

Exporting data

Exporting Data

TO_CSV, TO_EXCEL, TO_SQL, TO_JSON, TO_HTML, TO_CLIPBOARD

Create test objects

Create Test Objects

DATAFRAME, SERIES, INDEX

Working with dataframes

Working with DataFrames

DATAFRAME BASICS, DATAFRAME VALUES, LOC, ILOC, BOOLEAN MASKS, BOOLEAN OPERATORS, DATA EXPLORATION, ASSIGNING VALUES, BOOLEAN INDEXING

View and inspect data

View & Inspect Data

FREQUENCY TABLE, HISTOGRAM, VERTICAL BAR PLOT, HORIZONTAL BAR PLOT, LINE PLOT, SCATTER PLOT, HEAD, TAIL, SHAPE, INFO, DESCRIBE, VALUE_COUNTS, APPLY

Data cleaning

Data Cleaning

COLUMNS, ISNULL, NOTNULL, DROPNA, FILLNA, ASTYPE, REPLACE, RENAME, SET_INDEX, FINDING CORRELATION, CONVERTING A COLUMN TO DATETIME

Filter sort and group by

Filter, Sort, & Group By

COLUMNS, SORT_VALUES, GROUP BY, PIVOT_TABLE, APPLY

Join and combine

Join & Combine

APPEND, CONCAT, JOIN

Statistics

Statistics

DESCRIBE, MEAN, CORR, COUNT, MAX, MIN, MEDIAN, STD

Importing data

Importing Data

Syntax for

How to use

Explained

IMPORT

Import the library using its standard alias

READ_CSV

Reads from a CSV file

READ_TABLE

Reads from a delimited text file (like TSV)

READ_EXCEL

Reads from an Excel file

READ_SQL

pd.read_sql(query, connection_object)

Reads from a SQL table/database

READ_JSON

pd.read_json(json_string)

Reads from a JSON formatted string, URL or file

READ_HTML

Parses an html URL, string or file and extracts tables to a list of dataframes

CLIPBOARD

Reads the contents of your clipboard

DATAFRAME

Reads from a dict; keys for columns names, values for data as lists

Exporting data

Exporting Data

Syntax

How to use

Explained

TO_CSV

Writes to a CSV file

TO_EXCEL

Writes to an Excel file

TO_SQL

df.to_sql(table_name, connection_object)

Writes to a SQL table

TO_JSON

Writes to a file in JSON format

TO_HTML

Writes to an HTML table

TO_CLIPBOARD

Writes to the clipboard

Create test objects

Create Test Objects

Syntax for

How to use

Explained

DATAFRAME

pd.DataFrame(np.random.rand(20, 5))

5 columns and 20 rows of random floats

SERIES

Creates a series from an existing list object

INDEX

df.index = pd.date_range('1900/1/30', periods=df.shape[0])

Adds a date index

Working with dataframes

Working with DataFrames

Syntax for

How to use

Explained

DATAFRAME BASICS

f500 = pd.read_csv('f500.csv', index_col=0)

Read a CSV file into a DataFrame

Return the data type of each column in a DataFrame

Return the dimensions of a DataFrame

SELECTING DATAFRAME VALUES

Select the rank column from f500

f500[["country", "rank"]]

Select the country and rank columns from f500

first_five = f500.head(5)

Select the first five rows from f500

LOC

big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], 
                      ["rank", "previous_rank"]]
 
bottom_companies = f500.loc["National Grid":"AutoNation", 
                            ["rank", "sector", "country"]] 

revenue_giants = f500.loc[["Apple", "Industrial & Commercial Bank of China", 
                          "China Construction Bank", "Agricultural Bank of China"],
                          "revenues":"profit_change"]

Use .loc[] to select rows and columns from f500 by label—rows are specified first, followed by columns. You can select individual rows/columns or multiple by passing a list, and label-based slicing includes both the start and end labels.

ILOC

third_row_first_col = f500.iloc[2, 0]

Select the third row, first column by integer location

second_row = f500.iloc[1]

Select the second row by integer location

BOOLEAN MASKS

rev_is_null = f500["revenue_change"].isnull()

Check for null values in the revenue_change column

rev_change_null = f500[rev_is_null]

Filtering using Boolean array

f500[f500["previous_rank"].notnull()]

Filter rows where previous_rank is not null

BOOLEAN OPERATORS

filter_big_rev_neg_profit = (f500["revenues"] > 100000) & 
                            (f500["profits"] < 0)

Create a Boolean filter for companies with revenues greater than 100,000 and profits less than 0

DATA EXPLORATION

revs = f500["revenues"]
summary_stats = revs.describe()

Generate summary statistics for the revenues column in f500

country_freqs = f500["country"].value_counts()

Count the occurrences of each country in f500

ASSIGNING VALUES

top5_rank_revenue["year_founded"] = 0

Set the year_founded column to 0

f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"

Update the CEO of Dow Chemical to Jim Fitterling

BOOLEAN INDEXING

kr_bool = f500["country"] == "South Korea"
top_5_kr = f500[kr_bool].head()

Filter rows for South Korea and display the top 5

f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

Replace 0 with NaN in the previous_rank column and shows the top 5 most common values

View and inspect data

View & Inspect Data

Syntax for

How to use

Explained

FREQUENCY TABLE

Generate a frequency table from a Series object

Series.value_counts().sort_index()

Generate a sorted frequency table from a Series object

HISTOGRAM

Series.plot.hist()
plt.show()

Generate a histogram from a Series object

VERTICAL BAR PLOT

Series.plot.bar()
plt.show()

Generate a vertical bar plot from a Series object

HORIZONTAL BAR PLOT

Series.plot.barh()
plt.show()

Generate a horizontal bar plot from a Series object

LINE PLOT

DataFrame.plot.line(x='col_1', y='col_2')
plt.show()

Generate a line plot from a DataFrame object

SCATTER PLOT

DataFrame.plot.scatter(x='col_1', y='col_2')
plt.show()

Generate a scatter plot from a DataFrame object

HEAD

First n rows of the DataFrame

TAIL

Last n rows of the DataFrame

SHAPE

Number of rows and columns

INFO

Index, Datatype and Memory information

DESCRIBE

Summary statistics for numerical columns

VALUE_COUNTS

s.value_counts(dropna=False)

Views unique values and counts

APPLY

df.apply(pd.Series.value_counts) 

Unique values and counts for all columns

Data cleaning

Data Cleaning

Syntax for

How to use

Explained

COLUMNS

df.columns = ['a', 'b', 'c']

Renames columns

ISNULL

Checks for null Values, Returns Boolean Array

NOTNULL

Opposite of pd.isnull()

DROPNA

Drops all rows that contain null values

Drops all columns that contain null values

df.dropna(axis=1, thresh=n)

Drops all rows have have less than n non-null values

FILLNA

Replaces all null values with x

Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)

ASTYPE

Converts the datatype of the Series to float

REPLACE

Replaces all values equal to 1 with one

REPLACE

s.replace([1, 3], ['one','three'])

Replaces all 1 with 'one' and 3 with 'three'

RENAME

df.rename(columns=lambda x: x + 1)

Mass renaming of columns

df.rename(columns={'old_name': 'new_name'})

Selective renaming of columns

df.rename(index=lambda x: x + 1)

Mass renaming of index

SET_INDEX

df.set_index('column_one')

Selectively sets the index

FINDING CORRELATION

f500['revenues'].corr(f500[profits])

Calculate Pearson's r correlation between revenues and profits

Calculate the Pearson's r correlation matrix between all columns of f500

f500.corr()[['revenues', 'profits', 'assets']]

Calculate the correlation matrix for f500 and select the correlations for the revenues, profits, and assets columns

CONVERTING A COLUMN TO DATETIME

f500['founding_date'] = f500.to_datetime(f500['founding_date'])

Convert the founding_date column in f500 to datetime format

Filter sort and group by

Filter, Sort, & Group By

Syntax for

How to use

Explained

COLUMNS

Rows where the col column is greater than 0.5

df[(df[col] > 0.5) & (df[col] < 0.7)]

Rows where 0.7 > col > 0.5

SORT_VALUES

Sorts values by col1 in ascending order

df.sort_values(col2, ascending=False)

Sorts values by col2 in descending order

df.sort_values([col1, col2], ascending=[True, False])

Sorts values by col1 in ascending order then col2 in descending order

GROUPBY

Returns a groupby object for values from one column

Returns a groupby object values from multiple columns

df.groupby(col1)[col2].mean()

Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section

PIVOT_TABLE

df.pivot_table(index=col1, 
               values=[col2, col3], 
               aggfunc=mean)

Creates a pivot table that groups by col1 and calculates the mean of col2 and col3

GROUPBY

df.groupby(col1).agg(np.mean)

Finds the average across all columns for every unique col1 group

APPLY

Applies a function across each column

Applies a function across each row

Join and combine

Join & Combine

Syntax for

How to use

Explained

APPEND

Adds the rows in df1 to the end of df2 (number of columns should be identical)

CONCAT

pd.concat([df1, df2], axis=1)

Adds the columns in df1 to the end of df2 (number of rows should be identical)

JOIN

df1.join(df2, on=col1, how='inner')

SQL-style joins the columns in df1 with the columns on df2 where the rows for col have identical values. how can be one of 'left', 'right', 'outer', 'inner'

Statistics

Statistics

Syntax for

How to use

Explained

DESCRIBE

Summary statistics for numerical columns

MEAN

Returns the mean of all columns

CORR

Returns the correlation between columns in a DataFrame

COUNT

Returns the number of non-null values in each DataFrame column

MAX

Returns the highest value in each column

MIN

Returns the lowest value in each column

MEDIAN

Returns the median of each column

STD

Returns the standard deviation of each column