Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to filter rows in Pandas

Understanding Pandas DataFrames

Before diving into the specifics of filtering rows in Pandas, let's get a better understanding of what a DataFrame is. Think of a DataFrame as a table, much like you would see in a spreadsheet program like Excel. It has rows and columns, with each column having a name and each row being a record of some sort.

In Pandas, a DataFrame is a central data structure that allows you to store and manipulate tabular data in rows and columns. It's like a powerful, supercharged Excel sheet that you can control with Python code.

Setting Up Your Environment

To start filtering rows in Pandas, you need to have Python and Pandas installed on your computer. If you're not familiar with setting up Python packages, there are many resources online to guide you through the installation process.

Once you have Pandas installed, you'll need to import it into your Python script or notebook. You can do this by writing:

import pandas as pd

The pd is an alias for Pandas. It's like giving it a nickname, so you don't have to type 'pandas' every time you want to use a function from the Pandas library.

Creating a Simple DataFrame

Let's create a simple DataFrame to work with. This will give us some data to filter. Here's an example:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 42, 35, 23, 34],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Portland']
}

df = pd.DataFrame(data)

This code creates a DataFrame with three columns: 'Name', 'Age', and 'City', and five rows of data.

Basic Filtering: Using Conditions

Suppose you want to find all the people in your DataFrame who are older than 30. In Pandas, you can do this by applying a condition to the DataFrame:

older_than_30 = df[df['Age'] > 30]

This line of code is like asking a question: "Which rows have an 'Age' greater than 30?" Pandas then gives you a new DataFrame with only the rows that meet that condition.

Filtering with Multiple Conditions

What if you want to filter by more than one condition? For example, you might want to find all people older than 30 who also live in New York. You can do this by combining conditions with an 'and' operator, which in Pandas is the & symbol:

older_than_30_in_ny = df[(df['Age'] > 30) & (df['City'] == 'New York')]

Here, the parentheses are important. They tell Python to evaluate the conditions inside them before combining them with the &. It's like saying, "First, find everyone over 30, and then among those, find the ones who live in New York."

Using the .query() Method

Pandas also provides a method called .query() that allows you to filter rows using a query string. This can be more readable, especially when you have complex conditions:

older_than_30_in_ny = df.query("Age > 30 and City == 'New York'")

This does the same thing as the previous example but uses a string to express the conditions. It's like writing a mini-sentence that describes what you're looking for.

Filtering with .isin() for Multiple Values in a Column

Sometimes you want to filter rows based on whether a column's value is in a list of possible values. For instance, let's say you want to find all people in the DataFrame who live in either New York or Chicago. You can use the .isin() method:

cities = ['New York', 'Chicago']
in_ny_or_chicago = df[df['City'].isin(cities)]

This is like asking, "Are these people living in any of the cities on my list?" If the answer is yes, those rows are included in the new DataFrame.

Filtering Out Missing Data

In real-world data, you'll often encounter missing values. In Pandas, these are represented as NaN (Not a Number). To filter out rows with missing data in a specific column, you can use the .dropna() method:

clean_df = df.dropna(subset=['Age'])

This line tells Pandas to drop any rows where the 'Age' column has missing data. It's like saying, "If you don't know the age, remove that record from my table."

Using .loc and .iloc for More Control

Sometimes you need more control over which rows you're filtering. This is where .loc and .iloc come in handy. The .loc method allows you to select rows and columns by labels, and .iloc does the same by integer positions.

For example, if you want to get all rows where 'Age' is greater than 30 and only the 'Name' and 'City' columns, you can do:

older_than_30_names_cities = df.loc[df['Age'] > 30, ['Name', 'City']]

This is like giving precise instructions: "Give me the 'Name' and 'City' for everyone over 30."

Intuitive Analogies to Understand Filtering

Filtering rows in a DataFrame can be likened to sorting through a deck of cards. Imagine each card is a row in your DataFrame, and the characteristics of the card (suit, number) are the columns. When you're looking for all the hearts, you're applying a filter to the deck, much