Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to filter Pandas dataframe

Understanding the Basics of Filtering in Pandas

Filtering data is a common task in data analysis. It's like sorting through a basket of fruits to pick out only the apples. In programming, particularly with the Pandas library in Python, filtering allows you to view or analyze a subset of your data that meets certain conditions.

Pandas is an open-source data manipulation and analysis library for Python that provides flexible data structures. These structures are designed to make it easy to work with structured (tabular, multidimensional, potentially heterogeneous) and time series data.

Getting Started with Pandas DataFrames

Before we dive into filtering, it's important to understand what a DataFrame is. A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). You can think of it as a spreadsheet or a SQL table.

Here's an example of creating a simple DataFrame using Pandas:

import pandas as pd

# Create a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 30, 18, 22, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

df = pd.DataFrame(data)

print(df)

This code will produce the following DataFrame:

      Name  Age         City
0    Alice   24     New York
1      Bob   30  Los Angeles
2  Charlie   18      Chicago
3    David   22      Houston
4      Eva   29      Phoenix

Basic Filtering Techniques

Filtering in Pandas is done by selecting rows that satisfy a certain condition. Let's say you want to find all people in your DataFrame who are older than 25. You can do this by:

older_than_25 = df[df['Age'] > 25]

print(older_than_25)

This code will output:

  Name  Age         City
1  Bob   30  Los Angeles
4  Eva   29      Phoenix

The expression df['Age'] > 25 creates a boolean series that looks like [False, True, False, False, True]. When you use this series to index the DataFrame, Pandas selects only the rows where the value is True.

Combining Multiple Conditions

What if you want to filter using more than one condition? For instance, finding people over 25 who also live in New York. In Pandas, you can combine conditions using & (and) or | (or) operators.

Here's how you can apply multiple conditions:

older_than_25_in_ny = df[(df['Age'] > 25) & (df['City'] == 'New York')]

print(older_than_25_in_ny)

This will result in an empty DataFrame, as no one in the original DataFrame meets both conditions. Make sure to wrap each condition in parentheses to avoid any ambiguity in the order of operations.

Filtering Using query()

Another way to filter data in Pandas is by using the query() method. This method allows you to express the filter conditions as a string, which can be more readable at times.

Here's the previous example using query():

older_than_25_in_ny = df.query("Age > 25 & City == 'New York'")

print(older_than_25_in_ny)

This yields the same result as before. The query() method is especially useful when you have long or complex filtering conditions.

Using Functions to Filter

Sometimes, you might want to use a more complex function to determine if a row should be included in the filtered dataset. This is where the apply() method comes in handy. It allows you to apply a custom function along an axis of the DataFrame.

Let's filter for names that start with the letter 'A':

def starts_with_a(name):
    return name.startswith('A')

names_starting_with_a = df[df['Name'].apply(starts_with_a)]

print(names_starting_with_a)

The output will be:

    Name  Age      City
0  Alice   24  New York

The apply() method runs the starts_with_a function for each value in the 'Name' column and returns a boolean series that is then used to filter the DataFrame.

Filtering with isin()

If you need to filter rows based on whether the column's value is in a set of interest, you can use the isin() method. Suppose you want to filter the DataFrame to show only the rows where the city is either New York or Chicago:

cities = ['New York', 'Chicago']
in_cities = df[df['City'].isin(cities)]

print(in_cities)

This will display:

      Name  Age      City
0    Alice   24  New York
2  Charlie   18   Chicago

Handling Missing Data

In real-world data, you often encounter missing values. Pandas represents these as NaN (Not a Number). To filter out rows with missing data, you can use dropna():

# Suppose we have a DataFrame with missing values
df_with_nan = df.copy()
df_with_nan.loc[2, 'Age'] = None  # Introduce a missing value

# Drop rows with any column having NA/null data
clean_df = df_with_nan.dropna()

print(clean_df)

The resulting DataFrame will exclude the row with the missing 'Age'.

Conclusion

Filtering is an essential skill when it comes to data analysis with Pandas. It allows you to focus on the most relevant pieces of your dataset, just like finding the perfect ingredients for a gourmet dish. By mastering the techniques we discussed, you can slice and dice your data in any way you need. Remember to think of filtering as setting up a series of gates that only let through the data that meets your criteria. With practice, you'll be able to handle even the most complex datasets with ease. Keep experimenting with different filtering methods and conditions, and soon enough, you'll be able to uncover valuable insights from your data with just a few lines of code.