Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to filter columns in Pandas

Understanding DataFrames in Pandas

Before we dive into the specifics of filtering columns in Pandas, let's understand what a DataFrame is, as it's the central structure we'll be working with. You can think of a DataFrame as a table, much like one you would find in a spreadsheet. It's composed of rows and columns, with each column having a name that you can use to access it.

Getting Started with Pandas

To start working with Pandas, you need to have it installed on your computer. If you haven't done so, you can install it using pip, which is a package manager for Python:

pip install pandas

Once installed, you can import Pandas in your Python script like this:

import pandas as pd

Here, pd is a common alias used for Pandas, so you don't have to type pandas every time you want to use a function from the library.

Creating a Sample DataFrame

To show you how to filter columns, we'll need a DataFrame to work with. Let's create a simple one:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']

df = pd.DataFrame(data)

This code will give us a DataFrame that looks like this:

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4      Eva   29      Phoenix

Basic Column Filtering

Filtering columns in Pandas is like asking a question: "Can you show me only this specific information from the table?" Let's say we want only the 'Name' and 'Age' columns from our DataFrame. Here's how you do it:

filtered_df = df[['Name', 'Age']]

The output will be:

      Name  Age
0    Alice   24
1      Bob   27
2  Charlie   22
3    David   32
4      Eva   29

Notice how we used double square brackets [[ ]]. This is because we are passing a list of column names to the DataFrame.

Using loc and iloc to Filter Columns

Pandas provides two powerful methods, loc and iloc, for more advanced filtering. The loc method is used for label-based indexing, which means you use the column names to filter. On the other hand, iloc is used for position-based indexing, which means you use the column's integer positions to filter.

Label-based Indexing with loc

If you want to select all rows but only specific columns by their names, you can use loc like this:

filtered_df = df.loc[:, ['Name', 'City']]

This will give you:

      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago
3    David      Houston
4      Eva      Phoenix

The : before the comma means "select all rows," and the list after the comma specifies the columns you want.

Position-based Indexing with iloc

Sometimes, you might not know the column names or you just prefer using their integer positions. Here's how you can do this with `