Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to use Pandas dataframe

Understanding Pandas DataFrames: A Beginner's Guide

When embarking on the journey of learning programming, especially data analysis with Python, one of the most powerful tools you'll encounter is the Pandas library. At the heart of Pandas is the DataFrame, a structure that allows you to store and manipulate tabular data efficiently and intuitively. Think of a DataFrame as a table or a spreadsheet that you can program.

What is a DataFrame?

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Imagine a DataFrame as a table with rows and columns, where rows represent individual records (entries) and columns represent different attributes or features of these records.

Creating a DataFrame

To start using DataFrames, you'll first need to import the Pandas library. If you haven't installed Pandas yet, you can do so by running pip install pandas in your command line or terminal.

import pandas as pd

Now, let's create our first DataFrame. You can create a DataFrame from various data structures like lists, dictionaries, or even from external files like CSVs. Here's an example of creating a DataFrame from a dictionary:

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

This code will output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Accessing Data in DataFrame

Selecting Columns

To access the data in a DataFrame, you can select columns using their names. For example, to get all the names from the DataFrame df, you would do:

names = df['Name']
print(names)

This will give you:

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Selecting Rows

Accessing rows can be done with the .loc and .iloc methods. .loc is label-based, which means that you have to specify the name of the rows and columns that you need to filter out. On the other hand, .iloc is integer index-based, so you have to specify rows and columns by their integer index.

Here's how you can use .iloc to get the first row of the DataFrame:

first_row = df.iloc[0]
print(first_row)

And the output will be:

Name      Alice
Age          25
City    New York
Name: 0, dtype: object

Modifying DataFrames

Adding Columns

You can add new columns to your DataFrame simply by assigning a new column label and passing the data for that column. For instance, if you want to add a column for email addresses:

df['Email'] = ['alice@example.com', 'bob@example.com', 'charlie@example.com']
print(df)

Now, the DataFrame df will include an 'Email' column.

Deleting Columns

To remove columns, you can use the drop method:

df = df.drop('Age', axis=1)  # axis=1 specifies that we want to drop a column, not a row
print(df)

The 'Age' column will be removed from the DataFrame.

Filtering Data

Often, you'll want to work with a subset of your data based on certain conditions. Let's filter our DataFrame to only include people older than 25:

older_than_25 = df[df['Age'] > 25]
print(older_than_25)

This will output the rows of people whose age is greater than 25.

Grouping and Aggregating Data

Grouping data is a common operation that involves splitting your data into groups and then applying a function to each group independently. For example, if you want to know the average age of people in each city:

average_age_by_city = df.groupby('City')['Age'].mean()
print(average_age_by_city)

This will give you the average age per city.

Merging and Joining DataFrames

In real-world scenarios, data often comes in multiple sets that you need to combine. Pandas provides several methods to merge DataFrames, such as concat, merge, and join.

Here's a simple example using concat to combine two DataFrames vertically:

additional_data = pd.DataFrame({
    'Name': ['David', 'Eva'],
    'Age': [40, 28],
    'City': ['Boston', 'Denver']
})

df = pd.concat([df, additional_data]).reset_index(drop=True)
print(df)

Visualizing Data

Pandas also integrates with Matplotlib, a plotting library, to enable you to visualize your data directly from DataFrames. For instance, to plot the ages of people in your DataFrame:

import matplotlib.pyplot as plt

df['Age'].plot(kind='bar')
plt.show()

This will display a bar chart of the ages.

Conclusion: The Power of Pandas at Your Fingertips

DataFrames are an incredibly powerful tool for anyone learning programming and data analysis. They provide a flexible and intuitive way to handle data. As you've seen, with just a few lines of Python code, you can create, modify, filter, and visualize data. The beauty of Pandas is that it allows you to focus on the data and the analysis, rather than getting bogged down by the complexities of the programming.

As you continue to explore Pandas, you'll discover more advanced features and techniques that can help you to wrangle even the messiest of data sets. Remember, the key to becoming proficient in data analysis with Pandas is practice and exploration. So, dive into your data, play around with DataFrames, and unlock the insights that await you. Happy coding!