Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to use Pandas in Python

Getting Started with Pandas

Imagine you have a huge pile of papers, each filled with rows and columns of data, like a high school's grade report for every student. If you wanted to sort, filter, or manipulate that data by hand, it would take ages! This is where Pandas comes to the rescue. Pandas is a powerful Python library that helps you manage and analyze large datasets with ease, almost like having a super-powered, data-organizing assistant at your side.

Installing Pandas

Before we dive into using Pandas, we need to ensure it's installed on your system. If you have Python installed, installing Pandas is as simple as running a single command in your terminal or command prompt:

pip install pandas

Understanding Data Structures: Series and DataFrames

Pandas has two primary data structures: Series and DataFrames.

  • Series: A Series is like a column in a spreadsheet. It's a one-dimensional array holding data of any type (numbers, strings, etc.).
  • DataFrame: A DataFrame is a two-dimensional data structure, like a spreadsheet or a SQL table. It's essentially a collection of Series objects that form a table.

Let's create our first DataFrame with some example data:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
}

df = pd.DataFrame(data)

print(df)

Running the above code will display:

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London

Here, df is our DataFrame containing names, ages, and cities. It's like a table with rows and columns, where each column is a Series.

Reading Data from Files

In real-world scenarios, you won't manually create all your data within the code. You'll likely be working with data stored in files like CSV, Excel, or databases. Let's see how to read a CSV file into a DataFrame:

# Assuming you have a file named 'data.csv'
df = pd.read_csv('data.csv')
print(df)

read_csv is a function that reads a comma-separated values (CSV) file and converts it into a DataFrame. Pandas supports many other file formats, like Excel (with read_excel), JSON (with read_json), and more.

Exploring Data

Now that we have our data in a DataFrame, we can start exploring it. We often start by getting a quick overview with the following commands:

# Display the first 5 rows of the DataFrame
print(df.head())

# Display the last 5 rows of the DataFrame
print(df.tail())

# Get a summary of the DataFrame structure
print(df.info())

# Get statistical summaries of numerical columns
print(df.describe())

Selecting and Manipulating Data

You can think of a DataFrame like a cake. Sometimes you want a slice of it, or maybe you want to change the flavor of a layer. In Pandas, this is done by selecting and manipulating the data.

Selecting Columns

To select a single column, use the column's name:

ages = df['Age']
print(ages)

For multiple columns, pass a list of column names:

subset = df[['Name', 'City']]
print(subset)

Selecting Rows

Rows can be selected by their position using iloc or by their label using loc:

# Select the first row by position
first_row = df.iloc[0]
print(first_row)

# Select the row with index label '1'
row_with_label_1 = df.loc[1]
print(row_with_label_1)

Filtering Data

Sometimes you only want the rows that meet certain conditions. For example, to select only the rows where the age is greater than 30:

older_than_30 = df[df['Age'] > 30]
print(older_than_30)

Adding and Deleting Columns

To add a new column, just assign values to it like this:

df['Employed'] = True
print(df)

To delete a column, use drop:

df = df.drop('Employed', axis=1)
print(df)

Performing Operations on Data

Pandas allows you to apply functions to your data to perform calculations or transformations.

Applying Functions

For example, to add 10 years to everyone's age:

df['Age'] = df['Age'].apply(lambda x: x + 10)
print(df)

Aggregations

Aggregations are operations that summarize your data. For instance, finding the average age:

average_age = df['Age'].mean()
print(f"The average age is {average_age}")

Merging and Joining Data

If you have data spread across multiple tables, you may need to combine them. This is similar to putting together pieces of a puzzle to see the whole picture.

Concatenating DataFrames

To combine DataFrames vertically:

df2 = pd.DataFrame({
    'Name': ['Dave', 'Eva'],
    'Age': [40, 28],
    'City': ['Tokyo', 'Berlin']
})

combined_df = pd.concat([df, df2], ignore_index=True)
print(combined_df)

Joining DataFrames

To combine DataFrames horizontally, based on a common key:

df3 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Dave'],
    'Salary': [70000, 80000, 90000]
})

merged_df = df.merge(df3, on='Name')
print(merged_df)

Visualization with Pandas

Visualizing your data can give you insights that are not obvious from just looking at numbers. Pandas integrates with Matplotlib, a Python plotting library, to enable data visualization.

# You need to install matplotlib first
# pip install matplotlib

import matplotlib.pyplot as plt

df['Age'].plot(kind='hist')
plt.show()

This code will show a histogram of the ages, allowing you to see the distribution of ages within your dataset.

Conclusion: The Power of Pandas

Pandas is an incredibly powerful tool that turns complex data manipulation into a series of simple tasks. By learning to wield this tool, you can handle vast amounts of data with ease, making it possible to uncover insights and make decisions based on actual data. It's like being given a magical set of glasses that brings the essential details of a blurry picture into sharp focus. As you continue to practice and explore the functionalities of Pandas, you'll find that your ability to manage and understand data grows exponentially. So, embrace the journey of learning Pandas, and let your data analysis skills flourish!