Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to use Pandas

Getting Started with Pandas

Pandas is a powerful library in Python that allows us to work with data in a way that is both intuitive and efficient. Think of it as a supercharged Excel spreadsheet that you can program. It's used widely in data analysis, data cleaning, and data visualization.

Understanding Data Structures in Pandas

Pandas has two main data structures: DataFrame and Series.

What is a DataFrame?

Imagine a DataFrame as a table with rows and columns, much like a sheet in Excel. Each column can have a different type of data (numeric, string, datetime, etc.), and each row represents an entry in the dataset.

What is a Series?

A Series, on the other hand, is like a single column of that table. It's a one-dimensional array holding data of any type.

Installing Pandas

Before we dive into using Pandas, we need to make sure it's installed on your system. If you already have Python installed, installing Pandas is as simple as running the following command in your command prompt or terminal:

pip install pandas

Importing Pandas

Once installed, you can access Pandas in your Python script by importing it. It's common practice to import Pandas with the alias pd for convenience:

import pandas as pd

Creating Your First DataFrame

Let's start by creating a DataFrame from scratch. We'll use a dictionary where keys will become column names and values will become the data in the columns.

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
}

df = pd.DataFrame(data)
print(df)

This will output:

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London

Reading Data from Files

One of the most common tasks is to read data from a file. Pandas makes this easy with functions like read_csv for reading CSV files, which are plain text files with values separated by commas.

df = pd.read_csv('path_to_file.csv')

Replace 'path_to_file.csv' with the actual path to your CSV file.

Exploring Your Data

Once your data is loaded into a DataFrame, you'd want to explore it. Here are a few methods to help you understand your dataset better:

  • head(): Shows the first few rows of the DataFrame.
  • tail(): Shows the last few rows of the DataFrame.
  • describe(): Provides a statistical summary of numerical columns.
  • info(): Gives a concise summary of the DataFrame, including the number of non-null entries in each column.
print(df.head())
print(df.tail())
print(df.describe())
print(df.info())

Selecting and Filtering Data

Selecting Columns

To select a single column, use the column's name:

ages = df['Age']
print(ages)

For multiple columns, pass a list of column names:

subset = df[['Name', 'City']]
print(subset)

Filtering Rows

To filter rows based on a condition, use a boolean expression:

older_than_30 = df[df['Age'] > 30]
print(older_than_30)

This will display all rows where the 'Age' column has values greater than 30.

Modifying Data

Adding Columns

You can add new columns to a DataFrame just like you would add a new key-value pair to a dictionary:

df['Employed'] = [True, False, True]
print(df)

Modifying Values

To change a value, you can use the loc method with the row index and column name:

df.loc[0, 'Age'] = 26
print(df)

This changes the age of the first row to 26.

Handling Missing Data

Missing data can be a common issue. Pandas provides methods like isnull() and dropna() to identify and remove missing values:

print(df.isnull())
df_clean = df.dropna()
print(df_clean)

Grouping and Aggregating Data

Grouping data can be useful when you want to perform a calculation on subsets of your dataset. The groupby() method is used for this purpose:

grouped = df.groupby('City')
print(grouped.mean())

This would give you the average age for each city.

Merging and Joining Data

Sometimes you'll have data spread across multiple DataFrames. You can combine these DataFrames using methods like merge():

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [70000, 80000]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)

This will merge df1 and df2 on the 'Name' column.

Visualizing Data

Pandas integrates with Matplotlib, a plotting library, to enable data visualization:

import matplotlib.pyplot as plt

df['Age'].plot(kind='hist')
plt.show()

This will show a histogram of the 'Age' column.

Conclusion: The Power of Pandas at Your Fingertips

Congratulations! You've just scratched the surface of what Pandas can do. With these basics under your belt, you're well on your way to becoming proficient in data manipulation and analysis. Remember, learning Pandas is like learning to ride a bicycle – it might seem tricky at first, but with practice, it becomes second nature. So keep experimenting with different datasets, try out new methods, and watch how your data storytelling skills grow. Happy data wrangling!