Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to create new columns in Pandas

Understanding Pandas DataFrames

Before we dive into the process of creating new columns in Pandas, let's first understand what a DataFrame is. Think of a DataFrame as a big table of data, similar to a sheet in Excel. It has rows and columns, where rows represent individual records (like different students in a class), and columns represent different attributes or features of those records (like the names, ages, or grades of the students).

Pandas is a powerful Python library that allows us to work with these tables efficiently. It's like having a Swiss Army knife for data manipulation in Python!

Setting Up Your Environment

To start playing with Pandas, you first need to make sure you have it installed. You can do this by running the following command in your terminal or command prompt:

pip install pandas

Once installed, you can import Pandas in your Python script or notebook using:

import pandas as pd

The pd is a common alias for Pandas. It's like giving a nickname to Pandas so that you don't have to type pandas every time you want to use a function from the library.

Creating a Simple DataFrame

Before we add new columns, we need a DataFrame to work with. Let's create a simple one:

import pandas as pd

# Create a DataFrame using a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}

df = pd.DataFrame(data)

print(df)

This code will output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Here, we created a DataFrame with two columns: "Name" and "Age".

Adding Columns Using Assignment

One of the simplest ways to add a new column to a DataFrame is by using the assignment operator =. For example, let's add a new column called "City":

df['City'] = ['New York', 'Los Angeles', 'Chicago']
print(df)

This will give us:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Using the assign Method

Pandas also provides a method called assign that allows you to create new columns. This method is useful when you want to create multiple columns at once or when you want to chain several operations together.

df = df.assign(
    Salary=[70000, 80000, 90000],
    Department=['HR', 'Tech', 'Finance']
)
print(df)

Now your DataFrame has two more columns, "Salary" and "Department":

      Name  Age         City  Salary Department
0    Alice   25     New York   70000         HR
1      Bob   30  Los Angeles   80000       Tech
2  Charlie   35      Chicago   90000    Finance

Creating Columns Based on Other Columns

Often, you'll want to create a new column based on the values of other columns. For instance, let's say we want to add a column that shows if a person is over 30 years old.

df['Over30'] = df['Age'] > 30
print(df)

This code will add a new boolean column (True or False) indicating whether each person is over 30:

      Name  Age         City  Salary Department  Over30
0    Alice   25     New York   70000         HR   False
1      Bob   30  Los Angeles   80000       Tech   False
2  Charlie   35      Chicago   90000    Finance    True

Using Functions to Create Columns

You can also use functions to create new columns. For example, let's create a column that contains a personalized message for each person.

def create_message(row):
    return f"Hello, {row['Name']} from {row['City']}!"

df['Message'] = df.apply(create_message, axis=1)
print(df)

The apply method runs the create_message function for each row in the DataFrame. The axis=1 parameter tells Pandas to apply the function across columns (i.e., row-wise).

      Name  Age         City  Salary Department  Over30                   Message
0    Alice   25     New York   70000         HR   False  Hello, Alice from New York!
1      Bob   30  Los Angeles   80000       Tech   False  Hello, Bob from Los Angeles!
2  Charlie   35      Chicago   90000    Finance    True  Hello, Charlie from Chicago!

Handling Missing Data When Creating Columns

Sometimes, you might not have data for every row in a new column you're creating. Pandas handles missing data using a special value called NaN (Not a Number). Let's add a column with some missing values:

import numpy as np

df['PreviousEmployer'] = ['Company A', np.nan, 'Company C']
print(df)

The np.nan is how you represent a missing value in Pandas. The output will show NaN where the data is missing:

      Name  Age         City  Salary Department  Over30                   Message PreviousEmployer
0    Alice   25     New York   70000         HR   False  Hello, Alice from New York!       Company A
1      Bob   30  Los Angeles   80000       Tech   False  Hello, Bob from Los Angeles!             NaN
2  Charlie   35      Chicago   90000    Finance    True  Hello, Charlie from Chicago!       Company C

Using insert to Add Columns at Specific Locations

If you want to add a new column at a specific position in the DataFrame, you can use the insert method. Let's say we want to insert a "Gender" column as the second column in our DataFrame:

df.insert(1, 'Gender', ['Female', 'Male', 'Male'])
print(df)

The first argument to insert is the index where you want to place the new column, the second argument is the column name, and the third is the data for the column.

      Name  Gender  Age         City  Salary Department  Over30                   Message PreviousEmployer
0    Alice  Female   25     New York   70000         HR   False  Hello, Alice from New York!       Company A
1      Bob    Male   30  Los Angeles   80000       Tech   False  Hello, Bob from Los Angeles!             NaN
2  Charlie    Male   35      Chicago   90000    Finance    True  Hello, Charlie from Chicago!       Company C

Conclusion: Expanding Your Data Horizons

Creating new columns in a DataFrame is a fundamental skill in data manipulation with Pandas. It allows you to enrich your data, derive new insights, and prepare your data for further analysis or visualization. Whether you're adding simple calculated columns, applying functions for more complex operations, or dealing with missing data, Pandas provides a versatile set of tools for column creation.

Remember, the key to becoming proficient in data manipulation is practice. Don't hesitate to experiment with the different methods we've discussed and explore the extensive Pandas documentation for more advanced techniques. As you grow more comfortable with these tools, you'll find that the possibilities for transforming your data are limited only by your imagination. Keep exploring, and happy data wrangling!