Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to create a new column in Pandas

Understanding Pandas and DataFrames

Before we dive into the process of creating a new column in a Pandas DataFrame, let's briefly understand what Pandas is and what a DataFrame represents. Pandas is an open-source Python library that provides high-performance data manipulation and analysis tools using its powerful data structures. One of these structures is the DataFrame, which can be imagined as a table much like one you would find in a spreadsheet. Each column in a DataFrame can be thought of as a list of entries, much like a column in a spreadsheet, and each row represents a single record.

Getting Started with Pandas

To start using Pandas, we first need to import it. We can do this with the following line of code:

import pandas as pd

The pd is a common alias for Pandas, and it allows us to access all the functions and classes within Pandas using this shorthand notation.

Creating a New Column from Scratch

Creating a new column in a DataFrame is akin to adding a new feature to our data. Let's say we have a DataFrame that contains information about fruits and their prices. We want to add a new column that shows the stock quantity for each fruit.

First, let's create our simple DataFrame:

data = {
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date'],
    'Price': [1.2, 0.5, 1.5, 1.0]
}

df = pd.DataFrame(data)

Now, to add a new column, we can simply assign a list of values to a new column name like so:

df['Stock'] = [20, 35, 15, 50]

After this, our DataFrame df will have a new column named 'Stock'.

Using a Constant Value for the New Column

If we want to add a new column where every row has the same value, we can assign a single value instead of a list. For instance, if we want to add a column that indicates the currency of the prices:

df['Currency'] = 'USD'

Now, every row in the 'Currency' column will contain the string 'USD'.

Creating a Column Based on Operations with Existing Columns

We can also create a new column by performing operations on existing columns. For example, if we want to calculate the total value of the stock for each fruit, we can multiply the 'Price' column with the 'Stock' column:

df['TotalValue'] = df['Price'] * df['Stock']

This operation is performed row-wise, meaning each row's 'Price' and 'Stock' are multiplied to give the 'TotalValue' for that particular row.

Using Functions to Create Columns

If we have more complex logic for our new column, we can define a function and apply it to the DataFrame. Suppose we want to categorize fruits based on their price: 'Cheap' for prices less than 1, 'Moderate' for prices between 1 and 1.5, and 'Expensive' for prices higher than 1.5. We can write a function and use the apply() method:

def categorize_price(price):
    if price < 1:
        return 'Cheap'
    elif price <= 1.5:
        return 'Moderate'
    else:
        return 'Expensive'

df['PriceCategory'] = df['Price'].apply(categorize_price)

The apply() method takes a function and applies it to each element in the column.

Conditional Column Creation with np.where

For conditional column creation, we can use NumPy's where function. This is useful for creating binary or flag columns based on a condition. Let's add a column to flag whether a fruit is in stock, assuming a stock quantity of less than 10 means it's not in stock:

import numpy as np

df['InStock'] = np.where(df['Stock'] > 10, 'Yes', 'No')

This line will check each row's 'Stock' value, and if it's greater than 10, 'Yes' will be assigned to the 'InStock' column for that row; otherwise, 'No'.

Adding a Column with Data from Another DataFrame

Sometimes, we might need to add a column that comes from another DataFrame. For example, we have another DataFrame with discount information:

discount_data = {
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date'],
    'Discount': [0.1, 0.2, 0.15, 0.05]
}

discount_df = pd.DataFrame(discount_data)

We can merge this with our original DataFrame:

df = df.merge(discount_df, on='Fruit', how='left')

The merge function combines the two DataFrames based on the 'Fruit' column. The how='left' argument means that all entries from the original df will be kept, even if there's no matching entry in discount_df.

Conclusion: The Versatility of Data Manipulation with Pandas

In the realm of data manipulation, Pandas stands as a versatile and powerful tool, and creating new columns is a fundamental aspect of shaping and enriching your data. Whether you're setting up a straightforward list of values, calculating from existing columns, or even integrating complex logic, Pandas offers a variety of ways to achieve your goal. The ability to seamlessly add and manipulate columns in a DataFrame empowers you to prepare your data for analysis, visualization, or any other process that might follow in your data journey. As you become more familiar with these operations, you'll find that they become second nature, allowing you to handle data with both precision and creativity. Remember, each new column is a step towards unveiling insights and telling the story hidden within your data.