Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to merge multiple dataframes in Pandas

Understanding DataFrames in Pandas

Imagine you're organizing a collection of recipes from various cookbooks into one mega cookbook. Each recipe is written on its own page, and you want to combine these pages to create sections such as desserts, appetizers, and main courses. In the world of data analysis with Python, these pages can be thought of as DataFrames, and the process of organizing them into sections is similar to merging multiple DataFrames.

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Simply put, it's like a spreadsheet or a SQL table that you can manipulate with Python.

Preparing to Merge: Understanding Your Data

Before you start merging, it's important to understand the structure and content of your DataFrames. This includes knowing the column names, the types of data they contain, and how the DataFrames relate to each other. It's like knowing what ingredients you have before you start cooking.

Here's a simple example of a DataFrame:

import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

print(df)

This code will output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

The Basics of Merging DataFrames

Merging DataFrames is akin to putting together pieces of a puzzle. In Pandas, the merge() function is one of the main tools to do this. It combines DataFrames based on common columns or indices, similar to a JOIN operation in SQL.

Here's a basic example of merging two DataFrames:

# First DataFrame
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Second DataFrame
df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Dave'],
    'Height': [165, 180, 175]
})

# Merge DataFrames on the 'Name' column
merged_df = pd.merge(df1, df2, on='Name')

print(merged_df)

The result will be:

     Name  Age  Height
0   Alice   25     165
1     Bob   30     180

Notice how 'Charlie' and 'Dave' are not included in the merged DataFrame. This is because the default merge operation is an 'inner join', which only includes records that have matching values in both DataFrames.

Different Types of Merges

Just like there are various ways to organize your recipes, there are different types of merges in Pandas:

  • Inner Merge: Combines only the common elements. It's the default type of merge in Pandas.
  • Outer Merge: Combines all elements from both DataFrames, filling in missing values with NaN (Not a Number).
  • Left Merge: Includes all elements from the left DataFrame and the common elements from the right DataFrame.
  • Right Merge: Includes all elements from the right DataFrame and the common elements from the left DataFrame.

Imagine you're merging two decks of cards. An inner merge would give you only the cards that are found in both decks. An outer merge would give you a mega deck with all the cards from both decks, and where a card is missing from one deck, you'd leave an empty slot. A left merge would ensure you keep all the cards from the left deck, and a right merge would do the same for the right deck.

Here's how you can perform an outer merge:

# Outer merge
outer_merged_df = pd.merge(df1, df2, on='Name', how='outer')

print(outer_merged_df)

The result will be:

      Name   Age  Height
0    Alice  25.0   165.0
1      Bob  30.0   180.0
2  Charlie  35.0     NaN
3     Dave   NaN   175.0

Merging on Multiple Columns

Sometimes, you might want to merge DataFrames based on more than one column. This is similar to organizing recipes by both the type of dish and the main ingredient.

Let's say we have two DataFrames, and we want to merge them based on both the 'Name' and 'Age' columns:

# DataFrame A
df_a = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Gender': ['F', 'M', 'M']
})

# DataFrame B
df_b = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 36],  # Note that Charlie's age is different here
    'Height': [165, 180, 175]
})

# Merge on multiple columns
multi_merged_df = pd.merge(df_a, df_b, on=['Name', 'Age'])

print(multi_merged_df)

The result will be:

     Name  Age Gender  Height
0   Alice   25      F     165
1     Bob   30      M     180

Charlie is not included because his age does not match in both DataFrames.

Handling Non-Matching Column Names

What if the columns you want to merge on have different names in each DataFrame? It's like trying to match a recipe's 'main ingredient' with another's 'primary component'. In Pandas, you can specify which columns to use for merging in each DataFrame.

Here's an example:

# DataFrame A
df_a = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# DataFrame B with different column names
df_b = pd.DataFrame({
    'PersonName': ['Alice', 'Bob', 'Charlie'],
    'PersonAge': [25, 30, 35],
    'Height': [165, 180, 175]
})

# Merge with different column names
merged_df = pd.merge(df_a, df_b, left_on=['Name', 'Age'], right_on=['PersonName', 'PersonAge'])

print(merged_df)

The result will be:

      Name  Age PersonName  PersonAge  Height
0    Alice   25      Alice         25     165
1      Bob   30        Bob         30     180
2  Charlie   35    Charlie         35     175

Concatenating DataFrames

Sometimes, instead of merging on columns, you might just want to stack DataFrames on top of each other or side by side. This is called concatenation. It's like stacking pages of recipes to make a thicker book, without worrying about matching the content.

Here's an example of vertical concatenation:

# DataFrame A
df_a = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# DataFrame B with the same columns
df_b = pd.DataFrame({
    'Name': ['Dave', 'Eve', 'Frank'],
    'Age': [40, 45, 50]
})

# Concatenate DataFrames vertically
concatenated_df = pd.concat([df_a, df_b], ignore_index=True)

print(concatenated_df)

The result will be:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3     Dave   40
4      Eve   45
5    Frank   50

And here's an example of horizontal concatenation:

# DataFrame A
df_a = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# DataFrame C with different columns
df_c = pd.DataFrame({
    'Gender': ['F', 'M', 'M'],
    'Height': [165, 180, 175]
})

# Concatenate DataFrames horizontally
concatenated_df = pd.concat([df_a, df_c], axis=1)

print(concatenated_df)

The result will be:

      Name  Age Gender  Height
0    Alice   25      F     165
1      Bob   30      M     180
2  Charlie   35      M     175

Dealing with Missing Data After Merging

After merging, you might end up with missing data, represented by NaN. This could be because some records didn't have a match in the other DataFrame. It's like having a recipe with a missing ingredient. You can choose to fill in the missing data with a default value, drop the incomplete records, or find a suitable replacement.

Here's how to fill in NaN with a default value:

# Fill NaN with a default value
merged_df_filled = merged_df.fillna({'Height': 0})

print(merged_df_filled)

Conclusion

Merging multiple DataFrames in Pandas can be compared to assembling a complex jigsaw puzzle. Each piece of the puzzle is a DataFrame, and your task is to find how they fit together to reveal the bigger picture. Whether you're performing an inner merge, outer merge, or simply concatenating DataFrames, the key is to understand the relationships between your data sets and how to align them correctly.

Just as a well-organized cookbook makes it easier to find the perfect recipe, mastering the art of merging DataFrames will streamline your data analysis process, allowing you to glean insights and craft narratives from your data with ease. As you become more familiar with these techniques, you'll find that the once-daunting task of data manipulation becomes as simple and satisfying as following a well-written recipe. Keep experimenting with different merge types and strategies, and soon you'll be able to handle even the most complex data with confidence and precision.