Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to drop duplicate rows in Pandas

Understanding Data Duplication

When working with datasets in programming, especially for data analysis or data science tasks, it's common to encounter duplicate rows. These are rows that have identical information across all columns, much like photocopies of a document in a stack of papers. In the real world, this could happen due to data entry errors, merging data from multiple sources, or as a consequence of other data processing steps.

In the context of the Python programming language, the Pandas library is a powerhouse for data manipulation. It provides a simple yet powerful way to handle tabular data (think of tables like in Excel) through DataFrame objects. Imagine a DataFrame as a table with rows and columns, where each row represents an observation or record, and each column represents a property or feature of that observation.

Getting Started with Pandas

Before diving into how to remove duplicate rows, let's set up a basic Pandas DataFrame. To work with Pandas, you first need to import it, usually under the alias pd.

import pandas as pd

Now, let's create a simple DataFrame with some duplicate rows to work with:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Age': [25, 30, 35, 25, 30],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}

df = pd.DataFrame(data)
print(df)

This code will produce a DataFrame that looks like this:

     Name  Age         City
0   Alice   25     New York
1     Bob   30  Los Angeles
2 Charlie   35      Chicago
3   Alice   25     New York
4     Bob   30  Los Angeles

In this table, you'll notice that the first and fourth rows are identical, as are the second and fifth rows. These are what we call duplicate rows.

Identifying Duplicates

Before removing duplicates, it's often useful to first identify them. Pandas provides the duplicated() method for this purpose. It returns a boolean series (a sequence of true/false values) that you can use to understand which rows are duplicates.

duplicates = df.duplicated()
print(duplicates)

The output will be:

0    False
1    False
2    False
3     True
4     True
dtype: bool

This tells us that rows 3 and 4 are duplicates based on the preceding rows.

Removing Duplicate Rows

Now that we know how to identify duplicates, we can proceed to remove them using the drop_duplicates() method. This method returns a new DataFrame without the duplicate rows.

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

The resulting DataFrame will be:

     Name  Age         City
0   Alice   25     New York
1     Bob   30  Los Angeles
2 Charlie   35      Chicago

By default, drop_duplicates() keeps the first occurrence of the duplicate rows and removes the subsequent ones. But what if you wanted to keep the last occurrence or remove all instances of duplicates entirely? You can do this by using the keep parameter.

# Keep the last occurrence
df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)

# Remove all duplicates
df_remove_all = df.drop_duplicates(keep=False)
print(df_remove_all)

The output will be:

For df_keep_last:
     Name  Age         City
2 Charlie   35      Chicago
3   Alice   25     New York
4     Bob   30  Los Angeles

For df_remove_all:
     Name  Age     City
2 Charlie   35  Chicago

Specifying Columns for De-duplication

Sometimes, you might want to remove duplicates based on specific columns rather than the entire row. For instance, you might want to remove rows with duplicate names, regardless of their age or city. You can specify the subset of columns to consider for identifying duplicates with the subset parameter.

df_no_duplicate_names = df.drop_duplicates(subset=['Name'])
print(df_no_duplicate_names)

The output will allow each name to appear only once:

     Name  Age         City
0   Alice   25     New York
1     Bob   30  Los Angeles
2 Charlie   35      Chicago

Understanding the inplace Parameter

When you use drop_duplicates(), by default, it returns a new DataFrame. However, if you want to remove duplicates from the original DataFrame without creating a new one, you can use the inplace parameter.

df.drop_duplicates(inplace=True)
print(df)

This will modify df directly and will not return anything. The df will now be duplicate-free.

Intuition and Analogy

Think of a DataFrame as a guest list for a party. If the same person is mistakenly invited multiple times, you'd want to remove the duplicate invitations. Similarly, in a DataFrame, you want to ensure that each row represents a unique observation to avoid skewing your data analysis.

Conclusion

In the world of data manipulation with Pandas, removing duplicate rows is akin to tidying up your workspace. It's about ensuring that the information you're working with is accurate and representative of the real-world scenario you're analyzing. The drop_duplicates() method is your broom, sweeping away the clutter of repetition and leaving you with a pristine dataset where every row tells a unique story.

By understanding how to identify and remove duplicate rows, you're one step closer to becoming a master of data. Remember, the cleaner your data, the clearer your insights. With the power of Pandas and the knowledge you've gained, you're now well-equipped to tackle the common task of de-duplication and ensure the integrity of your data-driven projects. Happy coding, and may your datasets always be clean and your insights sharp!