Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to drop duplicates in Pandas

Understanding Duplicates in DataFrames

When you're working with data in Python, it's not uncommon to encounter duplicate rows. Imagine you have a list of attendees for a series of events, and some people attend multiple events. If you're only interested in the unique list of attendees, you'll want to remove the duplicates. In the world of data manipulation with Python, the Pandas library is your go-to tool for such tasks.

Think of a DataFrame in Pandas as a table, much like one you'd see in a spreadsheet. It has rows and columns with labels, and sometimes, some rows are repeated. These are what we call duplicates.

Identifying Duplicates with duplicated()

Before dropping duplicates, it's essential to identify them. Pandas provides the duplicated() function for this purpose. This function returns a boolean series, a list-like object where each item corresponds to a row in the DataFrame and indicates whether that row is a duplicate (True) or not (False).

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Age': [24, 27, 22, 24, 27],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}

df = pd.DataFrame(data)

# Identifying duplicates
print(df.duplicated())

In this example, the last two rows are duplicates of the first two rows, so the output will show True for those rows.

Removing Duplicates with drop_duplicates()

Once you've identified duplicates, you can remove them using the drop_duplicates() function. By default, this function keeps the first occurrence of the duplicate row and removes the rest.

# Removing duplicates
df_unique = df.drop_duplicates()
print(df_unique)

After running this code, df_unique will contain only the unique rows. The duplicates where 'Alice' and 'Bob' occur for the second time will be gone.

Keeping the Last Occurrence

Sometimes, you might want to keep the last occurrence of a duplicate row instead of the first. You can do this by passing the argument keep='last' to the drop_duplicates() function.

# Keeping the last occurrence of duplicates
df_unique_last = df.drop_duplicates(keep='last')
print(df_unique_last)

Now, the DataFrame df_unique_last will keep the second occurrence of 'Alice' and 'Bob' and remove the first ones.

Dropping Duplicates Based on Specific Columns

In some cases, you might want to consider a row duplicate only if certain columns have the same values. You can specify these columns using the subset parameter.

# Dropping duplicates based on specific columns
df_unique_name = df.drop_duplicates(subset=['Name'])
print(df_unique_name)

Here, only the 'Name' column is considered when looking for duplicates. As a result, the DataFrame df_unique_name will have only one 'Alice' and one 'Bob', even though the 'Age' and 'City' columns have different values for their entries.

In-Place Deletion

If you don't need the original DataFrame and want to save memory, you can remove duplicates in-place without creating a new DataFrame. Use the inplace=True parameter for this.

# In-place deletion of duplicates
df.drop_duplicates(inplace=True)
print(df)

Now, df itself has the duplicates removed, and no new DataFrame is created.

Handling Missing Values

When dealing with duplicates, it's essential to consider how missing values (denoted by NaN in Pandas) are handled. By default, rows with NaN in the same places are considered duplicates. However, you might want to treat rows with missing values as unique. Pandas doesn't provide a direct way to do this, but you can fill NaN with unique values before dropping duplicates and then revert the changes.

Intuition and Analogies

To help you understand the concept of dropping duplicates, let's use an analogy. Imagine you have a deck of playing cards, and some cards are repeated. If you want a deck with only unique cards, you'd go through the deck, check each card, and remove the ones that are the same as cards you've already seen.

In this analogy, the deck of cards is your DataFrame, and the process of checking and removing duplicates is what drop_duplicates() does. The subset parameter is like deciding to look only at the numbers or the suits of the cards to determine if they are duplicates.

Practical Example with Real Data

Let's apply what we've learned to a more realistic dataset. We'll use the drop_duplicates() function on a dataset containing information about books.

# Load a dataset of books
books_data = {
    'Title': ['The Hobbit', '1984', 'The Hobbit', '1984'],
    'Author': ['J.R.R. Tolkien', 'George Orwell', 'J.R.R. Tolkien', 'George Orwell'],
    'Year': [1937, 1949, 1937, 1949]
}

books_df = pd.DataFrame(books_data)

# Drop duplicates
books_df_unique = books_df.drop_duplicates()
print(books_df_unique)

In this dataset, we have duplicate entries for "The Hobbit" and "1984". Using drop_duplicates(), we can easily get a DataFrame with unique books.

Conclusion

Dropping duplicates in Pandas is a crucial step in data cleaning that helps ensure the accuracy of your analysis. It's like tidying up your workspace before starting a project; it makes everything that follows more manageable and your results more reliable. With the drop_duplicates() function, you have a powerful tool at your disposal to streamline your datasets. Whether you're a novice coder or an experienced data wrangler, mastering the art of handling duplicates is an essential skill in your programming toolkit. Remember, clean data leads to clearer insights, and with Pandas, you're well-equipped to achieve just that.