Skip to main content
HomeAbout PythonLearn Python

Pandas Drop Duplicates Tutorial

Learn how to drop duplicates in Python using pandas.
Sep 2020  · 4 min read

Removing duplicates is an essential skill to get accurate counts because you often don't want to count the same thing multiple times. In Python, this could be accomplished by using the Pandas module, which has a method known as drop_duplicates.

Let's understand how to use it with the help of a few examples.

Dropping Duplicate Names

Let's say you have a dataframe that contains vet visits, and the vet's office wants to know how many dogs of each breed have visited their office. However, there are dogs like Max and Stella, who have visited the vet more than once in your dataset. Hence, you cannot just count the number of each breed in the breed column.

print(vet_visits)
          date     name        breed  weight_kg
0   2018-09-02    Bella     Labrador      24.87
1   2019-06-07      Max     Labrador      28.35
2   2018-01-17   Stella    Chihuahua       1.51
3   2019-10-19     Lucy    Chow Chow      24.07
..         ...      ...          ...        ...
71  2018-01-20   Stella    Chihuahua       2.83
72  2019-06-07      Max    Chow Chow      24.01
73  2018-08-20     Lucy    Chow Chow      24.40
74  2019-04-22      Max     Labrador      28.54

To achieve this, you would remove rows that contain a dog name already listed earlier, or in other words, you will extract a dog with each name from the dataset once.

You would do this using the drop_duplicates method. It takes an argument subset, which is the column we want to find or duplicates based on - in this case, we want all the unique names.

vet_visits.drop_duplicates(subset="name")
          date     name        breed  weight_kg
0   2018-09-02    Bella     Labrador      24.87
1   2019-06-07      Max    Chow Chow      24.01
2   2019-03-19  Charlie       Poodle      24.95
3   2018-01-17   Stella    Chihuahua       1.51
4   2019-10-19     Lucy    Chow Chow      24.07
7   2019-03-30   Cooper    Schnauzer      16.91
10  2019-01-04   Bernie  St. Bernard      74.98
(6  2019-06-07      Max     Labrador      28.35)

But, what if we have dogs with the same name?

Dropping Duplicate Pairs

In that case, we need to consider more than just name when dropping duplicates. Since Max and Max are different breeds, we can drop the rows with pairs of names and breeds listed earlier in the dataset.

unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])
print(unique_dogs)
          date     name        breed  weight_kg
0   2018-09-02    Bella     Labrador      24.87
1   2019-03-13      Max    Chow Chow      24.13
2   2019-03-19  Charlie       Poodle      24.95
3   2018-01-17   Stella    Chihuahua       1.51
4   2019-10-19     Lucy    Chow Chow      24.07
6   2019-06-07      Max     Labrador      28.35
7   2019-03-30   Cooper    Schnauzer      16.91
10  2019-01-04   Bernie  St. Bernard      74.98

To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset argument, in this case, name and breed.

Now both Max's have been included.

Interactive Example

In this exercise, you'll create some new DataFrames using unique values from sales. sales is available, and pandas is imported as pd.

You will perform the following steps:

  • First, you will remove rows of sales with duplicate pairs of store and type and save as store_types and print the head.
  • Then, you will remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head.
  • Subset the rows that are holiday weeks, and drop the duplicate dates, saving as holiday_dates.
  • Finally, select the date column of holiday_dates, and print the holiday_dates dataframe.
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])
print(store_depts.head())

# Subset the rows that are holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")

# Print date col of holiday_dates
print(holiday_dates["date"])

When we run the above code, it produces the following result:

      store type  department       date  weekly_sales  is_holiday  temperature_c  fuel_price_usd_per_l  unemployment
0         1    A           1 2010-02-05      24924.50       False          5.728                 0.679         8.106
901       2    A           1 2010-02-05      35034.06       False          4.550                 0.679         8.324
1798      4    A           1 2010-02-05      38724.42       False          6.533                 0.686         8.623
2699      6    A           1 2010-02-05      25619.00       False          4.683                 0.679         7.259
3593     10    B           1 2010-02-05      40212.84       False         12.411                 0.782         9.765
    store type  department       date  weekly_sales  is_holiday  temperature_c  fuel_price_usd_per_l  unemployment
0       1    A           1 2010-02-05      24924.50       False          5.728                 0.679         8.106
12      1    A           2 2010-02-05      50605.27       False          5.728                 0.679         8.106
24      1    A           3 2010-02-05      13740.12       False          5.728                 0.679         8.106
36      1    A           4 2010-02-05      39954.04       False          5.728                 0.679         8.106
48      1    A           5 2010-02-05      32229.38       False          5.728                 0.679         8.106
498    2010-09-10
691    2011-11-25
2315   2010-02-12
6735   2012-09-07
6810   2010-12-31
6815   2012-02-10
6820   2011-09-09
Name: date, dtype: datetime64[ns]

Try it for yourself.

To learn more about counting and aggregating data, please see this video from our course Data Manipulation with pandas.

This content is taken from DataCamp’s Data Manipulation with pandas course by Maggie Matsui and Richie Cotton.

Topics

Learn more about Python and pandas

Certification available

Course

Data Manipulation with pandas

4 hr
350.4K
Learn how to import and clean data, calculate statistics, and create visualizations with pandas.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Mastering the Pandas .explode() Method: A Comprehensive Guide

Learn all you need to know about the pandas .explode() method, covering single and multiple columns, handling nested data, and common pitfalls with practical Python code examples.
Adel Nehme's photo

Adel Nehme

5 min

Python NaN: 4 Ways to Check for Missing Values in Python

Explore 4 ways to detect NaN values in Python, using NumPy and Pandas. Learn key differences between NaN and None to clean and analyze data efficiently.
Adel Nehme's photo

Adel Nehme

5 min

Seaborn Heatmaps: A Guide to Data Visualization

Learn how to create eye-catching Seaborn heatmaps
Joleen Bothma's photo

Joleen Bothma

9 min

Test-Driven Development in Python: A Beginner's Guide

Dive into test-driven development (TDD) with our comprehensive Python tutorial. Learn how to write robust tests before coding with practical examples.
Amina Edmunds's photo

Amina Edmunds

7 min

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

See MoreSee More