Skip to main content

SQL Remove Duplicates: Comprehensive Methods and Best Practices

Explore the different methods for filtering out and permanently removing duplicate rows using SQL. Learn the practical applications of how to remove duplicates in SQL Server, MySQL, and PostgreSQL.
Oct 29, 2024  · 8 min read

Duplicate records are a common issue that can compromise data integrity and database performance. Removing these duplicates is essential for maintaining data accuracy, optimizing storage, and improving query performance. In this article, we will explore various techniques for removing duplicate rows in SQL, tailored to various use cases and database management systems.

As we get started, I highly recommend taking DataCamp’s Introduction to SQL and Learn SQL courses to learn foundational knowledge of extracting and analyzing data using SQL. Also, I find the SQL Basics Cheat Sheet, which you can download, is a helpful reference because it has all the most common SQL functions.

Understanding Duplicate Rows in SQL

Duplicate rows in SQL refer to records within a table that contain identical values across all or selected columns. The common causes of duplicate rows in SQL include the following:

  • Missing Primary Keys: When tables lack a defined primary key or unique constraint, there is no mechanism to prevent the insertion of duplicate data. This can happen when a table is not normalized and/or there are transitive dependency issues.
  • Data Integration Issues: When merging datasets from different sources, improper joins or inconsistencies in data formats can accidentally introduce duplicates.
  • Manual Data Entry Errors: Human error, such as entering the same record multiple times, is another common cause of duplicate rows.

In the rest of the article, we will look at how to remove duplicates in SQL, and we will divide the article into two blocks. In the first section, we will cover how to remove duplicates in the data that you are retrieving for a report or dashboard; in the second section, we will look at how to remove duplicates in the database.

Methods for Removing Duplicates in the Data You Retrieve

There are different methods of removing duplicates while retrieving records in SQL. Each method depends on the DBMS, such as SQL Server, MySQL, and PostgreSQL. In this section, we will look at the methods of removing duplicates while highlighting any special consideration for each database. Keep in mind, these methods filter the data and return unique records and they do now modify the underlying table.

Using DISTINCT keyword

The DISTINCT keyword is used in a SELECT statement to retrieve unique rows. The DISTINCT keyword syntax for removing duplicates is similar for MySQL, PostgreSQL, and SQL Server databases. The query below will retrieve unique customer names from the customers table.

SELECT DISTINCT Name 
FROM customers;

Using GROUP BY with aggregate functions

The GROUP BY clause, combined with other aggregate functions like MAX(), MIN(), or COUNT(), can help remove duplicate records from tables. The GROUP BY clause helps select specific records to retain while deleting other duplicates.

Suppose you want to delete duplicate customer records but keep the one with the highest ID. You will use the GROUP BY clause with the MAX() function, as shown below.

-- Delete duplicate rows from the 'customers' table (aliased as c1)
DELETE c1
FROM customers c1
-- Find the maximum ID for each unique Name
JOIN (
    SELECT Name, MAX(ID) AS MaxID
    FROM customers
    GROUP BY Name
) c2
-- Match rows based on 'Name' and keep the row with the maximum ID
ON c1.Name = c2.Name 
AND c1.ID < c2.MaxID;

MySQL and SQL Server support the above syntax of GROUP BY with aggregate functions and the JOIN clause. 

Using ROW_NUMBER() with Common Table Expressions (CTE)

With the ROW_NUMBER() function combined with a Common Table Expression (CTE), you can filter out duplicates based on your criteria. The ROW_NUMBER function, when used with PARTITION BY and ORDER BY clauses, assigns a unique sequential number to each row. This method allows for filtering out the rows that do not meet the required criteria.

The following query identifies duplicates and removes all but the first occurrence.

-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
    SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
    FROM customers
)
-- Select only the unique records where RowNum = 1
SELECT ID, Name
FROM CTE
WHERE RowNum = 1;

This method works well for modern versions of SQL Server, MySQL, and PostgreSQL. It is useful for larger datasets or more complex conditions, as it allows you to specify exactly which duplicate to keep.

Removing duplicates using self-JOIN

A self-join allows you to compare a table to itself, making it helpful in identifying and removing duplicate rows by comparing records based on specific criteria. The following example uses the self-join to delete the row with the higher ID, keeping only the first occurrence of each name.

-- Delete duplicate rows using self-join
DELETE c1
FROM customers c1
JOIN customers c2
ON c1.Name = c2.Name AND c1.ID > c2.ID;

The above method works in major databases, including SQL server, MySQL, and PostgreSQL. Check out our Intermediate SQL course to learn more about using aggregate functions and joins to filter data.

Methods for Removing Duplicates in the Database

While you can remove duplicate records using queries, you can also permanently delete them from the database. This approach is important for maintaining data quality. The following methods are used to remove duplicates from the database.

Using ROW_NUMBER() and DELETE

The ROW_NUMBER() function assigns a sequential number to rows within a defined partition. When used with the DELETE statement, it helps identify duplicates by ranking rows based on specific columns and removing unwanted records. This method applies to modern versions of MySQL (from 8.0), PostgreSQL, and SQL Server.

Suppose you want to remove duplicate customer records based on the Name column, keeping only the first occurrence (smallest ID):

-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
    SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
    FROM customers
)
-- Delete rows from the 'customers' table where the row number is greater than 1
DELETE FROM customers
WHERE ID IN (SELECT ID FROM CTE WHERE RowNum > 1);

Using DELETE with subquery

Sometimes, a simple DELETE operation using a subquery can remove duplicates from the database. This method is suitable for older versions of MySQL or PostgreSQL where ROW_NUMBER() might not be available.

The query below deletes rows from the customers table where the ID is not the minimum for each Name, keeping only the row with the smallest ID for each unique Name.

-- Delete rows from the 'customers' table
DELETE FROM customers
WHERE ID NOT IN (
    -- Subquery to find the minimum ID for each unique Name
    SELECT MIN(ID)
    FROM customers
    GROUP BY Name
);

Using GROUP BY with HAVING clause

When you need to check for duplicate values in specific columns, the GROUP BY clause combined with the HAVING clause can be used to identify duplicates. This method allows you to delete specific rows based on the given criteria. This method is compatible with SQL Server, MySQL, and PostgreSQL.

The following query deletes rows from the customers table where the ID belongs to a group of duplicates.

-- Delete rows from the 'customers' table where there are duplicates
DELETE FROM customers
WHERE ID IN (
    -- Subquery to find IDs of duplicate rows
    SELECT ID
    FROM customers
    GROUP BY ID
    HAVING COUNT(*) > 1
);

Using temporary tables for batch processing

Temporary tables are efficient for batch processing and removing duplicates in large datasets. This method is useful where single queries can cause performance issues. The following query creates a temporary table to store the minimum ID for each customer_name and delete rows from the customers table where the ID is not in the temp_customers table.

-- Create a temporary table
CREATE TEMPORARY TABLE temp_customers AS
SELECT MIN(customer_id) AS ID, customer_name
FROM customers
GROUP BY customer_name;
DELETE FROM customers
WHERE customer_id NOT IN (SELECT ID FROM temp_customers);

The above syntax using CREATE TEMPORARY TABLE is only supported in MySQL and PostgreSQL databases. 

Remove Duplicates in SQL Server

SQL Server offers different methods of removing duplicate records from the database. These methods include using DISTINCT with INTO, ROW_NUMBER(), and temporary tables.

Using DISTINCT with INTO

You can use the DISTINCT keyword in a SELECT statement to create a new table with unique records. You can drop the old table once you verify the new table has the specified records. The following example creates the unique_customers table with unique records from the customers table.

-- Select distinct rows from 'customers' and create a new table 'unique_customers'
SELECT DISTINCT *
INTO unique_customers
FROM customers;
-- Drop the original 'customers' table to remove it from the database
DROP TABLE customers;
-- Rename the 'unique_customers' table to 'customers' to replace the original table
EXEC sp_rename 'unique_customers', 'customers';

Using ROW_NUMBER()

You can also use the ROW_NUMBER() function to remove duplicate records from the SQL Server. Assume you have a Customers table with duplicate rows based on the CustomerName column, and you want to delete all but the first occurrence for each duplicate group.

-- Common Table Expression (CTE) to assign a row number to each customer 
WITH CTE AS (
    SELECT CustomerID, CustomerName, ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerID ASC) AS RowNum
    FROM Customers
)
-- Delete rows from the CTE
DELETE FROM CTE
WHERE RowNum > 1;

Using temporary table

Since SQL Server does not support the CREATE TEMPORARY TABLE function, you use the SELECT INTO function. Temporary tables in SQL Server use # as a prefix for the table name.

-- Create a temporary table
SELECT MIN(CustomerID) AS ID, CustomerName
INTO #temp_customers
FROM customers
GROUP BY CustomerName;
-- Delete rows from the 'customers' table where the ID is not in the temporary table
DELETE FROM customers
WHERE CustomerIDNOT IN (SELECT ID FROM #temp_customers);
-- Optionally drop the temporary table after use
DROP TABLE #temp_customers;

I suggest trying our SQL Server Fundamentals skill track to improve your joining tables and data analysis skills. The SQL Server Developer career track will equip you with the skills to write, troubleshoot, and optimize your queries using SQL Server.  

Best Practices

Duplicate rows are a common problem affecting data quality and database performance. Consider the following best practices to prevent duplicate records from being inserted in your database.

  • Use Primary Keys: The primary key column ensures that each record contains unique information, preventing duplicate values from entering the table.
  • Implement Unique Constraints: Applying unique constraints to any column ensures no duplicates exist across non-primary key columns, such as email addresses or phone numbers.
  • Proper Database Design and Normalization: Effective schema design and database normalization help reduce redundancy and duplicate data. This approach ensures each record is stored in specific tables.
  • Use Unique Indexes: Use unique indexes to ensure that certain column combinations are unique without requiring full table-level constraints across the entire dataset.
  • Regular Data Audits: Perform regular data audits by running queries to identify potential duplicates based on your business rules.

Conclusion

Identifying and removing duplicate rows is important to maintaining database efficiency and data accuracy. It is always a best practice to back up your data before making modifications to ensure no accidental data loss occurs.

If you are interested in becoming a proficient data analyst, check out our Associate Data Analyst in SQL career track to learn the necessary skills. The Reporting in SQL course is also appropriate if you want to learn how to build professional dashboards using SQL. Finally, I recommend obtaining the SQL Associate Certification to demonstrate your mastery of using SQL for data analysis and stand out among other data professionals.

Earn a Top SQL Certification

Prove your core SQL skills and advance your data career.
Get SQL Certified

Allan Ouko's photo
Author
Allan Ouko
LinkedIn
I create articles that simplify data science and analytics, making them easy to understand and accessible.

Frequently Asked SQL Questions

What causes duplicate rows in SQL databases?

Duplicate rows can occur due to several factors, including improper database design, missing primary keys, data integration from multiple sources, manual data entry errors, or data migration issues where validation isn’t properly enforced.

Can I prevent duplicates based on multiple columns?

Yes, you can enforce uniqueness across multiple columns using composite keys or unique constraints. This ensures that combinations of values across those columns remain unique.

How does the DISTINCT keyword remove duplicate rows?

Using the DISTINCT keyword only removes duplicates in the query results and does not alter the underlying data.

Which method can you use to permanently delete duplicate records from the database?

You can use ROW_NUMBER() with DELETE, DELETE with subquery, GROUP BY with HAVING clause, and temporary tables for batch processing to permanently delete duplicate rows from the database.

Can duplicates affect the performance of my database?

Yes, duplicates can negatively impact performance by increasing storage costs, slowing queries, and complicating data analysis.

Topics

Learn SQL with DataCamp

course

Introduction to SQL

2 hr
862.7K
Learn how to create and query relational databases using SQL in just two hours.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

Cleaning Data in SQL

In this tutorial, you'll learn techniques on how to clean messy data in SQL, a must-have skill for any data scientist.
Sayak Paul's photo

Sayak Paul

10 min

tutorial

How to Clean Data in Excel: A Beginner's Guide

Learn essential data cleaning techniques in Excel, including removing duplicates, handling missing values, and maintaining consistent formatting.
Laiba Siddiqui's photo

Laiba Siddiqui

15 min

tutorial

UNION vs UNION ALL in SQL

Discover the output and performance differences between UNION and UNION ALL: UNION removes duplicates while UNION ALL includes them.
Vikash Singh's photo

Vikash Singh

7 min

tutorial

How to Best Use the SQL LIMIT Clause

Learn the application of the SQL LIMIT clause to filter data. Master the use of the LIMIT clause in PostgreSQL and MySQL databases.
Allan Ouko's photo

Allan Ouko

8 min

tutorial

SQL: Reporting and Analysis

Master SQL for Data Reporting & daily data analysis by learning how to select, filter & sort data, customize output, & how you can report aggregated data from a database!
Hafsa Jabeen's photo

Hafsa Jabeen

37 min

tutorial

SQL Tutorial: How To Write Better Queries

Learn about anti-patterns, execution plans, time complexity, query tuning, and optimization in SQL.
Karlijn Willems's photo

Karlijn Willems

35 min

See MoreSee More