Find and delete duplicate rows in MySQL with examples

Introduction

Sometimes, unwanted duplicate data in the database creates problems for the application. It gives an error message and we search for the problem in the program. Even though you might have handled exceptions with proper error messages, we still need to identify the root cause of the issue. Here, we are talking about duplicate rows in the database, meaning, a group of columns or all the columns of a table have the same data in multiple rows. In this topic, we will discuss identifying and deleting duplicate rows from a MySQL table.

How does duplicate data get created?

Let us take an example - Employee table can have the columns - name, employee ID, email and salary. The employee ID and email are supposed to be unique individually, and the application never allows the same employee ID or email for two or more rows in a table. However, there could be data migration or data import that can create duplicate data, when database level constraints are not present. In this case, we need to fix the data loading program, but still, we need to check if duplicate rows are being created. To identify and remove the duplicates we will write SQL queries on the table.

Why do we need to identify them

Duplicate rows can result in an error for an application. Because additional rows are unwanted and program will raise an exception. Even if data is not being used, it occupies extra space in the database. So, we should identify them and remove them from the database.

How to identify duplicate rows in a table?

We can use GROUP BY with HAVING clause in our SQL statement to identify the duplicate rows. We can use SQL count() function by grouping the columns for which we want to view duplicate data. If you want to see duplicate emails, use GROUP BY on the email column only. If you want to get duplicate data for name, email and salary, then use GROUP BY on all these columns.

Let us create a table named "employees" in MySQL. We will insert some data in this table with duplicate rows. Create table scripts with insert statements are given below:


CREATE TABLE `employees` (
  `id` int(11) NOT NULL,
  `emp_id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL,
  `email` varchar(255) NOT NULL,
  `gender` varchar(255) NOT NULL,
  `salary` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;
    
INSERT INTO `employees` (`id`, `emp_id`, `name`, `email`, `gender`, `salary`) VALUES
(1, 2230, 'John Doe', 'jdoe@test.com', 'M', 20000),
(2, 2624, 'Robert S', 'roberts@test.com', 'M', 10000),
(3, 9654, 'Joe Stewart', 'jstewart@test.com', 'M', 40000),
(4, 2262, 'Julie Agarwal', 'sagarwal@test.com', 'F', 25000),
(5, 2471, 'Amanda J', 'amandaj@test.com', 'F', 18000),
(6, 2230, 'John Doe', 'jdoe@test.com', 'M', 20000),
(7, 9654, 'Joe Stewart', 'jstewart@test.com', 'M', 40000);
     
ALTER TABLE `employees`
  ADD PRIMARY KEY (`id`);

ALTER TABLE `employees`
  MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=8;

Structure of the employees table and the inserted data are given below:

employees table structure to show example of duplicate rows

Table data:

employees table data to identify duplicate rows in sql

You can see there are duplicate rows above as marked by arrows.

We will use phpMyAdmin SQL query window to write and execute the queries. To identify the duplicate rows, we will use below SQL statement:


  SELECT emp_id, name, email, gender, salary, count(*)
  FROM employees
  GROUP BY emp_id, name, email, gender, salary
  HAVING count(*) > 1

You can see we are selecting count of rows along with the columns for which we want to get the duplicates and using GROUP BY on those columns. Then we are using HAVING clause with count more than one. After executing the above query, we get below results:

employees table structure to show example of dupliacte rows

You can see the query has identified the duplicate rows with count as 2 for two rows, now how to delete these duplicates rows?.

MySQL query to delete duplicate rows from a table

We want to keep the row that was created first and remove the one created later. This means we want to delete the row with higher id. So, if you look at the table data, our delete statement should delete the last two rows which are unwanted, i.e., rows with id 6 and 7 should be deleted.

Delete Statement 1


DELETE FROM employees
WHERE EXISTS (SELECT 'x' FROM employees emp
              WHERE  employees.emp_id = emp.emp_id
              AND employees.name = emp.name
              AND employees.email = emp.email
              AND employees.gender = emp.gender
              AND employees.salary = emp.salary
              AND employees.id > emp.id)

After executing the above query, we get below results:

mysql query to identify duplicate rows

If you now look at the table data we see that rows with id 6 and 7 are deleted. Duplicate rows are removed now.

mysql query to show duplicate rows in a table

Delete Statement 2

Another way to delete the duplicate rows is to select the MIN(id) in a subquery and delete other rows which are not in the list of min ids. See the below SQL statement:


  DELETE FROM employees
  WHERE id NOT IN (SELECT MIN(id) FROM employees 
                  GROUP BY emp_id, name, email, gender, salary)

Delete duplicate data in SQLWatch YouTube Video

Conclusion

SQL statements with GROUP BY and HAVING clauses can be used to identify the count of rows with the same data for one or more columns. These are very useful to identify duplicates. We have discussed the SQL statement to identify the duplicate rows and then another SQL statement to delete only the duplicate rows. I hope, this will be useful for you.