Find and Delete Duplicate Rows in MySQL with Examples

Introduction

Sometimes, unwanted duplicate data in the database creates problems for the application. It gives an error message and we search for the problem in the program. Even though you might have handled exceptions with proper error messages, we still need to identify the root cause of the issue. Here, we are talking about duplicate rows in the database, meaning, a group of columns or all the columns of a table have the same data in multiple rows. In this topic, we will discuss how to identify and delete duplicate rows from a MySQL table.

How does duplicate data get created?

Let us take an example of the Employee table that can have columns - name, employee ID, email and salary. The employee ID and email are supposed to be unique individually, and the application never allows the same employee ID or email for two or more rows in a table. However, there could be data migration or data import that can create duplicate data, when database-level constraints are not present. In this case, we need to fix the data loading program, but still, we need to check if duplicate rows are being created. To identify and remove duplicate rows in a table, we will write the SQL queries on the table.

Why do we need to identify them

Duplicate rows can result in an error for an application. Because additional rows are unwanted and program will raise an exception. Even if data is not being used, it occupies extra space in the database. So, we should identify them and remove them from the database.

How to identify duplicate rows in a table?

We can use GROUP BY with HAVING clause in our SQL statement to identify the duplicate rows. We can use the SQL count() function by grouping the columns for which we want to view duplicate data. If you want to get duplicate emails, use GROUP BY on the email column only. If you want to get duplicate data for name, email and salary, then use GROUP BY on all these columns.

Let us create a table named "employees" in MySQL. We will insert some data in this table with duplicate rows. Create table scripts with insert statements are given below:


CREATE TABLE `employees` (
  `id` int(11) NOT NULL,
  `emp_id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL,
  `email` varchar(255) NOT NULL,
  `gender` varchar(255) NOT NULL,
  `salary` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;
    
INSERT INTO `employees` (`id`, `emp_id`, `name`, `email`, `gender`, `salary`) VALUES
(1, 2230, 'John Doe', 'jdoe@test.com', 'M', 20000),
(2, 2624, 'Robert S', 'roberts@test.com', 'M', 10000),
(3, 9654, 'Joe Stewart', 'jstewart@test.com', 'M', 40000),
(4, 2262, 'Julie Agarwal', 'sagarwal@test.com', 'F', 25000),
(5, 2471, 'Amanda J', 'amandaj@test.com', 'F', 18000),
(6, 2230, 'John Doe', 'jdoe@test.com', 'M', 20000),
(7, 9654, 'Joe Stewart', 'jstewart@test.com', 'M', 40000);
     
ALTER TABLE `employees`
  ADD PRIMARY KEY (`id`);

ALTER TABLE `employees`
  MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=8;

Structure of the employees table and the inserted data are given below:

employees table structure to show example of duplicate rows

Table data:

employees table data to identify duplicate rows in sql

You can see there are duplicate rows above as marked by arrows.

We will use phpMyAdmin SQL query window to write and execute the queries. To identify the duplicate rows, we will use below SQL statement:


  SELECT emp_id, name, email, gender, salary, count(*)
  FROM employees
  GROUP BY emp_id, name, email, gender, salary
  HAVING count(*) > 1

You can see we are selecting the count of rows along with the columns for which we want to get the duplicates and using GROUP BY on those columns. Then we use the HAVING clause with a count of more than one. After executing the above query, we get the below results:

employees table structure to show example of dupliacte rows

You can see that the query has identified the duplicate rows with a count of 2 for two rows, now how do we delete these duplicate rows?.

MySQL query to delete duplicate rows from a table

We want to keep the row created first and remove the one created later. This means we want to delete the row with a higher id. So, if you look at the table data, our delete statement should delete the last two unwanted rows, i.e., the rows with id 6 and 7 should be deleted.

Delete Statement 1


DELETE FROM employees
WHERE EXISTS (SELECT 'x' FROM employees emp
              WHERE  employees.emp_id = emp.emp_id
              AND employees.name = emp.name
              AND employees.email = emp.email
              AND employees.gender = emp.gender
              AND employees.salary = emp.salary
              AND employees.id > emp.id)

After executing the above query, we get the below results:

mysql query to identify duplicate rows

If you take a look at the table data now, you will see that the rows with IDs 6 and 7 are deleted. So, the duplicate rows are removed now.

mysql query to show duplicate rows in a table

Delete Statement 2

Another way to delete the duplicate rows is to select the MIN(id) in a subquery and delete the other rows that are not in the list of min IDs. See the below SQL statement:


  DELETE FROM employees
  WHERE id NOT IN (SELECT MIN(id) FROM employees 
                  GROUP BY emp_id, name, email, gender, salary)

Delete duplicate data in SQLWatch YouTube Video

Conclusion

SQL statements with GROUP BY and HAVING clauses can be used to identify the count of rows with the same data for one or more columns. These are very useful to identify duplicates. We have discussed the SQL statement to identify the duplicate rows and then another SQL statement to delete only the duplicate rows. I hope this will be useful to you.

Post a Comment

Save my Name and Email id for future comments