Eliminating Duplicate Records From MySQL Tables


Anyone who works with database driven development to any extent, will occasionally run into a situation where duplicate information is added to their database. I have personally run into such a problem on many occasions since I started working with database driven software.

Being able to quickly undo data corruption is extremely important in production databases. As much as we like to have a perfect production environment; it is often very difficult to do dedicate the time and resources necessary to avoid all mistakes.

The following example is a great way to remove duplicates, based upon a set of fields. Basically you create an unique index on a list of fields (composite key) of your specification.

Let’s say you have the following table:

mysql> desc result;
+-------------------------+--------------+------+-----+---------+----------------+
| Field                   | Type         | Null | Key | Default | Extra          |
+-------------------------+--------------+------+-----+---------+----------------+
| id                      | bigint(20)   | NO   | PRI | NULL    | auto_increment |
| comment                 | varchar(255) | YES  |     | NULL    |                |
| key1_id                 | bigint(20)   | YES  | MUL | NULL    |                |
| key2_id                 | bigint(20)   | YES  | MUL | NULL    |                |
| str1                    | varchar(255) | YES  | MUL | NULL    |                |
| result                  | float        | YES  |     | NULL    |                |
+-------------------------+--------------+------+-----+---------+----------------+

In this example, the fields which you do not want duplicated are as follows:

| key1_id              | bigint(20)   | YES  | MUL | NULL    |                |
| key2_id              | bigint(20)   | YES  | MUL | NULL    |                |
| str1                 | varchar(255) | YES  | MUL | NULL    |                |

In remove duplicates on those three fields, to where there is only one record in the database with the same values for each of those three fields, you would do the following:

alter ignore table result add unique index unique_result (key1_id,key2_id,str1);

I ran this on one of my databases and got the following:

Query OK, 12687876 rows affected (21 min 13.74 sec)
Records: 12687876  Duplicates: 1688  Warnings: 0

In this example, we removed 1688 duplicate records based upon the three specified fields.


Leave a Reply

Your email address will not be published. Required fields are marked *