Eliminating Duplicate Records From MySQL Tables

Any­one who works with data­base dri­ven devel­op­ment to any extent, will occa­sion­al­ly run into a sit­u­a­tion where dupli­cate infor­ma­tion is added to their data­base. I have per­son­al­ly run into such a prob­lem on many occa­sions since I start­ed work­ing with data­base dri­ven soft­ware.

Being able to quick­ly undo data cor­rup­tion is extreme­ly impor­tant in pro­duc­tion data­bas­es. As much as we like to have a per­fect pro­duc­tion envi­ron­ment; it is often very dif­fi­cult to do ded­i­cate the time and resources nec­es­sary to avoid all mis­takes.

The fol­low­ing exam­ple is a great way to remove dupli­cates, based upon a set of fields. Basi­cal­ly you cre­ate an unique index on a list of fields (com­pos­ite key) of your spec­i­fi­ca­tion.

Let’s say you have the fol­low­ing table:

mysql> desc result;
+-------------------------+--------------+------+-----+---------+----------------+
| Field                   | Type         | Null | Key | Default | Extra          |
+-------------------------+--------------+------+-----+---------+----------------+
| id                      | bigint(20)   | NO   | PRI | NULL    | auto_increment |
| comment                 | varchar(255) | YES  |     | NULL    |                |
| key1_id                 | bigint(20)   | YES  | MUL | NULL    |                |
| key2_id                 | bigint(20)   | YES  | MUL | NULL    |                |
| str1                    | varchar(255) | YES  | MUL | NULL    |                |
| result                  | float        | YES  |     | NULL    |                |
+-------------------------+--------------+------+-----+---------+----------------+

In this exam­ple, the fields which you do not want dupli­cat­ed are as fol­lows:

| key1_id              | bigint(20)   | YES  | MUL | NULL    |                |
| key2_id              | bigint(20)   | YES  | MUL | NULL    |                |
| str1                 | varchar(255) | YES  | MUL | NULL    |                |

In remove dupli­cates on those three fields, to where there is only one record in the data­base with the same val­ues for each of those three fields, you would do the fol­low­ing:

alter ignore table result add unique index unique_result (key1_id,key2_id,str1);

I ran this on one of my data­bas­es and got the fol­low­ing:

Query OK, 12687876 rows affected (21 min 13.74 sec)
Records: 12687876  Duplicates: 1688  Warnings: 0

In this exam­ple, we removed 1688 dupli­cate records based upon the three spec­i­fied fields.