Deleting Duplicate Records in a Table

Saturday, January 12. 2008

Recommended Books: SQL Cookbook SQL for Mere Mortals

Question:

How do you delete duplicate rows in a table and still maintain one copy of the duplicate?

Answer:

There are a couple of ways of doing this and approaches vary based on how big your table is, whether you have constraints in place, how programming intensive you want to go, whether you have a surrogate key and whether or not you have the luxury of taking a table down. Approaches vary from using subselects, dropping a table and rebuilding using a distinct query from temp table, and using non-set based approaches such as cursors.

The approach we often use is this one:


DELETE
FROM 	sometable
WHERE 	someuniquekey NOT IN
	(SELECT 	MAX(dup.someuniquekey)
        FROM   		sometable As dup
        GROUP BY 	dup.dupcolumn1, dup.dupcolumn2, dup.dupcolum3)

We prefer this approach for the following reasons

Its the simplest to implement
It works equally well across many relational databases
It does not require you to take a table offline, but of course if you have a foreign key constraint in place, you will need to move the related child records before you can delete the parent.
You don't have to break relationships to do this as you would with drop table approaches

The above presumes you have some sort of unique/primary key such as a serial number (e.g. autonumber, identity) or some character field with a primary or unique key constraint that prevents duplicates. Primary candidates are serial key or OID if you still build your tables WITH OIDs.

If you don't have any of these unique keys, can you still use this technique? In PostgreSQL you can, but in other databases such as SQL Server - you would have to add a dummy key first and then drop it afterward. The reason you can always use this technique in Postgres is because PostgreSQL has another hidden key for every record, and that is the ctid. The ctid field is a field that exists in every PostgreSQL table and is unique for each record in a table and denotes the location of the tuple. Below is a demonstration of using this ctid to delete records. Keep in mind only use the ctid if you have absolutely no other unique identifier to use. A regularly indexed unique identifier will be more efficient.


--Create dummy table with dummy data that has duplicates
CREATE TABLE duptest
(
  first_name character varying(50),
  last_name character varying(50),
  mi character(1),
  name_key serial NOT NULL,
  CONSTRAINT name_key PRIMARY KEY (name_key)
)
WITH (OIDS=FALSE);

INSERT INTO duptest(first_name, last_name, mi)
SELECT chr(65 + mod(f,26)), chr(65 + mod(l,26)), 
CASE WHEN f = (l + 2) THEN chr(65 + mod((l + 2), 26)) ELSE NULL END 
FROM 
	generate_series(1,1000) f
	CROSS JOIN generate_series(1,456) l;
	
--Verify how many unique records we have -
--We have 676 unique sets out of 456,000 records
SELECT first_name, last_name, COUNT(first_name) As totdupes
FROM duptest 
GROUP BY first_name, last_name;
	
--Query returned successfully: 455324 rows affected, 37766 ms execution time.
DELETE FROM duptest
	WHERE 	ctid NOT IN
	(SELECT 	MAX(dt.ctid)
        FROM   		duptest As dt
        GROUP BY 	dt.first_name, dt.last_name);

--Same query but using name_key		
--Query returned successfully: 455324 rows affected, 3297 ms execution time.
DELETE FROM duptest
	WHERE 	name_key NOT IN
	(SELECT 	MAX(dt.name_key)
        FROM   		duptest As dt
        GROUP BY 	dt.first_name, dt.last_name);
		
--Verify we have 676 records in our table
SELECT COUNT(*) FROM duptest;

A slight variation on the above approach is to use a DISTINCT ON query. This one will only work in PostgreSQL since it uses the DISTINCT ON feature of PostgreSQL, but it does have the advantage of allowing you to selectively pick which record to keep based on which has the most information. e.g. in this example we prefer records that have a middle initial vs. ones that do not. The downside of using the DISTINCT ON, is that you really need a real key. You can't use the secret ctid field, but you can use an oid field. Below is the same query but using DISTINCT ON


--Repeat same steps above except using a DISTINCT ON query instead of MAX query
--Query returned successfully: 455324 rows affected, 5422 ms execution time. 
DELETE FROM duptest
	WHERE duptest.name_key
	NOT IN(SELECT DISTINCT ON (dt.first_name, dt.last_name) 
        dt.name_key
       FROM duptest dt 
       ORDER BY dt.first_name, dt.last_name, COALESCE(dt.mi, '') DESC) ;

Note: for the above if you want to selectively pick records say on which ones have the most information, you can change the order by to something like this

ORDER BY dt.first_name, dt.last_name, (CASE WHEN dt.mi > '' THEN 1 ELSE 0 END + CASE WHEN dt.address > '' THEN 1 ELSE 0 END ..etc) DESC

Posted by Leo Hsu and Regina Obe in intermediate, q&a, sql server at 04:06 | Comments (19) | Trackbacks (2)

Trackbacks

Trackback specific URI for this entry

PingBack

Weblog: smathermather.wordpress.com
Tracked: Nov 24, 10:54

PingBack

Weblog: www.jaceju.net
Tracked: Nov 24, 21:43

Comments

Display comments as (Linear | Threaded)

Have you checked if this is faster or slower than the form I've seen used many times and have gotten used to:

delete from tab a where exists (select 1 from tab b where a.uniq1=b.uniq1 and a.uniq2=b.uniq2 and a.prkey>b.prkey)

#1 Symbiatch (Homepage) on 2008-01-15 03:56 (Reply)

Glad you asked. For this particular example we chose not to show that approach since it was considerably slower than the above (so slower we don't bother waiting for it to finish). I suspect it depends on if you have indexes on the dupe fields and the ratio of duplicates to non-dupes. This example is odd in that there are more duplicates than actual rows we are keeping. So we may try it when its the reverse case and it may win out.

Just FYI. Writing this example with the exists would be something like

DELETE FROM duptest a WHERE EXISTS (SELECT 1 FROM duptest b WHERE a.first_name=b.first_name and a.last_name=b.last_name and a.name_key < b.name_key)

#1.1 Leo and Regina on 2008-01-16 15:07 (Reply)

This method depends upon a unique id. If an auto-number wasn't designed onto a table, the table.CTID could be used in-place of this.

Since the CTID is a postgresql-ism, some don't like to use it. But it is an option that is available to use in a case like this.

#2 Richard Broersma Jr. on 2008-01-15 13:39 (Reply)

This post helped a lot. The only problem was type casting. I couldn't run query without it on pgsql 8.1. Here's modified version:

DELETE FROM duptest WHERE textin(tidout(ctid)) NOT IN (SELECT max(textin(tidout(t1.ctid))) FROM duptest AS t1 GROUP BY t1.dupid);

#3 Rustam Aliyev on 2009-06-15 18:48 (Reply)

In your preferred method, doesn't also mean that the delete query itself requires you to have at least enough hard drive space available to effectively cache your entire database over again?

#4 Derick on 2009-08-19 12:24 (Reply)

You mean entire table right? If you don't have enough disk space to cache a single table in your database, then you have serious problems anyway.

#5 Regina on 2009-08-20 12:45 (Reply)

The approach with the ctid seems to be impractical for tables with a a lot of records. I tested it with PostgreSQL 8.2 on a table with 8 million rows and cancelled the statement after 15 hours. For every record the server has to check whether its ctid is outside of a a set of 8 million minus N ctids where N is the number of duplicate rows.

In PostgreSQL 8.4 you can do the same more efficiently with window functions in a subselect (count with partion over).

I resorted to an approach in Perl:

SELECT ctid, first_name, last_name FROM duptest ORDER BY first_name, last_name

Then you iterate over the result set and delete all rows where first_name and last_name is equal to the row before. Since you only need the ctid of the duplicte rows in the where clause of the delete statement, you don't need any index for that approach.

#6 Guido Flohr (Homepage) on 2009-11-17 04:37 (Reply)

Your query is indeed much faster, but it deletes only one duplicate per entry at a time --> you have to run it repeatedly if you have more than one duplicate.

#6.1 Jani (Homepage) on 2011-12-07 18:40 (Reply)

I tried the SQL code here and it was quite slow on a table with 750,000 records. It kept timing out on me. A little tweaking I managed to get it down to under a second.

DELETE FROM sometable
WHERE uniquefield IN
(SELECT max(uniquefield)
FROM sometable
GROUP BY dupcol1
HAVING Count(dupcol1)>1);

Now I am using only one dupcolumn here, haven't tried to make it work with more than one.

Simply replace uniquefield with ctid if you must. Will take 2 seconds instead of 1.

#7 Brad Mathews on 2010-01-08 21:01 (Reply)

I've realized, that using WHERE instead of grouping, I get much much better performance when deleting from a large table.

Using the initial example, the query would look like:

DELETE FROM duptest d1 WHERE EXISTS
(SELECT 'x' FROM duptest d2
WHERE d1.first_name = d2.first_name
AND d1.last_name = d2.last_name
AND d1.unique_id < d2.unique_id);

One would need a unique column (PK column perhaps) as the unique_id column, and an index on the first_name and last_name column would be preferable.

On my setup (with 3 grouping columns) it takes 17 seconds to delete around 76,000 duplicate rows from around 1,2 million rows of data. The solution with grouping the columns takes much much longer for me.

#8 Ove Andersen on 2010-07-29 22:17 (Reply)

I tried both the queries suggested by this blog post, and the variation described in comment #8, on a table with 25 million entries and no unique key (so using ctid).

both queries took forever (I cancelled them after 6 days each) and used a huge amount of resources (the first one mainly CPU, the second up to 26GB of RAM).

in the end, I wrote a simple perl script which finished in a few hours (using a temporary table to store the unique entries).

#9 Patrick Meidl on 2011-05-24 03:48 (Reply)

name | phone | birth_date | balance
------+----------+------------+---------
a | 555-8628 | 1988-06-10 | 23.00
b | 555-0780 | 1986-12-02 | 25.00
c | 555-5898 | 1965-06-14 | 46.00
d | 555-5797 | 1961-03-18 | 48.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00
e | 555-7656 | 1990-09-05 | 21.00

Please lemme know how to delete these multiple rows from this table..
I tried the above ways but im not getting the desired result except errors..

#10 nikhil on 2011-11-07 03:00 (Reply)

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is nine minus five?
	Remember Information? Subscribe to this entry

Deleting Duplicate Records in a Table

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Saturday, January 12. 2008

Deleting Duplicate Records in a Table

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Deleting Duplicate Records in a Table

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Saturday, January 12. 2008

Deleting Duplicate Records in a Table

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting