|
PostgreSQL Books Saturday, January 12. 2008Deleting Duplicate Records in a TableRecommended Books: SQL Cookbook SQL for Mere Mortals
![]()
Recommended Books: SQL Cookbook SQL for Mere Mortals
Comments
Display comments as
(Linear | Threaded)
Have you checked if this is faster or slower than the form I've seen used many times and have gotten used to:
delete from tab a where exists (select 1 from tab b where a.uniq1=b.uniq1 and a.uniq2=b.uniq2 and a.prkey>b.prkey)
Glad you asked. For this particular example we chose not to show that approach since it was considerably slower than the above (so slower we don't bother waiting for it to finish). I suspect it depends on if you have indexes on the dupe fields and the ratio of duplicates to non-dupes. This example is odd in that there are more duplicates than actual rows we are keeping. So we may try it when its the reverse case and it may win out.
Just FYI. Writing this example with the exists would be something like DELETE FROM duptest a WHERE EXISTS (SELECT 1 FROM duptest b WHERE a.first_name=b.first_name and a.last_name=b.last_name and a.name_key < b.name_key)
This method depends upon a unique id. If an auto-number wasn't designed onto a table, the table.CTID could be used in-place of this.
Since the CTID is a postgresql-ism, some don't like to use it. But it is an option that is available to use in a case like this.
This post helped a lot. The only problem was type casting. I couldn't run query without it on pgsql 8.1. Here's modified version:
DELETE FROM duptest WHERE textin(tidout(ctid)) NOT IN (SELECT max(textin(tidout(t1.ctid))) FROM duptest AS t1 GROUP BY t1.dupid);
In your preferred method, doesn't also mean that the delete query itself requires you to have at least enough hard drive space available to effectively cache your entire database over again?
You mean entire table right? If you don't have enough disk space to cache a single table in your database, then you have serious problems anyway.
The approach with the ctid seems to be impractical for tables with a a lot of records. I tested it with PostgreSQL 8.2 on a table with 8 million rows and cancelled the statement after 15 hours. For every record the server has to check whether its ctid is outside of a a set of 8 million minus N ctids where N is the number of duplicate rows.
In PostgreSQL 8.4 you can do the same more efficiently with window functions in a subselect (count with partion over). I resorted to an approach in Perl: SELECT ctid, first_name, last_name FROM duptest ORDER BY first_name, last_name Then you iterate over the result set and delete all rows where first_name and last_name is equal to the row before. Since you only need the ctid of the duplicte rows in the where clause of the delete statement, you don't need any index for that approach.
I tried the SQL code here and it was quite slow on a table with 750,000 records. It kept timing out on me. A little tweaking I managed to get it down to under a second.
DELETE FROM sometable WHERE uniquefield IN (SELECT max(uniquefield) FROM sometable GROUP BY dupcol1 HAVING Count(dupcol1)>1); Now I am using only one dupcolumn here, haven't tried to make it work with more than one. Simply replace uniquefield with ctid if you must. Will take 2 seconds instead of 1. |
QuicksearchCalendarCategoriesBlog AdministrationEntry's LinksShow tagged entries |
Tracked: Nov 24, 10:54
Tracked: Nov 24, 21:43