You have a large table and the default planner stats collector underestimates distinct counts of a critical query column thus resulting in much less than optimal query plans.
How can you manually set this?
PostgreSQL 9.0 introduced ability to set two settings on table columns: n_distinct and n_distinct_inherited which are described a bit in ALTER TABLE help.
The n_distinct is the estimated number of distinct values for that column
with -1 or any negative number representing a percentage of estimated table count instead of a true count.
n_distinct_inherited is a setting useful for parent tables that denotes the estimated distinct count sum of all a parent's child tables.
By tweaking these settings when they are less than optimal, you can influence the query planner to produce better plans. Why this is necessary is mostly for large
tables where the stat collector will not query the whole table to determine stats. The stats collector generally queries at most 10-30% of a table.
Determine If you need to set counts
It's always nice to have the stat collector do all these things for you especially if you have a table that is constantly updated and distinct counts can fluctuate a lot.
For static tables you may just want to set them manually.
So how do you know whether you should bother or not. Well you can check the current values
the stats collector has with this query:
--determineifyourstatsarefine-compareestimateswithwhatyouknowtheyare--SELECT tablename, schemaname, attname As colname, n_distinct,
array_to_string(most_common_vals, E'\n')AS common_vals,
array_to_string(most_common_freqs, E'\n')As dist_freq
WHERE tablename ='table_of_interest'ORDERBY schemaname, tablename, attname;
What is the difference between CURRENT_TIMESTAMP and clock_timestamp()
CURRENT_TIMESTAMP is an ANSI-SQL Standard variable you will find in many relational databases including PostgreSQL, SQL Server, Firebird, IBM DB2 and MySQL to name a few
that records the start of the transaction. The important thing to keep in mind about it is there is only one entry per transaction so if you have a long running transaction,
you won't be seeing it changing as you go along.
clock_timestamp() is a PostgreSQL function that always returns the current clock's timestamp. I don't think I'm alone in using it for doing simple benchmarking and other things
where for example I need to record the timings of each part of a function within the function using pedestrian RAISE NOTICE debug print statements.
There is another cool way I like using it, and that is for a batch of records each with an expensive function call, benchmarking how long it takes to process each record.
One of the things I'm working on is improving the speed of the tiger_geocoder packaged in PostGIS 2.0. The first root of attack seemed to me would be the normalize_address function
which I was noticing was taking anywhere from 10% to 50% of my time in the geocode process. That's a ton of time if you are trying to batch geocode a ton of records. The thing is
the function is very particular to how badly formed the address is so a whole batch could be held up by one bad apple and since the batch doesn't return until all are processed, it makes
the whole thing seem to take a while.
So rather than looping thru each, I thought it would be cool if I could run the batch, but for each record have it tell me how long it took to process relative to the rest so I could get
a sense of what a problem address looks like. So I wrote this query:
the_time -COALESCE(lag(the_time)OVER(ORDERBY the_time), CURRENT_TIMESTAMP)As process_time,
the_time -CURRENT_TIMESTAMPAs diff_from_start
FROM(SELECT address_1, city, state, zip,
pprint_addy(normalize_address(coalesce(address_1,'')||','||coalesce(city ||'','')|| state ||''|| zip))As pp_addr,
FROM testgeocode LIMIT 1000)As foo )SELECT*FROM ctbenchmark
WHERE process_time >'00:00:00.016'::interval;
Which returned an output something like this:
address_1 | city | state | zip | pp_addr | the_time | process_time | diff_from_start
48 MAIN ST .. | S.. | MA | 021.. | 48 MAIN .. | 2011-05-10 03:24:43.078-04 | 00:00:00.032 | 00:00:00.032
15 ... | | MA | 018... | 15 GREN... | 2011-05-10 03:24:50.796-04 | 00:00:00.031 | 00:00:07.75