You have a large table and the default planner stats collector underestimates distinct counts of a critical query column thus resulting in much less than optimal query plans.
How can you manually set this?
PostgreSQL 9.0 introduced ability to set two settings on table columns: n_distinct and n_distinct_inherited which are described a bit in ALTER TABLE help.
The n_distinct is the estimated number of distinct values for that column
with -1 or any negative number representing a percentage of estimated table count instead of a true count.
n_distinct_inherited is a setting useful for parent tables that denotes the estimated distinct count sum of all a parent's child tables.
By tweaking these settings when they are less than optimal, you can influence the query planner to produce better plans. Why this is necessary is mostly for large
tables where the stat collector will not query the whole table to determine stats. The stats collector generally queries at most 10-30% of a table.
Determine If you need to set counts
It's always nice to have the stat collector do all these things for you especially if you have a table that is constantly updated and distinct counts can fluctuate a lot.
For static tables you may just want to set them manually.
So how do you know whether you should bother or not. Well you can check the current values
the stats collector has with this query:
--determineifyourstatsarefine-compareestimateswithwhatyouknowtheyare--SELECT tablename, schemaname, attname As colname, n_distinct,
array_to_string(most_common_vals, E'\n')AS common_vals,
array_to_string(most_common_freqs, E'\n')As dist_freq
WHERE tablename ='table_of_interest'ORDERBY schemaname, tablename, attname;
What is the difference between CURRENT_TIMESTAMP and clock_timestamp()
CURRENT_TIMESTAMP is an ANSI-SQL Standard variable you will find in many relational databases including PostgreSQL, SQL Server, Firebird, IBM DB2 and MySQL to name a few
that records the start of the transaction. The important thing to keep in mind about it is there is only one entry per transaction so if you have a long running transaction,
you won't be seeing it changing as you go along.
clock_timestamp() is a PostgreSQL function that always returns the current clock's timestamp. I don't think I'm alone in using it for doing simple benchmarking and other things
where for example I need to record the timings of each part of a function within the function using pedestrian RAISE NOTICE debug print statements.
There is another cool way I like using it, and that is for a batch of records each with an expensive function call, benchmarking how long it takes to process each record.
One of the things I'm working on is improving the speed of the tiger_geocoder packaged in PostGIS 2.0. The first root of attack seemed to me would be the normalize_address function
which I was noticing was taking anywhere from 10% to 50% of my time in the geocode process. That's a ton of time if you are trying to batch geocode a ton of records. The thing is
the function is very particular to how badly formed the address is so a whole batch could be held up by one bad apple and since the batch doesn't return until all are processed, it makes
the whole thing seem to take a while.
So rather than looping thru each, I thought it would be cool if I could run the batch, but for each record have it tell me how long it took to process relative to the rest so I could get
a sense of what a problem address looks like. So I wrote this query:
the_time -COALESCE(lag(the_time)OVER(ORDERBY the_time), CURRENT_TIMESTAMP)As process_time,
the_time -CURRENT_TIMESTAMPAs diff_from_start
FROM(SELECT address_1, city, state, zip,
pprint_addy(normalize_address(coalesce(address_1,'')||','||coalesce(city ||'','')|| state ||''|| zip))As pp_addr,
FROM testgeocode LIMIT 1000)As foo )SELECT*FROM ctbenchmark
WHERE process_time >'00:00:00.016'::interval;
Which returned an output something like this:
address_1 | city | state | zip | pp_addr | the_time | process_time | diff_from_start
48 MAIN ST .. | S.. | MA | 021.. | 48 MAIN .. | 2011-05-10 03:24:43.078-04 | 00:00:00.032 | 00:00:00.032
15 ... | | MA | 018... | 15 GREN... | 2011-05-10 03:24:50.796-04 | 00:00:00.031 | 00:00:07.75
We've started to play with PostgreSQL 9.1beta and the PgAdmin III 1.14.0 Beta 1. We'll briefly go over the cool gems
found in PgAdmin III beta 1. Most of the new features are for navigating the upcoming PostgreSQL 9.1. Well first obstacle we ran into
was we can't get our favorite extension, PostGIS, to compile against PostgreSQL 9.1beta though it did with the alphas, so you won't be seeing any windows experimental builds until we resolve this issue.
Details of ticket here? PostGIS 2.0 won't compile for PostgreSQL 9.1 beta1
Despite that minor set back, we decided to push on and navigate the new features by using PgAdmin III 1.14.0 as our Tour Guide. Below is a list of new features you can experience
via PgAdmin III 1.14.0 Beta 1. I'm sure there are more we missed, but these are the ones that were most flashing.