Planner Statistics

Tuesday, June 09. 2009

You'll often hear the term planner statistics thrown around by database geeks. Did you update your statistics. This lingo isn't even limited to PostgreSQL, but is part and parcel to how most decent databases work. For example in PostgreSQL you do a vacuum analyze to update your planner statistics in addition to cleaning up dead space. In SQL Server you do an UPDATE STATISTICS. In MySQL you do an ANALYZE TABLE or a more invasive OPTIMIZE TABLE.

Normally all this "update your stats so your planner can be happy" is usually unnecessary unless you just did a bulk load or a bulk delete or you are noticing your queries are suddenly slowing down. These stat things are generally updated behind the scenes by most databases on an as needed basis.

What makes SQL really interesting and a bit different from procedural languages is that it is declarative (like functional and logical programming languages) and relies on the database planner to come up with strategies for navigating the data. Its strategy is not fixed as it is in procedural languages. A big part of this strategy is decided on by the query planner which looks at distributions of data. Given different WHERE conditions for similar queries, it could come up with vastly different strategies if one value has a significantly higher distribution in a table than another. This is also the mystery of why it sometimes refuses to use an index on a field because it has decided a table scan is more efficient and also why some people consider HINTS evil because they pollute the imperative nature of the language.

So what do these stat things look like exactly? Well you can see these things in PostgreSQL by interrogating the pg_stats view. The pg_stats view exists for all currently supported versions of PostgreSQL (7.4+). Not sure if it existed in prior versions.

Details of the fields are here: pg_stats fields

Below is a sample query from one of our test databases and the output.


SELECT attname As colname, n_distinct,
		   array_to_string(most_common_vals, E'\n') AS common_vals,
		   array_to_string(most_common_freqs, E'\n') As dist_freq
  FROM pg_stats
  WHERE schemaname = 'assets' and tablename = 'land';




	 colname     | n_distinct |    common_vals    | dist_freq
-----------------+------------+-------------------+-----------
 pid             |         -1 |                   |
 land_name       |         -1 |                   |
 land_type       |         13 | park              | 0.104587
							  : college           : 0.0990826
							  : highschool        : 0.0899083
							  : hospital          : 0.0899083
							  : 2 family          : 0.0862385
							  : 3 family          : 0.0825688
							  : elementary school : 0.0788991
							  : commercial        : 0.0770642
							  : government        : 0.0752294
							  : condo             : 0.0697248
							  : police station    : 0.0623853
							  : 1 family          : 0.0458716
							  : vacant land       : 0.0385321
 the_geom        |         -1 |                   |
 land_type_other |  -0.166972 |                   |
(5 rows)

So the above tells us that land_type has 13 distinct values with park the most common value in this table with 10 percent of the sampled data being that and that pid, the_geom and name are fairly distinct.

Keep in mind sampling may not be the all the records and in general is not and sampling can be indirectly controlled per column with ALTER TABLE tablename ALTER COLUMN column_name.

For columns with no specific sampling set, you can check the default with:

show default_statistics_target;

The statistics setting controls the number of items that can be stored in the common_vals, frequencies, and histogram_bounds fields arrays for each table and that would indirectly increase the sampling of records. Prior to 8.4 -- this was defaulted to 10 though can be increased by changing the postgresql.conf file. At 8.4 this the default setting was increased to 100.

Observe what happens to our data set when we up the statistics to as high as we can set for land_type_other and then reanalyze.


ALTER TABLE assets.land
 ALTER COLUMN land_type_other SET STATISTICS 10000;

 vacuum analyze assets.land;

 SELECT attname As colname, n_distinct,
		 array_to_string(most_common_vals, E'\n') AS common_vals,
		 array_to_string(most_common_freqs, E'\n') As dist_freq
FROM pg_stats
WHERE schemaname = 'assets' and tablename = 'land';


	 colname     | n_distinct |           common_vals            | dist_freq
-----------------+------------+----------------------------------+-------------
 land_type       |         13 | park                             | 0.104587
							  : college                          : 0.0990826
 the_geom        |         -1 |                                  |
 land_type_other |  -0.166972 | {"3 family",college}             | 0.00550459
							  : {condo,park}                     : 0.00550459
							  : {highschool,"police station"}    : 0.00550459
							  : {"1 family",government,hospital} : 0.00366972
							  : {"1 family",hospital}            : 0.00366972
							  : {"2 family","police station"}    : 0.00366972
							  : {college,commercial}             : 0.00366972
							  : {college,government}             : 0.00366972
							  : {college,"police station"}       : 0.00366972
							  : {commercial,condo}               : 0.00366972
							  : {government}                     : 0.00366972
							  : {highschool,park}                : 0.00366972
							  : {hospital,park}                  : 0.00366972
							  : {park,"police station"}          : 0.00366972
 land_name       |         -1 |                                  |
 pid             |         -1 |                                  |

Posted by Leo Hsu and Regina Obe in 8.2, 8.3, 8.4, basics, beginner, mysql, sql server at 09:10 | Comments (0) | Trackbacks (0)

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is five plus two?
	Remember Information? Subscribe to this entry

Planner Statistics

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Tuesday, June 09. 2009

Planner Statistics

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Planner Statistics

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Tuesday, June 09. 2009

Planner Statistics

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting