Friday, June 03. 2011
Printer Friendly
Since others have shared their PostgreSQL Conference notes on Planet PostgreSQL:
Josh Berkus,
Blake Crosby,
Selena Deckelman,
Dmitri Fontaine,
Bruce Momjian,
Gabrielle Roth,
Andreas Scherbaum,
and Greg Smith,
I thought I'd do my civic duty and add Paul Ramsey's notes to the mix. His are on his corporate OpenGeo blog which is carried by Planet Geospatial
and Planet OSGeo but not by Planet PostgreSQL.
One thing I admire about Paul is how easily he lets himself be changed by his environment. Sometimes you have to be a little careful what you say to him since he sometimes takes your comments a little too much to heart
and changes a little bit more than you had intended. Anyrate here are his notes: PgCon Notes #1,
PgCon Notes #2, PgCon Notes #3.
Sunday, May 22. 2011
Printer Friendly
Question: How do you deal with bad stat counts?
You have a large table and the default planner stats collector underestimates distinct counts of a critical query column thus resulting in much less than optimal query plans.
How can you manually set this?
PostgreSQL 9.0 introduced ability to set two settings on table columns: n_distinct and n_distinct_inherited which are described a bit in ALTER TABLE help.
The n_distinct is the estimated number of distinct values for that column
with -1 or any negative number representing a percentage of estimated table count instead of a true count.
n_distinct_inherited is a setting useful for parent tables that denotes the estimated distinct count sum of all a parent's child tables.
By tweaking these settings when they are less than optimal, you can influence the query planner to produce better plans. Why this is necessary is mostly for large
tables where the stat collector will not query the whole table to determine stats. The stats collector generally queries at most 10-30% of a table.
Determine If you need to set counts
It's always nice to have the stat collector do all these things for you especially if you have a table that is constantly updated and distinct counts can fluctuate a lot.
For static tables you may just want to set them manually.
So how do you know whether you should bother or not. Well you can check the current values
the stats collector has with this query:
SELECT tablename, schemaname, attname As colname, n_distinct,
array_to_string(most_common_vals, E'\n') AS common_vals,
array_to_string(most_common_freqs, E'\n') As dist_freq
FROM pg_stats
WHERE tablename = 'table_of_interest'
ORDER BY schemaname, tablename, attname;
You would then compare with your actuals
SELECT count(DISTINCT column_of_interest) FROM table_of_interest;
Will give you the current count.
Setting n_distinct and n_distinct_inherited
You may want to bump this up or down when you set the value. Next to set the column distinct count stats you would do something like below
replacing 50 with the count you computed:
ALTER TABLE table_of_interest
ALTER COLUMN column_of_interest
SET (n_distinct=50);
ALTER TABLE table_of_interest
ALTER COLUMN column_of_interest
SET (n_distinct_inherited=50);
Tuesday, May 10. 2011
Printer Friendly
Question: What is the difference between CURRENT_TIMESTAMP and clock_timestamp()
Answer: CURRENT_TIMESTAMP is an ANSI-SQL Standard variable you will find in many relational databases including PostgreSQL, SQL Server, Firebird, IBM DB2 and MySQL to name a few
that records the start of the transaction. The important thing to keep in mind about it is there is only one entry per transaction so if you have a long running transaction,
you won't be seeing it changing as you go along.
clock_timestamp() is a PostgreSQL function that always returns the current clock's timestamp. I don't think I'm alone in using it for doing simple benchmarking and other things
where for example I need to record the timings of each part of a function within the function using pedestrian RAISE NOTICE debug print statements.
There is another cool way I like using it, and that is for a batch of records each with an expensive function call, benchmarking how long it takes to process each record.
One of the things I'm working on is improving the speed of the tiger_geocoder packaged in PostGIS 2.0. The first root of attack seemed to me would be the normalize_address function
which I was noticing was taking anywhere from 10% to 50% of my time in the geocode process. That's a ton of time if you are trying to batch geocode a ton of records. The thing is
the function is very particular to how badly formed the address is so a whole batch could be held up by one bad apple and since the batch doesn't return until all are processed, it makes
the whole thing seem to take a while.
So rather than looping thru each, I thought it would be cool if I could run the batch, but for each record have it tell me how long it took to process relative to the rest so I could get
a sense of what a problem address looks like. So I wrote this query:
WITH ctbenchmark
AS
(SELECT *,
the_time - COALESCE(lag(the_time) OVER(ORDER BY the_time), CURRENT_TIMESTAMP) As process_time,
the_time - CURRENT_TIMESTAMP As diff_from_start
FROM (SELECT address_1, city, state, zip,
pprint_addy(normalize_address(coalesce(address_1,'') || ', ' || coalesce(city || ' ','') || state || ' ' || zip)) As pp_addr,
clock_timestamp() As the_time
FROM testgeocode LIMIT 1000) As foo )
SELECT *
FROM ctbenchmark
WHERE process_time > '00:00:00.016'::interval;
Which returned an output something like this:
address_1 | city | state | zip | pp_addr | the_time | process_time | diff_from_start
------------------+------------+-------+------- +-------------------------------------------+--------------+------------------
48 MAIN ST .. | S.. | MA | 021.. | 48 MAIN .. | 2011-05-10 03:24:43.078-04 | 00:00:00.032 | 00:00:00.032
15 ... | | MA | 018... | 15 GREN... | 2011-05-10 03:24:50.796-04 | 00:00:00.031 | 00:00:07.75
Friday, May 06. 2011
Printer Friendly
We've started to play with PostgreSQL 9.1beta and the PgAdmin III 1.14.0 Beta 1. We'll briefly go over the cool gems
found in PgAdmin III beta 1. Most of the new features are for navigating the upcoming PostgreSQL 9.1. Well first obstacle we ran into
was we can't get our favorite extension, PostGIS, to compile against PostgreSQL 9.1beta though it did with the alphas, so you won't be seeing any windows experimental builds until we resolve this issue.
Details of ticket here? PostGIS 2.0 won't compile for PostgreSQL 9.1 beta1
Despite that minor set back, we decided to push on and navigate the new features by using PgAdmin III 1.14.0 as our Tour Guide. Below is a list of new features you can experience
via PgAdmin III 1.14.0 Beta 1. I'm sure there are more we missed, but these are the ones that were most flashing.
Continue reading "Navigating PostgreSQL 9.1 beta 1 with PgAdmin III 1.14.0 Beta 1"
Wednesday, March 30. 2011
Printer Friendly
I am happy to report, that the final proof of the PostGIS in Action E-Book got released today
and the printed version is scheduled for release Aprill 11th, 2011 and should be available on Amazon and other locations around then. The other e-Reader formats will come after that.
You can buy from here or download the two free chapters, if you haven't already.
Each hard-copy purchase comes with a free E-Book version. There is a coupon in the back of the book when you get it to get the E-Book versions.
Yes, I know it's been a really really long time.
On the bright side, we produced twice as much content as we had set out to do and that was with keeping things as concise as we
could get away with, still managing to cover more than we set out to cover, and stripping out as many unnecessary words as we could muster.
So 520 pages and almost 2 years later, this is where we are.
A good chunk of the additional bulk of the book was the appendices which are about 150 pages
total and focus strictly on PostgreSQL and SQL. After many comments from early reviewers, we thought it unfair not to have a good chunk of PostgreSQL
and just general relational database content to familiarize programmers and GIS folks with the RDBMS that PostGIS lives in. Most GIS folk unfortunately
have the hardest time with getting up to speed with SQL and just standard RDBMS management.
Two free chapters and accompanying code for all chapters
The two free chapters we selectively picked because we thought they would be most beneficial to newcomers and people new to relational databases.
So the free chapters are:
- Chapter 1: What is a spatial database? Which provides a fast paced history of PostGIS, PostgreSQL, Spatial Databases and moves into
an even faster journey into converting flat file restaurant locations to spatial point geometries, loading in an ESRI shapefile of roads. Then shows you how to write standard
spatial queries and render the results.
- Appendix C: SQL Primer -- goes through querying information_schemas, the common points of writing SELECT, INSERT, UPDATE, DELETE SQL statements and the finer points of using aggregate functions, Windowing constructs and common table expressions as well
as a brief overview of how PostgreSQL stacks up with other relational databases (SQL Server, Oracle, IBM DB2, MySQL, Firebird) in SQL features.
- All the chapter code and accompanying data. It's a bit hefty at 57 MB.
So even if you don't buy our book, we hope you find the free chapters useful.
You can get a more detailed listing of all the chapters from the PostGIS in Action book site.
We'd like to thank all those who supported us through this long and unpredictable journey. Hopefully we'll have several more, though hopefully
a bit less nerve-racking than this first one.
Saturday, February 19. 2011
Printer Friendly
QuestionYou have a system of products and categories and you want a product to be allowed to be in multiple categories, but you want a product to only be allowed to be in one main category.
How do you enforce this rule in the database?
Some people will say -- why can't you just deal with this in your application logic. Our general reason is that much of our updating doesn't happen at our application level. We like enforcing rules at the database
level because it saves us from ourselves. We are in the business of massaging data. For this particular kind of example we wanted to make sure the database would provide us a nice safety net so that
we wouldn't accidentally assign a product in two main categories.
Answer
There are two approaches we thought of. One is the obvious have a primary category column and a bridge table that has secondary categories. That is an ugly solution because when you do a query you have to do a union
and always treat the secondary categories as different from the main. For most use-cases we don't usually care about distinguisihing primary from secondary category.
The solution we finally settled on was to have one bridge table with a boolean field for if its the main category. We enforce the only one main category requirement using a partial index. Now not all databases support partial indexes
This is one major value of using PostgreSQL that you have so many more options for implementing logic.
As some people noted in the comments and the reddit entry. SQL Server 2008 has a similar feature called Filtered Indexes. Though PostgreSQL has had partial indexes for as far back as I can remember. This brings up an interesting point which I have observed -- if you were using PostgreSQL before, you would already know how to use the Filtered Indexes, Multi row inserts introduced in SQL Server 2008, and the SEQUENCES feature coming in SQL Server 2010. So we should all use PostgreSQL, because it teaches us how to use the newer versions of SQL Server before they come out. :)
So how does the partial index solution look: NOTE for simplicity, we are leaving out all the complimentary tables and the foreign key constraints that we also have in place.
CREATE TABLE products_categories
(
category_id integer NOT NULL,
product_id integer NOT NULL,
main boolean NOT NULL DEFAULT false,
orderby integer NOT NULL DEFAULT 0,
CONSTRAINT products_categories_pkey PRIMARY KEY (category_id, product_id)
);
CREATE UNIQUE INDEX idx_products_categories_primary
ON products_categories
USING btree
(product_id)
WHERE main = true;
Testing it out. It saves us and gives us a nice informative message to boot.
INSERT INTO products_categories(category_id, product_id, main)
VALUES (1,2,true), (3,2,false), (3,3,true), (4,2,true);
ERROR: duplicate key value violates unique constraint "idx_products_categories_primary"
DETAIL: Key (product_id)=(2) already exists.
Monday, December 20. 2010
Printer Friendly
In Part 2 of PL/R we covered how to build PL/R functions that take arrays and output textual outputs of generated R objects. We then used this in an aggregate SQL query using array_agg. Often when you are building PL/R functions
you'll have R functions that you want to reuse many times either inside a single PL/R function or across various PL/R functions.
Unfortunately, if you wanted to call a PL/R function from another PL/R function, this is not possible unless you are doing it from spi.execute call.
There is another way to embed reusable R code in a PostgreSQL database.
In order to be able to share databases stored R code across various PL/R functions, PL/R has a feature called a plr_module. In this tutorial
we'll learn how to create and register shareable R functions with plr_module. In the next part of this series we'll start to explore generating graphs with PL/R.
Continue reading "PL/R Part 3: Sharing Functions across PL/R functions with plr_module"
Friday, December 10. 2010
Printer Friendly
In Intro to PL/R and R, we covered how to enable PL/R language in the database and wrote some PL/R functions
that rendered plain text reports using the R environment. What makes combining R and PostgreSQL in
PL/R most powerful is when you can start writing SQL summary queries that use R functions like any other SQL function.
In this next example, we'll be using PostGIS test runs from tests we autogenerated from the Official PostGIS documentation (Documentation Driven Testing (DDT))
as described in the Garden Test section of the PostGIS Developer wiki.
We've also updated some of our logging generator and test patterns so future results may not represent what we demonstrated in the last article.
On a side note: Among the changes in the tests was to introduce more variants of the Empty Geometry now supported by PostGIS 2.0.
Our beloved PostGIS 2.0 trunk is at the moment somewhat unstable when working with these new forms of emptiness and stuffing geometries in inappropriate places. At the moment it doesn't survive through the mindless machine gun battery of tests we have mercilessly inflicted.
It's been great fun trying to build a better dummy while watching Paul run around patching holes to make the software more dummy proof as the dummy stumbles across questionable but amusing PostGIS use cases not gracefully handled by his new serialization and empty logic.
On yet another side note, it's nice to
see that others are doing similar wonderful things with documentation. Check out Euler's comment on catalog tables
where he uses the PostgreSQL SGML documentation to autogenerate PostgreSQL catalog table comments using OpenJade's OSX to convert the SGML to XML and then XSL similar to what we did with PostGIS documentation to autogenerate PostGIS function/type comments and as a platform
for our test generator.
For our next exercises we'll be using the power of aggregation to push data into R instead of pg.spi.execute. This will make our functions far more reusable and versatile.
Continue reading "PL/R Part 2: Functions that take arguments and the power of aggregation"
Monday, October 11. 2010
Printer Friendly
One of the neat changes already present in the PgAdmin III 1.13dev, is the change in plugin architecture. Version 1.13 dev allows for multiple plugin*.ini files. How does this work.
Well if you have a plugins.d folder in your PgAdmin III version folder, it will read all the inis in that folder and load them as plugins.
Recall in PgAdmin III Plug-in Registration: PostGIS Shapefile and DBF Loader,
we demonstrated how to load the PostGIS shapefile and dbf loader as a plugin in PgAdmin III, well this time we will demonstrate how to do it using PgAdmin version 1.13. Better yet, we'll show you the new and improved
PgAdmin III Shapefile and DBF Loader in the works for PostGIS 2.0.
Continue reading "PgAdmin III 1.13 - change in plugin architecture and PostGIS Plugins"
|