Postgres OnLine Journal - Entries from December 2007

Monday, December 31. 2007

Trojan SQL Function Hack - A PL Lemma in Disguise

Have you ever noticed that in PostgreSQL you can put set returning functions in the SELECT part of an sql statement if the function is written in language SQL or C. Try the same trick for PL written functions such as plpgsql, plperl, plr etc, and you get a slap on the wrist of the form ERROR: set-valued function called in context that cannot accept a set. For Plpgsql and other PL languages you must put the set returning function in the FROM clause.

Below is a simple example:


--Build test data
CREATE TABLE test
(
  test_id serial NOT NULL,
  test_date date,
  CONSTRAINT pk_test PRIMARY KEY (test_id)
)
WITH (OIDS=FALSE);

INSERT INTO test(test_date)
	SELECT current_date + n
		FROM generate_series(1,1000) n;
		
--test function with sql
CREATE OR REPLACE FUNCTION fnsqltestprevn(id integer, lastn integer)
  RETURNS SETOF test AS
$$
	SELECT *
	FROM test 
	WHERE test_id < $1 ORDER BY test_id
	LIMIT $2
$$
  LANGUAGE 'sql' VOLATILE;
  
--Test example 1 works fine
SELECT (fnsqltestprevn(6,5)).*;

--Test example 2 works fine
SELECT *
	FROM fnsqltestprevn(6,5);
	
--Same test function written as plpgsql
CREATE OR REPLACE FUNCTION fnplpgsqltestprevn(id integer, prevn integer)
  RETURNS SETOF test AS
$$
DECLARE
    rectest test;
BEGIN
	FOR rectest 
		IN(SELECT * 
			FROM test 
			WHERE test_id < id 
			ORDER BY test_id LIMIT prevn)  
								LOOP
		RETURN NEXT rectest;
	END LOOP;
END;
$$
LANGUAGE 'plpgsql' VOLATILE;

--Test example 1 - gives error
-- ERROR:  set-valued function called in context that cannot accept a set
SELECT (fnplpgsqltestprevn(6,5)).*;

--Test example 2 works fine
SELECT *
	FROM fnplpgsqltestprevn(6,5);

So it appears that PostgreSQL is not quite as democratic as we would like.


--But what if we did this?
CREATE OR REPLACE FUNCTION fnsqltrojtestprevn(id integer, prevn integer)
	RETURNS SETOF test AS
$$
	SELECT * FROM fnplpgsqltestprevn($1, $2);
$$
LANGUAGE 'sql' VOLATILE;

--Test example 1 - works fine
SELECT (fnsqltrojtestprevn(6,5)).*;

--Test example 2 works fine
SELECT *
	FROM fnsqltrojtestprevn(6,5);

All interesting, but so what? you may ask. It is bad practice to put set returning functions in a SELECT clause. Such things are commonly mistakes and should be avoided.

Functional Row Expansion

It turns out that there are a whole class of problems in SQL where the simplest way to achieve the desired result is via a technique we shall call Functional Row Expansion. By that, we mean that for each record in a given set, we want to return another set of records that can not be expressed as a constant join expression. Basically the join expression is different for each record or the function we want to apply is too complicated to be expressed as a static join statement or join at all.

Taking the above example. Lets say for each record in test, you want to return the 4 records preceding including the current one. So basically you want to explode each row into 5 or fewer rows. Your general gut reaction would be do something as follows:

these give error: ERROR: function expression in FROM cannot refer to other relations of same query level

	
		SELECT test.test_id As ref_id, test.test_date as ref_date, targ.*
			FROM test , 
				(SELECT tinner.* 
					FROM test as tinner 
						WHERE tinner.test_id <= test.test_id 
					ORDER BY tinner.test_id LIMIT 5) As targ;
					
		SELECT test.test_id As ref_id, test.test_date as ref_date, targ.*
			FROM test,fnsqltrojtestprevn(test.test_id, 5) As targ;

--But this does what you want

	
SELECT test.test_id As ref_id, test.test_date as ref_date, 
	(fnsqltrojtestprevn(test.test_id, 5)).*
  FROM test

Keep in mind what makes the above tricky is that you want to return at most 4 of the preceding plus current. If you want to return all the preceding plus current, then you can do a trivial self join as follows:

	
SELECT test.test_id As ref_id, test.test_date as ref_date, targ.*
  FROM test INNER JOIN
	 test As targ ON targ.test_id <= test.test_id
  ORDER BY test.test_id, targ.test_id

So as you can see - its sometimes tricky to tell when you need to use this technique and when you don't.

For this trivial example, writing the function as an SQL only function works fine and is the best to use. SQL functions unfortunately lack the ability to define dynamic sql statements, among other deficiencies so resorting to using a pl language is often easier which means you lose this useful feature of sql functions. Stuffing a pl function in an SQL function just might do the trick. We haven't tried this on other pl languages except plpgsql, but we suspect it should work the same.

Posted by Leo Hsu and Regina Obe in advanced, pl programming, plpgsql, sql functions at 17:33 | Comments (2) | Trackbacks (0)

Friday, December 28. 2007

Stored Procedures in PostgreSQL

Printer Friendly

Recommended Books: Fundamentals of Database Design

Question: Does PostgreSQL support stored procedures?

Short Answer: Sort Of as Stored functions.

Longer Answer:
By strict definition it does not. PostgreSQL as of even 8.3 will not support the Create Procedure syntax nor the Call Level calling mechanism that defines a bonafide stored procedure supporting database (this is not entirely true), since EnterpriseDB does suport CREATE PROCEDURE to be compatible with Oracle. In PostgreSQL 8.4, this may change. Check out Pavel Stehule: Stacked Recordset and Pavel Stehule: First Real Procedures on PostgreSQL for details.

For all intents and purposes, PostgreSQL has less of a need for CREATE PROCEDURE than other databases aside from looking more like other databases. For example in SQL Server -> 2005 - although you can write functions that return tables and so forth, you have to resort to writing CLR functions marked as unsafe to actually update data in a stored function. This gets pretty messy and has its own limitations so you have no choice but to use a stored procedures, which can not be called from within an SQL query. In MySQL 5.1 the abilities of functions are even more limiting - they can't even return a dataset. In PostgreSQL, you can write a function marked as VOLATILE that updates data and that can do all sorts of wacky things that are useful but considered by some to be perverse such as the following:


SELECT rule_id, rule_name, fnprocess_rule(rule_id) As process_result
FROM brules
WHERE brules.category = 'Pay Employees' 
ORDER BY brules.rule_order

Another thing stored procedures can usually do that functions can not is to return multiple result sets. PostgreSQL can simulate such behavior by creating a function that returns a set of refcursors. See this .NET example Getting full results in a DataSet object: Using refcursors way down the page, that demonstrates creating a postgresql function that returns a set of refcursors to return multiple result sets using the Npgsql driver.

Prior to PostgreSQL 8.1, people could yell and scream, but PostgreSQL doesn't support Output Parameters. As weird as it is for a function to support such a thing, PostgreSQL 8.1+ do support output parameters and ODBC drivers and such can even use the standard CALL interface to grab those values.

At a glance it appears that PostgreSQL functions do all that stored procedures do plus more. So the question is, is there any reason for PostgreSQL to support bonafide stored procedures aside from the obvious To be more compatible with other databases and not have to answer the philosophical question, But you really don't support stored procedures?.

There must be some efficiency benefits to declaring something as a store procedure and having it called in that way. Not quite sure if anyone has done benchmarks on that. So for the time being PostgreSQL functions have the uncanny role of having a beak like a duck and the flexibility of a beaver, but having the makeup of a Platypus.

Posted by Leo Hsu and Regina Obe in intermediate, mysql, oracle, q&a, sql server at 07:48 | Comments (23) | Trackbacks (0)

Thursday, December 27. 2007

CrossTab Queries in PostgreSQL using tablefunc contrib

Printer Friendly

The generic way of doing cross tabs (sometimes called PIVOT queries) in an ANSI-SQL database such as PostgreSQL is to use CASE statements which we have documented in the article What is a crosstab query and how do you create one using a relational database?.

In this particular issue, we will introduce creating crosstab queries using PostgreSQL tablefunc contrib.

Installing Tablefunc

Tablefunc is a contrib that comes packaged with all PostgreSQL installations - we believe from versions 7.4.1 up (possibly earlier). We will be assuming the one that comes with 8.2 for this exercise. Note in prior versions, tablefunc was not documented in the standard postgresql docs, but the new 8.3 seems to have it documented at http://www.postgresql.org/docs/8.3/static/tablefunc.html.

Often when you create crosstab queries, you do it in conjunction with GROUP BY and so forth. While the astute reader may conclude this from the docs, none of the examples in the docs specifically demonstrate that and the more useful example of crosstab(source_sql,category_sql) is left till the end of the documentation.

To install tablefunc simply open up the share\contrib\tablefunc.sql in pgadmin and run the sql file. Keep in mind that the functions are installed by default in the public schema. If you want to install in a different schema - change the first line that reads
SET search_path = public;

Alternatively you can use psql to install tablefunc using something like the following command:
path\to\postgresql\bin\psql -h localhost -U someuser -d somedb -f "path\to\postgresql\share\contrib\tablefunc.sql"

We will be covering the following functions

crosstab(source_sql, category_sql)
crosstab(source_sql)
Tricking crosstab to give you more than one row header column
Building your own custom crosstab function similar to the crosstab3, crosstab4 etc. examples
Adding a total column to crosstab query

There are a couple of key points to keep in mind which apply to both crosstab functions.

Source SQL must always return 3 columns, first being what to use for row header, second the bucket slot, and third is the value to put in the bucket.
crosstab except for the example crosstab3 ..crosstabN versions return unknown record types. This means that in order to use them in a FROM clause, you need to either alias them by specifying the result type or create a custom crosstab that outputs a known type as demonstrated by the crosstabN flavors. Otherwise you get the common a column definition list is required for functions returning "record" error.
A corrollary to the previous statement, it is best to cast those 3 columns to specific data types so you can be guaranteed the datatype that is returned so it doesn't fail your row type casting.
Each row should be unique for row header, bucket otherwise you get unpredictable results

Setting up our test data

For our test data, we will be using our familiar inventory, inventory flow example. Code to generate structure and test data is shown below.

CREATE TABLE inventory
(
  item_id serial NOT NULL,
  item_name varchar(100) NOT NULL,
  CONSTRAINT pk_inventory PRIMARY KEY (item_id),
  CONSTRAINT inventory_item_name_idx UNIQUE (item_name)
)
WITH (OIDS=FALSE);

CREATE TABLE inventory_flow
(
  inventory_flow_id serial NOT NULL,
  item_id integer NOT NULL,
  project varchar(100),
  num_used integer,
  num_ordered integer,
  action_date timestamp without time zone 
  	NOT NULL DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT pk_inventory_flow PRIMARY KEY (inventory_flow_id),
  CONSTRAINT fk_item_id FOREIGN KEY (item_id)
      REFERENCES inventory (item_id) 
      ON UPDATE CASCADE ON DELETE RESTRICT
)
WITH (OIDS=FALSE);

CREATE INDEX inventory_flow_action_date_idx
  ON inventory_flow
  USING btree
  (action_date)
  WITH (FILLFACTOR=95);

INSERT INTO inventory(item_name) VALUES('CSCL (g)');
INSERT INTO inventory(item_name) VALUES('DNA Ligase (ul)');
INSERT INTO inventory(item_name) VALUES('Phenol (ul)');
INSERT INTO inventory(item_name) VALUES('Pippette Tip 10ul');


INSERT INTO inventory_flow(item_id, project, num_ordered, action_date)
	SELECT i.item_id, 'Initial Order', 10000, '2007-01-01'
		FROM inventory i;
		
--Similulate usage
INSERT INTO inventory_flow(item_id, project, num_used, action_date)
	SELECT i.item_id, 'MS', n*2, 
		'2007-03-01'::timestamp + (n || ' day')::interval + ((n + 1) || ' hour')::interval
		FROM inventory As i CROSS JOIN generate_series(1, 250) As n
		WHERE mod(n + 42, i.item_id) = 0;
		
INSERT INTO inventory_flow(item_id, project, num_used, action_date)
	SELECT i.item_id, 'Alzheimer''s', n*1, 
		'2007-02-26'::timestamp + (n || ' day')::interval + ((n + 1) || ' hour')::interval
		FROM inventory as i CROSS JOIN generate_series(50, 100) As n
		WHERE mod(n + 50, i.item_id) = 0;
		
INSERT INTO inventory_flow(item_id, project, num_used, action_date)
	SELECT i.item_id, 'Mad Cow', n*i.item_id, 
		'2007-02-26'::timestamp + (n || ' day')::interval + ((n + 1) || ' hour')::interval
		FROM inventory as i CROSS JOIN generate_series(50, 200) As n
		WHERE mod(n + 7, i.item_id) = 0 AND i.item_name IN('Pippette Tip 10ul', 'CSCL (g)');

vacuum analyze;

Using crosstab(source_sql, category_sql)

For this example we want to show the monthly usage of each inventory item for the year 2007 regardless of project. The crosstab we wish to achieve would have columns as follows: item_name, jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec

--Standard group by aggregate query before we pivot to cross tab
--This we use for our source sql
 SELECT i.item_name::text As row_name, to_char(if.action_date, 'mon')::text As bucket, 
 		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	WHERE (if.num_used <> 0 AND if.num_used IS NOT NULL)
	  AND action_date BETWEEN date '2007-01-01' and date '2007-12-31 23:59'
	GROUP BY i.item_name, to_char(if.action_date, 'mon'), date_part('month', if.action_date)
	ORDER BY i.item_name, date_part('month', if.action_date);

--Helper query to generate lowercase month names - this we will use for our category sql	
SELECT to_char(date '2007-01-01' + (n || ' month')::interval, 'mon') As short_mname 
		FROM generate_series(0,11) n;

--Resulting crosstab query --Note: For this we don't need the order by month since the order of the columns is determined by the category_sql row order


SELECT mthreport.*
	FROM 
	crosstab('SELECT i.item_name::text As row_name, to_char(if.action_date, ''mon'')::text As bucket, 
		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	GROUP BY i.item_name, to_char(if.action_date, ''mon''), date_part(''month'', if.action_date)
	ORDER BY i.item_name', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n')
		As mthreport(item_name text, jan integer, feb integer, mar integer, 
			apr integer, may integer, jun integer, jul integer, 
			aug integer, sep integer, oct integer, nov integer, 
			dec integer)

The output of the above crosstab looks as follows:
crosstab source_sql cat_sql example

Using crosstab(source_sql)

crosstab(source_sql) is much trickier to understand and use than the crosstab(source_sql, category_sql) variant, but in certain situations and certain cases is faster and just as effective. The reason why is that crosstab(source_sql) is not guaranteed to put same named buckets in the same columns especially for sparsely populated data. For example - lets say you have data for CSCL for Jan Mar Apr and data for Phenol for Apr. Then Phenols Apr bucket will be in the same column as CSCL Jan's bucket. This in most cases is not terribly useful and is confusing.

To skirt around this inconvenience one can write an SQL statement that guarantees you have a row for each permutation of Item, Month by doing a cross join. Below is the above written so item month usage fall in the appropriate buckets.

	
	--Code to generate the row tally - before crosstab
	SELECT i.item_name::text As row_name, i.start_date::date As bucket, 
			SUM(if.num_used)::integer As bucketvalue
		FROM (SELECT inventory.*,  
			  date '2007-01-01' + (n || ' month')::interval As start_date,
			  date '2007-01-01' + ((n + 1) || ' month')::interval +  - '1 minute'::interval As end_date
			FROM inventory CROSS JOIN generate_series(0,11) n) As i 
				LEFT JOIN inventory_flow As if 
		ON (i.item_id = if.item_id AND if.action_date BETWEEN i.start_date AND i.end_date)
	GROUP BY i.item_name, i.start_date
	ORDER BY i.item_name, i.start_date;

	
	--Now we feed the above into our crosstab query to achieve the same result as 
	--our crosstab(source, category) example 
	SELECT mthreport.*
	FROM crosstab('SELECT i.item_name::text As row_name, i.start_date::date As bucket, 
			SUM(if.num_used)::integer As bucketvalue
		FROM (SELECT inventory.*,  
			  date ''2007-01-01'' + (n || '' month'')::interval As start_date,
			  date ''2007-01-01'' + ((n + 1) || '' month'')::interval +  - ''1 minute''::interval As end_date
			FROM inventory CROSS JOIN generate_series(0,11) n) As i 
				LEFT JOIN inventory_flow As if 
		ON (i.item_id = if.item_id AND if.action_date BETWEEN i.start_date AND i.end_date)
	GROUP BY i.item_name, i.start_date
	ORDER BY i.item_name, i.start_date;') 
		As mthreport(item_name text, jan integer, feb integer, 
			mar integer, apr integer, 
			may integer, jun integer, jul integer, aug integer, 
			sep integer, oct integer, nov integer, dec integer)

In actuality the above query if you have an index on action_date is probably more efficient for larger datasets than the crosstab(source, category) example since it utilizes a date range condition for each month match.

There are a couple of situations that come to mind where the standard behavior of crosstab of not putting like items in same column is useful. One example is when its not necessary to distiguish bucket names, but order of cell buckets is important such as when doing column rank reports. For example if you wanted to know for each item, which projects has it been used most in and you want the column order of projects to be based on highest usage. You would have simple labels like item_name, project_rank_1, project_rank_2, project_rank_3 and the actual project names would be displayed in project_rank_1, project_rank_2, project_rank_3 columns.

	
SELECT projreport.*
	FROM crosstab('SELECT i.item_name::text As row_name, 
		if.project::text As bucket, 
		if.project::text As bucketvalue
	FROM inventory  i 
			LEFT JOIN inventory_flow As if 
	ON (i.item_id = if.item_id)
	WHERE if.num_used > 0
GROUP BY i.item_name, if.project
ORDER BY i.item_name, SUM(if.num_used) DESC, if.project') 
	As projreport(item_name text, project_rank_1 text, project_rank_2 text, 
			project_rank_3 text)

Output of the above looks like:
Example crosstab column rank report

Tricking crosstab to give you more than one row header column

Recall we said that crosstab requires exactly 3 columns output in the sql source statement. No more and No less. So what do you do when you want your month crosstab by Item, Project, and months columns. One approach is to stuff more than one Item in the item slot by either using a delimeter or using an Array. We shall show the array approach below.


SELECT mthreport.row_name[1] As project, mthreport.row_name[2] As item_name,
	jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
	FROM 
	crosstab('SELECT ARRAY[if.project::text, i.item_name::text] As row_name,
		to_char(if.action_date, ''mon'')::text As bucket, SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	  WHERE if.num_used <> 0
	GROUP BY if.project, i.item_name, to_char(if.action_date, ''mon''), 
		date_part(''month'', if.action_date)
	ORDER BY if.project, i.item_name', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n')
		As mthreport(row_name text[], jan integer, feb integer, mar integer, 
			apr integer, may integer, jun integer, jul integer, 
			aug integer, sep integer, oct integer, nov integer, 
			dec integer)

Result of the above looks as follows:
crosstab with multi row header column

Building your own custom crosstab function

If month tabulations are something you do often, you will quickly become tired of writing out all the months. One way to get around this inconvenience - is to define a type and crosstab alias that returns the well-defined type something like below:


CREATE TYPE tablefunc_crosstab_monthint AS
   (row_name text[],jan integer, feb integer, mar integer, 
	apr integer, may integer, jun integer, jul integer, 
	aug integer, sep integer, oct integer, nov integer, 
	dec integer);
	  
CREATE OR REPLACE FUNCTION crosstabmonthint(text, text)
  RETURNS SETOF tablefunc_crosstab_monthint AS
'$libdir/tablefunc', 'crosstab_hash'
  LANGUAGE 'c' STABLE STRICT;

Then you can write the above query as


SELECT mthreport.row_name[1] As project, mthreport.row_name[2] As item_name,
	jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
	FROM 
	crosstabmonthint('SELECT ARRAY[if.project::text, i.item_name::text] As row_name, to_char(if.action_date, ''mon'')::text As bucket, 
SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	  WHERE if.num_used <> 0
	GROUP BY if.project, i.item_name, to_char(if.action_date, ''mon''), date_part(''month'', if.action_date)
	ORDER BY if.project, i.item_name', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n')
		As mthreport;

Adding a Total column to the crosstab query

Adding a total column to a crosstab query using crosstab function is a bit tricky. Recall we said the source sql should have exactly 3 columns (row header, bucket, bucketvalue). Well that wasn't entirely accurate. The crosstab(source_sql, category_sql) variant of the function allows for a source that has columns row_header, extraneous columns, bucket, bucketvalue. Don't get extraneous columns confused with row headers. They are not the same and if you try to use it as we did for creating multi row columns, you will be leaving out data. For simplicity here is a fast rule to remember.
Extraneous column values must be exactly the same for source rows that have the same row header and they get inserted right before the bucket columns.

We shall use this fact to produce a total column.


--This we use for our source sql
 SELECT i.item_name::text As row_name, 
 	(SELECT SUM(sif.num_used) 
		FROM inventory_flow sif 
			WHERE action_date BETWEEN date '2007-01-01' and date '2007-12-31 23:59'
				AND sif.item_id = i.item_id)::integer As total, 
				to_char(if.action_date, 'mon')::text As bucket, 
 		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	WHERE (if.num_used <> 0 AND if.num_used IS NOT NULL)
	  AND action_date BETWEEN date '2007-01-01' and date '2007-12-31 23:59'
	GROUP BY i.item_name, total, to_char(if.action_date, 'mon'), date_part('month', if.action_date)
	ORDER BY i.item_name, date_part('month', if.action_date);

--This we use for our category sql	
SELECT to_char(date '2007-01-01' + (n || ' month')::interval, 'mon') As short_mname 
		FROM generate_series(0,11) n;
		
--Now our cross tabulation query
SELECT mthreport.*
	FROM crosstab('SELECT i.item_name::text As row_name, 
	(SELECT SUM(sif.num_used) 
		FROM inventory_flow sif 
			WHERE action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
				AND sif.item_id = i.item_id)::integer As total, 
		to_char(if.action_date, ''mon'')::text As bucket, 
		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	WHERE (if.num_used <> 0 AND if.num_used IS NOT NULL)
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	GROUP BY i.item_name, total, to_char(if.action_date, ''mon''), date_part(''month'', if.action_date)
	ORDER BY i.item_name, date_part(''month'', if.action_date)', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n'	
	) 
		As mthreport(item_name text, total integer, jan integer, feb integer, 
			mar integer, apr integer, 
			may integer, jun integer, jul integer, aug integer, 
			sep integer, oct integer, nov integer, dec integer)

Resulting output of our cross tabulation with total column looks like this:

If per chance you wanted to have a total row as well you could do it with a union query in your source sql. Unfotunately PostgreSQL does not support windowing functions that would make the row total not require a union. We'll leave that one as an exercise to figure out.

Another not so obvious observation. You can define a type that say returns 20 bucket columns, but your actual crosstab need not return up to 20 buckets. It can return less and whatever buckets that are not specified will be left blank. With that in mind, you can create a generic type that returns generic names and then in your application code - set the heading based on the category source. Also if you have fewer buckets in your type definition than what is returned, the right most buckets are just left off. This allows you to do things like list the top 5 colors of a garment etc.

Posted by Leo Hsu and Regina Obe in contrib spotlight, intermediate, tablefunc at 03:57 | Comments (16) | Trackbacks (2)

Tuesday, December 18. 2007

Using OpenOffice Base 2.3.1 with PostgreSQL

Printer Friendly

Recommended Books: Getting Started with OpenOffice.org 3 written by OO.org group Database Programming with OO Base

For those who are not familiar with OpenOffice Base. OpenOffice Base is the equivalent of Microsoft Access in the OpenOffice Open source suite. While it is not as feature rich as Microsoft Access, it has been getting increasingly better and has some unique features that even Microsoft Access lacks. Unfortuantely you can't just convert an access mdb to its format like you can with other Open office suite products - Word to Writer Writer to Word etc. However you can open MS Access databases in OOBase, but you can't take advantage of the forms and reports in an MS Access Database.

One thing I always liked about Microsoft Access was the ease with which you could link to various different kinds of datasources and generate rapid queries and so forth. Microsoft Access has a particular feature called Access Projects which ties it very closely with Microsoft SQL Server. What an MS Access Project does is connect you with a specific SQL Server database and allow you to browse all the objects, create forms and reports etc against the objects etc. Unfortunately MS Access Project only works with SQL Server. For other datasources you need to use linked tables and can't make design changes and browse a database as you can with Access Projects.

We had looked at Openoffice Base a while ago and thought they are making progress, but still not quite good enough to put to daily use. When we revisited Open Office Base recently, we were surprised to find a couple of neat nuggets.

They now had a native SDBC driver for postgresql instead of having to rely on the jdbc or odbc driver. You can still use the jdbc and odbc bridges, and unfortunately for Mac OSX users, you are stuck using the jdbc driver.
They have this Access Project like feature except it was better than Access in that it worked with other server side dbs. Any that had a driver - e.g. PostgreSQL, MySQL etc.
It had a relational designer viewer similar to what Access had and when we opened up a PostgreSQL db it laid out all the relationships we had carefully defined before with foreign key constraints etc.

In the next couple of sections we'll lay out how to quickly setup OpenOffice, install the native PostgreSQL driver and JDBC PostgreSQL driver and connect to a PostgreSQL database in OpenOffice Base. Please forgive us for using Windows paths in this. We figured it would be easier for people to follow since most users coming to this site are windows users and a lot of Linux users already use OO and paths are too different from Linux/Mac OSX installs.

Continue reading "Using OpenOffice Base 2.3.1 with PostgreSQL"

Posted by Leo Hsu and Regina Obe in beginner, ms access, oobase, product showcase at 07:05 | Comments (17) | Trackbacks (3)

Saturday, December 15. 2007

PostGIS for geospatial analysis and mapping

Printer Friendly

In later issues we'll be covering other PostgreSQL contribs. We would like to start our first issue with introducing, PostGIS, one of our favorite PostgreSQL contribs. PostGIS spatially enables PostgreSQL in an OpenGeospatial Consortium (OGC) compliant way. PostGIS was one reason we started using PostgreSQL way back in 2001 when Refractions released the first version of PostGIS with the objective of providing affordable basic OGC Compliant spatial functionality to rival the very expensive commercial offerings. There is perhaps nothing more powerful in the geospatial world than the succinct expressiveness of SQL married with spatial operators and functions. Together they allow you to manipulate and analyze space with a single sentence. For details on using Postgis and why you would want to, check out the following links

Just as PostgreSQL has grown over the years, so too has PostGIS and the whole FOSS4G ecosystem. PostGIS has benefited from both the FOSS4G and PostgreSQL growths. On the PostgreSQL, improvements such as improved GIST indexing, bitmap indexes etc and on the FOSS4G side dependency projects such as Geos and Proj4, and JTS, as well as more tools and applications being built on top of it.

In 2001 only UMN Mapserver was available to display PostGIS spatial data. As time has passed, UMN Mapserver has grown, and other Mapping software both Commercial and Open Source have come on board that can utilize PostGIS spatial data directly. On the FOSS side there are many, some being UMN Mapserver, GRASS, uDig, QGIS, GDAL/OGR, FeatureServer, GeoServer, SharpMap, ZigGIS for ArcGIS integration, and on the commercial side you have CadCorp SIS, Manifold, MapDotNet, Safe FME Data Interoperability and ETL tools.

In terms of spatial databases, PostGIS is the most capable open source spatial database extender. While MySQL does have some spatial capabilities, its spatial capabilities are extremely limited particularly in the selectivity of the spatial relational functions which are all MBR only, ability to create spatial indexes on non-MyISAM stores, and lack a lot of the OGC compliant functions such as Intersection, Buffering even in its 5.1 product. For details on this check the MySQL 5.1 docs - Spatial Extensions.

When compared with commercial spatial databases, PostGIS has most of the core functions you will see in the commercial databases such as Oracle Spatial, DB2 Spatial Blade, Informix Spatial Blade, has comparable speed, fewer deployment headaches, but lacks some of the advanced add-ons you will find, such as Oracle Spatial network topology model, Raster Support and Geodetic support. Often times the advanced spatial features are add-ons on top of the standard price of the database software.

Some will argue that for example Oracle provides Locator free of charge in their standard and XE versions, Oracle Locator has a limited set of spatial functions. Oracle's Locator is missing most of the core spatial analysis and geometric manipulation functions like centroid, buffering, intersection and spatial aggregate functions; granted it does sport geodetic functionality that PostGIS is currently lacking. To use those non-locator features requires Oracle Spatial and Oracle Enterprise which would cost upwards of $60,000 per processor. Many have heard of SQL Server 2008 coming out and the new spatial features it will sport which will be available in both the express and the full version. One feature that SQL Server 2008 will have that PostGIS currently lacks is Geodetic support (the round world model so to speak). Aside from that SQL Server 2008 has a glarying omission from a current GIS perspective - and that is the ability to transform from one spatial reference system to another directly in the database and is Windows bound so not an option for anyone who needs or is thinking of cross-platform or in a Unix environment. SQL Server 2008 will probably come closest to PostGIS in terms of price / functionality. The express versions of the commercial offerings have many limitations in terms of size of database and usually limited to one processor use. For any reasonably sized deployment in terms of database size, processor utilization, replication, or ISP/Service Provider/Integrator this is not adequate and for any reasonably large deployment that is not receiving manna from heaven, some of the commercial offerings like Oracle Spatial, are not cost-sensible.

Note that in near future versions PostGIS is planning to have geodetic support and does provide basic network topology support via the PgRouting project and there are plans to incorporate network topology as part of PostGIS.

There is a rise in the use of mapping and geospatial analysis in the world and it is moving out of its GIS comfort zone to mingle more with other IT Infrastructure, General Sciences, and Engineering. Mapping and the whole Geospatial industry is not just a tool for GIS specialists anymore. A lot of this rise is driven by the rise of mapping mashups - things like Google Maps, Microsoft Virtual Earth, and Open data initiatives that are introducing new avenues of map sharing and spatial awareness. This new rise is what many refer to as NeoGeography. NeoGeography is still in its infancy; people are just getting over the excitement of seeing dots in their hometown, and are quickly moving into the next level - where more detailed questions are being asked about those dots and dots are no longer sufficient. We want to draw trails such as trail of hurricane destruction, avian bird flu, track our movement with GPS, draw boundaries and measure the densities of these based on some socio-ecological factor and we need to store all that user generated or tool generated information, and have all that transactional goodness, security and ability to query in an easy way that a relational database offers. This is the level where PostGIS and other spatial databases are most useful.

Posted by Leo Hsu and Regina Obe in contrib spotlight, gis, intermediate, mysql, oracle, postgis, sql server at 02:18 | Comments (22) | Trackback (1)

Monday, December 10. 2007

Serendipity Blogging Software

Printer Friendly

Choosing Blogging Software

When we started blogging, we had several criteria for the blogging software we would use.

Easy to install
Easy to use
Couldn't be a service and the database structure needed to be fairly easy to understand because we needed to mesh it seamlessly with the rest of our site.
Had to support PostgreSQL
Preferably open source and based on technology we understood - that meant either ASP.NET or PHP
As far as code goes we are pretty finicky about those things and for PHP we prefer the Smarty Templating system and PHP ADODB over other PHP paradigms. Part of that, not to insult others, was that it was the first approach we found that worked really well for us, so we stuck with it.

We immediately dismissed wordpress because it was MySQL centric, Blogger etc services were out the door as well. There were not that many blogging applications in .NET and most were very SQL Server centric.

We noticed other PostgreSQL bloggers use predominantly Serendipity, so we thought we'd give it a try.

Serendipity met all our requirements except for the PHP ADODB part. It has a database abstraction layer, but it appears to be a custom one. This we could live with. Below are the features we really liked about it.

Works on both Linux and Windows - IIS and Apache
Easy installation. Install process was literally 10 minutes or less
Supports PostgreSQL, MySQL, and SQLite
Underlying database structure was sane
Its underlying templating system is PHP Smarty-Based
At least it had a database abstraction layer. For .NET development we've built our own because all the ones out there went too far in their abstraction to the point of being counter-intuitive and .NET doesn't really have a pre-packaged database abstraction layer to speak of. So this particular choice of decisions was one we could accept.
Fairly intuitive.
Plug-ins galore - in fact most of our time was spent figuring out which plug-ins we wanted to use.
Ability to assign multiple categories to a blog post and threaded categories
Fairly straight-forward theming system

Choosing Plugins

There are some plugins enabled by default, but can't remember which ones. For the most part they are the common ones people would choose if they chose them. These get you pretty far at least to use the software before you realize hey there is other stuff you can turn on or off. Below are some of the ones we found as must haves or things that should think about changing.

Serendipity has plugins broken up into event plugins and side bar plugins. Side bar plugins can be drag and dropped between the left right middle areas, which is a nice convenience. Event plugins are triggered based on Serendipity system events such as blog posts or comment posts and some aren't really events so to speak but aren't side bar plugins either so they show under events.

Event Plug-ins

To WYSIWYG or Not? Serendipity's WYSIWYG setting is set at the user level which is nice since some people like it and some don't and if you have a group blog this is very useful. Personally we don't care for WYSIWYG especially for a site that shows coding snippets. This is not to say that WYSIWYG in Serendipity is not adequate for a lot of people. Just not for us. We never use the WYSIWYG in Visual Studio either. I blame being brought up writing papers in LATEX for this frustration with WYSIWYG.
Markup: NL2BR - If you are going to be writing your own HTML turn this off for blog body. It screws up your nice formatting since it will literally turn each newline into a break when presented. Should probably always have this turned on for comments otherwise people writing out carefully thought out comments will be frustrated when their paragraphs are squashed.
[S]erendipity [P]lugin [A]ccess [R]epository [T]ool [A]nd [C]ustomization/[U]nification [S]ystem (SPARTACUS) - this is a plug-in that allows you to connect to the Serendipity plug-in and update your plug-in repository - kind of like a YUM for Serendipity.
Announce Entries - this is a plug-in that does an XML-RPC ping post to places like technorati, google, ping-o-matic. You can enable and disable which ones you want posted to by default when your entry is published. Within the entry screen, you can selectively uncheck and check them as well for that particular entry

Sidebar Plugins

We haven't played with these too much. The standard default calendar, category, and search were pretty much what we needed starting off. We liked the Wiki Finder and the links to publish to social bookmarking sites as a nice convenience.

Gripes

We also tried this on a virgin install of PostgreSQL 8.3 Beta 3 and it didn't work. Seems to be some logic in the DB layer of serendipity that uses LIKE instead of = against ids and the fact that PostgreSQL 8.3 has taken out a lot of the default CASTS. I think the serendipity code should be changed in this case since from a cursory glance, doesn't quite look right or efficient, but I'm sure there is a good reason they chose to do things that way.

Posted by Leo Hsu and Regina Obe in adodb, mysql, product showcase, serendipity, smarty, sqlite at 04:58 | Comments (8) | Trackback (1)

Friday, December 07. 2007

Database Abstraction with Updateable Views

Printer Friendly

One of the annoying things about PostgreSQL unlike some other databases we have worked with is that simple views are not automatically updateable. There is some work involved to make views updateable. For simple views, this is annoying, but for more complex views it is a benefit to be able to control how things are updated. In a later version of PostgreSQL perhaps 8.4 or 8.5 this will be ratified and PostgreSQL will enjoy the same simplicity of creating simple updateable views currently offered by MySQL and SQL Server and other DBMSs, but still allow for defining how things should be updated for more complex views. For this exercise we are using PostgreSQL 8.2.5, but most of it should work for lower versions with slight modification.

For this exercise, we shall create a fairly complex updateable view to demonstrate how one goes about doing this.

Here is a scenario where being able to control how a view is updated comes in very handy.

We all know relational databases are great because they give you great mobility on how you slice and dice information. At times for data entry purposes, the good old simple flat file is just more user-friendly.

Problem: You are developing an inventory application for a molecular biology lab and they have the following requirements:

They want to keep track of how much of each supply they use for each project grant for funding purposes and report on that monthly or daily.
They want to keep track of how much of each supply they ordered, what they have left and their usage over time.
They however want data entry to be as simple as possible. They want a simple flat file structure to input data that has columns for each project usage and column for purchase quantity.

They have 2 projects going on. One on Multiple Sclerosis Research (MS) and one on Alzheimer's. Each is funded by different grants and for grant cost allocation purposes, they need to keep track of the supplies they use on each project.
How do you present a flat file inventory entry screen, but behind the scenes have a inventory and inventory transaction scheme so you can run period reports and aggregate summaries and have automatic totaling?
Possible Solution: One way to do it is with a crosstab summary view that is updateable. Views are incredibly useful abstraction tools. You do that in PostgreSQL by creating insert, update, and delete rules on your views. For our particular case, we will not be allowing deletion so we will not have a delete rule.

In our system we have 2 tables for simplicity. inventory and inventory_flow. I know we should have a project lookup table or in 8.3 possibly use an ENUM, but to make this short, we are skipping that.


CREATE TABLE inventory
(
  item_id serial NOT NULL,
  item_name varchar(100) NOT NULL,
  CONSTRAINT pk_inventory PRIMARY KEY (item_id),
  CONSTRAINT inventory_item_name_idx UNIQUE (item_name)
)
WITH (OIDS=FALSE);


CREATE TABLE inventory_flow
(
  inventory_flow_id serial NOT NULL,
  item_id integer NOT NULL,
  project varchar(100),
  num_used integer,
  num_ordered integer,
  action_date timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT pk_inventory_flow PRIMARY KEY (inventory_flow_id),
  CONSTRAINT fk_item_id FOREIGN KEY (item_id)
      REFERENCES inventory (item_id) 
      ON UPDATE CASCADE ON DELETE RESTRICT
)
WITH (OIDS=FALSE);



CREATE VIEW vwinventorysummary As 
	SELECT i.item_id, i.item_name, 
SUM(CASE WHEN iu.project = 'Alzheimer''s' 
               THEN iu.num_used ELSE 0 END) As total_num_used_altz, 
	   SUM(CASE WHEN iu.project = 'MS' THEN iu.num_used ELSE 0 END) As total_num_used_ms, 
           CAST(NULL As integer) As add_num_used_altz, 
           CAST(NULL As integer) As add_num_used_ms,
           CAST(NULL As integer) As add_num_ordered, 
	   SUM(COALESCE(iu.num_ordered,0)) - SUM(COALESCE(iu.num_used,0)) As num_remaining
	FROM inventory i LEFT JOIN inventory_flow iu ON i.item_id = iu.item_id
	GROUP BY i.item_id,  i.item_name;
	
	CREATE RULE updinventory AS
		ON UPDATE TO vwinventorysummary
			DO INSTEAD (
				UPDATE inventory 
                                   SET item_name = NEW.item_name WHERE inventory.item_id = NEW.item_id;

				INSERT INTO inventory_flow(item_id, project, num_used, num_ordered)
						SELECT NEW.item_id, 'Alzheimer''s', NEW.add_num_used_altz, 0
						WHERE NEW.add_num_used_altz IS NOT NULL; 
				INSERT INTO inventory_flow(item_id, project, num_used, num_ordered)
						SELECT NEW.item_id, 'MS', NEW.add_num_used_ms, 0
						WHERE NEW.add_num_used_ms IS NOT NULL;
				INSERT INTO inventory_flow(item_id, project, num_used, num_ordered)
						SELECT NEW.item_id, 'Resupply', 0, NEW.add_num_ordered						
				WHERE NEW.add_num_ordered IS NOT NULL;);
	
	CREATE RULE insinventory AS
		ON INSERT TO vwinventorysummary 
		DO INSTEAD ( 
			INSERT INTO inventory (item_name) VALUES (NEW.item_name);
			INSERT INTO inventory_flow (item_id, project, num_used)  
                            SELECT i.item_id AS new_itemid, 'Altzeimer''s', NEW.add_num_used_altz
					FROM inventory i 
					WHERE i.item_name = NEW.item_name 
                                           AND NEW.add_num_used_altz IS NOT NULL;

			INSERT INTO inventory_flow (item_id, project, num_used)  
                          SELECT i.item_id AS new_item_id, 'MS', NEW.add_num_used_ms
				FROM inventory i
                                WHERE i.item_name = NEW.item_name AND
                                         NEW.add_num_used_ms IS NOT NULL;

			INSERT INTO inventory_flow (item_id, project, num_ordered)  
                             SELECT i.item_id AS new_item_id, 'Initial Supply', NEW.add_num_ordered
				FROM inventory i 
					WHERE i.item_name = NEW.item_name AND   
                                          NEW.add_num_ordered IS NOT NULL;
		);

Now look at what happens when we insert and update our view

 --NOTE: here we are using the new multi-row valued insert feature introduced in 8.2
	INSERT INTO vwinventorysummary(item_name, add_num_ordered) 
		VALUES ('Phenol (ml)', 1000), ('Cesium Chloride (g)', 20000),
				('Chloroform (ml)', 10000), ('DNA Ligase (ml)', 100);
				
	UPDATE vwinventorysummary 
                SET add_num_used_ms = 5, add_num_used_altz = 6 WHERE item_name = 'Cesium Chloride (g)';
	UPDATE vwinventorysummary SET add_num_used_ms = 2 WHERE item_name = 'Phenol (ml)';
	UPDATE vwinventorysummary SET item_name = 'CSCL (g)' WHERE item_name = 'Cesium Chloride (g)';

The slick thing about this is if you were to create a linked table in something like say Microsoft Access and designated item_id as the primary key, then the user could simply open up the table and update as normally and behind the scenes the rules would be working to do the right thing.

Posted by Leo Hsu and Regina Obe in advanced, application development at 02:26 | Comment (1) | Trackbacks (2)

Thursday, December 06. 2007

PostgreSQL 8.3 is just around the Corner

Printer Friendly

PostgreSQL 8.3 is currently in Beta 4 and promises to offer some whoppingly neat features. First before we go over the new features we are excited about in this upcoming release, we'd like to briefly cover what was added in past releases.

The big 8.0 Highlights

The biggest feature in 8.0 was native support for Windows
Dollar quoting syntax for stored functions which made it a lot easier to write stored functions since instead of having to escape quotes you could use the $something$ body of function $something$ delimiter approach.
Next favorite was - Tablespaces so that people no longer needed to resort to messy symlinks to simulate this needed feature
Save points, and improved buffer management.

8.1 Highlights

Introduced Bitmap indexes which allowed for ability to use multiple indexes on a table simultaneously and less need for compound indexes.
Auto vacuuming. Auto Vacuuming was a huge benefit that all could appreciate. It allowed for automatic cleaning of dead space without human or scheduled intervention.
Also introduced in this release was improved shared locking and two phase commit
change in security - introducing login roles and group roles plus prevention of dropping roles that owned objects.
Constraint exclusion which improved speed of inherited tables thus improving table partitioning strategies.

8.2 Highlights

Query optimization improvements
8.2 introduced multi row valued lists insert syntax (example here) similar to what MySQL has . As a side note, SQL Server 2008 will introduce a similar feature as row constructors.
Improved indexed creation that no longer required blocking concurrent insert, create, delete.
Ability to remove table inheritance from a child table without having to rebuild it.
Aggregates that can take multiple inputs and SQL:2003 statistical functions
Introduction of Fill Factor for tables similar to Microsoft SQL Server's Fill Factor functionality

8.3 upcoming Highlights

8.3 has numerous highlights just as previous versions, but we shall focus on our favorite ones.

Support Security Service Provider Interface (SSPI) for authentication on Windows - which presumably will allow PostgreSQL databases to enjoy the same single signon you get with Microsoft SQL Server 7-2005.
GSSAPI with Kerberos authentication as a new and improved authentication scheme for single signon
Numerous performance improvements - too many to itemize - check out Stefan Kaltenbrunner's 8.3 vs. 8.2 simple benchmark
The new QUERY functionality in plpgsql which offers a simpler way of returning result sets
Scrollable cursors in PLPgSQL
Improved shared buffers on windows
TSearch - Full Text Search is now integrated into PostgreSQL instead of being a contrib module
Support for SQL/XML and new XML datatype
ENUM datatype
New add-on feature to PgAdmin III - a PL debugger most compliments of EnterpriseDB

Posted by Leo Hsu and Regina Obe in mysql, new in postgresql, sql server at 00:00 | Comment (1) | Trackbacks (0)

Wednesday, December 05. 2007

How does CLUSTER ON improve index performance

Printer Friendly

CLUSTER Basics

One of the features in PostgreSQL designed to enhance index performance is the use of a clustered index. For people coming from MS SQL Server shops, this may look familiar to you and actually serves the same purpose, but is implemented differently and this implementation distinction is very important to understand and be aware of. In PostgreSQL 8.3 the preferred syntax of how you cluster has changed. For details check out 8.3 CLUSTER 8.2 CLUSTER 8.0 CLUSTER. A lot of what I'm going to say is somewhat of a regurgitation of the docs, but in slightly different words.

First in short - clustering on an index forces the physical ordering of the data to be the same as the index order of the index chosen. Since you can have only one physical order of a table, you can have only one clustered index per table and should carefully pick which index you will use to cluster on or if you even want to cluster. Unlike Microsoft SQL Server, clustering on an index in PostgreSQL does not maintain that order. You have to reapply the CLUSTER process to maintain the order. Clustering helps by reducing page seeks. Once an index search is done and found, pulling out the data on the same page is vastly faster since once you find the start point all successive data nearby is easy picking.

As a corrollary to the above, it doesn't help too much for non-range queries. E.g. if you have dummy ids for records and you are just doing single record select queries, clustering is fairly useless to you. It is only really useful if you are doing range queries like between date ranges or spatial ranges or queries where the neighboring data to an index match is likely to be pulled. For example if you have an order items table, then clustering on a compound index such as order_id,order_item_id may prove useful since neighboring data is something you likely want to pull for range and summations.

Now lets see how we create a clustered index and then talk about the pros and gotchas


--First we create the index
CREATE INDEX member_name_idx
  ON member
  USING btree
  (upper(last_name), upper(first_name));

ALTER TABLE member CLUSTER ON member_name_idx;

Once a clustered index is created to force a recluster, you simply do this


CLUSTER member;

To force a cluster on all tables that have clustered indexes, you do this

CLUSTER

What is FillFactor and how does it affect clustering?

Again those coming from Microsoft SQL Server will recognize FILLFACTOR syntax. IBM Informix also has a FILLFACTOR syntax that serves the same purpose as the SQL Server and PostgreSQL ones. For more details here PostgreSQL docs: Create Index. FillFactor basically creates page gaps in an index page. So a Fill Factor of 80 means leave 20% of an index page empty for updates and inserts to the index so minimal reshuffering of existing data needs to happen as new records are added or indexed fields are updated in the system. This is incorporated into the index creation statement.


CREATE INDEX member_name_idx
  ON member
  USING btree
  (upper(last_name), upper(first_name))
  WITH (FILLFACTOR=80);
ALTER TABLE member CLUSTER ON member_name_idx;

-- note we do this to update the planner statistics information. 
Its important since this information helps the planner at selecting indexes and scan approach.
ANALYZE member;

After an index is created on a table, this information is then used in several scenarios

As stated earlier, as new records are added or indexed fields are updated, indexed values are shuffled into the empty slots and when new pages need to be created, they are created with the specified amount of space left blank.
During vacuuming, reindexing again the specified amount of space is left blank
During CLUSTERING, the clustering process tries to leave records where they are and uses the empty space to shuffle in the new data.

Why should you care?

First for fairly static tables such as large lookup tables, that rarely change or when they change are bulk changes, there is little point in leaving blank space in pages. It takes up disk space and causes Postgres to scan thru useless air. In these cases - you basically want to set your FillFactor high to like 99.

Then there are issues of how data is inserted, if you have only one index and new data usually resides at the end of the index and the indexed field are rarely updated, again having a low fill factor is probably not terribly useful even if the data is updated often. You'll never be using that free space so why have it.

For fairly updated data that changes such that you are randomly adding 10% new data per week or so in middle of page, then a fill factor of say 90 is the general rule of thumb.

Cluster approach benefits and Gotchas

The approach PostgreSQL has taken to cluster means that unlike the SQL Server approach, there is no additional penalty during transactions of having a clustered index. It is simply used to physically order the data and all new data goes to the end of the table. In the SQL Server approach all non-clustered indexes are keyed by the clustered index, which means any change to a clustered field requires rebuilding of the index records for the other indexes and also any insert or update may require some amount of physical shuffling. There are also other consequences with how the planner uses this information that are too detailed to get into.

The Bad and the Ugly

The bad is that since there is no additional overhead aside from the usual index key creation during table inserts and updates, you need to schedule reclustering to maintain your fine order and the clustering causes a table lock. The annoying locking hopefully will be improved in later versions. Scheduling a cluster can be done with a Cron Job or the more OS agnostic PgAgent approach. In another issue, we'll cover how to use PgAgent for backup and other scheduling maintenance tasks such as this.

Posted by Leo Hsu and Regina Obe in basics, intermediate, sql server at 04:55 | Comments (7) | Trackbacks (0)

How to create an index based on a function

Printer Friendly

It is often handy to create indexes that are based on calculations of a function. One reason is that instead of storing the calculated value in the table along with the actual value, you save a bit on table scan speed since your row is thinner and also saves some disk space. It helps search speed if its a common function search.

Case in point, PostgreSQL is case sensitive so in order to do a simple search you will often resort to using upper or ILIKE. In those cases its useful to have an index on upper or lower cased text. Here is an example.



CREATE INDEX mem_name_idx
  ON member
  USING btree
  (upper(last_name), upper(first_name));

Here is another example taken from PostGIS land. Often times you provide your data in various transformation, but for space savings and row seek reasons, you want to only transform your data to the less used projections as needed. One way to do this is to create functional indexes on the commonly used transformations and create views or just write raw SQL that uses these alternative transformations.


   CREATE INDEX parcels_idx_geom_nadm  
       ON parcels USING gist(ST_Transform(the_geom, 26986))

So now when I do a select like this that lists all buildings within 100 meters of my NAD 83 MA Meter State Plane point of interest:


    SELECT bldg_name, ST_Transform(the_geom,26986) As newgeom
        FROM buildings 
        WHERE ST_DWithin(ST_Transform(the_geom, 26986) ,ST_GeomFromText('POINT(235675.754215375 894022.495855985)', 26986),  100)

it will use indexes

Posted by Leo Hsu and Regina Obe in intermediate, q&a at 03:34 | Comments (15) | Trackbacks (0)

Monday, December 31. 2007

Functional Row Expansion

Friday, December 28. 2007

Thursday, December 27. 2007

Installing Tablefunc

Setting up our test data

Using crosstab(source_sql, category_sql)

Using crosstab(source_sql)

Tricking crosstab to give you more than one row header column

Building your own custom crosstab function

Adding a Total column to the crosstab query

Tuesday, December 18. 2007

Saturday, December 15. 2007

Monday, December 10. 2007

Choosing Blogging Software

Choosing Plugins

Event Plug-ins

Sidebar Plugins

Gripes

Friday, December 07. 2007

Thursday, December 06. 2007

The big 8.0 Highlights

8.1 Highlights

8.2 Highlights

8.3 upcoming Highlights

Wednesday, December 05. 2007

CLUSTER Basics

What is FillFactor and how does it affect clustering?

Why should you care?

Cluster approach benefits and Gotchas

The Bad and the Ugly

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration