contrib spotlight

Thursday, December 27. 2007

CrossTab Queries in PostgreSQL using tablefunc contrib

The generic way of doing cross tabs (sometimes called PIVOT queries) in an ANSI-SQL database such as PostgreSQL is to use CASE statements which we have documented in the article What is a crosstab query and how do you create one using a relational database?.

In this particular issue, we will introduce creating crosstab queries using PostgreSQL tablefunc contrib.

Installing Tablefunc

Tablefunc is a contrib that comes packaged with all PostgreSQL installations - we believe from versions 7.4.1 up (possibly earlier). We will be assuming the one that comes with 8.2 for this exercise. Note in prior versions, tablefunc was not documented in the standard postgresql docs, but the new 8.3 seems to have it documented at http://www.postgresql.org/docs/8.3/static/tablefunc.html.

Often when you create crosstab queries, you do it in conjunction with GROUP BY and so forth. While the astute reader may conclude this from the docs, none of the examples in the docs specifically demonstrate that and the more useful example of crosstab(source_sql,category_sql) is left till the end of the documentation.

To install tablefunc simply open up the share\contrib\tablefunc.sql in pgadmin and run the sql file. Keep in mind that the functions are installed by default in the public schema. If you want to install in a different schema - change the first line that reads
SET search_path = public;

Alternatively you can use psql to install tablefunc using something like the following command:
path\to\postgresql\bin\psql -h localhost -U someuser -d somedb -f "path\to\postgresql\share\contrib\tablefunc.sql"

We will be covering the following functions

crosstab(source_sql, category_sql)
crosstab(source_sql)
Tricking crosstab to give you more than one row header column
Building your own custom crosstab function similar to the crosstab3, crosstab4 etc. examples
Adding a total column to crosstab query

There are a couple of key points to keep in mind which apply to both crosstab functions.

Source SQL must always return 3 columns, first being what to use for row header, second the bucket slot, and third is the value to put in the bucket.
crosstab except for the example crosstab3 ..crosstabN versions return unknown record types. This means that in order to use them in a FROM clause, you need to either alias them by specifying the result type or create a custom crosstab that outputs a known type as demonstrated by the crosstabN flavors. Otherwise you get the common a column definition list is required for functions returning "record" error.
A corrollary to the previous statement, it is best to cast those 3 columns to specific data types so you can be guaranteed the datatype that is returned so it doesn't fail your row type casting.
Each row should be unique for row header, bucket otherwise you get unpredictable results

Setting up our test data

For our test data, we will be using our familiar inventory, inventory flow example. Code to generate structure and test data is shown below.

CREATE TABLE inventory
(
  item_id serial NOT NULL,
  item_name varchar(100) NOT NULL,
  CONSTRAINT pk_inventory PRIMARY KEY (item_id),
  CONSTRAINT inventory_item_name_idx UNIQUE (item_name)
)
WITH (OIDS=FALSE);

CREATE TABLE inventory_flow
(
  inventory_flow_id serial NOT NULL,
  item_id integer NOT NULL,
  project varchar(100),
  num_used integer,
  num_ordered integer,
  action_date timestamp without time zone 
  	NOT NULL DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT pk_inventory_flow PRIMARY KEY (inventory_flow_id),
  CONSTRAINT fk_item_id FOREIGN KEY (item_id)
      REFERENCES inventory (item_id) 
      ON UPDATE CASCADE ON DELETE RESTRICT
)
WITH (OIDS=FALSE);

CREATE INDEX inventory_flow_action_date_idx
  ON inventory_flow
  USING btree
  (action_date)
  WITH (FILLFACTOR=95);

INSERT INTO inventory(item_name) VALUES('CSCL (g)');
INSERT INTO inventory(item_name) VALUES('DNA Ligase (ul)');
INSERT INTO inventory(item_name) VALUES('Phenol (ul)');
INSERT INTO inventory(item_name) VALUES('Pippette Tip 10ul');


INSERT INTO inventory_flow(item_id, project, num_ordered, action_date)
	SELECT i.item_id, 'Initial Order', 10000, '2007-01-01'
		FROM inventory i;
		
--Similulate usage
INSERT INTO inventory_flow(item_id, project, num_used, action_date)
	SELECT i.item_id, 'MS', n*2, 
		'2007-03-01'::timestamp + (n || ' day')::interval + ((n + 1) || ' hour')::interval
		FROM inventory As i CROSS JOIN generate_series(1, 250) As n
		WHERE mod(n + 42, i.item_id) = 0;
		
INSERT INTO inventory_flow(item_id, project, num_used, action_date)
	SELECT i.item_id, 'Alzheimer''s', n*1, 
		'2007-02-26'::timestamp + (n || ' day')::interval + ((n + 1) || ' hour')::interval
		FROM inventory as i CROSS JOIN generate_series(50, 100) As n
		WHERE mod(n + 50, i.item_id) = 0;
		
INSERT INTO inventory_flow(item_id, project, num_used, action_date)
	SELECT i.item_id, 'Mad Cow', n*i.item_id, 
		'2007-02-26'::timestamp + (n || ' day')::interval + ((n + 1) || ' hour')::interval
		FROM inventory as i CROSS JOIN generate_series(50, 200) As n
		WHERE mod(n + 7, i.item_id) = 0 AND i.item_name IN('Pippette Tip 10ul', 'CSCL (g)');

vacuum analyze;

Using crosstab(source_sql, category_sql)

For this example we want to show the monthly usage of each inventory item for the year 2007 regardless of project. The crosstab we wish to achieve would have columns as follows: item_name, jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec

--Standard group by aggregate query before we pivot to cross tab
--This we use for our source sql
 SELECT i.item_name::text As row_name, to_char(if.action_date, 'mon')::text As bucket, 
 		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	WHERE (if.num_used <> 0 AND if.num_used IS NOT NULL)
	  AND action_date BETWEEN date '2007-01-01' and date '2007-12-31 23:59'
	GROUP BY i.item_name, to_char(if.action_date, 'mon'), date_part('month', if.action_date)
	ORDER BY i.item_name, date_part('month', if.action_date);

--Helper query to generate lowercase month names - this we will use for our category sql	
SELECT to_char(date '2007-01-01' + (n || ' month')::interval, 'mon') As short_mname 
		FROM generate_series(0,11) n;

--Resulting crosstab query --Note: For this we don't need the order by month since the order of the columns is determined by the category_sql row order


SELECT mthreport.*
	FROM 
	crosstab('SELECT i.item_name::text As row_name, to_char(if.action_date, ''mon'')::text As bucket, 
		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	GROUP BY i.item_name, to_char(if.action_date, ''mon''), date_part(''month'', if.action_date)
	ORDER BY i.item_name', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n')
		As mthreport(item_name text, jan integer, feb integer, mar integer, 
			apr integer, may integer, jun integer, jul integer, 
			aug integer, sep integer, oct integer, nov integer, 
			dec integer)

The output of the above crosstab looks as follows:
crosstab source_sql cat_sql example

Using crosstab(source_sql)

crosstab(source_sql) is much trickier to understand and use than the crosstab(source_sql, category_sql) variant, but in certain situations and certain cases is faster and just as effective. The reason why is that crosstab(source_sql) is not guaranteed to put same named buckets in the same columns especially for sparsely populated data. For example - lets say you have data for CSCL for Jan Mar Apr and data for Phenol for Apr. Then Phenols Apr bucket will be in the same column as CSCL Jan's bucket. This in most cases is not terribly useful and is confusing.

To skirt around this inconvenience one can write an SQL statement that guarantees you have a row for each permutation of Item, Month by doing a cross join. Below is the above written so item month usage fall in the appropriate buckets.

	
	--Code to generate the row tally - before crosstab
	SELECT i.item_name::text As row_name, i.start_date::date As bucket, 
			SUM(if.num_used)::integer As bucketvalue
		FROM (SELECT inventory.*,  
			  date '2007-01-01' + (n || ' month')::interval As start_date,
			  date '2007-01-01' + ((n + 1) || ' month')::interval +  - '1 minute'::interval As end_date
			FROM inventory CROSS JOIN generate_series(0,11) n) As i 
				LEFT JOIN inventory_flow As if 
		ON (i.item_id = if.item_id AND if.action_date BETWEEN i.start_date AND i.end_date)
	GROUP BY i.item_name, i.start_date
	ORDER BY i.item_name, i.start_date;

	
	--Now we feed the above into our crosstab query to achieve the same result as 
	--our crosstab(source, category) example 
	SELECT mthreport.*
	FROM crosstab('SELECT i.item_name::text As row_name, i.start_date::date As bucket, 
			SUM(if.num_used)::integer As bucketvalue
		FROM (SELECT inventory.*,  
			  date ''2007-01-01'' + (n || '' month'')::interval As start_date,
			  date ''2007-01-01'' + ((n + 1) || '' month'')::interval +  - ''1 minute''::interval As end_date
			FROM inventory CROSS JOIN generate_series(0,11) n) As i 
				LEFT JOIN inventory_flow As if 
		ON (i.item_id = if.item_id AND if.action_date BETWEEN i.start_date AND i.end_date)
	GROUP BY i.item_name, i.start_date
	ORDER BY i.item_name, i.start_date;') 
		As mthreport(item_name text, jan integer, feb integer, 
			mar integer, apr integer, 
			may integer, jun integer, jul integer, aug integer, 
			sep integer, oct integer, nov integer, dec integer)

In actuality the above query if you have an index on action_date is probably more efficient for larger datasets than the crosstab(source, category) example since it utilizes a date range condition for each month match.

There are a couple of situations that come to mind where the standard behavior of crosstab of not putting like items in same column is useful. One example is when its not necessary to distiguish bucket names, but order of cell buckets is important such as when doing column rank reports. For example if you wanted to know for each item, which projects has it been used most in and you want the column order of projects to be based on highest usage. You would have simple labels like item_name, project_rank_1, project_rank_2, project_rank_3 and the actual project names would be displayed in project_rank_1, project_rank_2, project_rank_3 columns.

	
SELECT projreport.*
	FROM crosstab('SELECT i.item_name::text As row_name, 
		if.project::text As bucket, 
		if.project::text As bucketvalue
	FROM inventory  i 
			LEFT JOIN inventory_flow As if 
	ON (i.item_id = if.item_id)
	WHERE if.num_used > 0
GROUP BY i.item_name, if.project
ORDER BY i.item_name, SUM(if.num_used) DESC, if.project') 
	As projreport(item_name text, project_rank_1 text, project_rank_2 text, 
			project_rank_3 text)

Output of the above looks like:
Example crosstab column rank report

Tricking crosstab to give you more than one row header column

Recall we said that crosstab requires exactly 3 columns output in the sql source statement. No more and No less. So what do you do when you want your month crosstab by Item, Project, and months columns. One approach is to stuff more than one Item in the item slot by either using a delimeter or using an Array. We shall show the array approach below.


SELECT mthreport.row_name[1] As project, mthreport.row_name[2] As item_name,
	jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
	FROM 
	crosstab('SELECT ARRAY[if.project::text, i.item_name::text] As row_name,
		to_char(if.action_date, ''mon'')::text As bucket, SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	  WHERE if.num_used <> 0
	GROUP BY if.project, i.item_name, to_char(if.action_date, ''mon''), 
		date_part(''month'', if.action_date)
	ORDER BY if.project, i.item_name', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n')
		As mthreport(row_name text[], jan integer, feb integer, mar integer, 
			apr integer, may integer, jun integer, jul integer, 
			aug integer, sep integer, oct integer, nov integer, 
			dec integer)

Result of the above looks as follows:
crosstab with multi row header column

Building your own custom crosstab function

If month tabulations are something you do often, you will quickly become tired of writing out all the months. One way to get around this inconvenience - is to define a type and crosstab alias that returns the well-defined type something like below:


CREATE TYPE tablefunc_crosstab_monthint AS
   (row_name text[],jan integer, feb integer, mar integer, 
	apr integer, may integer, jun integer, jul integer, 
	aug integer, sep integer, oct integer, nov integer, 
	dec integer);
	  
CREATE OR REPLACE FUNCTION crosstabmonthint(text, text)
  RETURNS SETOF tablefunc_crosstab_monthint AS
'$libdir/tablefunc', 'crosstab_hash'
  LANGUAGE 'c' STABLE STRICT;

Then you can write the above query as


SELECT mthreport.row_name[1] As project, mthreport.row_name[2] As item_name,
	jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
	FROM 
	crosstabmonthint('SELECT ARRAY[if.project::text, i.item_name::text] As row_name, to_char(if.action_date, ''mon'')::text As bucket, 
SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	  WHERE if.num_used <> 0
	GROUP BY if.project, i.item_name, to_char(if.action_date, ''mon''), date_part(''month'', if.action_date)
	ORDER BY if.project, i.item_name', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n')
		As mthreport;

Adding a Total column to the crosstab query

Adding a total column to a crosstab query using crosstab function is a bit tricky. Recall we said the source sql should have exactly 3 columns (row header, bucket, bucketvalue). Well that wasn't entirely accurate. The crosstab(source_sql, category_sql) variant of the function allows for a source that has columns row_header, extraneous columns, bucket, bucketvalue. Don't get extraneous columns confused with row headers. They are not the same and if you try to use it as we did for creating multi row columns, you will be leaving out data. For simplicity here is a fast rule to remember.
Extraneous column values must be exactly the same for source rows that have the same row header and they get inserted right before the bucket columns.

We shall use this fact to produce a total column.


--This we use for our source sql
 SELECT i.item_name::text As row_name, 
 	(SELECT SUM(sif.num_used) 
		FROM inventory_flow sif 
			WHERE action_date BETWEEN date '2007-01-01' and date '2007-12-31 23:59'
				AND sif.item_id = i.item_id)::integer As total, 
				to_char(if.action_date, 'mon')::text As bucket, 
 		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	WHERE (if.num_used <> 0 AND if.num_used IS NOT NULL)
	  AND action_date BETWEEN date '2007-01-01' and date '2007-12-31 23:59'
	GROUP BY i.item_name, total, to_char(if.action_date, 'mon'), date_part('month', if.action_date)
	ORDER BY i.item_name, date_part('month', if.action_date);

--This we use for our category sql	
SELECT to_char(date '2007-01-01' + (n || ' month')::interval, 'mon') As short_mname 
		FROM generate_series(0,11) n;
		
--Now our cross tabulation query
SELECT mthreport.*
	FROM crosstab('SELECT i.item_name::text As row_name, 
	(SELECT SUM(sif.num_used) 
		FROM inventory_flow sif 
			WHERE action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
				AND sif.item_id = i.item_id)::integer As total, 
		to_char(if.action_date, ''mon'')::text As bucket, 
		SUM(if.num_used)::integer As bucketvalue
	FROM inventory As i INNER JOIN inventory_flow As if 
		ON i.item_id = if.item_id
	WHERE (if.num_used <> 0 AND if.num_used IS NOT NULL)
	  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 23:59''
	GROUP BY i.item_name, total, to_char(if.action_date, ''mon''), date_part(''month'', if.action_date)
	ORDER BY i.item_name, date_part(''month'', if.action_date)', 
	'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval, ''mon'') As short_mname 
		FROM generate_series(0,11) n'	
	) 
		As mthreport(item_name text, total integer, jan integer, feb integer, 
			mar integer, apr integer, 
			may integer, jun integer, jul integer, aug integer, 
			sep integer, oct integer, nov integer, dec integer)

Resulting output of our cross tabulation with total column looks like this:

If per chance you wanted to have a total row as well you could do it with a union query in your source sql. Unfotunately PostgreSQL does not support windowing functions that would make the row total not require a union. We'll leave that one as an exercise to figure out.

Another not so obvious observation. You can define a type that say returns 20 bucket columns, but your actual crosstab need not return up to 20 buckets. It can return less and whatever buckets that are not specified will be left blank. With that in mind, you can create a generic type that returns generic names and then in your application code - set the heading based on the category source. Also if you have fewer buckets in your type definition than what is returned, the right most buckets are just left off. This allows you to do things like list the top 5 colors of a garment etc.

Posted by Leo Hsu and Regina Obe in contrib spotlight, intermediate, tablefunc at 03:57 | Comments (16) | Trackbacks (2)

Saturday, December 15. 2007

PostGIS for geospatial analysis and mapping

Printer Friendly

In later issues we'll be covering other PostgreSQL contribs. We would like to start our first issue with introducing, PostGIS, one of our favorite PostgreSQL contribs. PostGIS spatially enables PostgreSQL in an OpenGeospatial Consortium (OGC) compliant way. PostGIS was one reason we started using PostgreSQL way back in 2001 when Refractions released the first version of PostGIS with the objective of providing affordable basic OGC Compliant spatial functionality to rival the very expensive commercial offerings. There is perhaps nothing more powerful in the geospatial world than the succinct expressiveness of SQL married with spatial operators and functions. Together they allow you to manipulate and analyze space with a single sentence. For details on using Postgis and why you would want to, check out the following links

Just as PostgreSQL has grown over the years, so too has PostGIS and the whole FOSS4G ecosystem. PostGIS has benefited from both the FOSS4G and PostgreSQL growths. On the PostgreSQL, improvements such as improved GIST indexing, bitmap indexes etc and on the FOSS4G side dependency projects such as Geos and Proj4, and JTS, as well as more tools and applications being built on top of it.

In 2001 only UMN Mapserver was available to display PostGIS spatial data. As time has passed, UMN Mapserver has grown, and other Mapping software both Commercial and Open Source have come on board that can utilize PostGIS spatial data directly. On the FOSS side there are many, some being UMN Mapserver, GRASS, uDig, QGIS, GDAL/OGR, FeatureServer, GeoServer, SharpMap, ZigGIS for ArcGIS integration, and on the commercial side you have CadCorp SIS, Manifold, MapDotNet, Safe FME Data Interoperability and ETL tools.

In terms of spatial databases, PostGIS is the most capable open source spatial database extender. While MySQL does have some spatial capabilities, its spatial capabilities are extremely limited particularly in the selectivity of the spatial relational functions which are all MBR only, ability to create spatial indexes on non-MyISAM stores, and lack a lot of the OGC compliant functions such as Intersection, Buffering even in its 5.1 product. For details on this check the MySQL 5.1 docs - Spatial Extensions.

When compared with commercial spatial databases, PostGIS has most of the core functions you will see in the commercial databases such as Oracle Spatial, DB2 Spatial Blade, Informix Spatial Blade, has comparable speed, fewer deployment headaches, but lacks some of the advanced add-ons you will find, such as Oracle Spatial network topology model, Raster Support and Geodetic support. Often times the advanced spatial features are add-ons on top of the standard price of the database software.

Some will argue that for example Oracle provides Locator free of charge in their standard and XE versions, Oracle Locator has a limited set of spatial functions. Oracle's Locator is missing most of the core spatial analysis and geometric manipulation functions like centroid, buffering, intersection and spatial aggregate functions; granted it does sport geodetic functionality that PostGIS is currently lacking. To use those non-locator features requires Oracle Spatial and Oracle Enterprise which would cost upwards of $60,000 per processor. Many have heard of SQL Server 2008 coming out and the new spatial features it will sport which will be available in both the express and the full version. One feature that SQL Server 2008 will have that PostGIS currently lacks is Geodetic support (the round world model so to speak). Aside from that SQL Server 2008 has a glarying omission from a current GIS perspective - and that is the ability to transform from one spatial reference system to another directly in the database and is Windows bound so not an option for anyone who needs or is thinking of cross-platform or in a Unix environment. SQL Server 2008 will probably come closest to PostGIS in terms of price / functionality. The express versions of the commercial offerings have many limitations in terms of size of database and usually limited to one processor use. For any reasonably sized deployment in terms of database size, processor utilization, replication, or ISP/Service Provider/Integrator this is not adequate and for any reasonably large deployment that is not receiving manna from heaven, some of the commercial offerings like Oracle Spatial, are not cost-sensible.

Note that in near future versions PostGIS is planning to have geodetic support and does provide basic network topology support via the PgRouting project and there are plans to incorporate network topology as part of PostGIS.

There is a rise in the use of mapping and geospatial analysis in the world and it is moving out of its GIS comfort zone to mingle more with other IT Infrastructure, General Sciences, and Engineering. Mapping and the whole Geospatial industry is not just a tool for GIS specialists anymore. A lot of this rise is driven by the rise of mapping mashups - things like Google Maps, Microsoft Virtual Earth, and Open data initiatives that are introducing new avenues of map sharing and spatial awareness. This new rise is what many refer to as NeoGeography. NeoGeography is still in its infancy; people are just getting over the excitement of seeing dots in their hometown, and are quickly moving into the next level - where more detailed questions are being asked about those dots and dots are no longer sufficient. We want to draw trails such as trail of hurricane destruction, avian bird flu, track our movement with GPS, draw boundaries and measure the densities of these based on some socio-ecological factor and we need to store all that user generated or tool generated information, and have all that transactional goodness, security and ability to query in an easy way that a relational database offers. This is the level where PostGIS and other spatial databases are most useful.

Posted by Leo Hsu and Regina Obe in contrib spotlight, gis, intermediate, mysql, oracle, postgis, sql server at 02:18 | Comments (22) | Trackback (1)

Entries from December 2007

PostGIS in Action About the Authors Consulting

Thursday, December 27. 2007

CrossTab Queries in PostgreSQL using tablefunc contrib

Installing Tablefunc

Setting up our test data

Using crosstab(source_sql, category_sql)

Using crosstab(source_sql)

Tricking crosstab to give you more than one row header column

Building your own custom crosstab function

Adding a Total column to the crosstab query

Saturday, December 15. 2007

PostGIS for geospatial analysis and mapping

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

contrib spotlight

Entries from December 2007 PostGIS in Action About the Authors Consulting

Thursday, December 27. 2007

CrossTab Queries in PostgreSQL using tablefunc contrib

Installing Tablefunc

Setting up our test data

Using crosstab(source_sql, category_sql)

Using crosstab(source_sql)

Tricking crosstab to give you more than one row header column

Building your own custom crosstab function

Adding a Total column to the crosstab query

Saturday, December 15. 2007

PostGIS for geospatial analysis and mapping

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Entries from December 2007

PostGIS in Action About the Authors Consulting