Unpivoting data in PostgreSQL

A while ago we demonstrated how to create cross tabulation tables using tablefunc extension aka (Pivot Tables) (basically collapsing rows into columns). Recently someone asked me how to do the reverse (convert columns to rows). He found a solution to the problem here: http://stackoverflow.com/questions/1128737/unpivot-and-postgresql using a combination of array and unnest. That approach is very similar to SQL Server's built-in Pivot SQL predicate. The solution seemed nice enough except similar to the SQL Server Unpivot, it required knowing the column names beforehand so very hard to genericize. So would it be possible to accomplish this feat without knowing the columns names (except for the key) and be able to do it with one SQL statement. I realized that the PostgreSQL hstore extension fit the bill nicely. In this article I'll demonstrate both approaches by creating a view using both.

Installing Hstore

If you don't have hstore installed and are running PostgreSQL 9.1+, you can use the handy

CREATE EXTENSION hstore;

For lower versions, look for the hstore.sql file in share/contrib and run in your database.

Load Our data

For this exercise, I'm going to grab data the Gazetteer zipcode tabulation areas http://www.census.gov/geo/maps-data/data/gazetteer2010.html

First create table to hold the data:

CREATE TABLE zcta5 (zip char(5) primary key, pop10 integer, hu10 integer
 , aland numeric(16,2), awater numeric(16,2)
 , aland_sqmi numeric(16,2), awater_sqmi numeric(16,2)
 , intptlong numeric(8,5), intptlat numeric(8,5));
 

Load the data with psql:

 --in psql
 \cd /Gaz_zcta_national
 \copy zcta5 FROM Gaz_zcta_national.txt DELIMITER E'\t' CSV HEADER

The Hstore unpivot view

We create a view that renders our table unpivoted:

CREATE VIEW vw_zcta_unpivot_hs
AS 
SELECT zip, (h).key, (h).value As val
 FROM (SELECT zip, each(hstore(foo) - 'zip'::text) As h
 FROM zcta5 as foo  ) As unpiv ;

Take it for a test drive by outputting just one zip

 --10 ms uses index too
-- returns 8 rows
SELECT * FROM vw_zcta_unpivot_hs WHERE zip = '02109';

  zip  |     key     |    val
-------+-------------+-----------
 02109 | hu10        | 2462
 02109 | aland       | 449654.00
 02109 | pop10       | 3771
 02109 | awater      | 292691.00
 02109 | intptlat    | -71.05063
 02109 | intptlong   | 42.36722
 02109 | aland_sqmi  | 0.17
 02109 | awater_sqmi | 0.11
 --output the whole thing  3,360ms (264960 records) --
SELECT * from vw_zcta_unpivot_hs;

 -- no network effects-- 
-- 290ms
SELECT count(*) from vw_zcta_unpivot_hs;

The thing that is kind of cool about this approach is that if I add another column to my table, I don't need to redo the view for it to output that new column as a row. To demonstrate

ALTER TABLE zcta5 ADD COLUMN  hu_income10 numeric(12,2);
SELECT * FROM vw_zcta_unpivot_hs WHERE zip = '02109';
  zip  |     key     |    val
-------+-------------+-----------
 02109 | hu10        | 2462
 02109 | aland       | 449654.00
 02109 | pop10       | 3771
 02109 | awater      | 292691.00
 02109 | intptlat    | -71.05063
 02109 | intptlong   | 42.36722
 02109 | aland_sqmi  | 0.17
 02109 | awater_sqmi | 0.11
 02109 | hu_income10 |

The array unpivot view

The second approach is to use array and unnest. As mentioned, the thing I didn't like about this approach is that it's not dynamic - you have to itemize the columns. It also on the downside requires you to cast all the column values before hand if they are not all the same type. It does have a couple of benefits:

To test performance I ran this:

Create the view

 CREATE OR REPLACE VIEW vw_zcta_unpivot_ary
 AS 
SELECT zip
 , unnest(
    array['hu10', 'aland', 'pop10'
     , 'awater', 'intptlat', 'intptlong', 'aland_sqmi', 'awater_sqmi'] 
    ) AS key
 , unnest(
    array[hu10::text, aland::text, pop10::text
      , awater::text, intptlat::text, intptlong::text
      , aland_sqmi::text, awater_sqmi::text] 
    ) AS val
 FROM zcta5;

Test some queries

The single record one takes same time and uses index but shorter plan:

--where filter  10 ms --
SELECT * 
FROM vw_zcta_unpivot_ary
WHERE zip = '02109';

Output the whole view

 -- network effects --
-- 3120 ms  (264960 records)
SELECT * from vw_zcta_unpivot_ary;
 -- no network effects --
-- 200 ms 
SELECT count(*) FROM vw_zcta_unpivot_ary;

So there you have it, two very different approaches for accomplishing the same task.