PostgreSQL 8.4 Faster array building with array_agg

One of the very handy features introduced in PostgreSQL 8.4 is the new aggregate function called array_agg which is a companion function to the unnest function we discussed earlier. This takes a set of elements similar to what COUNT, SUM etc do and builds an array out of them. This approach is faster than the old used array_append , array_accum since it does not rebuild the array on each iteration.

Sadly it does not appear to be completely swappable with array_append as there does not seem to be a mechanism to use it to build your own custom aggregate functions that need to maintain the set of objects flowing thru the aggregate without venturing into C land. This we tried to do in our median example but were unsuccessful.

In PostGIS 1.4 Paul borrowed some of this array_agg logic to make the PostGIS spatial aggregates much much faster with large numbers of geometries. So collecting polygons or making a line out of say 30,000 geometries which normally would have taken 2 minutes or more (just accumulating), got reduced to under 10 seconds in many cases. That did require C code even when installed against PostgreSQL 8.4. Though in PostGIS you reap the benefits as far as geometries go even if you are running lower than 8.4.

We had originally thought array_agg was a PostgreSQL only creation, but it turns out that array_agg is a function defined in the ANSI SQL:2008 specs and for one appears to exist in IBM DB2 as well. I don't think Oracle or any other database supports it as of yet.

As we had demonstrated in the other article, we shall demonstrate the olden days and what array_agg brings to the table to make your life easier.

In the olden days prior to 8.4, when you wanted to get the list of employees that reports to a supervisor as an array and also the salaries of these people as an array so you could pass it to a funky R-statistical function to do interesting calculations with it, you would have had one of two choices.

As mentioned the aggregate solution is personally nicer and elegant if its a function you call often, but was slow when accumulating large numbers of records in each grouping. Below is a very contrived example that uses our favorite system catalog to demonstrate its use and we are doing it because we are too lazy to create dummy data and this query will work in all databases. In the below for each table in our database we are getting two sets of arrays -- one that has the columns in the table and another that has the distinct types of columns in each table.

--The olden days
--Note this is not a terribly interesting example since an array_accum 
-- built with array_append would work fine here as the arrays don't get that big for tables
--but just imagine we had something that had 10,000 elements in each grouping
SELECT t.table_schema, t.table_name, 
 ARRAY(SELECT CAST(c.column_name AS text) 
        FROM information_schema.columns AS c
    WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_names,
  ARRAY(SELECT DISTINCT CAST(c.data_type AS text) 
            FROM information_schema.columns AS c
        WHERE c.table_schema = t.table_schema 
        AND c.table_name = t.table_name) As dat_types
 FROM information_schema.tables AS t
GROUP BY t.table_schema, t.table_name;


--The 8.4 way
SELECT c.table_schema, c.table_name, 
 array_agg(CAST(c.column_name AS text)) as col_names,
 array_agg(DISTINCT CAST(c.data_type AS text)) As dat_types
 FROM information_schema.columns AS c
GROUP BY c.table_schema, c.table_name;