String Aggregation in PostgreSQL, SQL Server, and MySQL

Friday, December 24. 2010

String Aggregation in PostgreSQL, SQL Server, and MySQL

Question: You have a table of people and a table that specifies the activities each person is involved in. You want to return a result that has one record per person and a column that has a listing of activities for each person separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.

This is a question we are always asked and since we mentor on various flavors of databases, we need to be able to switch gears and provide an answer that works on the client's database. Most often the additional requirement is that you can't install new functions in the database. This means that for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.

Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job. We feel that data transformation is an important function of a database, and if your database is incapable of massaging the data into a format your various client apps can easily digest, WELL THAT's A PROBLEM.

We shall now document this answer rather than trying to answer for the nteenth time. For starter's PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing, but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement. MySQL has a fairly simple, elegant and very portable way that it has had for a really long time.

Is the new way always better?

For PostgreSQL and SQL Server, we'll demonstrate more than one approach. Just because we demonstrate newer approaches doesn't make the older approaches obsolete or necessarily worse. Don't feel the need to use something new because it's new. Think of the newer approaches as just extra options that may or may not be faster for your use-case.

Test Table

Below is the script to setup our test tables. This script will work fine for all databases we will discuss here.

CREATE TABLE people(p_name varchar(75) PRIMARY KEY);
INSERT INTO people(p_name) VALUES('Jack');
INSERT INTO people(p_name) VALUES('Jill');
INSERT INTO people(p_name) VALUES('Humpty');
INSERT INTO people(p_name) VALUES('Dopey');

CREATE TABLE people_activities(p_name varchar(75) NOT NULL, 
    activity varchar(100) NOT NULL);
    
ALTER TABLE people_activities 
    ADD CONSTRAINT pk_people_activities 
        PRIMARY KEY (p_name,activity); 
        
INSERT INTO people_activities(p_name, activity)
SELECT 'Jack', 'Hiking'
UNION ALL
SELECT 'Jack', 'Skiing'
UNION ALL
SELECT 'Jack', 'Fishing'
UNION ALL
SELECT 'Jill', 'Hiking'
UNION ALL
SELECT 'Jill', 'Bear Hunting'
UNION ALL
SELECT 'Humpty', 'Tumbling'
UNION ALL
SELECT 'Humpty', 'Seating'
UNION ALL
SELECT 'Humpty', 'Cracking';

-- we want our output to look like --
 p_name |           activities
--------+--------------------------------
 Dopey  |
 Humpty | Cracking;Seating;Tumbling
 Jack   | Fishing;Hiking;Skiing
 Jill   | Bear Hunting;Hiking

The PostgreSQL ways

In PostgreSQL 9.0, PostgreSQL introduced the STRING_AGG function and also the ORDER BY predicate in any aggregate function, which made this task much easier and probably much more efficient as well. Note that in Oracle 11G Release 2, Oracle introduced an aggregate function call LISTAGG which is similar. Details of using LISTAGG and Oracle LISTAGG, the older WM_CONCAT and other string aggregation approaches for Oracle are described in Oracle String Aggregation Techniques.

-- PostgreSQL 9.0 way --
SELECT p.p_name, 
    STRING_AGG(a.activity, ';' ORDER BY a.activity) As activities
FROM people AS p 
    LEFT JOIN people_activities As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;

PostgreSQL 8.4 introduced the ANSI-SQL array_agg function which IBM DB2 also has. Not sure about other databases. This provided another option for doing a string aggregation. Without the ORDER BY feature for aggregates introduced in PostgreSQL 9.0, this option is less powerful.

-- PostgreSQL 8.4+ --
-- This you should use with caution if you care about order
-- It seems to work but well that is probably by accident rather than design
SELECT p.p_name, 
    array_to_string(ARRAY_AGG(a.activity),';') As activities
FROM people AS p 
    LEFT JOIN 
     (SELECT * 
      FROM people_activities 
       ORDER BY p_name, activity) As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;

Here is the PostgreSQL 8.4+ equivalent of a third option that people mentioned is available for SQL Server - the Recursive Window CTE.

-- This is the PostgreSQL 8.4+ equivalent to the SQL Server Window CTE approach --
WITH RECURSIVE
    t1 AS (
        SELECT
            p.p_name,
            a.activity,
            ROW_NUMBER() OVER(PARTITION BY a.p_name ORDER BY a.activity) AS rn
        FROM people AS p
        LEFT JOIN people_activities AS a ON a.p_name = p.p_name
    ),
    t2 AS (
        SELECT
            t1.p_name,
            CAST(t1.activity AS varchar) AS activities,
            t1.rn
        FROM t1
        WHERE t1.rn = 1
        UNION ALL
        SELECT
            t2.p_name,
            t2.activities || ';' || t1.activity,
            t1.rn
        FROM t2
        INNER JOIN t1 ON
            t1.p_name = t2.p_name AND
            t1.rn = t2.rn + 1
    )
SELECT
    t2.p_name,
    COALESCE(t2.activities, '')
FROM t2
INNER JOIN (
    SELECT
        t1.p_name,
        MAX(t1.rn) AS max_rn
    FROM t1
    GROUP BY t1.p_name
) AS t ON
    t.p_name = t2.p_name AND
    t.max_rn = t2.rn
ORDER BY t2.p_name;

One thing that PostgreSQL has always had going for it that both MySQL and SQL Server still lack is phenomenal array support. Arrays are first class citizens in PostgreSQL witha rich set of functions to complement them. They are extremely versatile and make creating things like aggregate functions pretty trivial. For example in MySQL you'd have to become a C programmer to build an aggregate function which is generally out of the reach of most users. In SQL Server you'd have to learn .NET (VB or C#) and write a non-trivial amount of code which is much more accessible than C but not wihtout its issues. In PostgreSQL all you need to know is SQL and write a couple of lines of code to build your own custom aggregate functions. That is just SUPER COOL.

-- All versions of PostgreSQL
SELECT p.p_name, 
    array_to_string(ARRAY(
        SELECT activity 
        FROM people_activities AS a 
            WHERE a.p_name = p.p_name ORDER BY a.activity) ,';') As activities
FROM people AS p 
ORDER BY p.p_name;

SQL Server Ways

SQL Server has a fairly cryptic way of doing this which one would consider a HACK and repurpose of functionality that was designed for something else. On the plus side, we do find its CROSS APPLY and OUTER APPLY functionality which I crave a lot when working with MySQL and a little with PostgreSQL (PostgreSQL has similar but not quite as elegant equivalents). Not so much for this particular use case.

-- SQL Server 2005+  
-- Note this won't work unless you also have SQL Server 2005+ compatibility mode --
-- We haven't done benchmarks to see which approach is faster.
-- use subselect - the stuff is to get rid of the leading ;
-- In theory this should be faster than subselect of SQL 2000
-- Note we use OUTER APPLY here 
-- but for this particular use case CROSS APPLY would work 
-- fine too since one record seem to be returned regardless
-- As a general rule of thumb you should use OUTER APPLY for LEFT JOIN like behavior and 
-- CROSS APPLY as inner join like behavior
SELECT p.p_name, 
       activities
  FROM people AS p OUTER APPLY
    (SELECT STUFF(( SELECT ';' + a.activity
       FROM people_activities AS a
      WHERE a.p_name = p.p_name
      ORDER BY a.activity
        FOR XML PATH('') ), 1,1,'') AS activities ) AS f
  ORDER BY p.p_name;

SQL Server Recursive CTE + Window

Evidentally there is a third way of doing this in SQL Server 2005 that many purists were up in arms about that we neglected to mention.

This is the Window + CTE approach that someone posted on reddit that we revised slightly to be a bit more ANSI-SQL compliant. Note: we listed the PostgrreSQL 8.4+ equivalent of this approach in the PostgreSQL section. The reason many people prefer this approach is that it doesn't leave any XML entities you have to strip with REPLACE. e.g. if you had & in your original text you would have to use REPLACE(final,'&', '&') to strip off the xmlish stuff. The other reason is people felt this was less of a hack. Note: that the PostgreSQL 8.4 variant looks pretty close to SQL Server since both support ANSI SQL Window and CTE. The main differences:

PostgreSQL forces you to put in a RECURSIVE word if any of the subexpressions is RECURSIVE and SQL SERVER would reject that word and consider it invalid.
SQL SERVER uses the non-ANSI SQL + operator for string concatenation while PostgreSQL uses the ANSI SQL ||
SQL Server won't allow text + varchar concatenation and if your varchar is unlimited, you must call it varchar(MAX). PostgreSQL assumes if no length is specified,then its limitless so varchar, text mean much the same in PostgreSQL and varchar(MAX) is equivalent to PostgreSQL varchar.

As far as speed between the two variants, people did some benchmarks on reddit, but there wasn't a consensus on a clear winner.

-- SQL Server 2005+ using Window and CTE
WITH 
    t1 AS (
        SELECT
            p.p_name,
            a.activity,
            ROW_NUMBER() OVER(PARTITION BY a.p_name ORDER BY a.activity) AS rn
        FROM people AS p
        LEFT JOIN people_activities AS a ON a.p_name = p.p_name
    ),
    t2 AS (
        SELECT
            t1.p_name,
            CAST(t1.activity AS varchar(MAX)) AS activities,
            t1.rn
        FROM t1
        WHERE t1.rn = 1
        UNION ALL
        SELECT
            t2.p_name,
            t2.activities + ';' + t1.activity,
            t1.rn
        FROM t2
        INNER JOIN t1 ON
            t1.p_name = t2.p_name AND
            t1.rn = t2.rn + 1
    )
SELECT
    t2.p_name,
    COALESCE(t2.activities, '')
FROM t2
INNER JOIN (
    SELECT
        t1.p_name,
        MAX(t1.rn) AS max_rn
    FROM t1
    GROUP BY t1.p_name
) AS t ON
    t.p_name = t2.p_name AND
    t.max_rn = t2.rn
ORDER BY t2.p_name;

This is the old tried and true way which should work even if you are in SQL Server 2000 compatibility mode or are stuck using SQL Server 2000

-- SQL Server 2000+ 
SELECT p.p_name, 
       STUFF(( SELECT ';' + a.activity
           FROM people_activities AS a
          WHERE a.p_name = p.p_name
          ORDER BY a.activity
            FOR XML PATH('') ), 1,1,'') AS activities
      FROM people AS p
  ORDER BY p.p_name;

The MySQL way

Database purists like to make fun of MySQL, but in some ways it's way ahead of others. Its focus has always been to make the common easy and as a result was more advanced than some other databases. Some might say it catered too much to the whims of its users at the expense of doing things the right way. That's another story. This catering is no doubt a large reason for MySQL's popularity. There is something to be said about listening to the needs of users, which I think most other databases dismiss. For example MySQL was the first of these 3 databases to have a LIMIT OFFSET feature, which PostgreSQL later copied, and SQL Server 2010 (Denali) is finally introducing in the form of OFFSET FETCH NEXT. It was also the first to have the MULTI ROW VALUE insert option which was later adopted by PostgreSQL 8.2 and SQL Server 2008 and I think Oracle is yet to adopt. Heck Oracle still has that arcane FROM DUAL stupid requirement to my knowledge. How Oracle users put up with that nonsense I don't know.

UPDATE: As noted by some below and in reddit comments, a GOTCHA of the MySQL GROUP_CONCAT is that it silently truncates your result at about 1024 characters. As noted by another in the reddit thread "You can change the length at server level with group_concat_max_len up to your max_allowed_packet." or as Charles noted at the Session level with SET SESSION group_concat_max_len = 32768;

In addition it has always had this aggregate function for as far back as MySQL 3.23 days GROUP_CONCAT. It's really convenient to be able to use this on any version of MySQL. Below is how you would use it.

-- MySQL Way --
SELECT p.p_name, 
    GROUP_CONCAT(a.activity ORDER BY a.activity SEPARATOR ';') As activities
FROM people AS p 
    LEFT JOIN people_activities As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;

Posted by Leo Hsu and Regina Obe in 8.2, 8.3, 8.4, 9.0, cte, db2, intermediate, mysql, oracle, postgresql versions, q&a, sql server, window functions at 11:24 | Comments (16) | Trackback (1)

Trackbacks

Trackback specific URI for this entry

PingBack

Weblog: www.postgresonline.com
Tracked: Jan 18, 18:51

Comments

Display comments as (Linear | Threaded)

I predominantly work with MS SQL but also enjoy working with PostgreSQL. You are so right about CROSS APPLY. I've found so many uses for it such as pivoting/aggregating data on a row-by-row basis, building and embedding nested XML nodes and creating comma-delimited lists. It is such a fun addition to T-SQL. I wish PostgreSQL had an equivalent.

#1 Michael on 2010-12-23 01:38 (Reply)

Note that MySQL has a default 1024 character limit for GROUP_CONCAT(), if you try to concat more with this function it will be silently truncated.

#2 Willem (Homepage) on 2010-12-23 09:42 (Reply)

You can control the maximum length of a GROUP_CONCAT using the system variable group_concat_max_len. For example, to set it to 32k:

SET SESSION group_concat_max_len = 32768;

GROUP_CONCAT is one of my favorite little MySQL cheats. I was tickled pink when STRING_AGG was added to PG 9.

#2.1 Charles on 2010-12-23 12:47 (Reply)

What is the max length that string_agg can output? Just for some sick fun I tried feeding the last name from a table with 163,840 rows into it and it's returning a big empty string.

Query looks like this:

select string_agg(last_name, ',' order by last_name)
from people;

#3 Bob (Homepage) on 2010-12-23 10:25 (Reply)

Bob,
There is no max. Most likely you are using PgAdmin III which by default won't show anything larger than 256 characters.

In newer versions of PgAdmin, you can override this default by going to File Options ->Query Tool and changing the max characters per column.

#3.1 Regina on 2010-12-23 10:34 (Reply)

There is a limit - it returns a text field, so it can't go above 1GB.

#3.1.1 Magnus (Homepage) on 2010-12-23 15:11 (Reply)

Magnus,

Really. I thought the text field is limitless since my understanding was it would start using toast for storage, or is that only for data in non-temp tables?

If its only a limitation for output would that mean it would be limitless if you bulk inserted to another table.

#3.1.1.1 Regina on 2010-12-23 18:55 (Reply)

No, it's limited. Any TOASTED value should be detoasted to varlena type.

#3.1.1.1.1 Pavel Stehule (Homepage) on 2010-12-24 07:34 (Reply)

To be sure -- try dumping out using psql. I don't think psql ever truncates.

#3.2 Regina on 2010-12-23 10:46 (Reply)

FFS, use COALESCE() in MSSQL!

#4 amt on 2010-12-23 11:31 (Reply)

Ouch. Your people primary key (varchar(75)) and your activities reference (varchar(50)) do not match.

#5 Tom Davis on 2010-12-23 12:16 (Reply)

Tom,

Thanks for the catch. Fixed.

#5.1 Regina on 2010-12-23 20:08 (Reply)

Regina,
Thanks, I got it to work dumping from psql and by exporting from pgadmin to a file. Visually pgadmin still doesn't show the data(or at least it appears it isn't there. They need a raw text view instead of a grid view for the data output.

#6 Bob on 2010-12-23 14:33 (Reply)

In the SQL Server world, the XML option is preferred for a few reasons. The most important reason is that CTEs can cause an incredible amount of logical quiery complexity which, in turn, can cause the query optimizer to abandon optimization and produce a less than optimal plan. The XML approach is also well known, tested, and (I suspect) has been optimized.

#7 Jeremiah Peschka (Homepage) on 2010-12-28 08:03 (Reply)

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above:
	Remember Information? Subscribe to this entry

String Aggregation in PostgreSQL, SQL Server, and MySQL

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Friday, December 24. 2010

String Aggregation in PostgreSQL, SQL Server, and MySQL

Is the new way always better?

Test Table

The PostgreSQL ways

SQL Server Ways

The MySQL way

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

String Aggregation in PostgreSQL, SQL Server, and MySQL

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Friday, December 24. 2010

String Aggregation in PostgreSQL, SQL Server, and MySQL

Is the new way always better?

Test Table

The PostgreSQL ways

SQL Server Ways

The MySQL way

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting