Friday, December 24. 2010
Recommended Books: PostgreSQL 9 SQL Reference Manual SQL Server MVP Deep Dives High Performance MySQL
Question: You have a table of people and a table that specifies the activities each person is involved in. You want to return a result that has one record per person and a column that has a listing of activities for each person separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.
This is a question we are always asked and since we mentor on various flavors of databases, we need to be able to switch gears and provide an answer that works on the client's database. Most often the additional requirement is that you can't install new functions in the database. This means that for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.
Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job. We feel that data transformation is an important function of a database, and if your database is incapable of massaging the data into a format your various client apps can easily digest, WELL THAT's A PROBLEM.
We shall now document this answer rather than trying to answer for the nteenth time. For starter's PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing, but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement. MySQL has a fairly simple, elegant and very portable way that it has had for a really long time.
Is the new way always better?
For PostgreSQL and SQL Server, we'll demonstrate more than one approach. Just because we demonstrate newer approaches doesn't make the older approaches obsolete or necessarily worse. Don't feel the need to use something new because it's new. Think of the newer approaches as just extra options that may or may not be faster for your use-case.
Below is the script to setup our test tables. This script will work fine for all databases we will discuss here.
The PostgreSQL ways
In PostgreSQL 9.0, PostgreSQL introduced the STRING_AGG function and also the ORDER BY predicate in any aggregate function, which made this task much easier and probably much more efficient as well. Note that in Oracle 11G Release 2, Oracle introduced an aggregate function call LISTAGG which is similar. Details of using LISTAGG and Oracle LISTAGG, the older WM_CONCAT and other string aggregation approaches for Oracle are described in Oracle String Aggregation Techniques.
PostgreSQL 8.4 introduced the ANSI-SQL array_agg function which IBM DB2 also has. Not sure about other databases. This provided another option for doing a string aggregation. Without the ORDER BY feature for aggregates introduced in PostgreSQL 9.0, this option is less powerful.
Here is the PostgreSQL 8.4+ equivalent of a third option that people mentioned is available for SQL Server - the Recursive Window CTE.
One thing that PostgreSQL has always had going for it that both MySQL and SQL Server still lack is phenomenal array support. Arrays are first class citizens in PostgreSQL witha rich set of functions to complement them. They are extremely versatile and make creating things like aggregate functions pretty trivial. For example in MySQL you'd have to become a C programmer to build an aggregate function which is generally out of the reach of most users. In SQL Server you'd have to learn .NET (VB or C#) and write a non-trivial amount of code which is much more accessible than C but not wihtout its issues. In PostgreSQL all you need to know is SQL and write a couple of lines of code to build your own custom aggregate functions. That is just SUPER COOL.
SQL Server Ways
SQL Server has a fairly cryptic way of doing this which one would consider a HACK and repurpose of functionality that was designed for something else. On the plus side, we do find its CROSS APPLY and OUTER APPLY functionality which I crave a lot when working with MySQL and a little with PostgreSQL (PostgreSQL has similar but not quite as elegant equivalents). Not so much for this particular use case.
SQL Server Recursive CTE + Window
Evidentally there is a third way of doing this in SQL Server 2005 that many purists were up in arms about that we neglected to mention.
This is the Window + CTE approach that someone posted on reddit that we revised slightly to be a bit more ANSI-SQL compliant. Note: we listed the PostgrreSQL 8.4+ equivalent of this approach in the PostgreSQL section.
The reason many people prefer this approach is that it doesn't leave any XML entities you have to strip with REPLACE. e.g. if you had & in your original text you would have to use
This is the old tried and true way which should work even if you are in SQL Server 2000 compatibility mode or are stuck using SQL Server 2000
-- SQL Server 2000+ SELECT p.p_name, STUFF(( SELECT ';' + a.activity FROM people_activities AS a WHERE a.p_name = p.p_name ORDER BY a.activity FOR XML PATH('') ), 1,1,'') AS activities FROM people AS p ORDER BY p.p_name;
The MySQL way
Database purists like to make fun of MySQL, but in some ways it's way ahead of others. Its focus has always been to make the common easy and as a result was more advanced than some other databases. Some might say it catered too much to the whims of its users at the expense of doing things the right way. That's another story. This catering is no doubt a large reason for MySQL's popularity. There is something to be said about listening to the needs of users, which I think most other databases dismiss. For example MySQL was the first of these 3 databases to have a LIMIT OFFSET feature, which PostgreSQL later copied, and SQL Server 2010 (Denali) is finally introducing in the form of OFFSET FETCH NEXT. It was also the first to have the MULTI ROW VALUE insert option which was later adopted by PostgreSQL 8.2 and SQL Server 2008 and I think Oracle is yet to adopt. Heck Oracle still has that arcane FROM DUAL stupid requirement to my knowledge. How Oracle users put up with that nonsense I don't know.
UPDATE: As noted by some below and in reddit comments, a GOTCHA of the MySQL GROUP_CONCAT is that it silently truncates your result at about 1024 characters. As noted by another in the reddit thread "You can change the length at server level with group_concat_max_len up to your max_allowed_packet." or as Charles noted at the Session level with
In addition it has always had this aggregate function for as far back as MySQL 3.23 days GROUP_CONCAT. It's really convenient to be able to use this on any version of MySQL. Below is how you would use it.
Posted by Leo Hsu and Regina Obe in 8.2, 8.3, 8.4, 9.0, cte, db2, intermediate, mysql, oracle, postgresql versions, q&a, sql server, window functions at 11:24 | Comments (14) | Trackbacks (0)
Defined tags for this entry: common table expressions, mysql, oracle, postgresql 9.0, sql server, string concatenation, window functions
Related entries by tags:
Tracked: Jan 18, 18:51
Recommended Books: PostgreSQL 9 SQL Reference Manual SQL Server MVP Deep Dives High Performance MySQL
Display comments as (Linear | Threaded)
I predominantly work with MS SQL but also enjoy working with PostgreSQL. You are so right about CROSS APPLY. I've found so many uses for it such as pivoting/aggregating data on a row-by-row basis, building and embedding nested XML nodes and creating comma-delimited lists. It is such a fun addition to T-SQL. I wish PostgreSQL had an equivalent.
#1 Michael on 2010-12-23 01:38
Note that MySQL has a default 1024 character limit for GROUP_CONCAT(), if you try to concat more with this function it will be silently truncated.
You can control the maximum length of a GROUP_CONCAT using the system variable group_concat_max_len. For example, to set it to 32k:
SET SESSION group_concat_max_len = 32768;
GROUP_CONCAT is one of my favorite little MySQL cheats. I was tickled pink when STRING_AGG was added to PG 9.
#2.1 Charles on 2010-12-23 12:47
What is the max length that string_agg can output? Just for some sick fun I tried feeding the last name from a table with 163,840 rows into it and it's returning a big empty string.
Query looks like this:
select string_agg(last_name, ',' order by last_name)
There is no max. Most likely you are using PgAdmin III which by default won't show anything larger than 256 characters.
In newer versions of PgAdmin, you can override this default by going to File Options ->Query Tool and changing the max characters per column.
#3.1 Regina on 2010-12-23 10:34
There is a limit - it returns a text field, so it can't go above 1GB.
Really. I thought the text field is limitless since my understanding was it would start using toast for storage, or is that only for data in non-temp tables?
If its only a limitation for output would that mean it would be limitless if you bulk inserted to another table.
#184.108.40.206 Regina on 2010-12-23 18:55
No, it's limited. Any TOASTED value should be detoasted to varlena type.
To be sure -- try dumping out using psql. I don't think psql ever truncates.
#3.2 Regina on 2010-12-23 10:46
FFS, use COALESCE() in MSSQL!
#4 amt on 2010-12-23 11:31
Ouch. Your people primary key (varchar(75)) and your activities reference (varchar(50)) do not match.
#5 Tom Davis on 2010-12-23 12:16
Thanks for the catch. Fixed.
#5.1 Regina on 2010-12-23 20:08
Thanks, I got it to work dumping from psql and by exporting from pgadmin to a file. Visually pgadmin still doesn't show the data(or at least it appears it isn't there. They need a raw text view instead of a grid view for the data output.
#6 Bob on 2010-12-23 14:33
In the SQL Server world, the XML option is preferred for a few reasons. The most important reason is that CTEs can cause an incredible amount of logical quiery complexity which, in turn, can cause the query optimizer to abandon optimization and produce a less than optimal plan. The XML approach is also well known, tested, and (I suspect) has been optimized.
Syndicate This Blog
Show tagged entries