Build Median Aggregate Function in SQL

Sunday, August 10. 2008

One of the things we love most about PostgreSQL is the ease with which one can define new aggregate functions with even a language as succinct as SQL. Normally when we have needed a median function, we've just used the built-in median function in PL/R as we briefly demonstrated in Language Architecture in PostgreSQL.

If all you demand is a simple median aggregate function ever then installing the whole R statistical environment so you can use PL/R is overkill and much less portable.

In this article we will demonstrate how to create a Median function with nothing but the built-in PostgreSQL SQL language, array constructs, and functions.

Primer on PostgreSQL Aggregate Structure

PostgreSQL has a very simple but elegant architecture for defining aggregates. Aggregates can be defined using any functions, built-in languages and PL languages you have installed. You can even mix and match languages if you want to take advantage of the unique speed optimization/library features of each language. Below are the basic steps to building an aggregate function in PostgreSQL.

Define a start function that will take in the values of a result set - this can be in the PL/built-in language of your choosing or you can use one that already exists.
Define an end function that will do something with the final output of the start function - this can be in the PL/built-in language of your choosing or you can use one that already exists.
If the intermediary type returned by your start function, does not exist, then create it

Now define the aggregate with steps that look something like this:


	CREATE AGGREGATE median(numeric) (
		  SFUNC=array_append,
		  STYPE=numeric[],
		  FINALFUNC=array_median
		);

NOTE: As Tom Lane pointed out in comments below, the following is not entirely true. Since all arrays can be cast to anyarray datatype. You can use anyarray to use the same function for all data types assuming you want all medians to behave the same regardless of data type. We shall demonstrate this in our next aggregate example
This part is a bit annoying. You need to define an aggregate for each data type you need it to work with that doesn't automatically cast to a predefined type. The above example will only work for numbers because all numbers can be automatically cast to a numeric. However if we needed a median for dates, we would also need to do
```
	CREATE AGGREGATE median(timestamp) (
		  SFUNC=array_append,
		  STYPE=timestamp[],
		  FINALFUNC=array_median
		);
```
and also define a array_median function for dates. Keep in mind that PostgreSQL supports function overloading which means we can have all these functions named the same as long as they take different data type inputs. This allows the final user of our median function not to worry about whether they are taking a median for dates or numbers and just call the aggregate median().

Build our Median Aggregate

In the steps that follow we shall flesh out the FINALFUNC function. Please note that array_append is a built-in function in PostgreSQL that takes an array and keeps on appending elements to the array. So conveniently - we don't need to define an SFUNC as we would normally.

Now what makes creating a median aggregate function harder than say an Average is that it cares about order and needs to look at all items to determine what to return. This means that unlike average, sum, max, min etc - we need to look at all values passed to us, resort it based on the data type sorting rules of that data type and return the middle item. Here is where the beauty of array_append saves us.

Now lets get started. We have conveniently everything we need gratis from PostgreSQL. Now all we need are our array_median functions that will take in our array of items collected during the group process, junk the nulls and resort whats left and then return the middle item.

NOTES:

you can instead of using the array_append directly, create an intermediary that rejects nulls. That would probably perform better but require a bit more code.
When there are ties, the customary thing is to average the ties, for our particular use case, we wanted the result to be in the list, so we are simply taking the last in the average set.
You see the multiply by 2.0, that is needed because 1/2 is 0 in SQL because it needs to return the same data type as the inputs. To get around that we force the 2 to be a decimal.

So the code looks like this:

CREATE OR REPLACE FUNCTION array_median(numeric[])
  RETURNS numeric AS
$$
    SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
    FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
    WHERE ($1)[n] IS NOT NULL
            ORDER BY ($1)[n]
) As asorted) As foo ;
$$
  LANGUAGE 'sql' IMMUTABLE;


CREATE OR REPLACE FUNCTION array_median(timestamp[])
  RETURNS timestamp AS
$$
    SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
    FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
    WHERE ($1)[n] IS NOT NULL
            ORDER BY ($1)[n]
) As asorted) As foo ;
$$
  LANGUAGE 'sql' IMMUTABLE;


CREATE AGGREGATE median(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=array_median
);

CREATE AGGREGATE median(timestamp) (
  SFUNC=array_append,
  STYPE=timestamp[],
  FINALFUNC=array_median
);

Now the tests

----TESTS numeric median - 16ms
SELECT m, median(n) As themedian, avg(n) as theavg
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2
GROUP BY m
ORDER BY m;

--Yields
m | themedian |       theavg
---+-----------+---------------------
 1 |        31 | 31.0000000000000000
 2 |        31 | 32.5000000000000000
 3 |        31 | 32.5000000000000000
 4 |        34 | 34.0000000000000000
 5 |        34 | 35.5000000000000000
 
 SELECT m, n
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2 and m = 1
ORDER BY m, n;
--Yields


 m | n
---+----
 1 |  4
 1 |  7
 1 | 10
 1 | 13
 1 | 16
 1 | 19
 1 | 22
 1 | 25
 1 | 28
 1 | 31
 1 | 34
 1 | 37
 1 | 40
 1 | 43
 1 | 46
 1 | 49
 1 | 52
 1 | 55
 1 | 58

--Test to ensure feeding numbers out of order still works
SELECT avg(x), median(x)
FROM (SELECT 3 As x 
    UNION ALL 
    SELECT - 1 As x 
    UNION ALL 
    SELECT 11 As x
    UNION ALL
    SELECT 10 As x
    UNION ALL
    SELECT 9 As x) As foo;
    
--Yields - 


avg | median
--+------------
6.4 | 9

---TEST date median -NOTE: average is undefined for dates so we left that out. 16ms
SELECT m, median(CAST('2008-01-01' As date) + n) As themedian
    FROM generate_series(1, 58, 3) n, generate_series(1,5) m
    WHERE n > m*2
GROUP BY m
ORDER BY m;


 m |      themedian
---+---------------------
 1 | 2008-02-01 00:00:00
 2 | 2008-02-01 00:00:00
 3 | 2008-02-01 00:00:00
 4 | 2008-02-04 00:00:00
 5 | 2008-02-04 00:00:00
 
SELECT m, (CAST('2008-01-01' As date) + n) As thedate
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2 AND m = 1
ORDER BY m,n;

--Yields


m |  thedate
--+------------
1 | 2008-01-05
1 | 2008-01-08
1 | 2008-01-11
1 | 2008-01-14
1 | 2008-01-17
1 | 2008-01-20
1 | 2008-01-23
1 | 2008-01-26
1 | 2008-01-29
1 | 2008-02-01
1 | 2008-02-04
1 | 2008-02-07
1 | 2008-02-10
1 | 2008-02-13
1 | 2008-02-16
1 | 2008-02-19
1 | 2008-02-22
1 | 2008-02-25
1 | 2008-02-28

Posted by Leo Hsu and Regina Obe in intermediate, pl programming, sql functions at 21:10 | Comments (19) | Trackbacks (2)

Trackbacks

Trackback specific URI for this entry

More Aggregate Fun: Who's on First and Who's on Last
Microsoft Access has these peculiar set of aggregates called First and Last. We try to avoid them because while the concept is useful, we find Microsoft Access's implementation of them a bit broken. MS Access power users we know moving over to somethin

Weblog: Postgres OnLine Journal
Tracked: Aug 12, 22:58

How to create multi-column aggregates
PostgreSQL 8.2 and above has this pretty neat feature of allowing you to define aggregate functions that take more than one column as an input. First we'll start off with a rather pointless but easy to relate to example and then we'll follow up with som

Weblog: Postgres OnLine Journal
Tracked: Mar 05, 22:56

Comments

Display comments as (Linear | Threaded)

Uh, you don't need multiple implementations if you use polymorphism.

CREATE OR REPLACE FUNCTION array_median(anyarray)
RETURNS anyelement AS
$$
SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
WHERE ($1)[n] IS NOT NULL
ORDER BY ($1)[n]
) As asorted) As foo ;
$$
LANGUAGE 'sql' IMMUTABLE;

CREATE AGGREGATE median(anyelement) (
SFUNC=array_append,
STYPE=anyarray,
FINALFUNC=array_median
);

#1 Tom Lane on 2008-08-10 23:22 (Reply)

Thanks Tom. That does make it much easier.

#1.1 Regina on 2008-08-11 01:10 (Reply)

This is example, where aggregate function is relative slow.

postgres=# select median(a) from foo;
median
--------
503
(1 row)

Time: 90,089 ms

postgres=# select array_median(array(select a from foo order by 1));
array_median
--------------
503
(1 row)

Time: 30,101 ms

#2 Pavel Stehule (Homepage) on 2008-08-11 09:22 (Reply)

Pavel,

This is very interesting. I thought your example difference was because of the presorting, but evidentally the aggregation process is adding a lot of overhead as well. What could that be?

I assumed there was a benefit to presorting especially if you have an index on the presorted column, but strangely I am not seeing the benefit in my sample data.

I have an index on the table on the length column census2000tiger_arc. Looking at these 3 - I would have expected the sorted ones to perform better, but it doesn't by much if at all. Looking at the explain I realize its not using the length index I put in but instead the zip index. So I'm guessing presorting only helps if your data has an index on that field and that index is actually used.

--This one returns in 2907 ms
-- midlength = 134.34012, count = 19,084 (without count 2891 ms)
SELECT median(length) as midlength, count(zipl)
from census2000tiger_arc
where zipl between 2000 and 2109;

--This one returns in 2964 ms (without count 2953 ms)
--midlength = 134.34012, count = 19,084

SELECT median(length) as midlength, count(zipl)
from (SELECT * FROM census2000tiger_arc
where zipl between 2000 and 2109 ORDER BY length) as foo;

--Equivalent to yours - 1844 ms, result = 134.34012
select array_median(array(select length FROM census2000tiger_arc
where zipl between 2000 and 2109 ORDER BY length));

#2.1 Regina (Homepage) on 2008-08-11 22:55 (Reply)

Hello,

simply repeated calling array_append function is expensive - this function isn't optimalised for using as aggregate. Array(subselect) is much faster.

#2.1.1 Pavel Stehule (Homepage) on 2008-08-12 02:23 (Reply)

There is an error with this algorithm, as it doesn't consider the odd/even rule in the calculation. Although not a major error, the results using this function are different than all other implementations of the median function (e.g., R, Excel, etc.).

When there is an even number of items, the median is the average of the inner two middle values (since there is no symmetry to provide a middle value). If there is an odd number of items, the median is the middle index (as applied above). The examples above have odd set counts, so they work as expected. However, even set count examples fail with a "left" bias.

Building on Tom's suggestion above, consider using the following (note that this only works for numeric-like inputs, or other types that can be divided by two):

CREATE OR REPLACE FUNCTION array_median(anyarray)
RETURNS anyelement AS
$$
SELECT CASE
WHEN array_upper($1,1) = 0 THEN null
WHEN mod(array_upper($1,1),2) = 1 THEN asorted[ceiling(array_upper(asorted,1)/2.0)]
ELSE ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END
FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
WHERE ($1)[n] IS NOT NULL
ORDER BY ($1)[n]
) As asorted) As foo ;
$$
LANGUAGE 'sql' IMMUTABLE;

#3 Mike on 2008-09-12 13:03 (Reply)

Mike,

Thanks for the example. Yes I was thinking about providing something like that and ideally for this one, I think you would use numeric[], numeric instead of anyelement.

As I recall things working you can have both definitions of median working and the one that is the closest match to the function is the one that gets used. So if you define this - then it would get used for numerics and the other would get used for everything else. I'll have to try that though since I'm not absolutely sure that's how it works.

#3.1 Regina on 2008-09-14 05:14 (Reply)

I tried to create median aggregate, but it gave me an error:

ERROR: syntax error at or near "("
SQL state: 42601
Character: 36

Thanks

#4 Kursat on 2009-05-08 11:27 (Reply)

Kursat,

Which version of PostgreSQL are you running.

use

SELECT version();

pre 8.0 versions don't support $$ quoting. The other possibility is just a cut and paste typo somewhere.

#4.1 Regina on 2009-05-08 14:28 (Reply)

Pavel, thank you for posting this. Your method made a dramatic difference in the performance of medians.

When using the r_median() function from PL/R combined with your method I am getting back results in 1-2s on a table where using a MEDIAN AGGREGATE was taking literally 30 minutes.

Array mutating is absolutely terrible in Postgres and if you have a large amount of rows, using array_append (or similar things, like R's plr_array_accum) get exponentially worse as your result set increases.

I am disappointed that Arrays are handled so poorly in Postgres. It basically completely rules out using the CREATE AGGREGATE construct with holistic aggregates.

#5 Dimensionally on 2009-05-08 11:43 (Reply)

Dimensionally -- have you tried PostgreSQL 8.4 beta. One of the major improvements is significantly speeding up the array accumulation in aggregates so that you get the benefits of what Pavel described, and the elegance of an aggregate function.

#5.1 Regina on 2009-05-08 14:26 (Reply)

Thank you Regina, using the array_agg aggregate is a much nicer building block.

Am I correct in understanding that the syntax I need to use is:

SELECT r_median(array_agg(myColumn)) FROM myTable

That is what I am using, and it is working efficiently. However, I would rather be able to do:

SELECT median(myColumn) FROM myTable

But I am happy with the former.

thank you

#5.1.1 Dimensionally on 2009-05-08 16:54 (Reply)

That is one way. I haven't experimented with it myself. I would more likely go the direction of still making a median aggregate function but using array_agg helper functions. If you look at the definition of array_agg -- it looks like this

CREATE AGGREGATE array_agg(anyelement) (
SFUNC=array_agg_transfn,
STYPE=internal,
FINALFUNC=array_agg_finalfn
);

So I would use the array_agg_transfn/finalfn etc to try to roll my own faster aggregates. We are planning to experiment with this in another article and see how it goes.

#5.1.1.1 Regina on 2009-05-11 10:01 (Reply)

Hi, thank you for the additional information.

However, it does not appear to be legal for user-SQL to contain a STYPE of "internal".

I get this error:

"ERROR: aggregate transition data type cannot be internal"

#6 Dimensionally on 2009-05-11 15:01 (Reply)

Dimensionally,

We haven't gotten around to it yet but hope to soon to demonstrate.

In the meantime, take a look at the
/share/contrib/int_aggregate.sql of your PostgreSQL 8.4 install.

It demonstrates how to use the new agg transition functions and the unnest function to build a custom aggregate.

#6.1 Regina on 2009-05-12 13:59 (Reply)

Okay I looked into this further and sadly I think I was wrong. The int_agg example is nothing but a thin wrapper around the existing array_agg so it seems that to do what I really wanted to do, would require writing C code function for the last step. Its too bad the CREATE AGGREGATE doesn't have another arg like POST CONDITION or something that would allow you to further massage the INT type. That would make this way more elegant and useful.

#6.2 Regina on 2009-05-14 07:14 (Reply)

Your implementation has a bug, in that sets of even numbers don't return the average of the two middle numbers. You can fix it by adding cases for checking if the size of an array is even or odd.

Here's a fix. I used asorted for all the case checks because you could otherwise be doing arithmetic operations on null values, which is never good.

CREATE OR REPLACE FUNCTION array_median(numeric[])
RETURNS numeric AS
$$
SELECT CASE
WHEN array_upper(asorted, 1) = 0 THEN null
WHEN array_upper(asorted, 1) % 2 = 1 then asorted[ceiling(array_upper(asorted,1)/2.0)]
ELSE (asorted[ceiling(array_upper(asorted,1)) / 2.0] + asorted[ceiling(array_upper(asorted,1)) / 2.0 + 1]) / 2.0
END
FROM (
SELECT ARRAY (
SELECT ($1)[n]
FROM generate_series(1, array_upper($1, 1)) AS n
WHERE ($1)[n] IS NOT NULL ORDER BY ($1)[n]
) as asorted
) as foo ;
$$
LANGUAGE 'sql' IMMUTABLE;

#7 Peter Koczan on 2009-11-17 19:19 (Reply)

This is not working anymore with pg 8.4.1 and postgis 1.4.1

Many errors found in geomunion etc.
Need to be refreshed

#8 Bruno Friedmann on 2010-01-10 04:20 (Reply)

Bruno,

What does this article have to do with PostGIS?

Please don't use geomunion anymore. Its been deprecated since PostGIS 1.2 and was not improved on in 1.4. It will be destroyed in PostGIS 2.0.

You should be using ST_Union instead which has the faster Cascade Union functionality.

#8.1 Regina on 2010-01-10 23:19 (Reply)

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is eight minus six?
	Remember Information? Subscribe to this entry

Build Median Aggregate Function in SQL

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Sunday, August 10. 2008

Build Median Aggregate Function in SQL

Primer on PostgreSQL Aggregate Structure

Build our Median Aggregate

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Build Median Aggregate Function in SQL

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Sunday, August 10. 2008

Build Median Aggregate Function in SQL

Primer on PostgreSQL Aggregate Structure

Build our Median Aggregate

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting