Running totals and sums using PostgreSQL 8.4 Windowing functions

One thing that is pretty neat about windowing functions in PostgreSQL 8.4 aside from built-in windowing functions (row_number(), rank(), lead(), lag(), dense_rank(), percent_rank(), cume_dist(), first_value, last_value, nth_value) as documented in the manual Windowing Functions is that you can use windows with most aggregate functions (built-in or custom defined) as well as define your own specific windowing functions. In a later article, we'll demonstrate creating custom windowing functions.

In our PGCon 2009 PostGIS presentation one of the last slides demonstrates using lead() and lag() windowing functions to show a family's income level in the same record with the income levels of the next door neighbors in the fictitious town we created. This is not terribly useful unless you live in a somewhat dysfunctional neighborhood where everyone is concerned about how rich their neighbors are compared to themselves. Our town was very dysfunctional but mostly geographically dysfunctional. We will have much more useful use cases of this as applied to GIS in our upcoming PostGIS in Action book.

Hitoshi Harada and David Fetter did a presentation of this in PGCon 2009 which sadly we missed since we were giving our own presentation. Check out the PGCon2009 PostgreSQL 8.4 Windowing Functions Video. Also check out the slides at Introducing Windowing Functions.

Those who have used SQL Server 2005+, Oracle or IBM DBII are probably familar or have run into examples of Windowing functions in those products. Windowing in PostgreSQL 8.4 works more or less the same way. In a prior article, we demonstrated how to return running totals and sums using rudimentary SQL. To precelebrate the eminent arrival of PostgreSQL 8.4 and the current PostgreSQL 8.4 beta 2 release, we shall demonstrate the same exercise using the new ANSI SQL:2003 Windowing functionality built into the upcoming PostgreSQL 8.4.



--First create our table and dataset
CREATE TABLE orders(order_id serial PRIMARY KEY, customer_id integer, 
      order_datetime timestamp, order_total numeric(10,2));

INSERT INTO orders(customer_id, order_datetime, order_total)
VALUES (1,'2009-05-01 10:00 AM', 500),
    (1,'2009-05-15 11:00 AM', 650),
    (2,'2009-05-11 11:00 PM', 100),
    (2,'2009-05-12 11:00 PM', 5),
       (3,'2009-04-11 11:00 PM', 100),
          (1,'2009-05-20 11:00 AM', 3);

Recall in our previous example we wanted our running total column to include the total of all purchases made by the customer prior to the current purchase. We accomplished this with a self join.


--non windowing version
SELECT n.customer_id, n.order_id, n.order_total, 
    COALESCE(SUM(o.order_total),0) As past_order_total
FROM orders AS n LEFT JOIN orders AS o 
    ON (o.customer_id = n.customer_id 
            AND n.order_datetime > o.order_datetime)
GROUP BY n.customer_id, n.order_datetime, n.order_id, n.order_total
ORDER BY n.customer_id, n.order_datetime, n.order_id;



 customer_id | order_id | order_total | past_order_total
-------------+----------+-------------+---------------
           1 |        7 |      500.00 |             0
           1 |        8 |      650.00 |        500.00
           1 |       12 |        3.00 |       1150.00
           2 |        9 |      100.00 |             0
           2 |       10 |        5.00 |        100.00
           3 |       11 |      100.00 |             0
(6 rows)

The above shows an approach to write running sums using self-joins. If you do not have windowing functions this is a common approach. With windowing functions, you have the option of doing this. Note in our example we want to exclude the current order from our running total sum.


SELECT n.customer_id, n.order_id, n.order_total, 
    SUM(n.order_total) 
        OVER (PARTITION BY n.customer_id 
                      ORDER BY n.order_datetime) - n.order_total As past_order_total
FROM orders AS n 
ORDER BY n.customer_id, n.order_datetime, n.order_id;


 customer_id | order_id | order_total | past_order_total
-------------+----------+-------------+------------------
           1 |        7 |      500.00 |             0.00
           1 |        8 |      650.00 |           500.00
           1 |       12 |        3.00 |          1150.00
           2 |        9 |      100.00 |             0.00
           2 |       10 |        5.00 |           100.00
           3 |       11 |      100.00 |             0.00


An equal statement to the above that uses pure windowing with no subtract would probably use something like a RANGE BETWEEN x PRECEEDING AND Y FOLLOWING etc. Unfortunately PostgreSQL 8.4 doesn't appear to support the Oracle like:
BETWEEN x PRECEDING AND y FOLLOWING
BETWEEN x PRECEDING AND y PRECEDING
As described in Oracle Windowing functions therefore we had to put the - n.order_total to achieve more or less the same result as our previous. Also note we are making the assumption in our prior version that a given customer will not place 2 orders at the same time. Otherwise the results would be different between the 2.

Named windows

One other feature that is nice about window functions is that you can name windows and reuse them, also have windowing different from your basic sort order, and have multiple windowing functions in the same query. Below is a demonstration of this where we sequentially order our orders by order_datetime in descending order and then reuse our existing window to get an ordering for each customer partitioning such that we have a row_number recount for each customer by ascending time.


--multiple windows and multiple window function calls
SELECT row_number() OVER(ORDER BY order_datetime DESC) As rtime_d, 
n.customer_id, row_number() OVER(window_custtime) As cr_num, n.order_id, n.order_total, 
    SUM(n.order_total) OVER (window_custtime) - n.order_total As past_order_total
FROM orders AS n 
WINDOW window_custtime AS (PARTITION BY n.customer_id 
                               ORDER BY n.order_datetime)
ORDER BY  n.order_datetime, n.order_id, n.customer_id;



 rtime_d | customer_id | cr_num | order_id | order_total | past_order_total
---------+-------------+--------+----------+-------------+------------------
       6 |           3 |      1 |       11 |      100.00 |             0.00
       5 |           1 |      1 |        7 |      500.00 |             0.00
       4 |           2 |      1 |        9 |      100.00 |             0.00
       3 |           2 |      2 |       10 |        5.00 |           100.00
       2 |           1 |      2 |        8 |      650.00 |           500.00
       1 |           1 |      3 |       12 |        3.00 |          1150.00
(6 rows)

Below is the PgAdmin graphical explain plan from PgAdmin III 1.10 Beta 3

Observe that using the same window results in one Windows Agg which is a bit clearer in the next example.



--using same window twice results in only one windowagg call
SELECT row_number()  OVER(window_custtime) As rtime_d, 
n.customer_id, lead(order_id) OVER(window_custtime) As cr_num, n.order_id, n.order_total
FROM orders AS n 
WINDOW window_custtime AS (PARTITION BY n.customer_id 
                               ORDER BY n.order_datetime)
ORDER BY  n.order_datetime, n.order_id, n.customer_id;




 rtime_d | customer_id | cr_num | order_id | order_total
---------+-------------+--------+----------+-------------
       1 |           3 |        |        5 |      100.00
       1 |           1 |      2 |        1 |      500.00
       1 |           2 |      4 |        3 |      100.00
       2 |           2 |        |        4 |        5.00
       2 |           1 |      6 |        2 |      650.00
       3 |           1 |        |        6 |        3.00
(6 rows)