Using procedures for batch geocoding and other batch processing

One of the features we are looking forward to in upcoming PostgreSQL 11 is the introduction of procedures via the CREATE PROCEDURE ANSI-SQL construct. The major benefit that sets apart procedures from functions is that procedures are not wrapped in an outer transaction and can have COMMITs within them. This means it's not an all or nothing like it is with functions. Even if you stop a procedure in motion, whatever work has been done and committed is saved. In the case of functions, a stop or failure would roll-back all the work. It also means you can see work in progress of a stored procedure since the work will already have been committed. This is a huge benefit for batch processing. Batch processing covers a lot of use-cases of PostGIS users since a good chunk of PostGIS work involves doing some kind of batch processing of data you get from third-parties or machines.

We'll demonstrate with batch geocoding, using a similar approach we discussed in. psql watch for batch processing. Folks often resort to doing this stuff outside of the database like using Python as a driver to push batches to postgres. Now postgres is capable of doing it all.

In addition to the code and data here, you'll need postgis_tiger_geocoder extension and loaded with MA census data.

Why batch process data?

For geocoding of data or any kind of lengthy processing work, you often do this in batches of say 1, 10, or 50, but never the whole 1,000,000. The reasons for that are

For each update statement, PostgreSQL needs to be able to rollback the update. If your server goes down for whatever reason your last update would be lost. A million records of work is a lot to lose.
Since PostgreSQL needs to be able to rollback, all that workspace to get ready to rollback is occupied and eats up memory. In that sense you may not even have the memory available to handle a batch of 1,000,000 in one bite.

You can't put your batch work in a PostgreSQL function, at least not all of it, because each function has an explicit transaction wrapped around it. This is one of the reasons why trying to do it in a function can take 10 or more times longer than just embedding the logic in a script that pushes the work in smaller commits. Now we've got PostgreSQL procedures where we can put in pretty much the same code as we would have done in our function (except now we can have COMMITS), and can explicitly do the commit after each batch. The great thing about this is if your connection is lost, the server crashes, you loose power, or whatever, the work that has already been committed is not lost. It also means the memory is freed after each commit, and not wait for all the work to be done in the procedure before making available for other processes.

Test Data Set

Our test dataset is small so not very interesting. To simulate more records, we repeat each address 1000 times.

DROP TABLE IF EXISTS addr_to_geocode ;
CREATE TABLE addr_to_geocode(addid serial PRIMARY KEY, address text,
    pt geography(point, 4326), new_address text, rating integer);

INSERT INTO addr_to_geocode(address)
VALUES ('529 Main Street, Boston MA, 02129'),
 ('77 Massachusetts Avenue, Cambridge, MA 02139'),
 ('25 Wizard of Oz, Walaford, KS 99912323'),
 ('26 Capen Street, Medford, MA'),
 ('124 Mount Auburn St, Cambridge, Massachusetts 02138'),
 ('950 Main Street, Worcester, MA 01610');
 
-- repeat address to get more data
INSERT INTO addr_to_geocode(address)
SELECT address 
FROM addr_to_geocode, generate_series(1,1000);

The batch processing procedure

To create our procedure, we are also going to take advantage of the FOR UPDATE SKIP LOCKED feature introduced in PostgreSQL 9.5. This will allow us to have multiple postgres connections calling the same procedure, but each being allocated a different set of records to work on so there is no lock contention.

CREATE OR REPLACE PROCEDURE batch_geocode()
LANGUAGE 'plpgsql' AS 
$$
BEGIN
    WHILE EXISTS (SELECT 1 FROM addr_to_geocode WHERE rating IS NULL) LOOP
    -- we use FOR UPDATE SKIP LOCKED to prevent multiple concurrent procedure calls 
    -- from grabbing the same set of records
        WITH a AS ( SELECT addid, address FROM addr_to_geocode 
                      WHERE rating IS NULL ORDER BY addid LIMIT 5 
                        FOR UPDATE SKIP LOCKED)
        UPDATE addr_to_geocode
            SET (rating, new_address, pt)
            = (COALESCE(g.rating,-1),
                COALESCE ((g.addy).address::text, '')
                || COALESCE(' ' || (g.addy).predirabbrev, '')
                || COALESCE(' ' || (g.addy).streetname,'')
                || ' ' || COALESCE(' ' || (g.addy).streettypeabbrev, '')
                || COALESCE(' ' || (g.addy).location || ', ', '')
                || COALESCE(' ' || (g.addy).stateabbrev, '')
                || COALESCE(' ' || (g.addy).zip, '')
            ,
            ST_SetSRID(g.geomout,4326)::geography
            )
           FROM (SELECT addid, (gc).rating, (gc).addy, (gc).geomout
                  FROM a
                  LEFT JOIN LATERAL geocode(address,1) AS gc ON (true)
                ) AS g
           WHERE g.addid = addr_to_geocode.addid;
           
         COMMIT;
    END LOOP;
    RETURN;
END;
$$;

To call the procedure, you would use the new CALL statement like so:

CALL batch_geocode();

You'll notice that if you then cancel the procedure (e.g. by clicking the stop icon in pgAdmin ), the geocoding that has already been done before canceling the call is not lost, unlike it would be if this were a function.