Postgres OnLine Journal : Jan 2018

From the Editors

What's new and upcoming in PostgreSQL

Using procedures for batch geocoding and other batch processing

Basics

Using PostgreSQL Extensions

FDWS for PostgreSQL Windows 32 and 64-bit
PostgresVision 2018 Slides and Impressions
PGOpen 2018 Data Loading Presentation Slides

A Product of Paragon Corporation
http://www.paragoncorporation.com/
http://www.postgresonline.com/

From the Editors

PostGIS 2.5.1 Bundle for Windows

PostGIS 2.5.1 was released on November 18th 2018 and I finished off packaging the PostGIS 2.5.1 windows builds and installers targeted for PostgreSQL EDB distribution this weekend and pushing them up to stackbuilder. This covers PostgreSQL 9.4-11 64-bit and PostgreSQL 95-10 (32bit).

Note that PostGIS 2.5 series will be the last of the PostGIS 2s. Goodbye PostGIS 2.* and start playing with the in-development version of PostGIS 3. Snapshot binaries for PostGIS 3.0 windows development are also available on the PostGIS windows download page. These should work for both BigSQL and EDB distributions.

One reason I wait a little to package the PostGIS Bundle is because it has MORE than just PostGIS. It's got the newest stable GEOS, (sometimes newer GDAL and SFCGAL), and most importantly usually a newer pgRouting, osm2pgrouting and ogr_fdw. I try to stage the packaging so that if a new pgRouting or ogr_fdw is about to come out I include it.

Back to Table Of Contents

What's new and upcoming in PostgreSQL

Using procedures for batch geocoding and other batch processing

One of the features we are looking forward to in upcoming PostgreSQL 11 is the introduction of procedures via the CREATE PROCEDURE ANSI-SQL construct. The major benefit that sets apart procedures from functions is that procedures are not wrapped in an outer transaction and can have COMMITs within them. This means it's not an all or nothing like it is with functions. Even if you stop a procedure in motion, whatever work has been done and committed is saved. In the case of functions, a stop or failure would roll-back all the work. It also means you can see work in progress of a stored procedure since the work will already have been committed. This is a huge benefit for batch processing. Batch processing covers a lot of use-cases of PostGIS users since a good chunk of PostGIS work involves doing some kind of batch processing of data you get from third-parties or machines.

We'll demonstrate with batch geocoding, using a similar approach we discussed in. psql watch for batch processing. Folks often resort to doing this stuff outside of the database like using Python as a driver to push batches to postgres. Now postgres is capable of doing it all.

In addition to the code and data here, you'll need postgis_tiger_geocoder extension and loaded with MA census data.

Why batch process data?

For geocoding of data or any kind of lengthy processing work, you often do this in batches of say 1, 10, or 50, but never the whole 1,000,000. The reasons for that are

For each update statement, PostgreSQL needs to be able to rollback the update. If your server goes down for whatever reason your last update would be lost. A million records of work is a lot to lose.
Since PostgreSQL needs to be able to rollback, all that workspace to get ready to rollback is occupied and eats up memory. In that sense you may not even have the memory available to handle a batch of 1,000,000 in one bite.

You can't put your batch work in a PostgreSQL function, at least not all of it, because each function has an explicit transaction wrapped around it. This is one of the reasons why trying to do it in a function can take 10 or more times longer than just embedding the logic in a script that pushes the work in smaller commits. Now we've got PostgreSQL procedures where we can put in pretty much the same code as we would have done in our function (except now we can have COMMITS), and can explicitly do the commit after each batch. The great thing about this is if your connection is lost, the server crashes, you loose power, or whatever, the work that has already been committed is not lost. It also means the memory is freed after each commit, and not wait for all the work to be done in the procedure before making available for other processes.

Test Data Set

Our test dataset is small so not very interesting. To simulate more records, we repeat each address 1000 times.

DROP TABLE IF EXISTS addr_to_geocode ;
CREATE TABLE addr_to_geocode(addid serial PRIMARY KEY, address text,
    pt geography(point, 4326), new_address text, rating integer);

INSERT INTO addr_to_geocode(address)
VALUES ('529 Main Street, Boston MA, 02129'),
 ('77 Massachusetts Avenue, Cambridge, MA 02139'),
 ('25 Wizard of Oz, Walaford, KS 99912323'),
 ('26 Capen Street, Medford, MA'),
 ('124 Mount Auburn St, Cambridge, Massachusetts 02138'),
 ('950 Main Street, Worcester, MA 01610');
 
-- repeat address to get more data
INSERT INTO addr_to_geocode(address)
SELECT address 
FROM addr_to_geocode, generate_series(1,1000);

The batch processing procedure

To create our procedure, we are also going to take advantage of the FOR UPDATE SKIP LOCKED feature introduced in PostgreSQL 9.5. This will allow us to have multiple postgres connections calling the same procedure, but each being allocated a different set of records to work on so there is no lock contention.

CREATE OR REPLACE PROCEDURE batch_geocode()
LANGUAGE 'plpgsql' AS 
$$
BEGIN
    WHILE EXISTS (SELECT 1 FROM addr_to_geocode WHERE rating IS NULL) LOOP
    -- we use FOR UPDATE SKIP LOCKED to prevent multiple concurrent procedure calls 
    -- from grabbing the same set of records
        WITH a AS ( SELECT addid, address FROM addr_to_geocode 
                      WHERE rating IS NULL ORDER BY addid LIMIT 5 
                        FOR UPDATE SKIP LOCKED)
        UPDATE addr_to_geocode
            SET (rating, new_address, pt)
            = (COALESCE(g.rating,-1),
                COALESCE ((g.addy).address::text, '')
                || COALESCE(' ' || (g.addy).predirabbrev, '')
                || COALESCE(' ' || (g.addy).streetname,'')
                || ' ' || COALESCE(' ' || (g.addy).streettypeabbrev, '')
                || COALESCE(' ' || (g.addy).location || ', ', '')
                || COALESCE(' ' || (g.addy).stateabbrev, '')
                || COALESCE(' ' || (g.addy).zip, '')
            ,
            ST_SetSRID(g.geomout,4326)::geography
            )
           FROM (SELECT addid, (gc).rating, (gc).addy, (gc).geomout
                  FROM a
                  LEFT JOIN LATERAL geocode(address,1) AS gc ON (true)
                ) AS g
           WHERE g.addid = addr_to_geocode.addid;
           
         COMMIT;
    END LOOP;
    RETURN;
END;
$$;

To call the procedure, you would use the new CALL statement like so:

CALL batch_geocode();

You'll notice that if you then cancel the procedure (e.g. by clicking the stop icon in pgAdmin ), the geocoding that has already been done before canceling the call is not lost, unlike it would be if this were a function.

Back to Table Of Contents

Basics

Unpivoting data using JSON functions

Most of our use-cases for the built-in json support in PostgreSQL is not to implement schemaless design storage, but instead to remold data. Remolding can take the form of restructuring data into json documents suitable for web maps, javascript charting web apps, or datagrids. It also has uses beyond just outputting data in json form. In addition the functions are useful for unraveling json data into a more meaningful relational form.

One of the common cases we use json support is what we call UNPIVOTING data. We demonstrated this in Postgres Vision 2018 presentation in slide 23. This trick won't work in other relational databases that support JSON because it also uses a long existing feature of PostgreSQL to be able to treat a row as a data field.

Unpivoting data is the process of converting columns to rows. It's the reverse of a cross tabluation (aka PIVOT). So why would you ever need to unpivot data?

Take for example if you had a spreadsheet where your HR person decided to input each employees hours in columns where each column represents a separate week.

The spreadsheet data might look something like this:

Employee	2018-05-04	2018-05-11	2018-05-18	2018-05-25
Regina	70	40	35	65
Leo	50	80	45	65
Caitlin	40	25	35	35
Jane	20	25	15	18

If you are using a spreadsheet as your data entry and presentation tool, then this is a great setup, but for storing the data in a relational database and being able to do arbitrary date summary aggregations, it's easier to work with the data in a format such as below because it's more flexible how you rollup date ranges and you don't have to add a new column to your table every time a new week comes along.

Employee	PPE	hours
Regina	2018-05-04	70
Regina	2018-05-11	40
Regina	2018-05-18	35
Regina	2018-05-18	65

Note this trick you can accomplish using the hstore extension as well as we covered a while back in Unpivoting data in PostgreSQL. Since JSON/JSONB functions are built-in no CREATE EXTENSION is required to use them.

INSERT INTO staff_hours(employee, ppe, hours)
SELECT hours_raw."Employee" As employee,  j.key::date AS ppe,  j.val::numeric As hours
FROM hours_raw, LATERAL jsonb_each_text(to_jsonb(hours_raw)) AS j(key,val)
WHERE j.key NOT IN('Employee');

The above query is composed for 4 steps

A row in a query (not just tables, though we are only showing a table here), can be output as a single thing. In this case we are going to use our hours_raw table as a set of elements of type hours_raw and feed it to the to_jsonb function.
to_jsonb (and it's companion to_json) can convert any element (including row elements) into json representations.
The jsonb_each_text function returns a set of key / value pairs. When used in conjunction with LATERAL (which is the default for functions when used in from), it generates a new set of rows for each row of hours_raw
For this example we assume all columns except for Employee are hour columns where the header is a pay period ending (PPE). We only want to return the Key/Values that do not correspond to employee name.

Back to Table Of Contents

Using PostgreSQL Extensions

FDWS for PostgreSQL Windows 32 and 64-bit

We've updated our binaries for PostgreSQL 10 windows, both 32 and 64-bit. The 64-bit should work fine with EnterpriseDb windows as well as BigSQL.

Note this package does not include our favorite FDW ogr_fdw since that is packaged as part of PostGIS packages from EnterpriseDb Stackbuilder and also BigSQL distributions.

If you do not have PostGIS installed (and don't want to for some reason) and want to use ogr_fdw on windows, you can download from: Winnie's PG 10 Extras. ogr_fdw is a great FDW for querying not just spatial data, but also a ton of other file formats or relational since spatial is a superset.

These packages contain the following FDWs:

odbc_fdw (recently updated to work with PostgreSQL 10)

file_textarray_fdw, great and fast for working with arbitrary and pesky delimited data. Especially where they weren't considerate enough to give you the same number of columns per row.
WWW FDW exposes web services as queryable tables. I'll note we haven't really used this since 9.4 since the 9.5 beyond we had some random crashes we haven't investigated. That said the http extension, Windows binaries here, is a better alternative especially for newer PostgreSQL. We'll do another article soon on the new features of the http extension which is pretty sweet if you have a need to use web services (even ones that require authentication) . We've used it for pushing our client data to SalesForce as well as posting and pulling down files from other third-party sources our clients are subscribed to. Both the http and www_fdw rely on Curl for their heavy lifting.

Windows 64-bit: zip 7z
Windows 32-bit: zip 7z

Back to Table Of Contents

Using PostgreSQL Extensions

PostgresVision 2018 Slides and Impressions

Leo and I attended PostgresVision 2018 which ended a couple of days ago.

We gave a talk on spatial extensions with main focus being PostGIS. Here are links to our slides PostgresVision2018_SpatialExtensions HTML version PDF.

Unfortunately there are no slides of the pgRouting part, except the one that says PGRouting Live Demos because Leo will only do live demos. He has no fear of his demos not working.

Side note, if you are on windows and use the PostGIS bundle, all the extensions listed in the PostGIS box of the spatial extensions diagram, as well as the pointcloud, pgRouting, and ogr_fdw are included in the bundle.

Postgres Vision 2018 was surprisingly good. We feared the talks would be mostly focused on upper management folks, but there turned out to be quite a few very technical ones to chew on as well such as sharding, JIT, WAL, and ZHeap to name some. Though Bruce Momjian's talk Why Postgres will live forever wasn't that technical it was the best we saw. It was inspiring, comical, and full of computer history tidbits. I love computer history.

Most exciting, I got to meet fellow package maintainer Devrim Gündüz for the first time. Devrim manages yum.postgresql.org and as such does packaging for PostGIS in addition to the other 100s of packages he maintains. I only do packaging for PostGIS bundle for windows and even that feels like a lot.

Back to Table Of Contents

Using PostgreSQL Extensions

PGOpen 2018 Data Loading Presentation Slides

At PGOpen 2018 in San Francisco, we gave a talk on 10 ways to load data into Posgres. This is one of the rare talks where we didn't talk much about PostGIS. However we did showcase tools ogr_fdw, ogr2ogr, shp2pgsql, which are commonly used for loading spatial data, but equally as good for loading non-spatial data. Below are the slide links.

Back to Table Of Contents