Postgres OnLine Journal: Jan 2018 - Dec 2018
An in-depth Exploration of the PostgreSQL Open Source Database
 

Table Of Contents

What's new and upcoming in PostgreSQL
Basics
Using PostgreSQL Contribs

What's new and upcoming in PostgreSQL

 

Using procedures for batch geocoding and other batch processing



One of the features we are looking forward to in upcoming PostgreSQL 11 is the introduction of procedures via the CREATE PROCEDURE ANSI-SQL construct. The major benefit that sets apart procedures from functions is that procedures are not wrapped in an outer transaction and can have COMMITs within them. This means it's not an all or nothing like it is with functions. Even if you stop a procedure in motion, whatever work has been done and committed is saved. In the case of functions, a stop or failure would roll-back all the work. It also means you can see work in progress of a stored procedure since the work will already have been committed. This is a huge benefit for batch processing. Batch processing covers a lot of use-cases of PostGIS users since a good chunk of PostGIS work involves doing some kind of batch processing of data you get from third-parties or machines.

We'll demonstrate with batch geocoding, using a similar approach we discussed in. psql watch for batch processing. Folks often resort to doing this stuff outside of the database like using Python as a driver to push batches to postgres. Now postgres is capable of doing it all.

In addition to the code and data here, you'll need postgis_tiger_geocoder extension and loaded with MA census data.

Why batch process data?

For geocoding of data or any kind of lengthy processing work, you often do this in batches of say 1, 10, or 50, but never the whole 1,000,000. The reasons for that are

  • For each update statement, PostgreSQL needs to be able to rollback the update. If your server goes down for whatever reason your last update would be lost. A million records of work is a lot to lose.
  • Since PostgreSQL needs to be able to rollback, all that workspace to get ready to rollback is occupied and eats up memory. In that sense you may not even have the memory available to handle a batch of 1,000,000 in one bite.

You can't put your batch work in a PostgreSQL function, at least not all of it, because each function has an explicit transaction wrapped around it. This is one of the reasons why trying to do it in a function can take 10 or more times longer than just embedding the logic in a script that pushes the work in smaller commits. Now we've got PostgreSQL procedures where we can put in pretty much the same code as we would have done in our function (except now we can have COMMITS), and can explicitly do the commit after each batch. The great thing about this is if your connection is lost, the server crashes, you loose power, or whatever, the work that has already been committed is not lost. It also means the memory is freed after each commit, and not wait for all the work to be done in the procedure before making available for other processes.

Test Data Set

Our test dataset is small so not very interesting. To simulate more records, we repeat each address 1000 times.

DROP TABLE IF EXISTS addr_to_geocode ;
CREATE TABLE addr_to_geocode(addid serial PRIMARY KEY, address text,
    pt geography(point, 4326), new_address text, rating integer);

INSERT INTO addr_to_geocode(address)
VALUES ('529 Main Street, Boston MA, 02129'),
 ('77 Massachusetts Avenue, Cambridge, MA 02139'),
 ('25 Wizard of Oz, Walaford, KS 99912323'),
 ('26 Capen Street, Medford, MA'),
 ('124 Mount Auburn St, Cambridge, Massachusetts 02138'),
 ('950 Main Street, Worcester, MA 01610');
 
-- repeat address to get more data
INSERT INTO addr_to_geocode(address)
SELECT address 
FROM addr_to_geocode, generate_series(1,1000);

The batch processing procedure

To create our procedure, we are also going to take advantage of the FOR UPDATE SKIP LOCKED feature introduced in PostgreSQL 9.5. This will allow us to have multiple postgres connections calling the same procedure, but each being allocated a different set of records to work on so there is no lock contention.

CREATE OR REPLACE PROCEDURE batch_geocode()
LANGUAGE 'plpgsql' AS 
$$
BEGIN
    WHILE EXISTS (SELECT 1 FROM addr_to_geocode WHERE rating IS NULL) LOOP
    -- we use FOR UPDATE SKIP LOCKED to prevent multiple concurrent procedure calls 
    -- from grabbing the same set of records
        WITH a AS ( SELECT addid, address FROM addr_to_geocode 
                      WHERE rating IS NULL ORDER BY addid LIMIT 5 
                        FOR UPDATE SKIP LOCKED)
        UPDATE addr_to_geocode
            SET (rating, new_address, pt)
            = (COALESCE(g.rating,-1),
                COALESCE ((g.addy).address::text, '')
                || COALESCE(' ' || (g.addy).predirabbrev, '')
                || COALESCE(' ' || (g.addy).streetname,'')
                || ' ' || COALESCE(' ' || (g.addy).streettypeabbrev, '')
                || COALESCE(' ' || (g.addy).location || ', ', '')
                || COALESCE(' ' || (g.addy).stateabbrev, '')
                || COALESCE(' ' || (g.addy).zip, '')
            ,
            ST_SetSRID(g.geomout,4326)::geography
            )
           FROM (SELECT addid, (gc).rating, (gc).addy, (gc).geomout
                  FROM a
                  LEFT JOIN LATERAL geocode(address,1) AS gc ON (true)
                ) AS g
           WHERE g.addid = addr_to_geocode.addid;
           
         COMMIT;
    END LOOP;
    RETURN;
END;
$$;

To call the procedure, you would use the new CALL statement like so:

CALL batch_geocode();

You'll notice that if you then cancel the procedure (e.g. by clicking the stop icon in pgAdmin ), the geocoding that has already been done before canceling the call is not lost, unlike it would be if this were a function.


Basics

 

Unpivoting data using JSON functions



Most of our use-cases for the built-in json support in PostgreSQL is not to implement schemaless design storage, but instead to remold data. Remolding can take the form of restructuring data into json documents suitable for web maps, javascript charting web apps, or datagrids. It also has uses beyond just outputting data in json form. In addition the functions are useful for unraveling json data into a more meaningful relational form.

One of the common cases we use json support is what we call UNPIVOTING data. We demonstrated this in Postgres Vision 2018 presentation in slide 23. This trick won't work in other relational databases that support JSON because it also uses a long existing feature of PostgreSQL to be able to treat a row as a data field.

Unpivoting data is the process of converting columns to rows. It's the reverse of a cross tabluation (aka PIVOT). So why would you ever need to unpivot data?

Take for example if you had a spreadsheet where your HR person decided to input each employees hours in columns where each column represents a separate week.

The spreadsheet data might look something like this:

Employee2018-05-042018-05-112018-05-182018-05-25
Regina70403565
Leo50804565
Caitlin40253535
Jane20251518

If you are using a spreadsheet as your data entry and presentation tool, then this is a great setup, but for storing the data in a relational database and being able to do arbitrary date summary aggregations, it's easier to work with the data in a format such as below because it's more flexible how you rollup date ranges and you don't have to add a new column to your table every time a new week comes along.

EmployeePPEhours
Regina2018-05-0470
Regina2018-05-1140
Regina2018-05-1835
Regina2018-05-1865

Note this trick you can accomplish using the hstore extension as well as we covered a while back in Unpivoting data in PostgreSQL. Since JSON/JSONB functions are built-in no CREATE EXTENSION is required to use them.

INSERT INTO staff_hours(employee, ppe, hours)
SELECT hours_raw."Employee" As employee,  j.key::date AS ppe,  j.val::numeric As hours
FROM hours_raw, LATERAL jsonb_each_text(to_jsonb(hours_raw)) AS j(key,val)
WHERE j.key NOT IN('Employee');

The above query is composed for 4 steps

  • A row in a query (not just tables, though we are only showing a table here), can be output as a single thing. In this case we are going to use our hours_raw table as a set of elements of type hours_raw and feed it to the to_jsonb function.
  • to_jsonb (and it's companion to_json) can convert any element (including row elements) into json representations.
  • The jsonb_each_text function returns a set of key / value pairs. When used in conjunction with LATERAL (which is the default for functions when used in from), it generates a new set of rows for each row of hours_raw
  • For this example we assume all columns except for Employee are hour columns where the header is a pay period ending (PPE). We only want to return the Key/Values that do not correspond to employee name.

Using PostgreSQL Contribs

 

http extension for windows 64 and 32-bit



Updated October 19th, 2018 64-bit package for PostgreSQL 11 http extension v1.2.4 release

For those folks on windows who want to do http gets and posts directly from your PostgreSQL server, we've made binaries for the http extension v1.2.2 release for PostgreSQL 10, 9.6, 9.5, and 9.4 Windows 64-bit.

These should work fine on both PostgreSQL EDB windows distributions as well as the BigSQL windows 64 distributions. Instructions in the ReadMe.txt for where to copy the files to for each distribution. Caution should be taken especially with BigSQL to not overwrite existing BigSQL files. Many of the files in this already exist in the BigSQL distribution since we both compile using the Mingw64 chain.

If you have PostGIS already installed, many of these files you will also already have since things like the libcurl and PCRE are also packaged with PostGIS.

Updated November 22nd, 2017 all packages updated to http extension v1.2.2 release

http extension binaries for PostgreSQL 11, 10, 9.6, 9.5, and 9.4 windows 64-bit downloads

The curl library for http is built with SSL support and utilizes the ssleasy.dll packaged with the EDB and BigSQL installs.

http extension binaries for PostgreSQL 10, 9.6, 9.5, and 9.4 windows 32-bit downloads

http quick primer

To enable in a database after having installed the binaries.

CREATE EXTENSION http;

Do a basic get

SELECT h.content, h.content_type, hkv.value As dt
FROM http_get('http://postgis.net/tips/') AS h 
    LEFT JOIN LATERAL (SELECT *  
    FROM unnest(h.headers) 
        WHERE field =  'Date') AS hkv ON true;

Check out more examples at: https://github.com/pramsey/pgsql-http


Using PostgreSQL Contribs

 

FDWS for PostgreSQL Windows 32 and 64-bit



We've updated our binaries for PostgreSQL 10 windows, both 32 and 64-bit. The 64-bit should work fine with EnterpriseDb windows as well as BigSQL.

Note this package does not include our favorite FDW ogr_fdw since that is packaged as part of PostGIS packages from EnterpriseDb Stackbuilder and also BigSQL distributions.

If you do not have PostGIS installed (and don't want to for some reason) and want to use ogr_fdw on windows, you can download from: Winnie's PG 10 Extras. ogr_fdw is a great FDW for querying not just spatial data, but also a ton of other file formats or relational since spatial is a superset.

These packages contain the following FDWs:

  • odbc_fdw (recently updated to work with PostgreSQL 10)
  • file_textarray_fdw, great and fast for working with arbitrary and pesky delimited data. Especially where they weren't considerate enough to give you the same number of columns per row.
  • WWW FDW exposes web services as queryable tables. I'll note we haven't really used this since 9.4 since the 9.5 beyond we had some random crashes we haven't investigated. That said the http extension, Windows binaries here, is a better alternative especially for newer PostgreSQL. We'll do another article soon on the new features of the http extension which is pretty sweet if you have a need to use web services (even ones that require authentication) . We've used it for pushing our client data to SalesForce as well as posting and pulling down files from other third-party sources our clients are subscribed to. Both the http and www_fdw rely on Curl for their heavy lifting.

Using PostgreSQL Contribs

 

PostgresVision 2018 Slides and Impressions



Leo and I attended PostgresVision 2018 which ended a couple of days ago.

We gave a talk on spatial extensions with main focus being PostGIS. Here are links to our slides PostgresVision2018_SpatialExtensions HTML version PDF.

Unfortunately there are no slides of the pgRouting part, except the one that says PGRouting Live Demos because Leo will only do live demos. He has no fear of his demos not working.

Side note, if you are on windows and use the PostGIS bundle, all the extensions listed in the PostGIS box of the spatial extensions diagram, as well as the pointcloud, pgRouting, and ogr_fdw are included in the bundle.

Postgres Vision 2018 was surprisingly good. We feared the talks would be mostly focused on upper management folks, but there turned out to be quite a few very technical ones to chew on as well such as sharding, JIT, WAL, and ZHeap to name some. Though Bruce Momjian's talk Why Postgres will live forever wasn't that technical it was the best we saw. It was inspiring, comical, and full of computer history tidbits. I love computer history.

Most exciting, I got to meet fellow package maintainer Devrim Gündüz for the first time. Devrim manages yum.postgresql.org and as such does packaging for PostGIS in addition to the other 100s of packages he maintains. I only do packaging for PostGIS bundle for windows and even that feels like a lot.


Using PostgreSQL Contribs

 

PGOpen 2018 Data Loading Presentation Slides



At PGOpen 2018 in San Francisco, we gave a talk on 10 ways to load data into Posgres. This is one of the rare talks where we didn't talk much about PostGIS. However we did showcase tools ogr_fdw, ogr2ogr, shp2pgsql, which are commonly used for loading spatial data, but equally as good for loading non-spatial data. Below are the slide links.