Paris OSGeo Code Sprint 2016 Highlights

Leo and I attended the Paris OSGeo Code Sprint at Mozilla Foundation put together by Oslandia and funded by several companies. It was a great event. There was quite a bit of PostGIS related hacking that happened by many new faces. We have detailed at BostonGIS: OSGeo Code Sprint 2016 highlights some of the more specific PostGIS hacking highlights. Giuseppe Broccolo of 2nd Quadrant already mentioned BRIN for PostGIS: my story at the Code Sprint 2016 in Paris.

ODBC enhancements to GDAL and OGR FDW

One of the things high on our laundry list, was improving performance of the ODBC driver in ogr_fdw, which by extension means improving performance of ODBC driver in GDAL. We've been using ogr_fdw quite a bit, mostly for non-spatial uses.

While I've always had a fondness for the GDAL, which grew a bit with PostGIS raster dependency on GDAL, the onset of ogr_fdw foreign data wrapper means I really love GDAL now because I can bring this swiss army data tool right into my database to serve non-spatial workloads as well. An ETL driven by the power of PostgreSQL. What could be better? Anyway I think my interest got a bit infectious because one guy named Jack came on PostGIS mailing list talking about ODBC in OGR FDW. It was slow for him and I rightfully concluded he had a ton of tables in his SQL Server. I had discovered this myself a while back when trying to pull data from a very large SQL Server database, to which I had to overcome the issue by itemizing the tables I wanted. The fundamental issue is that the GDAL ODBC driver does a load of all table meta data on connection unless you explicitly state in the connection string which you want. Since the connection is called for every query in OGR_FDW, this can be like 5 minutes added to your query time even if you just wanted to query 1 row of a small table in a large database. This was the main thing I at least wanted to know how to start tackling. I pointed at the section in ogr_fdw where the query goes south and traced thru the ODBC gdal driver call. Paul Ramsey (author of ogr_fdw) had some ideas of how the ogr_fdw code could be changed to hold on to the ogr connection instead of opening it up on each query call.

Even Roualt who is the chief developer now of GDAL was at the sprint too, his writeup is here. He was very busy as you can see, so I didn't want to take up too much of his time. He was nice enough though to come over and point at the sections of the PostgreSQL driver that are useful to copy for improving the ODBC driver. I sadly haven't gotten too deep into that yet. The first thing that was nagging me that I needed to fix before jumping into the ODBC driver, was to fix the GDAL configure script, so I could build Native Windows ODBC support under mingw64 without having to hack the configure script each time. This is something I had promised to do like 7 months ago, but had run into a snaffu with my mingw64-w32 env still not liking my revised configure that seemed to work fine under mingw64-w64. I'm happy to say, I've overcome that issue and finally done it and it was my first non-trivial GDAL patch and was accepted for both GDAL 2.0 and GDAL trunk (2.1). So now on to the harder work of improving the ODBC driver.

Two on my todo for the ODBC driver, are to

Raster data and materialized views restore and foreign table gotchas

Ronan Dunklau of Dalibo came up to me to discuss the sorry state of restoring PostGIS raster data and reminded me how I was going to do something about this and how it also affected restoring some materialized views and some cases of PostGIS geometry (like spatial indexes that use ST_Transform). The fundamental issue is that PostGIS, like many other extensions doesn't enforce a schema in which PostGIS should be installed in, and we have many functions that call other functions. Since pg_restore always sets the search_path = pg_catalog, schema_table_view - your data does not come back if PostGIS is not installed in the same schema as your data and restoration of your data requires calling PostGIS functions that call other PostGIS functions during load. Several raster table check constraint functions fall into this bucket, but they are not the only ones. I suspect we are the only extension suffering from this problem because we have more functions than any other extension and might be the only extension that has functions calling other functions.

I had a plan, but my plan would not handle ALTER EXTENSION .. SET search_path. He proposed the interesting idea of using event triggers to handle that situation. With that encouragement, I started hacking away at the problem with a proof of concept that contained hand-coded ALTER FUNCTION and Paul's query to figure out schema PostGIS is installed in. When I got back home I replaced with a perl script, encouraged by Dan Baston who said, "Why don't you just do it for all functions". Dan was also suffering database restore problems because of this issue and he doesn't even use raster. So I've done my first non-trivial perl script.

For PostGIS 2.2, the perl script, generates an SQL file that sets search path of all functions except for those used in relationships that require SQL inlining behavior. I discovered the unpleasant truth that doing - ALTER FUNCTION set search_path=... destroys SQL inlining behavior, which we need to pull off having things like our ST_Intersects function utilize a spatial index. So sadly raster data restore issue is fixed, but materialized views and foreign table calls that utilize ST_Intersects that use ST_Intersects, ST_covers etc are not fixed.

For PostGIS 2.3, I've folded this script into the extension script, so that the search path setting thing happens automagically. I plan to experiment with Ronan's idea of using EVENT TRIGGERS to see if it works well or creates more problems than it solves.