Import Foreign Schema hack with OGR_FDW and reading LibreOffice calc workbooks

PostgreSQL 9.4 and below doesn't support importing whole set of tables from a FOREIGN server, but PostgreSQL 9.5 does with the upcoming Import Foreign Schema. To use will require FDW wrapper designers to be aware of this feature and use the plumbing in their wrappers. IMPORT FOREIGN SCHEMA for ogr_fdw come PostgreSQL 9.5 release is on the features ticket list.

UPDATE: If you are using PostgreSQL 9.5+, you can use the IMPORT FOREIGN SCHEMA feature which is available in ogr_fdw 1.0.1+. We demonstrate this in: ogr fdw IMPORT FOREIGN SCHEMA.

The ogr_fdw comes with this to die for commandline utility called ogr_fdw_info that does generate the table structures for you and will also list all the tables in the Foreign data source if you don't give it a specific table name. So with this utility I wrote a little hack involving using PostgreSQL COPY PROGRAM feature to call out to the ogr_fdw_info commandline tool to figure out the table names and some DO magic to create the tables.

Though ogr_fdw is designed to be a spatial foreign data wrapper, it's turning out to be a pretty nice non-spatial FDW as well especially for reading spreadsheets which we seem to get a lot of. This hack I am about to demonstrate I am demonstrating with LibreOffice/OpenOffice workbook, but works equally well with Excel workbooks and most any data source that OGR supports.

This hack came in really handy for processing workbooks (with several spreadsheets e.g. an Excel or LibreOffice workbook) and SQL Server tables. Sadly I tried it on my MS Access data source, and while it could read the list of tables, PostgreSQL returned blank text when trying to grab the table structure. This was really weird considering from the commandline I can see the table structure. Haven't figured out what caused that. I blame it on some funky character MS Access is spitting out that PostgreSQL is reading as a black hole character or some weird permission issue with reading MS access catalogs from within PostgreSQL. The commandline worked fine using directly even for MS Access and reading the table once the foreign table was created worked fine as well.

The ogr_fdw_info wrapper function

The wrapper function just spits out the SQL to do the work. The reason I have it not do the work is because depending on my needs, I may say want to prefix the table names with something or change the data types it emits. For example in SQL Server, unless the timestamp field is the last field in the table, ogr_fdw doesn't handle it right. I still have to investigate this one. So where it says timestamp, for SQL Server I swap it out for varchar.

Note that I also hard-coded the executable path in there which is very specific to where it will end up if you are running PostgreSQL 9.4 on windows.

Some things to be cognizant of

2016-01-03 - ogr_fdw_table function updated to prevent COPY FROM from mangling layer names with special characters in it. Done by forcing to text with csv and using | (as delimiter which wouldn't exist in most datasources so prevents , from being used which could be found in name of data source.
CREATE OR REPLACE FUNCTION ogr_fdw_sql_table(
    data_source text,
    layer text DEFAULT ''::text)
  RETURNS text AS
$$
BEGIN
        DROP TABLE IF EXISTS ogr_fdw_out;
        CREATE TEMP TABLE ogr_fdw_out(out text);
        IF layer > '' THEN 
        EXECUTE 'COPY ogr_fdw_out FROM PROGRAM ''"C:/Program Files/PostgreSQL/9.4/bin/ogr_fdw_info" -s "' 
            || data_source || '" -l "' 
            || layer || '"'' WITH (format ''csv'', delimiter ''|'')';
        ELSE
                EXECUTE 'COPY ogr_fdw_out FROM PROGRAM ''"C:/Program Files/PostgreSQL/9.4/bin/ogr_fdw_info" -s "' 
                || data_source 
                || '"''  WITH (format ''csv'', delimiter ''|'')';
        END IF;
        RETURN (SELECT string_agg(out,E'\n') from ogr_fdw_out);
END;

$$
  LANGUAGE plpgsql VOLATILE SECURITY DEFINER;

For demonstrating this little puppy in action, I need a workbook with several spreadsheets to do it justice. So I downloaded a LibreOffice spreadsheet template from Monthly Home Budget Workbook and created a new workbook called Budget2015.ods from that. Note that LibreOffice, OpenOffice, and the Microsoft Excel 2007 and above (the .xlsx) format all are some variant of compressed XML. I will reiterate that what data sources you can read with ogr_fdw is dictated by your libgdal library and what support it is compiled with. In the case of reading LibreOffice/OpenOffice workbooks and the newer Microsoft XLSX files, your GDAL library needs to be compiled with Expat. In order to read older Excel files (xls format), you need to have your libgdal built with FreeXL support. The windows builds we created and detailed in PostgreSQL 9.4 bag-o-fdws are built with Expat and FreeXL (and as you can see, offer xlsx and ods support) and xls.

Reading LibreOffice Calc workbooks (ODS) direclty from PostgreSQL

So to test out, I throw my new fangled budget workbook in my FDW folder and ran this little statement.

SELECT ogr_fdw_sql_table('C:/fdw_data/Budget2015.ods');

And the output is this beautiful thing. Now as a non-spatial person, you have to get over the hump that Layer means Table as far as you are concerned

Layers:
  Dashboard
  Form 2 - Equity
  Form 3 - Income
  Form 4 - Lump Sum Planning
  Form 5 - Monthly Cash Flow
  Form 7 - Allocation
  Form 8 - Spending Details
  Form 9 - Work Expenses
  Debt Analysis
  Categories and Items

So we've got a couple of spreadsheets (tables, layers) in this workbook. I'm going to demonstrate how to see the structure with 2 different widely different spreadsheets just to demonstrate a particular behavior of the OGR ODS driver (the XLSX and XLS drivers seem to behave the same as far as I can tell) when faced with something that is not quite tabular looking (or lacking headers) vs. is tabular.

SELECT ogr_fdw_sql_table('C:/fdw_data/Budget2015.ods', 'Dashboard');

The output of the Dashboard layer structure is:

CREATE SERVER myserver
  FOREIGN DATA WRAPPER ogr_fdw
  OPTIONS (
    datasource 'C:/fdw_data/Budget2015.ods',
    format 'ODS' );

CREATE FOREIGN TABLE dashboard (
  fid integer,
  field1 varchar,
  field2 varchar,
  field3 real,
  field4 varchar,
  field5 varchar,
  field6 varchar,
  field7 varchar )
  SERVER myserver
  OPTIONS ( layer 'Dashboard' );

Now if you open up the workbook in LibreOffice Calc and look at the dashboard sheet, you'll see it's got charts and rows of data with no header. It has no header for the table. The only thing that is sure is that column 3 has got numbers. So since OGR needs to ascibe a column name and can't infer one from the top row, it just calls them field1.. fieldn. OGR always adds an additional column, called fid at the beginning which is really a row number identifier. You'll also sometimes see a geom column as the second column if it thinks it's a spatial datasource.

If you were to do the same exercise using:

SELECT ogr_fdw_sql_table('C:/fdw_data/Budget2015.ods','Form 4 - Lump Sum Planning');

You'd get the same CREATE SERVER command since they come from the same workbook, but your CREATE FOREIGN TABLE would look like this:

CREATE FOREIGN TABLE form_4___lump_sum_planning (
  fid integer,
  item varchar,
  annual_amount varchar,
  monthly_amount integer )
  SERVER myserver
  OPTIONS ( layer 'Form 4 - Lump Sum Planning' );

Note how the table name and the field names have been changed so they are legal field and column names for PostgreSQL (not requiring you to quote the columns or table name). For example the spaces in the header of spreadsheet cells got converted to underscore and lower case: e.g. Monthly Amount became monthly_amount. At first glance, it seems a little strange it considered annual_amount varchar and monthly_amount integer. OGR ODS driver by default infers the datatype from the data it sees (there are override environment variables for this, but can't pass them into the OGR_FDW yet I don't think). The Monthly Amount column had just 0s in it and the Annual Amount column had no values. If you go back and fill in the Annual Amount column with numbers (so that monthly_amount recomputes to numbers with decimals), and then you rerun this command, you'll find that the annual_amount and monthly_amount both then show as real data type.

To prepare the data for querying directly from PostgreSQL, we're going to:

  1. Create a schema to house foreign tables: CREATE SCHEMA budget2015;
  2. Create a foreign server pointing to this workbook. For this take the CREATE SERVER statement and replace myserver with: svr_budget2015_workbook.

    CREATE SERVER svr_budget2015_workbook
      FOREIGN DATA WRAPPER ogr_fdw
      OPTIONS (
        datasource 'C:/fdw_data/Budget2015.ods',
        format 'ODS' );
  3. Apply some DO magic to create the foreign tables: This do magic drops the foreign tables if they exist and recreates them. And also replaces references to myserver with the new server name. Finally it prefixes the tables with buget2015. so they get dumped in the budget2015 schema. You could do the same with set search_path before you run the DO.

    DO language plpgsql $$
    DECLARE var_sql text;
    BEGIN 
        var_sql := (WITH conn As (SELECT 'C:/fdw_data/Budget2015.ods'::text As conn),
        cte_tb_names AS 
            (SELECT unnest(
                string_to_array(ogr_fdw_sql_table(conn), E'\n') ) As table_name
                FROM conn)
            , tb AS (SELECT trim(table_name) As table_name
                FROM cte_tb_names WHERE table_name NOT LIKE 'Layers:%' and trim(table_name) > '')
        SELECT 
               string_agg( replace(regexp_replace(ogr_fdw_sql_table(conn, tb.table_name), 
                        'CREATE SERVER (.*);(.*)CREATE FOREIGN TABLE ([a-z0-9\_]+)', 
                  E'DROP FOREIGN TABLE IF EXISTS budget2015.\\3;CREATE FOREIGN TABLE budget2015.\\3'), 
            'myserver','svr_budget2015_workbook'), E'\n') As sql FROM tb, conn);
        
        EXECUTE var_sql;
    END ;$$;
    
    
  4. Finally to verify I got the worksheets as linked foreign tables, I run this query:

    SELECT table_name
        FROM information_schema.tables
    WHERE table_schema='budget2015'
    ORDER BY table_name;
    

    Which outputs this

             table_name
    ----------------------------
     categories_and_items
     dashboard
     debt_analysis
     form_2___equity
     form_3___income
     form_4___lump_sum_planning
     form_5___monthly_cash_flow
     form_7___allocation
     form_8___spending_details
     form_9___work_expenses

Now I can query some tables:

SELECT creditor, amount, interest
    FROM budget2015.debt_analysis 
WHERE interest < 0.08
ORDER BY creditor;
 creditor  | amount | interest
-----------+--------+----------
 Credit 1  |  20000 |   0.0775
 Credit 10 |        |     0.03
 Credit 11 |        |     0.04
 Credit 7  |        |  0.04875
 Credit 8  |        |     0.01
 Credit 9  |        |     0.02