Loading and Processing GPX XML files using PostgreSQL

Simon Greener, wrote an article on how to load GPX xml files into Oracle XMLDB. That got me thinking that I haven't really explored all the XML features that PostgreSQL has to offer and to some extent I've been reticent about XML processed in any database for that matter.

In this article we shall attempt to perform the same feats that Simon did, but with PostgreSQL instead of Oracle XMLDB. Note while we are demonstrating this with a GPX file, the same XPath approach can be used to process any XML file.

PostgreSQL since 8.3 has had ANSI SQL 2003 XML functionality built in. Before 8.3, you could use the xml2 contrib module to achieve the same effect in a not so standards compliant sort of way. In this example we shall demonstrate the built in functionality in 8.3 and above. The key function we will use is the xpath function. XPath is a language used to query XML data and PostgreSQL supports the XPath 1.0 version. The following is a quick primer on XPath that seems useful http://www.zvon.org/xxl/XPathTutorial/General/examples.html. You can also refer to the PostgreSQL XML section of the documentation.

Getting the data

We will use the same sample data Simon used, except sadly we had to change it further because the schema wasn't defined in such a way that PostgreSQL liked or rather I was too stupid to construct the XPath statement in such a fashion that would satisfy the PostgreSQL XML thingy. PostgreSQL seems to require that the schema have a name in addition to a location or at least that is what we concluded. The GPX example Simon had a location but no name. The docs have an example of the form. Where the second argument is an array of 2 dimensional arrays with the first item being the schema name and second the URI for the schema. Also note that the xpath function always returns an array even if there is only one element.

SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
			 ARRAY[ARRAY['my', 'http://example.com']]);

 xpath
--------
 {test}
(1 row)

Suffice it to say, our version is like Simon's revised version except we also stripped off the namespace references since the PostgreSQL XML parser seemed unhappy that the name space defined was never referenced in format xsi:.... or something of that sort . We will be using the xpath version that takes no schema references.

Our revision of Simon's revision can be downloaded from here

The change made is very subtle. Simon had this as the first part

<?xml version="1.0"?>
<gpx version="1.1" creator="Toshihiro Hiraoka" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.topografix.com/GPX/1/1" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">

and we changed it to:

<?xml version="1.0"?>
<gpx version="1.1" creator="Toshihiro Hiraoka">
Getting the data in the database

Simon used Oracle's get LOB fileopen to get the xml file into the db which he calls from an Oracle stored function. When you think about the closest parallel in PostgreSQL, I would say its the lo_* functions that allow import export of files into the db, though that only allows you to import and export files and not read the file. There is also the pg_read_file which does what we want, but can only read files from the PostgreSQL init cluster. Of course their are other ways. You could use perl or python or some other language such as PLPerlU that has system file access.

For now we'll just create a folder called gpxdir in the PostgreSQL cluster. You can determine the location of your cluster by running as super user

SELECT name, setting FROM pg_settings WHERE name='data_directory';

Now we'll create a function to mirror Simon's getClobDocument, except instead of calling it getClobDocument, we'll call it getXMLDocument because it will return an XML object instead of a CLOB object. Please note -- our getXMLDocument function is marked as SECURITY DEFINER because only super users can use the pg_read_file, so to allow regular users access to this, we have this run in the postgres context and then can give rights to this function to those users we want to who may not have super user rights.

--create the function to load xml doc
CREATE OR REPLACE FUNCTION getXMLDocument(p_filename character varying)
  RETURNS xml AS
$$
---we set the end read to some big number
-- because we are too lazy to grab the length
-- and it will cut of at the EOF anyway
SELECT CAST(pg_read_file(E'gpxdir/' || $1 ,0, 100000000) As xml);
$$
  LANGUAGE 'sql' VOLATILE SECURITY DEFINER;
ALTER FUNCTION getxmldocument(character varying) OWNER TO postgres;


Now to use this function we simply do:

Copy the gpxtestrevised.gpx file into the gpxdir and call the below

SELECT getXMLDocument('gpxtestrevised.gpx');

Next we'll create a table similar to what Simon has called gpx and stuff our xml in there

--create table to store  xml docs
CREATE TABLE gpx
(
  object_name character varying(50) NOT NULL PRIMARY KEY,
  object_value xml
);


--insert xml doc
INSERT INTO gpx(object_name, object_value)
VALUES ('gpxtestrevised.gpx', getXMLDocument('gpxtestrevised.gpx'));

Unfortunately PostgreSQL even in 8.4 is not as rich as Oracle's offering for XMLDB and doesn't have all that fancy validation schema stuff, though if you try to pull an obviously malformed XML document with getXMLDocument it will tell you you are missing tags and so forth. So we are skipping that section of Simon's and moving straight to the fun part.


--Get the metadataname
SELECT (xpath('/gpx/metadata/name/text()', g.object_value))[1] As metadataname
FROM GPX As g;

	  metadataname
------------------------
 Manila to Mt. Pinatubo


--Full meta data
SELECT (xpath('/gpx/metadata/name/text()', g.object_value))[1] as Name,
	   (xpath('/gpx/metadata/desc/text()', g.object_value))[1] as Description,
 (xpath('/gpx/metadata/copyright/year/text()', g.object_value))[1] as Copyright_Year,
 (xpath('/gpx/metadata/copyright/license/text()', g.object_value))[1] as Copyright_License,
(xpath('/gpx/metadata/link/@href', g.object_value))[1] as Hyperlink,
(xpath('/gpx/metadata/link/text/text()', g.object_value))[1] as Hyperlink_Text ,
(xpath('/gpx/metadata/link/time/text()', g.object_value))[1] as Document_DateTime ,
(xpath('/gpx/metadata/link/keywords/text()', g.object_value))[1] as keywords ,
(xpath('/gpx/metadata/bounds/@minlon', g.object_value))[1] as MinLong,
(xpath('/gpx/metadata/bounds/@minlat', g.object_value))[1] as MinLat,
(xpath('/gpx/metadata/bounds/@maxlon', g.object_value))[1] as MaxLong,
(xpath('/gpx/metadata/bounds/@maxlat', g.object_value))[1] as MaxLat
 FROM GPX AS g;


 --the same stuff Simon got
		  name          |          description           | copyright_year |
  copyright_license       |           hyperlink           |  hyperlink_text   |
document_datetime | keywords | minlong | minlat | maxlong | maxlat
------------------------+--------------------------------+----------------+-----
--------------------------+-------------------------------+-------------------+-
------------------+----------+---------+--------+---------+--------
 Manila to Mt. Pinatubo | This is test data for gpx2shp. | 2004           | http
://gpx2shp.sourceforge.jp | http://gpx2shp.sourceforge.jp | Toshihiro Hiraoka |
				  |          | -180.0  | -90.0  | 179.9   | 90.0
(1 row)

And now for the finale -- we shall pull the way points just as Simon did

--Lets extract way points (Simon's is a bit shorter)
-- (the offset here is an ugly hack to force Postgres to use our xml value instead of recopying it as
-- suggested by a commenter to our previous post
--note we were using order by before but OFFSET though still ugly seems cleaner --
--With the offset hack -- this finishes in 895ms.  Without offset hack it takes about 3182 ms (~3 seconds)
SELECT CAST((xpath('/wpt/name/text()', wayp.pt))[1] As varchar(20)) As Name,
	CAST(CAST((xpath('/wpt/@lon', wayp.pt))[1] As varchar) As numeric) As longitude,
	CAST(CAST((xpath('/wpt/@lat', wayp.pt))[1] As varchar) As numeric) As latitude,
	CAST(CAST((xpath('/wpt/ele/text()', wayp.pt))[1] As varchar) As numeric) As Elevation
FROM (SELECT (xpath('/gpx/wpt',g.object_value))[it.i] As pt
FROM (SELECT generate_series(1,array_upper(xpath('/gpx/wpt',g.object_value),1)) As i
FROM GPX As g WHERE object_name = 'gpxtestrevised.gpx') As it
 CROSS JOIN (SELECT object_value
	FROM GPX WHERE object_name = 'gpxtestrevised.gpx') As g OFFSET 0) As wayp;
  name  |   longitude    |   latitude   | elevation
--------+----------------+--------------+------------
 001    |  121.043382715 | 14.636015547 |  45.307495
 002    |  121.042653322 | 14.637198653 |  50.594727
 003    |  121.043165457 | 14.640581002 |  46.989868
 004    |  120.155537082 | 14.975596117 |  38.097656
 005    |  120.236538453 | 15.037303017 | 147.687134
 006    |  120.236548427 | 15.037305867 | 145.043579
 007    |  120.237012533 | 15.038105585 | 160.905151
 008    |  120.237643858 | 15.038478328 | 165.231079
 009    |  120.238984879 | 15.038991300 | 173.882935
 010    |  120.239190236 | 15.039099846 | 166.192383
 011    |  120.241263332 | 15.040223943 | 175.324829
 012    |  120.247956365 | 15.042621084 | 186.860474
 013    |  120.253084749 | 15.043179905 | 208.730347
 014    |  120.254095523 | 15.043297336 | 211.374023
 015    |  120.254105665 | 15.043296246 | 213.296631
 016    |  120.247880174 | 15.042568864 | 189.984863
 017    |  120.246971911 | 15.042486135 | 187.100830
 018    |  120.245966502 | 15.042233923 | 185.418579
 019    |  120.244808039 | 15.041693626 | 181.092651
 020    |  120.244476954 | 15.041558258 | 179.410400
 021    |  120.243841019 | 15.041360026 | 178.689453
 022    |  120.241488637 | 15.040351683 | 176.526489
:
:
 

Scott Bailey has some functions he has written for PostgreSQL that look similar to the extract_value you will find in Oracle that Simon uses in his example. http://scottrbailey.wordpress.com/2009/06/19/xml-parsing-postgres/