XPathing XML data with PostgreSQL

One of my favorite tools and I think that of many folks working with GIS and other kinds of Multimedia is the GDAL/OGR suite. Though I started using it to conquer GIS ETL activities, I found myself using it for problems that are inherently not GIS at all. I talked about the GDAL OGR2OGR component a while ago in GDAL OGR2OGR for Data Loading and this time I'll talk tangentially about its raster capabilities. It is a fantastic tool for converting between various raster formats and applying various raster operations. In PostGIS world the new 2.0 raster functionality puts an SQL wrapper around much of its power. I'm not going to talk about that though except as a fleeting comment to explore later (we've got cool 2 band Map Algebra in PostGIS 2.0 to flaunt its stuff). So what does this have to do with XPathing XML data with PostgreSQL? Well that's what I'm going to talk about what to do with machine generated data that comes at you in XML format.

A lot of machine generated data is hitting us in an XML like form. I talked about GPX data and navigating that in Which by the way GDAL/OGR can load and export easily into/out of a PostGIS enabled database.

GDAL exposes another kind of machine generated data in XML format which turns out to be very useful for all kinds of things. This is Exchangeable image file format (EXIF) data. There are all kinds of random text information embedded in pictures and this varies depending on what camera is taking it. Newer cameras like the ones you have built into your iphone or android embed location based information into them sometimes like where you were standing when you took the picture. Most cameras these days embed the time the picture was taken. This information is important because if you are taking electronic notes while you are snapping your pictures, it provides an easy way to match up your notes with the picture about the object. So what does this EXIF info look like when you point GDAL at it? We'll see.

Stuffing XML data in PostgreSQL, but that's not really relational is it?

Okay you thought PostgreSQL was a relational database right and relational databases have no business mucking with XML data or other kinds of data structures for that matter that don't fit into a nice normalized square hole.

I've got a folder of pictures and holy shmoly they have images and they've got this text information built into the image. I have no time to figure out what this data is or put it in an orderly fashion, so I bring it all in with a structure something like below. Most of it will be junk that I don't even want to waste time putting into a structure. This is just a single insert for demonstration, that stuffs the xml generated by GDAL for each picture into a table. Don't worry I do have a routine for scanning my folder in bringing inserting this info.

CREATE TABLE pics(id serial primary key, pic_file_name varchar(255),  pic_date timestamp, exif xml);
INSERT INTO pics(pic_file_name, exif)
VALUES('my_funny_little_house.jpg', '<PAMDataset>
  <Metadata domain="IMAGE_STRUCTURE">
    <MDI key="EXIF_ApertureValue">(2.97085)</MDI>
    <MDI key="EXIF_BrightnessValue">(7.35543)</MDI>
    <MDI key="EXIF_ColorSpace">1</MDI>
    <MDI key="EXIF_ComponentsConfiguration">0x1 0x2 0x3 00</MDI>
    <MDI key="EXIF_DateTime">2011:10:08 12:33:33</MDI>
    <MDI key="EXIF_DateTimeDigitized">2011:10:08 10:04:23</MDI>
    <MDI key="EXIF_DateTimeOriginal">2011:10:08 10:04:23</MDI>
    <MDI key="EXIF_ExifVersion">0221</MDI>
    <MDI key="EXIF_ExposureMode">0</MDI>
    <MDI key="EXIF_ExposureProgram">2</MDI>
    <MDI key="EXIF_ExposureTime">(0.00271003)</MDI>
    <MDI key="EXIF_Flash">32</MDI>
    <MDI key="EXIF_FlashpixVersion">0100</MDI>
    <MDI key="EXIF_FNumber">(2.8)</MDI>
    <MDI key="EXIF_FocalLength">(3.85)</MDI>
    <MDI key="EXIF_GPSAltitude">(52.4931)</MDI>
    <MDI key="EXIF_GPSAltitudeRef">00</MDI>
    <MDI key="EXIF_GPSImgDirection">(151.026)</MDI>
    <MDI key="EXIF_GPSImgDirectionRef">T</MDI>
    <MDI key="EXIF_GPSLatitude">(43) (18.39) (0)</MDI>
    <MDI key="EXIF_GPSLatitudeRef">N</MDI>
    <MDI key="EXIF_GPSLongitude">(72) (5.21) (0)</MDI>
    <MDI key="EXIF_GPSLongitudeRef">W</MDI>
    <MDI key="EXIF_GPSTimeStamp">(15) (4) (1713)</MDI>
    <MDI key="EXIF_ISOSpeedRatings">64</MDI>
    <MDI key="EXIF_MeteringMode">1</MDI>
    <MDI key="EXIF_Orientation">1</MDI>
    <MDI key="EXIF_PixelXDimension">1536</MDI>
    <MDI key="EXIF_PixelYDimension">2048</MDI>
    <MDI key="EXIF_ResolutionUnit">2</MDI>
    <MDI key="EXIF_SceneCaptureType">0</MDI>
    <MDI key="EXIF_SensingMethod">2</MDI>
    <MDI key="EXIF_Sharpness">1</MDI>
    <MDI key="EXIF_ShutterSpeedValue">(8.52582)</MDI>
    <MDI key="EXIF_SubjectArea">1023 767 614 614</MDI>
    <MDI key="EXIF_WhiteBalance">0</MDI>
    <MDI key="EXIF_XResolution">(72)</MDI>
    <MDI key="EXIF_YCbCrPositioning">1</MDI>
    <MDI key="EXIF_YResolution">(72)</MDI>
  <PAMRasterBand band="1">
    <Metadata domain="IMAGE_STRUCTURE">
  <PAMRasterBand band="2">
    <Metadata domain="IMAGE_STRUCTURE">
  <PAMRasterBand band="3">
    <Metadata domain="IMAGE_STRUCTURE">

I may have use for the other stuff in this xml glob some other day, but for today I really only care about the time the picture was taken which we all know you can't count on the creation time of the file itself since it looses it when copied from device to device, but the EXIF timestamp doesn't change. Perhaps another day I'd be interested in the GPS info, but not today. So how do I grab the timestamp, I use XPATH of course with a little dash of regular expressions.

UPDATE pics SET pic_date = regexp_replace((xpath('/PAMDataset/Metadata[2]/MDI[@key=''EXIF_DateTimeDigitized'']/text()',exif))[1]::text,
 '([0-9]{4}):([0-9]{2}):([0-9]{2})', E'\\1-\\2-\\3')::timestamp

So what did I just do in this snippet of code: I grabbed the second MetaData element and grabbed the first MDI element of that that has key = EXIF_DateTimeDigitized. The format doesn't give me something I can easily cast to a timestamp because of those colons, so I use a regular expression to replace the :s following the Year, month, and day with a - and then I cast to a timestamp. The \\1, \\2,\\3 is backrefencing syntax that would return year, month , day.

Using XPath functions

Perhaps one day someone will need to know about the GPS related elements. Then I could pull out the XPATH/XSLT lovely contains function. Like so:

-- Pull all GPS related elements -- 
SELECT pic_file_name
    , (xpath( '/MDI/@key',exp))[1]::text As key
    , (xpath('/MDI/text()',exp))[1]::text As value
SELECT pic_file_name, 
    unnest(xpath('/PAMDataset/Metadata[2]/MDI[contains(@key,''EXIF_GPS'')]',exif)) As exp
FROM pics) As foo;

      pic_file_name       |           key           |      value
my_funny_little_house.jpg | EXIF_GPSAltitude        | (52.4931)
my_funny_little_house.jpg | EXIF_GPSAltitudeRef     | 00
my_funny_little_house.jpg | EXIF_GPSImgDirection    | (151.026)
my_funny_little_house.jpg | EXIF_GPSImgDirectionRef | T
my_funny_little_house.jpg | EXIF_GPSLatitude        | (43) (18.39) (0)
my_funny_little_house.jpg | EXIF_GPSLatitudeRef     | N
my_funny_little_house.jpg | EXIF_GPSLongitude       | (72) (5.21) (0)
my_funny_little_house.jpg | EXIF_GPSLongitudeRef    | W
my_funny_little_house.jpg | EXIF_GPSTimeStamp       | (15) (4) (1713)