Postgres OnLine Journal: November 2011 / December 2011
An in-depth Exploration of the PostgreSQL Open Source Database
 

Table Of Contents

From the Editors
PostgreSQL Q & A
Basics
Using PostgreSQL Extensions

From the Editors

 

The Pure Relational Database is dead



A lot of redditers took offense at our article XPathing XML data with PostgreSQL with the general consensus, if you are going to be stuffing XML in a relational database where will you stop? That is not what relational databases are designed for. We had comitted a sacrilegious sin and worsed yet encouraging bad habits by forcing people to think more about different options they have for storing data in a relational database and god forbid demonstrating querying such columns with xml specific functions. What were we thinking? How dare we try to query XML data with SQL? Perhaps we were thinking like this guy or this guy, both equally misguided spatial relational database folk. Of course we stepped one foot further by actually defining a column as xml and dare storing data in it for later consumption rather than just an intermediary step.

If I want to store documents, that are navigateable I should be using a document database like MongoDb, CouchDB etc designed for that kind of stuff. If I've got graphs I should be using a graph database. This got me thinking that the "Pure Relational Database" is dead, and I'm surprised most people don't seem to realize it.

So while "Relational databases" have changed over the last 25 years, most people's notions of them have not kept up with the pace of its change.

First let me define what I mean by Pure. A pure relational database is one with standard meat and potato types like text, dates, numbers well suited for counting money and computing how close the world is to total bankruptcy which you store as fields in a row of a table and that you then define foreign keys / constraints / primary keys on to relate them to other tables. You reconstitute your real world objects by stitching these tables together with joins etc and return sets using where conditions, summarize by using group bys or other mathy like constructs. Don't get me wrong; these are very beautiful things because they allow for easy slicing of dimensions and not having to think about all the dimensions that make up an object all at once. In fact it was so beautiful that some people thought, "wow that's cool, but it would be even cooler if I could store more complex objects in those columns with their own specific needs for querying." and so was born the object relational database as some people refer to them that thought relational but also understood that different types had their own unique needs for querying, storage, indexing etc.

Nowadays most, if not all, relational like databases have standardized on some variant of SQL. In essence though, the pure relational database doesn't allow you to define new types or have exotic types such as arrays, xml, graphs, geometries, rasters, sparse matrices etc. Much less thinking involved and less likely you will shoot yourself in the foot by dumping a bunch of xml in a field and trying to do something with it. When it is used to store more complex things such as spreadsheets and other user documents, these are stored as blobs and just retrieved. Even such use is frowned upon.

Well most relational databases I can think of nowadays have richer types: e.g. PostgreSQL, Oracle and Firebird all support arrays as a column type. Some even allow you to define custom types and functions to support your custom types e.g. PostgreSQL (I could go on forever), Oracle has rich user defined type support too, and SQL Server 2005+ with each version getting better and better for user defined custom types and introducing more exotic types and support infrastructure. Even MySQL/Drizzle (mostly in the form of different storage engines). Even my favorite light-weight SQLite under the hood has some tricks that aren't what I would call relational. E.g. Spatialite/RasterLite has a whole geometry type library built on SQLite with functions you can call from SQL and I'm sure there are lots of middleware tools you don't know about using the SQLite and Firebird engine for more than relational tasks (e.g. HTML5 anyone/ CAD anyone).


From the Editors

 

The Relational Model is very much alive



In our article The Pure Relational database is dead there were a lot of misunderstandings as a result of our poor choice of words. People thought we were bashing the relational model because in their mind that was what pure meant. I got hit with a lot of poetic insults. I still can't think of an alternative word to use for what I meant. Simple doesn't really do it as even relational databases with just standard types were far from simple when you consider the planner and all the other stuff going on under the hood to protect you from the underlying storage structure. What I was trying to say is that in the beginning most relational databases just supported a standard set of types which you could not expand on and most people when they think relational today still think just that. That type of relational database is in my book dead or almost dead.

How did this all start. Well whenever we use something like PostgreSQL to store anything complex -- take your pick: geometry data, tree like structures which we use ltree for, full-text query constructs, and Yes XML we get bashed by some know-it-all who has a very narrow view of what a relational database should be doing and suggesting we use a NoSQL database, a graph engine or a full text engine or normalize our data more. I have also learned XML is a dirty word to many people. I mistakenly thought XML was a complex type people could relate to, but turns out they can relate to it so well that it brings up tragic memories I can only equate to Post Traumatic Stress Disorder suffered by war veterans or (early or wrong) technology adopters. That was not my intent either. XML was just merely an example. I will not say you should use XML in your tables, but I will also not say you should stay clear of it as many people wanted me to say. I will say its use is rare, but it has its place. It has its place just as any other complex type and it has its own special needs for navigation, indexing etc. which many relational databases handle fine enough.

My point was that the relational model, SQL, complex types, and user defined types work handsomely together. If you look at Oracle, DB2, PostgreSQL, SQL Server you'll see they have made lots of enhancements over the years to beef up their indexing, storage etc, so that YES your specialty types get treated with the same care that they provide for their more standard packaged types. That means you can write SQL that say outputs a geometry, a person's name, or a subset of data from an XML field and that can navigate data with a simple btree join or an RTree based join and then on top of that do further filtering using specialty functions many of which can also make use of specialty indexes or to output a subset of data of a larger object. This combination in my book is much more powerful than creating a separate database to for example store just graphs, or full text engine, or navigable documents. The reality is that related data comes in all shapes and forms and can be related to other data in a different shape and form. It is rare that a document storage proposed by JSON (Couch Db, MongDB etc.) is ideal for all your data just as it is rare that all your data should be finely normalized and broken up into columns with a canned set of types we are used to working with.

I am not bashing the NoSQL folks by any means. If all you want to do is that, then I'm all for simplicity and simplicity is best if it satisfies all or most of your needs or is significantly faster. For my apps however I need something a little more that can handle all my specialty needs. My needs for navigating across space, across nested relationships, and every so often querying an arbitrary bag of properties (e.g hstore, json, xml) without having to join with many more tables, and even use a more suitable syntax such as the syntax for querying full text data. This is why I choose PostgreSQL, because it lets me do all of that and lets me do it with my favorite language SQL and most of the time with a single SQL statement. Also please stop with these arguments about how SQL does not mean Relational and not what Codd had intended. I know all of that, but I still like it never the less and whether you like it or not, the databases we classify as relational HAVE standardized on SQL and SQL ain't a naughty word.


From the Editors

 

GeoInformatics article and new book in the works



Two exciting things happened this past month.

  • Our article on upcoming PostGIS 2.0 recently came out in GeoInformatics Magazine December 2011 Issue 8. Check it out here. Starts on page 30.
  • We just got notice that our book proposal has been accepted and this time it's not about PostGIS.

What is the book about?

This upcoming book is planned to be about PostgreSQL only. We are hoping to show case some of the features that make PostgreSQL stand out from the pack with more focus on 9.1 and 9.2 releases and how PostgreSQL has changed over the past releases from 8.3 on. So a bit of coverage on the various stored procedure languages you can program in, building aggregates, the plethora of extensions and data types that can make your database an application platform power house. Also of course basic stuff you'd expect from any enterprise database -- e.g. user and permissions management, backup/restore, replication etc. More on that as the time gets closer and the contract gets on the way.

We are really excited at the chance of writing another book. Book writing is one of those things I never thought I would enjoy doing because I don't really enjoy reading books. I suppose it's different writing because writing is more of an active process than book reading or at least for me anyway. It provides the perfect excuse to dig deeper into topics that interest you and call it work which for some reason reminds me of this xkcd strip I'm working


PostgreSQL Q & A

 

How to create an n-column table really fast Beginner



Ever have the need to create a holding table say spreadsheet data with say 100 columns. You need to create a table to hold this stuff. Or perhaps you were feeling in a sadist mood and wanted to abuse your PostgreSQL database to see how many columns you can create in a table of a specific data type. Here is a quick script to do it:

-- 8.4+ syntax
SELECT 'CREATE TABLE data_import('
|| array_to_string(array_agg('field' || i::text || ' varchar(255)'), ',') || ');'
FROM generate_series(1,10) As i;

-- 9.0+ syntax (string_agg was introduced in 9.0)
SELECT 'CREATE TABLE data_import('
|| string_agg('field' || i::text || ' varchar(255)', ',') || ');'
FROM generate_series(1,10) As i;

Both variants will return output that looks like this:

CREATE TABLE data_import(field1 varchar(255),field2 varchar(255),field3 varchar(255),field4 varchar(255)
    ,field5 varchar(255),field6 varchar(255),field7 varchar(255)
    ,field8 varchar(255),field9 varchar(255),field10 varchar(255));

Now if you want it to also execute because you are running it as part of an sql script, you could wrap it in an anonymous function.

 ---wrap it in an anonymous function to run
 -- requires PostgreSQL 9.0+
DO language 'plpgsql'
$$
DECLARE var_sql text :=  'CREATE TABLE data_import('
    || string_agg('field' || i::text || ' varchar(255)', ',') || ');'
    FROM generate_series(1,10) As i;
BEGIN
    EXECUTE var_sql;
END;
$$ ;

PostgreSQL Q & A

 

What objects are in a PostgreSQL installed extension Intermediate



As mentioned in Lessons Learned Packaging PostGIS extensions, I am working on PostGIS 2.0.0 extension packaging. One of the things I wanted to know was what objects, types, functions etc were installed by my extension. The new packaging system allows for cataloging this relatively easily, but I couldn't find a function or view for this and didn't see one mentioned in the manual, so I created this query which seems to work pretty well as far as I can tell. The basic idea being that any object that an extension depends on that is not an extension is part of the extension package.

SELECT c.relname As item_type, 
    COALESCE(proc.proname,typ.typname, cd.relname, op.oprname, 
        'CAST(' || cs.typname || ' AS ' || ct.typname || ') ', opcname, opfname) As item_name,
    COALESCE(proc.proisagg,false) As is_agg, oidvectortypes(proc.proargtypes) As arg_types
FROM pg_depend As d INNER JOIN pg_extension As e
    ON d.refobjid = e.oid INNER JOIN pg_class As c ON
        c.oid = d.classid
        LEFT JOIN pg_proc AS proc ON proc.oid = d.objid
        LEFT JOIN pg_type AS typ ON typ.oid = d.objid
        LEFT JOIN pg_class As cd ON cd.oid = d.objid
        LEFT JOIN pg_operator As op ON op.oid = d.objid
        LEFT JOIN pg_cast AS ca ON ca.oid = d.objid
        LEFT JOIN pg_type AS cs ON ca.castsource = cs.oid
        LEFT JOIN pg_type AS ct ON ca.casttarget = ct.oid
        LEFT JOIN pg_opclass As oc ON oc.oid = d.objid
        LEFT JOIN pg_opfamily As ofa ON ofa.oid = d.objid
WHERE d.deptype = 'e' and e.extname = 'postgis'
ORDER BY item_type, item_name;

The output looks like:

 item_type  |            item_name             | is_agg |arg_types
------------+----------------------------------+--------+-----------
pg_cast     | CAST(box2d AS box3d)             | f      |
pg_cast     | CAST(box2d AS geometry)          | f      |
pg_cast     | CAST(box3d AS box)               | f      |
pg_cast     | CAST(box3d AS box2d)             | f      |
pg_cast     | CAST(box3d AS geometry)          | f      |
:
pg_opclass  | btree_geography_ops              | f      |
pg_opclass  | btree_geometry_ops               | f      |
pg_opclass  | gist_geography_ops               | f      |
pg_opclass  | gist_geometry_ops_2d             | f      |
pg_opclass  | gist_geometry_ops_nd             | f      |
:
pg_opfamily | btree_geography_ops              | f      |
pg_opfamily | btree_geometry_ops               | f      |
pg_opfamily | gist_geography_ops               | f      |
pg_opfamily | gist_geometry_ops_2d             | f      |
pg_opfamily | gist_geometry_ops_nd             | f      |
:
pg_operator | &&                               | f      |
pg_operator | &&&                              | f      |
:
pg_proc     | postgis_addbbox                  | f      | geometry
:
pg_type     | spheroid                         | f      |
pg_type     | summarystats                     | f      |
:
pg_type     | wktgeomval                       | f      |
(984 rows)


PostgreSQL Q & A

 

How big is my database and my other stuff Beginner



In this day and age of clouds and what not, it is still important to keep tabs on your database growth to ensure it doesn't outgrow your disks or so that you can be prepared when it does. A common question that is asked is how big your database is or all your databases combined. Luckily PostgreSQL has had for a long time functions to help you answer this easily. In this little article, I'll demonstrate how to answer these questions and more with some PostgreSQL built-in functions

How big is my database?

The pg_database_size() is a function that takes the name of a database in the database cluster and returns the size in bytes. Trying to decipher bytes when most databases are in the size of megabytes, gigabytes, or even terabytes can get pretty ugly. Luckily there is a pretty function called pg_size_pretty() which makes these numbers more human readable. In additon there is a current_database() function which tells us which database we are currently connected to. Combine them all and you have your answer

SELECT pg_size_pretty( pg_database_size( current_database() ) ) As human_size
    , pg_database_size( current_database() ) As raw_size;
human_size |   raw_size
------------+--------------
 181 GB     | 193841573064

How big are my databases?

In this example we sum up the sizes of all databases in the current PostgreSQL instance. Note: we have to cast the size to bigint since the SUM function returns a numeric when summing bigints and pg_pretty_size expects a bigint.

SELECT pg_size_pretty( SUM(pg_database_size(datname))::bigint ) As human_size
    ,  SUM( pg_database_size(datname) )::bigint   As raw_size
    , pg_size_pretty( (SUM(pg_database_size(datname) ) - pg_database_size(current_database() ) )::bigint ) aS h_without_current
    FROM pg_database;

In this case the raw_size is a bit useful since it lets us know we've got more than this database in our cluster. Since my current database dwarfs the size of the others, I also would like to know how big things are without it.

human_size |   raw_size   | without_current
-----------+--------------+-----------------
181 GB     | 194080676140 | 228 MB

Size family of functions

Querying the PostgreSQL catalog, here are all the size function gems I've found. I'm especially fond of pg_column_size which tells you how much space an object would take up if it were stored in a table field.
For example: SELECT pg_column_size('LINESTRING(1 2,3 4, 5 6)'::geometry);


       funcname        |                                       description
------------------------+------------------------------------------------------------------------------------------
 pg_column_size         | bytes required to store the value, perhaps with compression
 pg_database_size       | total disk space usage for the specified database
 pg_indexes_size        | disk space usage for all indexes attached to the specified table
 pg_relation_size       | disk space usage for the main fork of the specified table or index
 pg_table_size          | disk space usage for the specified table, 
                          including TOAST, free space and visibility map
 pg_tablespace_size     | total disk space usage for the specified tablespace
 pg_total_relation_size | total disk space usage for the specified table and associated indexes

Basics

 

XPathing XML data with PostgreSQL Intermediate



One of my favorite tools and I think that of many folks working with GIS and other kinds of Multimedia is the GDAL/OGR suite. Though I started using it to conquer GIS ETL activities, I found myself using it for problems that are inherently not GIS at all. I talked about the GDAL OGR2OGR component a while ago in GDAL OGR2OGR for Data Loading and this time I'll talk tangentially about its raster capabilities. It is a fantastic tool for converting between various raster formats and applying various raster operations. In PostGIS world the new 2.0 raster functionality puts an SQL wrapper around much of its power. I'm not going to talk about that though except as a fleeting comment to explore later (we've got cool 2 band Map Algebra in PostGIS 2.0 to flaunt its stuff). So what does this have to do with XPathing XML data with PostgreSQL? Well that's what I'm going to talk about what to do with machine generated data that comes at you in XML format.

A lot of machine generated data is hitting us in an XML like form. I talked about GPX data and navigating that in Which by the way GDAL/OGR can load and export easily into/out of a PostGIS enabled database.

GDAL exposes another kind of machine generated data in XML format which turns out to be very useful for all kinds of things. This is Exchangeable image file format (EXIF) data. There are all kinds of random text information embedded in pictures and this varies depending on what camera is taking it. Newer cameras like the ones you have built into your iphone or android embed location based information into them sometimes like where you were standing when you took the picture. Most cameras these days embed the time the picture was taken. This information is important because if you are taking electronic notes while you are snapping your pictures, it provides an easy way to match up your notes with the picture about the object. So what does this EXIF info look like when you point GDAL at it? We'll see.

Stuffing XML data in PostgreSQL, but that's not really relational is it?

Okay you thought PostgreSQL was a relational database right and relational databases have no business mucking with XML data or other kinds of data structures for that matter that don't fit into a nice normalized square hole.

I've got a folder of pictures and holy shmoly they have images and they've got this text information built into the image. I have no time to figure out what this data is or put it in an orderly fashion, so I bring it all in with a structure something like below. Most of it will be junk that I don't even want to waste time putting into a structure. This is just a single insert for demonstration, that stuffs the xml generated by GDAL for each picture into a table. Don't worry I do have a routine for scanning my folder in bringing inserting this info.

CREATE TABLE pics(id serial primary key, pic_file_name varchar(255),  pic_date timestamp, exif xml);
INSERT INTO pics(pic_file_name, exif)
VALUES('my_funny_little_house.jpg', '<PAMDataset>
  <Metadata domain="IMAGE_STRUCTURE">
    <MDI key="COMPRESSION">JPEG</MDI>
    <MDI key="INTERLEAVE">PIXEL</MDI>
    <MDI key="SOURCE_COLOR_SPACE">YCbCr</MDI>
  </Metadata>
  <Metadata>
    <MDI key="EXIF_ApertureValue">(2.97085)</MDI>
    <MDI key="EXIF_BrightnessValue">(7.35543)</MDI>
    <MDI key="EXIF_ColorSpace">1</MDI>
    <MDI key="EXIF_ComponentsConfiguration">0x1 0x2 0x3 00</MDI>
    <MDI key="EXIF_DateTime">2011:10:08 12:33:33</MDI>
    <MDI key="EXIF_DateTimeDigitized">2011:10:08 10:04:23</MDI>
    <MDI key="EXIF_DateTimeOriginal">2011:10:08 10:04:23</MDI>
    <MDI key="EXIF_ExifVersion">0221</MDI>
    <MDI key="EXIF_ExposureMode">0</MDI>
    <MDI key="EXIF_ExposureProgram">2</MDI>
    <MDI key="EXIF_ExposureTime">(0.00271003)</MDI>
    <MDI key="EXIF_Flash">32</MDI>
    <MDI key="EXIF_FlashpixVersion">0100</MDI>
    <MDI key="EXIF_FNumber">(2.8)</MDI>
    <MDI key="EXIF_FocalLength">(3.85)</MDI>
    <MDI key="EXIF_GPSAltitude">(52.4931)</MDI>
    <MDI key="EXIF_GPSAltitudeRef">00</MDI>
    <MDI key="EXIF_GPSImgDirection">(151.026)</MDI>
    <MDI key="EXIF_GPSImgDirectionRef">T</MDI>
    <MDI key="EXIF_GPSLatitude">(43) (18.39) (0)</MDI>
    <MDI key="EXIF_GPSLatitudeRef">N</MDI>
    <MDI key="EXIF_GPSLongitude">(72) (5.21) (0)</MDI>
    <MDI key="EXIF_GPSLongitudeRef">W</MDI>
    <MDI key="EXIF_GPSTimeStamp">(15) (4) (1713)</MDI>
    <MDI key="EXIF_ISOSpeedRatings">64</MDI>
    <MDI key="EXIF_MeteringMode">1</MDI>
    <MDI key="EXIF_Orientation">1</MDI>
    <MDI key="EXIF_PixelXDimension">1536</MDI>
    <MDI key="EXIF_PixelYDimension">2048</MDI>
    <MDI key="EXIF_ResolutionUnit">2</MDI>
    <MDI key="EXIF_SceneCaptureType">0</MDI>
    <MDI key="EXIF_SensingMethod">2</MDI>
    <MDI key="EXIF_Sharpness">1</MDI>
    <MDI key="EXIF_ShutterSpeedValue">(8.52582)</MDI>
    <MDI key="EXIF_SubjectArea">1023 767 614 614</MDI>
    <MDI key="EXIF_WhiteBalance">0</MDI>
    <MDI key="EXIF_XResolution">(72)</MDI>
    <MDI key="EXIF_YCbCrPositioning">1</MDI>
    <MDI key="EXIF_YResolution">(72)</MDI>
  </Metadata>
  <PAMRasterBand band="1">
    <Metadata domain="IMAGE_STRUCTURE">
      <MDI key="COMPRESSION">JPEG</MDI>
    </Metadata>
  </PAMRasterBand>
  <PAMRasterBand band="2">
    <Metadata domain="IMAGE_STRUCTURE">
      <MDI key="COMPRESSION">JPEG</MDI>
    </Metadata>
  </PAMRasterBand>
  <PAMRasterBand band="3">
    <Metadata domain="IMAGE_STRUCTURE">
      <MDI key="COMPRESSION">JPEG</MDI>
    </Metadata>
  </PAMRasterBand>
</PAMDataset>');

I may have use for the other stuff in this xml glob some other day, but for today I really only care about the time the picture was taken which we all know you can't count on the creation time of the file itself since it looses it when copied from device to device, but the EXIF timestamp doesn't change. Perhaps another day I'd be interested in the GPS info, but not today. So how do I grab the timestamp, I use XPATH of course with a little dash of regular expressions.

UPDATE pics SET pic_date = regexp_replace((xpath('/PAMDataset/Metadata[2]/MDI[@key=''EXIF_DateTimeDigitized'']/text()',exif))[1]::text,
 '([0-9]{4}):([0-9]{2}):([0-9]{2})', E'\\1-\\2-\\3')::timestamp

So what did I just do in this snippet of code: I grabbed the second MetaData element and grabbed the first MDI element of that that has key = EXIF_DateTimeDigitized. The format doesn't give me something I can easily cast to a timestamp because of those colons, so I use a regular expression to replace the :s following the Year, month, and day with a - and then I cast to a timestamp. The \\1, \\2,\\3 is backrefencing syntax that would return year, month , day.

Using XPath functions

Perhaps one day someone will need to know about the GPS related elements. Then I could pull out the XPATH/XSLT lovely contains function. Like so:

-- Pull all GPS related elements -- 
SELECT pic_file_name
    , (xpath( '/MDI/@key',exp))[1]::text As key
    , (xpath('/MDI/text()',exp))[1]::text As value
FROM (
SELECT pic_file_name, 
    unnest(xpath('/PAMDataset/Metadata[2]/MDI[contains(@key,''EXIF_GPS'')]',exif)) As exp
FROM pics) As foo;

      pic_file_name       |           key           |      value
--------------------------+-------------------------+------------------
my_funny_little_house.jpg | EXIF_GPSAltitude        | (52.4931)
my_funny_little_house.jpg | EXIF_GPSAltitudeRef     | 00
my_funny_little_house.jpg | EXIF_GPSImgDirection    | (151.026)
my_funny_little_house.jpg | EXIF_GPSImgDirectionRef | T
my_funny_little_house.jpg | EXIF_GPSLatitude        | (43) (18.39) (0)
my_funny_little_house.jpg | EXIF_GPSLatitudeRef     | N
my_funny_little_house.jpg | EXIF_GPSLongitude       | (72) (5.21) (0)
my_funny_little_house.jpg | EXIF_GPSLongitudeRef    | W
my_funny_little_house.jpg | EXIF_GPSTimeStamp       | (15) (4) (1713)

Using PostgreSQL Extensions

 

Code Profiling with PostGIS raster



What happens when you take a bunch of auto-generated tests and test results which you burn into rasters with SQL? You get a code fingerprint of sorts. Well that was the dream I wanted to explore. Now that PostGIS 2.0 has introduced a new spatial type called raster and lots of functionality including some very cool Map Algebra functionality to go with it, this is now possible without too much fuss. My first attempt at it didn't produce quite what I expected. I have to admit it is absolutely gorgeous and will be hopefully useful once I figure out how to read it, tweak the rendering process to achieve more of a breakout of results, or patent it as a fabric style.

Now for the story of how a computer nerd is born and how rasters can be useful to nerds.

When I was a kid of yeh 9 or 10 years of age, I used to write code in BASIC , save it to a data tape recorder/player, and then listen to the signals on an audio tape recorder/player. I found it amusing that both devices utilized the same media but expressed the signals a bit differently. I put myself to sleep with those sounds much like normal kids listen to human generated music: the screeching, the abrupt changes in pitch that sounded to me like cars gliding along and then crashing into each other ; I wondered what piece of my code was playing at this very moment?. Looking down an electronic microscope when I was working in a molecular biology lab in high school brought similar joy. I thought wow that looks like the universe being born. Thus started my long time love and fascination with data and signal processing.

Raster's (imagery) are matrices (often sparse) and n-dimensional in nature where you can think of each band as a dimension. They are fundamentally a data storage /presentation mechanism often for environmentally generated data with visualization being one mechanism of expression. I had a dream recently about how would I digest all the PostGIS battery of tests into something I can quickly scan, compare, overlay much like a doctor's X-ray, and diagnose problems. Ah I see a problem and it's right there!. That is what Map Algebra is for and the combination with SQL I find pretty hot. Sadly its only a hotness that computer nerds can probably appreciate so nothing I can discuss in family gatherings except possibly with my brother.

That brings up the question of how can you represent right there? What does right there mean when it is not about physical space? If I asked you, given these failures, where in your code is right there? if you are intimately familiar with your code, you probably have a spot in your brain that has mapped out an almost physical map of all the parts of your code, such that right there is just as physical of a place to be as the ground you walk on though perhaps with more dimensions than the ground you walk on.

While we think of space as being physical -- that is merely a manifestation of it. Its positioning is purely based on how you define its continuity. So I thought to myself, if I could look at the tests in the context of some continuum of space, how would I define it so that I could quickly grasp the meaning of 80,000 tests in one glance? So in short converting my mental map to something visual that had little blip lights telling me where there are problems.

Imagine a grid where each cell defines a different test. The rows and columns are somewhat arbitrarily broken -- mostly so I can fit it on my monitor. I could just as well have a 1-row many many column strand like DNA. Now what do I put in those cells? Well my cells have bands where each band stores a different metric of my test as a number.

  • I've got a band defining the complexity of inputs to the test (which I visualize as a reddish color).
  • A band to measure how long the test took -- if it fails the value is null (undefined) and this I visualize as greenish.
  • Then the last band I define as the size of my output

Now my tests are generated by running through functions in sequence often in family order of functions with family order of inputs, so in a sense a function test next door is more closely related than a test further away. So there is a continuum of sorts even if it is in just my head.

We've just defined a raster and rasters of this type created from the same battery of tests will map out the same space assuming we define our space continuum the same. This means you can overlay two batteries of tests and compare them pixel by pixel across each band and they overlay perfectly. More importantly you can just look at the generated raster (after you apply some transformations to make it viewable), and with some coaching have those millions of sensors in your eye register -- Ah the problem is right there! and that right there would now have an uncontestable visual manifestation.

Later on I'll demonstrate some concrete code that explores what we PostGIS devs like to call Seamless Vector Raster operations. Those who think "relational databases" should not be performing these sacrilegious exercises may not want to view the next episode or may choose to keep their eyes closed lest they turn to stone.


Using PostgreSQL Extensions

 

Mail Merging using Hstore Intermediate



For those who aren't familiar with hstore, it's a key/value storage type that is packaged as an extension or contrib in PostgreSQL 8.2+. In PostgreSQL 9.0 it got a little extra loving in several ways one of which was the introduction of the hstore(record) casting function that converts a record to an hstore. In this article, I'll demonstrate how you can use this new casting function to do very sleek mail merges right in the database. The only caveat is that it seems to only correctly name the keys if it is fed a real table or view. Derived queries such as aggregates etc get keys named f1, f2, etc.

If you are on PostgreSQL 9.1 or above installing -- hstore is just a CREATE EXTENSION hstore; sql command away. If you are on a lower version of PostgreSQL, you can usually find the hstore.sql in share/contribs.

Converting a record to a key value pair

Records are narly things because they are hard to inspect and shuffle around in the database. The main reason is that its hard to get at the columns without knowing the column names before hand. In order to create a generic merging function that will take say a template and macro-replace any named column with the value, it would be really nice to have a simple key-value format instead of a row column format. Here is where hstore comes in handy.

We have a table that looks something like this:

CREATE TABLE clients(client_id varchar(5) PRIMARY KEY, contact_name varchar(100), business varchar(100));
INSERT INTO clients(client_id, contact_name, business)
VALUES ('DCH', 'John Dewey', 'Dewey, Cheetham, and Howe'),
    ('IYS', 'Spencer Tracey', 'Investigators at Your Service'),
    ('ABC', 'Abby Back', 'It is all about ABC');

To convert our records to hstore key-val pairs we do this:

SELECT c.client_id, hstore(c) As mydata
FROM clients As c;

-- output --
client_id |                                             mydata
----------+---------------------------------------------------------------------------------------------------
DCH       | "business"=>"Dewey, Cheetham, and Howe", "client_id"=>"DCH", "contact_name"=>"John Dewey"
IYS       | "business"=>"Investigators at Your Service", "client_id"=>"IYS", "contact_name"=>"Spencer Tracey"
ABC       | "business"=>"It is all about ABC", "client_id"=>"ABC", "contact_name"=>"Abby Back"

Mail Merging the data

To help with mail merging, I wrote this little helper function:

CREATE OR REPLACE FUNCTION merge_replace(param_template text, param_kval hstore 
  , param_beg_encap  text DEFAULT  '$', param_end_encap  text DEFAULT  '$'  ) 
RETURNS text AS 
$$
    DECLARE var_result text := param_template; r record;
    BEGIN 
        FOR r IN (SELECT (each(param_kval)).* ) LOOP
            var_result := replace(var_result, param_beg_encap || r.key || param_end_encap, COALESCE(r.value,''));
        END LOOP;
        RETURN var_result ;
    END 
$$
language 'plpgsql';

Now we write the query

SELECT c.client_id
    , merge_replace(template.letter,hstore(c)) As personalized_note
FROM clients As c 
CROSS JOIN  (SELECT 'Dear $contact_name$ of $business$,
You are cordially invited to our party.'::text AS letter) As template;


Which outputs this:
client_id |                   personalized_note
----------+-------------------------------------------------------
DCH       | Dear John Dewey of Dewey, Cheetham, and Howe,        
          | You are cordially invited to our party.
IYS       | Dear Spencer Tracey of Investigators at Your Service,
          | You are cordially invited to our party.
ABC       | Dear Abby Back of It is all about ABC,               
          | You are cordially invited to our party.

Merging Hstores

Now as I mentioned before, if you try to create a subquery and convert the subquery row to an hstore, your keys become pretty useless f1,f2,f3, etc, but luckily we can merge hstores. For this exercise, lets suppose we've got some items we REALLY REALLY want, like a little girl wants a pony for christmas. So we want each invited guest to bring us a present, but we don't want any guest to bring the same present because who needs 2 ponies, so we want to suggest a different present to each client.

Our gift table looks like this:

CREATE TABLE gifts(gift_name varchar(100) primary key)
INSERT INTO gifts(gift_name)
VALUES ('A real live fancy mouse')
    , ('Android Phone controlled helicopter with camera')
    , ('Train set'), ('Mouse scarf');

We can merge our clients row with a particular gift row by writing a query like below which uses the hstore || operator to merge 2 hstores:

SELECT c.client_id, my_client_data || g.gift_idea AS mydata
FROM (SELECT client_id, hstore(clients) As my_client_data, ROW_NUMBER()OVER() As gift_num
        FROM clients)  As c INNER JOIN
            (SELECT hstore(gifts) As gift_idea, ROW_NUMBER() OVER() As gift_num 
                FROM gifts) As g
            ON (c.gift_num = g.gift_num);

-- outputs --
 client_id |                                                                              mydata
-----------+-------------------------------------------------------------------------------------------------------------------------------------------
 DCH       | "business"=>"Dewey, Cheetham, and Howe", "client_id"=>"DCH", "gift_name"=>"A real live fancy mouse", "contact_name"=>"John Dewey"
 IYS       | "business"=>"Investigators at Your Service", "client_id"=>"IYS", "gift_name"=>"Android Phone controlled helicopter with camera"
             , "contact_name"=>"Spencer Tracey"
 ABC       | "business"=>"It is all about ABC", "client_id"=>"ABC", "gift_name"=>"Train set", "contact_name"=>"Abby Back"

And now for the finale -- the real purpose of this tutorial -- how to get gifts you really want:

SELECT c.client_id,  merge_replace(template.letter, my_client_data || g.gift_idea ) As personalized_note
FROM (SELECT client_id, hstore(clients) As my_client_data, ROW_NUMBER()OVER() As gift_num
        FROM clients)  As c INNER JOIN
            (SELECT hstore(gifts) As gift_idea, ROW_NUMBER() OVER() As gift_num 
                FROM gifts) As g
            ON (c.gift_num = g.gift_num)
    CROSS JOIN  (SELECT 'Dear $contact_name$ of $business$,
You are cordially invited to our party. We understand your time is very valuable and that you would feel guilty if you
came to our party empty-handed.  Out of the kindness of our heart, we have decided to relieve your burden by 
providing you with a gift recommendation from our wish list.  
Suggested gift offering: $gift_name$'::text AS letter) As template;


-- output --
 client_id |                                                   personalized_note
-----------+------------------------------------------------------------------------------------------------------------------------
 DCH       | Dear John Dewey of Dewey, Cheetham, and Howe,                                                                        
           | You are cordially invited to our party. We understand your time is very valuable and that you would feel guilty if you
           | came to our party empty-handed.  Out of the kindness of our heart, we have decided to relieve your burden by          
           | providing you with a gift recommendation from our wish list.                                                          
           | Suggested gift offering: A real live fancy mouse
           
 IYS       | Dear Spencer Tracey of Investigators at Your Service,                                                                 
           | You are cordially invited to our party. We understand your time is very valuable and that you would feel guilty if you
           | came to our party empty-handed.  Out of the kindness of our heart, we have decided to relieve your burden by          
           | providing you with a gift recommendation from our wish list.
           | Suggested gift offering: Android Phone controlled helicopter with camera
           
 ABC       | Dear Abby Back of It is all about ABC,                                                                                
           | You are cordially invited to our party. We understand your time is very valuable and that you would feel guilty if you
           | came to our party empty-handed.  Out of the kindness of our heart, we have decided to relieve your burden by          
           | providing you with a gift recommendation from our wish list.                                                          
           | Suggested gift offering: Train set