Sunday, November 28. 2010
Recommended Books: Free R reference card / Cheat Sheet PostGIS In Action R in Action R In a Nutshell
In this article we'll provide a summary of what PL/R is and how to get running with it. Since we don't like repeating ourselves, we'll refer you to an article we wrote a while ago which is still fairly relevant today called Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide and just fill in the parts that have changed. We should note that particular series was more geared toward the spatial database programmer (PostGIS in particular). There is a lot of overlap between the PL/R, R, and PostGIS user-base which is comprised of many environmental scientists and researchers in need of powerful charting and stats tools to analyse their data who are high on the smart but low on the money human spectrum.
This series will be more of a general PL/R user perspective. We'll follow more of the same style we did with Quick Intro to PL/Python. We'll end our series with a PL/R cheatsheet similar to what we had for PL/Python.
As stated in our State of PostGIS article, we'll be using log files we generated from our PostGIS stress tests. These stress tests were auto-generated from the PostGIS official documentation. The raster tests are comprised of 2,095 query executions exercising all the pixel types supported. The geometry/geograpy tests are comprised of 65,892 spatial SQL queries exercising every PostGIS geometry/geography supported in PostGIS 2.0 -- yes this includes TINS, Triangles,Polyhedral Surfaces, Curved geometries and all dimensions of them. Most queries are unique. If you are curious to see what these log tables look like or want to follow along with these exercises, you can download the tables from here.
What is R and PL/R and why should you care?
R is both a language and an environment for doing statistics and generating graphs and plots. It is GNU-licensed and a common favorite of Universities and Research institutions. PL/R is a procedural language for PostgreSQL that allows you to write database stored functions in R. R is a set-based and domain specific language similar to SQL except unlike the way relational databases treat data, it thinks of data as matrices, lists and vectors. I tend to think of it as a cross between LISP and SQL though more experienced Lisp and R users will probably disagree with me on that. This makes it easier in many cases to tabulate data both across columns as well as across rows. The examples we will show in these exercises, could be done in SQL, but they are much more succinct to write in R. In addition to the language itself, there are a whole wealth of statistical and graphing functions available in R that you will not find in any relational database. These functions are growing as more people contribute packages. Its packaging system called Comprehensive R Archive (CRAN) is similar in concept to Perl's CPAN and the in the works PGXN for PostgreSQL.
What do you need before you can use PL/R?
Our first PL/R stored function: creating data frames and saving to RData file
In R there are 5 basic kinds of data structures: scalars, vectors, arrays, dataframes, and lists. Lists are the most complex of all because they are collections of the other 4 data structures. You can think of a PostgreSQL query output of table as a data frame. In this simple example, we'll demonstrate how to feed a PostgreSQL query into an R dataframe and save it in an R data file.
The most common function used in PL/R is the pg.spi.exec routine which takes as input SQL statements and converts them to R dataframes. In the example above, we are dumping our geometry/geography logging test stats and our raster stats from a PostGIS 2.0 test run and saving them to 2 R DataFrames. In addition we are saving these data frames to a custom R Data file. An R Data file is a file that stores R objects in native R format. It's useful for saving datasets you are working on for easy load back.
Using R interactively
R is both a scriptable and an interactive environment and one that you can also automate a good bit of with PL/R. Before automating things, I like to test processes out interactively in R. For this next example, we'll load back the data we saved using PL/R into R and play with it.
To do so, launch R-GUI or R command line and at the prompt type:
-- yields logid log_label func g1 Min. : 1.0 Length:2095 Length:2095 Length:2095 1st Qu.: 524.5 Class :character Class :character Class :character Median :1048.0 Mode :character Mode :character Mode :character Mean :1048.0 3rd Qu.:1571.5 Max. :2095.0 g2 log_start log_end dur_sec Length:2095 Length:2095 Length:2095 Min. :0.000000 Class :character Class :character Class :character 1st Qu.:0.000000 Mode :character Mode :character Mode :character Median :0.000000 Mean :0.008964 3rd Qu.:0.016000 Max. :0.187000
So we know from the above summary that we have 2095 observations and basic stats about the average duration of each process in our raster log tests. We ran the above test on a fairly new beefed up Winodw 7 and an older clunkier Windows Xp. Both 32-bit running PostgreSQL 9.0 / PostGIS 2.0. The windows xp in general followed the same pattern as the windows 7 except timings were generally 50-80% worse.
One of my favorite functions in R is the
Basically what it does is provide a cross tab of counts. Here is an example use where we take our raster log and output the time durations but each time duration we round to 2 decimals so that our numbers in 0.11, 0.112 etc. collapse to 0.11
If you have a lot of variables coming from the same table or want to run a lot of R steps on the same table, you may want to write something like below which is kind of equivalent to a SELECT a, b FROM sometable; As opposed to our above which is more like a SELECT sometable.a, sometable.b FROM sometable;. I generally like to be more explicit especially when I have a lot of R variables floating around.
From this excerpt, we can tell that 64 bit floating point raster bands have the worst performance in general of any of the raster types. To get a true estimate we would run our simulation on different platforms and many times on same machine as well. We did run this a couple of times on our 32-bit windows XP and 7 and the 64BF is consistently slightly worse than the others. Our Windows 7 is about twice as fast in general but interestingly enough on the Windows 7, we see some suspicious outliers for 16BUI and 1BB. We think this is just turbulence since on other runs 64BF most always shows as the longest running.
0 0.02 0.03 0.05 0.06 0.08 0.09 0.11 0.16 0.19 16BSI 119 52 12 7 0 0 0 0 0 0 16BUI 122 48 18 1 0 0 0 0 1 0 1BB 121 57 8 1 1 1 0 0 0 1 2BUI 132 52 6 0 0 0 0 0 0 0 32BF 114 39 13 15 6 3 0 0 0 0 32BSI 115 41 10 17 6 0 1 0 0 0 32BUI 116 39 14 19 1 0 1 0 0 0 4BUI 130 53 7 0 0 0 0 0 0 0 64BF 113 41 1 2 10 12 10 1 0 0 8BSI 128 59 3 0 0 0 0 0 0 0 8BUI 124 59 5 2 0 0 0 0 0 0
Now we do the same exercise except flipping to see the culprit functions. Here we just want for each function if its better or worse than the 3rd Quantile standard deviation of our tests. Though this query can be done in SQL its much more succinct in R. Note: the 2 spellings in ST_SetValue, yes our machine gun test has spotted an inconsistency in our PostGIS documentation. Shame on us.
FALSE TRUE && 120 1 &< 121 0 &> 121 0 AddRasterColumn 3 8 DropRasterTable 11 0 insert data 10 1 PostGIS_Raster_Lib_Build_Date 1 0 PostGIS_Raster_Lib_Version 1 0 ST_AddBand 13 53 ST_AsBinary 6 5 ST_BandHasNoDataValue 22 0 ST_BandMetaData 7 15 ST_BandNoDataValue 22 0 ST_BandPath 22 0 ST_BandPixelType 22 0 ST_Box2D 11 0 ST_ConvexHull 11 0 ST_DumpAsPolygons 21 1 ST_Envelope 11 0 ST_GeoReference 22 0 ST_Height 11 0 ST_Intersection 220 0 ST_Intersects 438 2 ST_MakeEmptyRaster 14 0 ST_MetaData 11 0 ST_NumBands 11 0 ST_PixelAsPolygon 18 4 ST_PixelSizeX 11 0 ST_PixelSizeY 11 0 ST_Polygon 22 0 ST_Raster2WorldCoordX 22 0 ST_Raster2WorldCoordY 22 0 ST_SetBandHasNoDataValue 13 9 ST_SetBandNoDataValue 11 11 ST_SetGeoReference 11 11 ST_SetPixelSize 13 9 ST_SetSkew 11 11 ST_SetSRID 6 5 ST_SetUpperLeft 6 5 ST_SetValue 41 36 ST_SEtValue 31 24 ST_SkewX 11 0 ST_SkewY 11 0 ST_SRID 11 0 ST_UpperLeftX 11 0 ST_UpperLeftY 11 0 ST_Value 131 1 ST_Width 7 4 ST_World2RasterCoordX 77 0 ST_World2RasterCoordY 77 0
Second PL/R stored function, creating a plain text report
When you are writing PL/R code, you are pretty much just writing R code and living in the R environment. However, there are certain functions in R that you may come to prefer when working in PL/R than you would prefer in R. In PL/R you want data to be output in variables so you can easily return them in PL/R. The PostgreSQL language also doesn't provide a whole lot of mappings for the objects you get back in R. For our next example we want a simple text output of representation of the cross tabs we created and we want it to be returnable as an SQL statement. For this next example, we are going to lean on our favorite R functions for doing this. Other's more experienced may have better ways.
So now we show our reporting function which exercises all the above R functions.
To run our report, we run this command: SELECT output_garden_stats();
The report output can be seen http://www.postgresonline.com/demos/plr/postgis20report.txt
There are a lot of surprising things that we discovered simply looking at counts. Some might be things in the testing scripts (or even the PostGIS documentation) that need to be cleaned up and some may be PostGIS issues or just predictable oddities when a function is applied to 2 different kinds of geometries.
For example for the CURVEPOLYGON report:
So the moral of the story -- You can learn a lot from a dummy.
Join us next time as we continue our exploration of R and PL/R and demonstrate creating aggregates with PL/R and then later creating graphs and plots.
Posted by Leo Hsu and Regina Obe in 9.0, gis, intermediate, pl programming, PLR at 14:19 | Comments (3) | Trackbacks (0)
Related entries by tags:
Tracked: Dec 09, 23:10
Recommended Books: Free R reference card / Cheat Sheet PostGIS In Action R in Action R In a Nutshell
Display comments as (Linear | Threaded)
Hi, I am attempting to install plr. I am running Window XP 32bit, PostgreSQL 8.4, PostGIS 1.5, and R 2.11.0. I have downloaded plr (plr-220.127.116.11-pg84.win32.zip)and copied and pasted the dll into ...\PostgreSQL\8.4\lib. After restarting my computer. No sql file was created, I attempted to create my own (http://www.joeconway.com/plr/doc/plr-install.html) but no dice there either I get "error: could not load library...". Do you have any suggestions for trouble shooting? Thank you.
#1 garret on 2011-07-20 17:42
Why should an sql file be created. The plr.sql should be packaged in the .zip file and you install that in your database. though I think last time, I used the setup file which does have the sql script packaged.
Please try the setup and see if that works better for you.
Also make sure you put in the environment variables in control panel. that is usually the cause of this kind of error.
#1.1 Leo Hsu and Regina Obe on 2011-07-23 22:57
Thank you Leo and Regina, I suppose that is the problem then the package that I downloaded came with no sql file. But the link you have provided seems to have worked. Thank you for your help!
#1.1.1 garret on 2011-08-02 10:13
Syndicate This Blog
Show tagged entries