Sunday, January 27. 2013
Printer Friendly
This exercise is a continuation of our How to bulk export tables from MS Access. Now while this approach will work for other databases besides PostgreSQL, you'll probably need to fiddle with the subroutine to make it work for some other databases. PostgreSQL is fairly ANSI-SQL standard so not too much fiddling should be required to port to SQL Server, MySQL, Oracle etc.
Unlike the prior Visual basic subroutine we showed that exports the tables, this just creates an SQL script that you run on the already created PostgreSQL database that contains the exported data. We didn't test the quote option though we coded it in the subroutine, since like we said we hate having to quote fields. If perchance you are one of those folks that likes to put spaces in your field names to make it more englishy, then you'll need to quote or revise the other routine to convert your spaces to _ or some other thing.
Continue reading "How to recreate MS Access primary keys and indexes in PG"
Saturday, November 10. 2012
Printer Friendly
We've been fighting to get clients we have thinking of upgrading or creating new apps to also choose PostgreSQL in the process.
Here I'll just itemize some of the obstacles we've run into in making the sale. All of these fall under the It is not Database X line item.
By database X I mean SQL Server, MySQL, and Oracle and for us in exactly that order. Our obstacle focus is probably a bit different from others since
we are consultants to mostly Windows shops or consultants to ISVs who have to sell their applications to U.S. government agencies or units of agencies.
Continue reading "Biggest Obstacle to PostgreSQL Adoption: It is not Database X"
Sunday, July 01. 2012
Printer Friendly
In last article Finding Contiguous primary keys we detailed one of many ways of finding continuous ranges in data, but the approach would only work on higher-end dbs like Oracle 11G, SQL Server 2012, and PostgreSQL 8.4+. Oracle you'd have to replace the EXCEPT I think with MINUS. It wouldn't work on lower Oracle because of use of CTEs. It wouldn't work on lower SQL Server because it uses window LEAD function which wasn't introduced into SQL Server until SQL Server 2012. Someone on reddit provided a Microsoft SQL Server implementation which we found particularly interesting because - it's a bit shorter and it's more cross-platform. You can make it work with minor tweaks on any version of PostgreSQL, MySQL, SQL Server and even MS Access. The only downside I see with this approach is that it uses correlated subqueries which tend to be slower than window functions. I was curious which one would be faster, and to my surprise, this version beats the window one we described in the prior article. It's in fact a bit embarrassing how well this one performs. This one finished in 462 ms on this dataset and the prior one we proposed took 11seconds on this dataset. Without further ado. To test with we created a table:
CREATE TABLE s(n int primary key);
INSERT INTO s(n)
SELECT n
FROM generate_series(1,100000) As n
WHERE n % 200 != 0;
Continue reading "Contiguous Ranges of primary keys: a more cross platform and faster approach"
Sunday, December 11. 2011
Printer Friendly
In our article The Pure Relational database is dead there
were a lot of misunderstandings as a result of our poor choice of words. People thought we were bashing the relational model because in their mind that was what
pure meant. I got hit with a lot of poetic insults. I still can't think of an alternative word to use for what I meant. Simple doesn't really do it as even relational databases with just standard types
were far from simple when you consider the planner and all the other stuff going on under the hood to protect you from the underlying storage structure. What I was trying to say is that in the beginning most relational databases
just supported a standard set of types which you could not expand on and most people when they think relational today still think just that. That type of relational database is in my book dead or almost dead.
How did this all start. Well whenever we use something like PostgreSQL to store anything complex -- take your pick: geometry data, tree like structures which we use
ltree for, full-text query constructs, and Yes XML we get bashed by some know-it-all who has a very narrow view of what a relational database should be doing and suggesting we use a NoSQL database, a graph engine or a full text engine or normalize our data more. I have also learned XML is a dirty word to many people. I mistakenly thought XML was a complex type people could
relate to, but turns out they can relate to it so well that it brings up tragic memories I can only equate to Post Traumatic Stress Disorder suffered by war veterans or (early or wrong) technology adopters. That was not
my intent either. XML was just merely an example. I will not say you should use XML in your tables, but I will also not say you should stay clear of it as many people wanted me to say. I will say its use is rare, but it has its place. It has its place just as any other complex type and it has its own special needs for navigation, indexing etc. which many relational databases handle fine enough.
Continue reading "The Relational Model is very much alive"
Saturday, December 03. 2011
Printer Friendly
A lot of redditers took offense at our article XPathing XML data with PostgreSQL
with the general consensus, if you are going to be stuffing XML in a relational database where will you stop? That is not what relational databases are designed for.
We had comitted a sacrilegious sin and worsed yet encouraging bad habits by forcing people to think more about different options they have for storing data in a relational database and god forbid demonstrating querying such columns with xml specific functions. What were we thinking? How dare we try to query XML data with SQL? Perhaps we were thinking like this guy or this guy,
both equally misguided spatial relational database folk.
Of course we stepped one foot further by actually defining a column as xml and dare storing data in it for later consumption rather than just an intermediary step.
If I want to store documents, that are navigateable I should be using a document database like MongoDb, CouchDB etc designed for that kind of stuff. If I've got graphs I should be using a graph database. This got me thinking
that the "Pure Relational Database" is dead, and I'm surprised most people don't seem to realize it.
So while "Relational databases" have changed over the last 25 years, most people's notions of them have not kept up with the pace of its change.
First let me define what I mean by Pure. A pure relational database is one with standard meat and potato types like text, dates, numbers well suited for counting money and computing how close the world is to total bankruptcy which you store as fields in a row of a table and that you then define foreign keys / constraints / primary keys on to relate them to other tables.
You reconstitute your real world objects by stitching these tables together with joins etc and return sets using where conditions, summarize by using group bys or other mathy like constructs. Don't get me wrong; these are very beautiful things because they allow for easy slicing of dimensions and not having to think about all the dimensions that make up an object all at once. In fact it was so beautiful that some people thought, "wow that's cool, but it would be even cooler if I could
store more complex objects in those columns with their own specific needs for querying." and so was born the object relational database as some people refer to them that thought relational but also understood that different types had their own unique needs for querying, storage, indexing etc.
Nowadays most, if not all, relational like databases have standardized on some variant of SQL.
In essence though, the pure relational database doesn't allow you to define new types or have exotic types such as arrays, xml, graphs, geometries, rasters, sparse matrices etc.
Much less thinking involved and less likely you will shoot yourself in the foot by dumping a bunch of xml in a field and trying to do something with it.
When it is used to store more complex things such as spreadsheets and other user documents, these are stored as blobs and just retrieved. Even such use is frowned upon.
Well most relational databases I can think of nowadays have richer types: e.g. PostgreSQL, Oracle and Firebird all support arrays as a column type. Some even allow you to define custom types and functions to support your custom types e.g. PostgreSQL (I could go on forever), Oracle has rich user defined type support too, and SQL Server 2005+ with each version getting better and better for user defined custom types and introducing more exotic types and support infrastructure. Even MySQL/Drizzle (mostly in the form of different storage engines). Even my favorite light-weight SQLite under the hood has some tricks that aren't what I would call relational. E.g. Spatialite/RasterLite has a whole geometry type library built on SQLite with functions you can call from SQL and I'm sure there are lots of middleware tools you don't know about
using the SQLite and Firebird engine for more than relational tasks (e.g. HTML5 anyone/ CAD anyone).
Monday, September 26. 2011
Printer Friendly
UPDATE Turns out there is a simpler way of getting rid of roles that have explicit permissions to objects as Tom Lane pointed out in the comments.
DROP OWNED BY some_role;
Will drop the permissions to objects a user has rights to even if they don't own the objects. Of course this needs to be applied with caution since it will drop tables
and other things you don't want necessarily dropped. So it is best to first run a:
REASSIGN OWNED BY some_role TO new_role;
And then run the DROP OWNED BY.
The REASSIGN OWNED BY which is what we did originally is not sufficient since it doesn't drop the permissions or reassign
them as we assumed it would. This is noted in the docs.
And then you will be allowed to
DROP ROLE some_role
One of the things that is still tricky in PostgreSQL is permission management. Even though 9.0 brought us default privileges and the like, these permissions aren't
retroactive so still a pain to deal with if you already have objects defined in your database.
One of the annoyances we come across with is deleting roles. Lets say you have a role and it has explicit permissions to an object.
PostgreSQL won't allow you to delete this role if it owns objects or has explicit permissions to objects. In order to delete it seems you have
to go in and clear out all those permissions. To help with that -- we wrote a quickie script that will generate a script to revoke all permissions on objects
for a specific role. It looks like this:
Continue reading "Bulk Revoke of Permissions for Specific Group/User role"
Friday, June 03. 2011
Printer Friendly
PostgreSQL 8.4 introduced the ability to create user-defined variadic functions. These are basically
functions that take as input an undefined number of arguments where the argument that is an undefined number are all of the same type and are the last input arguments. Depesz went over it two years ago in Waiting for 8.4 variadic functions,
so we are a bit late to the party. In a nutshell -- variadic functions are syntactic sugar for functions that would otherwise take arrays. In this article we'll provide some more demonstrations of them to supplement Depesz article.
I was reminded that I had never explored this feature, when recently documenting one of the
new PostGIS 2.0 Raster functions - ST_Reclass which employs this feature.
I think ST_Reclass is a superb function and one of my favorite raster functions thus far that I hope to put to good use soon. Our new PostGIS family member,Bborie Park, is running thru our
PostGIS Raster milestones much faster than I had dreamed. He's already implemented a good chunk of stuff we discussed in Chapter 13 - PostGIS Raster and had stated you probably won't see in PostGIS 2.0. He's
going a bit faster than I can catalog them, so the documentation is already embarrassingly behind the fantastic functionality that is already present in PostGIS 2.0.
Continue reading "Variadic Functions in PostgreSQL"
Tuesday, May 10. 2011
Printer Friendly
Question: What is the difference between CURRENT_TIMESTAMP and clock_timestamp()
Answer: CURRENT_TIMESTAMP is an ANSI-SQL Standard variable you will find in many relational databases including PostgreSQL, SQL Server, Firebird, IBM DB2 and MySQL to name a few
that records the start of the transaction. The important thing to keep in mind about it is there is only one entry per transaction so if you have a long running transaction,
you won't be seeing it changing as you go along.
clock_timestamp() is a PostgreSQL function that always returns the current clock's timestamp. I don't think I'm alone in using it for doing simple benchmarking and other things
where for example I need to record the timings of each part of a function within the function using pedestrian RAISE NOTICE debug print statements.
There is another cool way I like using it, and that is for a batch of records each with an expensive function call, benchmarking how long it takes to process each record.
One of the things I'm working on is improving the speed of the tiger_geocoder packaged in PostGIS 2.0. The first root of attack seemed to me would be the normalize_address function
which I was noticing was taking anywhere from 10% to 50% of my time in the geocode process. That's a ton of time if you are trying to batch geocode a ton of records. The thing is
the function is very particular to how badly formed the address is so a whole batch could be held up by one bad apple and since the batch doesn't return until all are processed, it makes
the whole thing seem to take a while.
So rather than looping thru each, I thought it would be cool if I could run the batch, but for each record have it tell me how long it took to process relative to the rest so I could get
a sense of what a problem address looks like. So I wrote this query:
WITH ctbenchmark
AS
(SELECT *,
the_time - COALESCE(lag(the_time) OVER(ORDER BY the_time), CURRENT_TIMESTAMP) As process_time,
the_time - CURRENT_TIMESTAMP As diff_from_start
FROM (SELECT address_1, city, state, zip,
pprint_addy(normalize_address(coalesce(address_1,'') || ', ' || coalesce(city || ' ','') || state || ' ' || zip)) As pp_addr,
clock_timestamp() As the_time
FROM testgeocode LIMIT 1000) As foo )
SELECT *
FROM ctbenchmark
WHERE process_time > '00:00:00.016'::interval;
Which returned an output something like this:
address_1 | city | state | zip | pp_addr | the_time | process_time | diff_from_start
------------------+------------+-------+------- +-------------------------------------------+--------------+------------------
48 MAIN ST .. | S.. | MA | 021.. | 48 MAIN .. | 2011-05-10 03:24:43.078-04 | 00:00:00.032 | 00:00:00.032
15 ... | | MA | 018... | 15 GREN... | 2011-05-10 03:24:50.796-04 | 00:00:00.031 | 00:00:07.75
Friday, February 25. 2011
Printer Friendly
Many of our customers ask us this question so we thought we'd lay down our thoughts.
The last couple of our articles have been how to do this and that in PostgreSQL, SQL Server, MySQL or having PostgreSQL coexist with an existing SQL Server install.
A major reason for that is that in many of our projects we have a choice of what database to choose for a new piece of an application as long as it can play nicely with the existing infrastructure.
Our core database competencies are still PostgreSQL, SQL Server, and MySQL with it leaning
more toward PostgreSQL each day. We are perhaps somewhat unique in the PostgreSQL community in that Oracle never comes into our equation of decisions (though Oracle and PostgreSQL are perhaps more similar than the others).
Oracle is too expensive for most of our clientele
so it's a non-issue, and when our clients do have Oracle -- it's thrust upon them by thier ERP/CRM vendor and is essentially off limits to them.
Continue reading "Why choose or not choose PostgreSQL?"
Tuesday, January 18. 2011
Printer Friendly
In our last article we talked about String Aggregation implementing in PostgreSQL, SQL Server, and MySQL. This task is one that makes purist relational database programmers
a bit squeamish. In this article we'll talk about the reverse of that, how do you deal with data that someone hands you delimeted in a single field and that you are asked to explode or re-sort based on some lookup table.
What are the benefits of having a structure such as? : p_name | activities
--------+--------------------------------
Dopey | Tumbling
Humpty | Cracking;Seating;Tumbling
Jack | Fishing;Hiking;Skiing
Jill | Bear Hunting;Hiking
Well for the casual programmer or simple text file database that knows nothing about JOINS and so forth, it makes it simple to pull a list of people who like Tumbling.
You simply do a WHERE ';' || activities || ';' LIKE '%;Tumbling;%'. It's great for security too because you can determine security with a simple like check and also list all the security groups a member belongs in without doing anything.
Quite easy for even the least data-skilled of programmers to work with because most procedural languages have a split function that can easily parse these into an array useful for stuffing into drop down lists and so forth. As a consultant of semi-techie people
I'm often torn by the dilemma of "What is the way I would program for myself vs. the way that provides the most autonomy to the client". By that I mean
for example I try to avoid heavy-weight things like Wizards that add additional bloated dependencies or slow the speed down of an application. These bloated dependencies may provide ease to the client but make my debugging life harder. So I weight the options
and figure out which way works now and also provides me an easy escape route should things like speed or complexity become more of an issue.
This brings us to the topic of, what is wrong with this model? It can be slow because the LIKE condition you have can't easily take advantage of an index unless using a full text index so not ideal where this is the primary filtering factor. It's also prone to pollution because
you can't easily validate that the values in the field are in your valid set of lookups or if your lookup changes, the text can be forced to change with a CASCADE UPDATE/DELETE RULE etc. In cases where this is of minor consequence
which is many if referential integrity is not high on your list of requirements, this design is not bad. It might make a purist throw up but oh well there is always dramamine to fall back on. As long as you have done your cost benefit analysis, I don't think there should be any shame of following this
less than respected route.
While you may despise this model, it has its place and it's a fact of life that one day someone will hand it to you and you may need to flip it around a bit. We shall demonstrate how to do that in this article.
Continue reading "Reverse String Aggregation: Explode concatenated data into separate rows In PostgreSQL, SQL Server and MySQL"
Friday, December 24. 2010
Printer Friendly
Question: You have a table of people and a table that specifies the activities each person is involved
in. You want to return a result that has one record per person and a column that has a listing of activities for each person
separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.
This is a question we are always asked and since we mentor on various flavors of databases,
we need to be able to switch gears and provide an answer that works on the client's database. Most
often the additional requirement is that you can't install new functions in the database. This means that
for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.
Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in
most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be
done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job.
We feel that data
transformation is an important function of a database, and if your database is incapable of massaging the data into a format
your various client apps can easily digest, WELL THAT's A PROBLEM.
We shall now document this answer rather than trying to answer for the nteenth time. For starter's
PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others
and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing,
but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement.
MySQL has a fairly
simple, elegant and very portable way that it has had for a really long time.
Continue reading "String Aggregation in PostgreSQL, SQL Server, and MySQL"
Friday, November 05. 2010
Printer Friendly
Problem: You have a set of numbers, or characters or whatever and you are trying to find the max or min of this set?
If the values are separate records in a table or query, the answer is well known and respected across all relational databases -- use the aggregate MAX and MIN functions.
But what if you have a set of free wheeling numbers or text not in separate records, and you want the max or min of each. Here is where the
least and greatest functions come in handy.
PostgreSQL has had these functions for as far back as I can remember and is not the only database to sport these marvelous functions. Our beloved MySQL and Oracle database have these functions as well. Sadly our more beloved SQL Server even in the SQL Server 2008
variant - lacks these functions.
Okay how to use these functions -- you use it like this:
SELECT least(1,-2,5) As num_least, greatest('Bobby', 'Catty', 'Kitty') As greatest_cat;
Yields:
num_least| greatest_cat
----------+----------
-2 | Kitty
We would classify these functions along the lines of COALESCE. They are like COALESCE because they take an arbitrary number of arguments and the datatype that is returned
is highest datatype that all arguments in the function can be autocast to. If there is no autocast then well you get an error. To demonstrate, guess what happens when you do this:
SELECT least(-1, 'Kitty');
Well do this in PostgreSQL at least in 8.3+, you get a nice slap if you haven't installed any deprecated autocasts:
ERROR: invalid input syntax for integer: "Kitty"
LINE 1: SELECT least('Kitty', -1)
Do this in MySQL - so friendly and forgiving, and great reader of minds and you get:
-1
OF COURSE
I apologize for the ambiguous sarcasm, its just sometimes I want my mind read and sometimes I don't and I just can't figure out whether today is one of those days or the other day.
Friday, July 23. 2010
Printer Friendly
When it comes to naming things in databases and languages, there are various common standards. For many languages the
camel family of namings is very popular. For unix based databases
usually UPPER or lower _ is the choice and for databases such as SQL Server and MySQL which allow you to name your columns with mixed casing
but couldn't care less what case you express them in selects, you get a mish mush of styles depending on what camp the database user originated from.
So to summarize the key styles and the family of people
- camelCase : lastName - employed by SmallTalk, Java, Flex, C++ and various C derivative languages.
- Pascal Case: (a variant of Camel Case) -- LastName which is employed by C#, VB.NET, Pascal (and Delphi), and SQL Server (and some MySQL windows converts). Also often used for class names by languages that use standard camelCase for function names.
- lower case _ last_name : often found in C, a favorite among PostgreSQL database users. (some MySQL)
- upper case _ LAST_NAME : a favorite among Oracle Users (some MySQL Oracle defectors)
Being at the cross roads of all the above, we often have to deal with the various above as well as having internal schizophrenic strife and external fights.
The internal turmoil is the worst and is worse than an ambidextrous person trying to figure out which hand to use in battle. For these exercises, we'll demonstrate one way how to convert between the various conventions. These
are the first thoughts that came to our mind, so may not be the most elegant.
Continue reading "Of Camels and People: Converting back and forth from Camel Case, Pascal Case to underscore lower case"
Wednesday, June 02. 2010
Printer Friendly
One of the coolest features of PostgreSQL is the ability to write functions using plain old
SQL. This feature it has had for a long time. Even before PostgreSQL 8.2. No other database to our knowledge has this feature. By SQL we mean sans procedural mumbo jumbo like
loops and what not. This is cool for two reasons:
- Plain old SQL is the simplest to write and most anyone can write one and is just what the doctor ordered in many cases. PostgreSQL even allows you to write
aggregate functions with plain old SQL. Try to write an aggregate function in SQL Server
you've got to pull out your Visual Studio this and that and do some compiling and loading and you better know C# or VB.NET. Try in MySQL and you better learn C.
Do the same in PostgreSQL (you have a large choice of languages including SQL) and the code is simple to write. Nevermind
with MySQL and SQL Server, you aren't even allowed to do those type of things on a shared server or a server where the IT department is paranoid. The closest
with this much ease would be Oracle, which is unnecessarily verbose.
- Most importantly -- since it is just SQL, for simple user-defined functions, a PostgreSQL sql function can often be in-lined into the overall query plan since
it only uses what is legal in plain old SQL.
This inlining feature is part of the secret sauce that makes PostGIS fast and easy to use.
So instead of writing geom1 && geom2 AND Intersects(geom1,geom2) -- a user can write
ST_Intersects(geom1,geom2) . The short-hand is even more striking when you think of the ST_DWithin function.
With an inlined function, the planner has visibility into the function and breaks apart the
spatial index short-circuit test && from the more exhaustive absolute test Intersects(geom1,geom2)
and has great flexibility in reordering the clauses in the plan.
Continue reading "STRICT on SQL Function Breaks In-lining Gotcha"
Monday, May 17. 2010
Printer Friendly
For those people coming from Oracle, SQL Server and MySQL or other databases that have soundex functionality,
you may be puzzled, or even frustrated when you try to do
something like
WHERE soundex('Wushington') = soundex('Washington')
in PostgreSQL and get a function does not exist error.
Well it does so happen that there is a soundex function in PostgreSQL, and yes it is
also called soundex, but is offered as a contrib module and not installed by default. It also has other fuzzy string matching functions in addition to soundex.
One of my favorites, the levenshenstein distance function is included as well. In this article
we'll be covering the contrib module packaged as fuzzystrmatch.sql. Details of the module can be found in FuzzyStrMatch.
The contrib module has been around for sometime, but has changed slightly from PostgreSQL version to PostgreSQL version. We are covering the 8.4 version in this article.
For those unfamiliar with soundex, its a basic approach developed by the US Census in the 1930s as a way of sorting
names by pronounciation. Read Census and Soundex for more gory history details.
Given that it is an approach designed primarily for the English alphabet, it sort of makes sense why its not built-in to PostgreSQL,
which has more of a diverse international concern. For example if you used it to compare two words in Japanese or Chinese,
don't think it would fair too well in any of the database platforms that support this function.
The original soundex algorithm has been improved over the years. Though its still the most common used today, newer variants
exist called MetaPhone developed in the 1990s and Double Metaphone (DMetaPhone) developed in 2000 that support additional
consonants in other languages such as Slavic, Celtic, Italian, Spanish etc.
These two variants are also included in the fuzzystrmatch contrib library. The soundex function still seems to be
the most popularly used at least for U.S. This is perhaps because most of the other databases (Oracle, SQL Server, MySQL) have soundex built-in but not the metaphone variants.
So in a sense soundex is a more portable function. The other reason is that metaphone and dmetaphone take up a bit more space and
are also more processor intensive to compute than soundex. We'll demonstrate some differences between them in this article.
To enable soundex and the other fuzzy string matching functions included, just run the
share/contrib/fuzzystrmatch.sql located in your PostgreSQL install folder. This library is an important piece of arsenal for geocoding and genealogy tracking particularly
the U.S. streets and surnames data sets. I come from a long line of Minors, Miners, Burnettes and Burnets.
For the next set of exercises, we will be using the places dataset we created in Importing Fixed width data into PostgreSQL with just PSQL.
Continue reading "Where is soundex and other warm and fuzzy string things"
|