Friday, December 24. 2010
Printer Friendly
Question: You have a table of people and a table that specifies the activities each person is involved
in. You want to return a result that has one record per person and a column that has a listing of activities for each person
separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.
This is a question we are always asked and since we mentor on various flavors of databases,
we need to be able to switch gears and provide an answer that works on the client's database. Most
often the additional requirement is that you can't install new functions in the database. This means that
for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.
Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in
most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be
done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job.
We feel that data
transformation is an important function of a database, and if your database is incapable of massaging the data into a format
your various client apps can easily digest, WELL THAT's A PROBLEM.
We shall now document this answer rather than trying to answer for the nteenth time. For starter's
PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others
and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing,
but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement.
MySQL has a fairly
simple, elegant and very portable way that it has had for a really long time.
Continue reading "String Aggregation in PostgreSQL, SQL Server, and MySQL"
Friday, November 05. 2010
Printer Friendly
Problem: You have a set of numbers, or characters or whatever and you are trying to find the max or min of this set?
If the values are separate records in a table or query, the answer is well known and respected across all relational databases -- use the aggregate MAX and MIN functions.
But what if you have a set of free wheeling numbers or text not in separate records, and you want the max or min of each. Here is where the
least and greatest functions come in handy.
PostgreSQL has had these functions for as far back as I can remember and is not the only database to sport these marvelous functions. Our beloved MySQL and Oracle database have these functions as well. Sadly our more beloved SQL Server even in the SQL Server 2008
variant - lacks these functions.
Okay how to use these functions -- you use it like this:
SELECT least(1,-2,5) As num_least, greatest('Bobby', 'Catty', 'Kitty') As greatest_cat;
Yields:
num_least| greatest_cat
----------+----------
-2 | Kitty
We would classify these functions along the lines of COALESCE. They are like COALESCE because they take an arbitrary number of arguments and the datatype that is returned
is highest datatype that all arguments in the function can be autocast to. If there is no autocast then well you get an error. To demonstrate, guess what happens when you do this:
SELECT least(-1, 'Kitty');
Well do this in PostgreSQL at least in 8.3+, you get a nice slap if you haven't installed any deprecated autocasts:
ERROR: invalid input syntax for integer: "Kitty"
LINE 1: SELECT least('Kitty', -1)
Do this in MySQL - so friendly and forgiving, and great reader of minds and you get:
-1
OF COURSE
I apologize for the ambiguous sarcasm, its just sometimes I want my mind read and sometimes I don't and I just can't figure out whether today is one of those days or the other day.
Thursday, August 12. 2010
Printer Friendly
This is a question that comes up quite often by windows users, so thought we would share how we normally do it. The question is can you run a PostgreSQL server on your windows desktop/server box without having to install anything?
The answer is yes and quite easily. Why would you need to do this. There are a couple of cases -- one you are developing a single user app that you want users to be able to run from anywhere without having to install it first.
The other common reason is, you aren't allowed to install anything on a user's pc and you also want to package along a database you already have created.
For our purposes, many of our developers develop on portable WAMP like things, and for some of our applications, they need to work in both MySQL and PostgreSQL, so we need an easy way during development to swap one out for the other.
Continue reading "Starting PostgreSQL in windows without install"
Friday, July 23. 2010
Printer Friendly
When it comes to naming things in databases and languages, there are various common standards. For many languages the
camel family of namings is very popular. For unix based databases
usually UPPER or lower _ is the choice and for databases such as SQL Server and MySQL which allow you to name your columns with mixed casing
but couldn't care less what case you express them in selects, you get a mish mush of styles depending on what camp the database user originated from.
So to summarize the key styles and the family of people
- camelCase : lastName - employed by SmallTalk, Java, Flex, C++ and various C derivative languages.
- Pascal Case: (a variant of Camel Case) -- LastName which is employed by C#, VB.NET, Pascal (and Delphi), and SQL Server (and some MySQL windows converts). Also often used for class names by languages that use standard camelCase for function names.
- lower case _ last_name : often found in C, a favorite among PostgreSQL database users. (some MySQL)
- upper case _ LAST_NAME : a favorite among Oracle Users (some MySQL Oracle defectors)
Being at the cross roads of all the above, we often have to deal with the various above as well as having internal schizophrenic strife and external fights.
The internal turmoil is the worst and is worse than an ambidextrous person trying to figure out which hand to use in battle. For these exercises, we'll demonstrate one way how to convert between the various conventions. These
are the first thoughts that came to our mind, so may not be the most elegant.
Continue reading "Of Camels and People: Converting back and forth from Camel Case, Pascal Case to underscore lower case"
Sunday, February 14. 2010
Printer Friendly
Every programmer should embrace and use regular expressions (INCLUDING Database programmers).
There are many places where regular expressions can be used to reduce a 20 line piece of code into a
1 liner. Why write 20 lines of code when you can write 1.
Regular expressions are a domain language just like SQL. Just like SQL they are embedded in many places. You have them in your program editor.
You see it in sed, grep, perl, PHP, Python, VB.NET, C#,
in ASP.NET validators and javascript for checking correctness of input. You have them in PostgreSQL as well where
you can use them in SQL statements, domain definitions and check constraints. You can mix
regular expressions with SQL. When you mix the two domain languages, you can do enchanting things with a flip of a wrist that
would amaze your less informed friends. Embrace the power of domain languages and mix it up. PostgreSQL
makes that much easier than any other DBMS we can think of.
For more details on using regular expressions in PostgreSQL, check out the manual pages Pattern Matching in PostgreSQL
The problem with regular expressions is that they are slightly different depending on what language environment you are
running them in. Different enough to be frustrating. We'll just focus on their use in PostgreSQL, though these lessons
are applicable to other environments.
Continue reading "Regular Expressions in PostgreSQL"
Saturday, January 09. 2010
Printer Friendly
UPDATE: Thanks all for the suggestions. For now we ended up increasing the
seq_page_cost from 1 to 2 in the database. That has gotten us back to our old much much faster speeds without change in code and seems to have
improved the speeds of other queries as well, without reducing speed of any.
ALTER DATABASE mydb SET seq_page_cost=2;
As Jeff suggested, we'll try to come up with a standalone example that exhibits the behavior. The below example was more to demonstrate the construct. Table names and fields were changed to protect the innocent so that is why we didn't bother showing explain plans. The behavior also seems to do
with the distribution of data and gets worse when stats are updated (via vacuum analyze). Didn't see this in PostgreSQL 8.3 and this was a system recently upgraded from 8.3 to 8.4
---ORIGINAL ARTICLE HERE --
This is a very odd thing and I think has happened to us perhaps once before.
Its a bit puzzling, and we aren't particularly happy with our work around because its
something that looks to a casual observer as a bit bizarre. The hack is setting the enable_seqscan setting
off for a particular query to force the planner to use indexes available to it.
What is particularly troubling about this problem, is that it wasn't always this way.
This is a piece of query code we've had in an application for a while, and its worked shall
I say really fast. Response times in 300 ms - 1 sec, for what is not a trivial query against a not
so trivially sized hierarchy of tables.
Anyrate, one day -- this query that we were very happy with, suddenly started
hanging taking 5 minutes to run. Sure data had been added and so forth, but that didn't
completely explain this sudden change of behavior. The plan it had taken had changed drastically.
It just suddenly decided to stop using a critical index it had always used. Well it was still using it but just on
the root table, not the children. Though querying a child directly proved that it still refused to use it,
so it didn't seem to be the hierarchy at fault here.
Continue reading "Forcing the planner's hand with set enable_seqscan off WTF"
Wednesday, October 21. 2009
Printer Friendly
Vacuuming and analyzing is the process that removes dead rows and also updates the statistics of a table.
As of PostgreSQL 8.3, auto vacuuming (the process that runs around cleaning up tables), is on by default. If you are
creating a lot of tables and bulk loading data, the vacuumer sometimes gets in your way. One way to get around that is to
disable auto vacuuming on the tables you are currently working on and then reenable afterward.
You can also do this from the PgAdmin III management console.
Continue reading "Enable and Disable Vacuum per table"
Monday, October 05. 2009
Printer Friendly
In our prior story about allocating people with the power of window aggregation, we saw our valiant hero and heroine trying
to sort people into elevators
to ensure that each elevator ride was not over capacity. All was good in the world until someone named Frank came along and spoiled the party.
Frank rightfully pointed out that our algorithm was flawed because should Charlie double his weight, then we could have one elevator ride over capacity.
We have a plan.
Continue reading "Allocating People into Groups with SQL the Sequel"
Monday, September 28. 2009
Printer Friendly
This is along the lines of more stupid window function fun and how many ways can we abuse this technology in PostgreSQL. Well actually we were using this approach to allocate geographic areas such that each area has approximately the same population
of things. So you can imagine densely populated areas would have smaller regions and more of them and less dense areas will have larger regions but fewer of them (kind of like US Census tracts).
So you have to think about ways of allocating your regions so you don't have a multipolygon where one part is in one part of the world and the other in another etc. Using window aggregation is one approach in conjunction with spatial sorting algorithms.
The non-spatial equivalent of this problem is how do you shove people in an elevator and ensure you don't exceed the capacity of the elevator for each ride. Below is a somewhat naive way of doing this.
The idea being you keep on summing the weights until you reach capacity and then start a new grouping.
Continue reading "Allocating People into Groups with Window aggregation"
Wednesday, September 09. 2009
Printer Friendly
One of the small little treats provided in PostgreSQL 8.4 is the new pg_terminate_backend function. In the
past when we wanted to kill runaway postgresql queries issued by a database or user or hmm us, we would call the pg_cancel_backend function.
The problem with that is it would simply cancel the query in the backend process, but often times the offending application would simply launch the same query again.
In PostgreSQL 8.4 a new function was introduced called pg_terminate_backend. This doesn't completely replace pg_cancel_backend, but basically does
what you do when you go into say a Windows Task manager and kill the offending postgres process or on Linux, you call a kill command on a postgres process. Its nicer
in the sense that you can do it all within PostgreSQL and you can use the pg_stat_activity query to help you out a bit. Also you don't run the risk as easily of
killing the root postgres process and killing the postgres service all together.
Continue reading "Terminating Annoying Back Ends"
Thursday, June 18. 2009
Printer Friendly
This has been bugging me for a long time and I finally complained about it and Tom Lane kindly gave a reason for the problem and that its by design and not a bug.
So I thought I would post the situation here without getting into too many embarassing specifics in case others have suffered from a similar fate and can learn from this.
The situation:
- You create a function lets call it myniftyfunc() in the public schema.
- Then you create another function that depends on myniftyfunc(), lets call it mysuperniftyfunc() also in public schema.
- Then because your function is such a super nifty function, you decide to create a functional index with that super function on your table that sits in mysuperdata schema - mysuperdata.mysupertable
Your super nifty function is doing its thing; your table is happy; the planner is spitting out your queries lightning fast using the super nifty index on your super table;
The world is good.
One day you decide to restore your nifty database backup and to your chagrin, your nifty index is not there. The planner is no longer happily spitting out your queries lighting fast and everything has come to a painful crawl.
Your super nifty index is gone. What happened to super nifty functional index?
I have to admit that I'm the type of person that assumes the public schema is always there and always in search_path and that my assumption is a flawed one. After all the public schema is there by default on new databases for convenience,
but one can change it not to be in the search_path and in fact pg_dump does just that. So if everything you have is kept in public schema -- you don't run into this particular misfortune. If however you have your functions in
public and your tables in different schemas, during restore -- the search path is changed to the schema being restored and your super functional indexes based on super functions that depend on other super functions fail because public is no longer in the search_path.
Continue reading "Restore of functional indexes gotcha"
Thursday, May 21. 2009
Printer Friendly
Someone asked me this recently and not playing with custom types much, I'm not sure this is
the best solution. Anyrate here it goes. Note this works for PostgreSQL 8.2 and above. Note sure about 8.1 etc.
I assume it probably does.
Let us say you created a custom type something like this:
CREATE TYPE my_type1 AS
(name varchar(150),
rotation_x double precision,
rotation_y double precision,
x_pos integer,
y_pos integer
);
Continue reading "Creating instance of custom type"
Sunday, April 26. 2009
Printer Friendly
Different Linux distros have their preferred place of where stuff goes and of course the default location on windows is completely different from that too. So there isn't really one default location where you can find PostgreSQL data cluster. Of course user's can pick their
locations as well. So what is a casual DBA supposed to do?
The pg_settings table
PostgreSQL has a convenient system table view called pg_settings that stores a lot of information. It stores the location of the data cluster, the pg_hbafile and other conf files.
In additon to that you can interogate it to find out information you will find in the postgresql.conf file. Why sift thru that postgresql.conf file (assuming you can already query your postgresql server) when you can find the answers you are looking
for with an SQL query?
Continue reading "Where is my data and other stuff"
Saturday, April 18. 2009
Printer Friendly
This question is one that has come up a number of times in PostGIS newsgroups worded in many different ways. The situation is that if you use a function a number of times
not changing the arguments that go into the function, PostgreSQL still insists on recalculating the value even when the function is marked IMMUTABLE. I have tested this on
8.2 and 8.3 with similarly awful results.
This issue is not so much a problem if function calculations are fast, but spatial function calculations relative to most other functions you will use are pretty
slow especially when dealing with large geometries. As a result your query could end up twice as slow. Even setting the costs of these functions to relatively high does not help the situation.
To demonstrate here is a non-PostGIS version of the issue that everyone should be able to run and demonstrates its not a PostGIS only issue.
Continue reading "How to force PostgreSQL to use a pre-calculated value"
Saturday, March 28. 2009
Printer Friendly
Even though others have blogged about this in the past and its well-documented in the docs, its a frequently enough asked question, that we thought we'd post it here again
with a couple of additional twists.
How to determine the size of a database on disk
SELECT pg_size_pretty(pg_database_size('somedatabase')) As fulldbsize;
How to determine the size of a database table on disk
NOTE: There are two functions in PostgreSQL - pg_relation_size and pg_total_relation_size. The pg_relation_size just measures the size of the actual table where
as the pg_total_relation_size includes both the table and all its toasted tables and indexes.
SELECT pg_size_pretty(pg_total_relation_size('someschema.sometable')) As fulltblsize,
pg_size_pretty(pg_relation_size('someschema.sometable')) As justthetblsize;
How to determine the size of a database schema
When you are doing a lot of data loading for later massaging to dump into your production tables, you will often generate junk tables in the process and worse yet other
people may generate junk tables period. As a general practice, we like to create a schema just for junk. The reason for that is that it is extremely easy to
exclude schemas from being backed up and to load individual schemas. So with this approach we create or move all tables we are not quite sure if they are needed or we only need temporarily
into a schema
called junk, scratch, possiblejunk or whatever you want to call the schema and exclude it from our backups.
Something of the form:
CREATE SCHEMA scratch;
ALTER TABLE someschema.joeyplaying SET SCHEMA scratch;
After saving the world from junk, you would like to know how much space you have saved your backups from backing up. So you would employ a query something of the form:
SELECT pg_size_pretty(pg_database_size('mycurrentdb')) As fullprod,
pg_size_pretty(CAST(pg_database_size('mycurrentdb') - (SELECT SUM(pg_total_relation_size(table_schema || '.' || table_name) )
FROM information_schema.tables WHERE table_schema = 'scratch') As bigint)) As tobebackedup_size,
pg_size_pretty(CAST((SELECT SUM(pg_total_relation_size(table_schema || '.' || table_name) )
FROM information_schema.tables
WHERE table_schema = 'scratch') As bigint) ) As junk_size;
Size of Geometries in PostGIS tables
PostGIS has a companion function for measuring geometry size which is useful when you want to get a sense of how much space your geometries are taking up on disk
for a set of records.
SELECT ST_Mem_Size(ST_GeomFromText('LINESTRING(220268 150415,220227 150505,220227 150406)'));
SELECT pg_size_pretty(CAST(SUM(ST_Mem_Size(the_geom)) As bigint) ) as totgeomsum
FROM sometable WHERE state = 'MA';
|